1. 24 Feb, 2017 3 commits
    • Robbie Ko's avatar
      Btrfs: incremental send, do not issue invalid rmdir operations · 01914101
      Robbie Ko authored
      When both the parent and send snapshots have a directory inode with the
      same number but different generations (therefore they are different
      inodes) and both have an entry with the same name, an incremental send
      stream will contain an invalid rmdir operation that refers to the
      orphanized name of the inode from the parent snapshot.
      
      The following example scenario shows how this happens.
      
      Parent snapshot:
      
       .
       |---- d259_old/               (ino 259, gen 9)
       |         |---- d1/           (ino 258, gen 9)
       |
       |---- f                       (ino 257, gen 9)
      
      Send snapshot:
      
       .
       |---- d258/                   (ino 258, gen 7)
       |---- d259/                   (ino 259, gen 7)
               |---- d1/             (ino 257, gen 7)
      
      When the kernel is processing inode 258 it notices that in both snapshots
      there is an inode numbered 259 that is a parent of an inode 258. However
      it ignores the fact that the inodes numbered 259 have different generations
      in both snapshots, which means they are effectively different inodes.
      Then it checks that both inodes 259 have a dentry named "d1" and because
      of that it issues a rmdir operation with orphanized name of the inode 258
      from the parent snapshot. This happens at send.c:process_record_refs(),
      which calls send.c:did_overwrite_first_ref() that returns true and because
      of that later on at process_recorded_refs() such rmdir operation is issued
      because the inode being currently processed (258) is a directory and it
      was deleted in the send snapshot (and replaced with another inode that has
      the same number and is a directory too).
      Fix this issue by comparing the generations of parent directory inodes
      that have the same number and make send.c:did_overwrite_first_ref() when
      the generations are different.
      
      The following steps reproduce the problem.
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ touch /mnt/f
       $ mkdir /mnt/d1
       $ mkdir /mnt/d259_old
       $ mv /mnt/d1 /mnt/d259_old/d1
       $ btrfs subvolume snapshot -r /mnt /mnt/snap1
       $ btrfs send /mnt/snap1 -f /tmp/1.snap
       $ umount /mnt
      
       $ mkfs.btrfs -f /dev/sdc
       $ mount /dev/sdc /mnt
       $ mkdir /mnt/d1
       $ mkdir /mnt/dir258
       $ mkdir /mnt/dir259
       $ mv /mnt/d1 /mnt/dir259/d1
       $ btrfs subvolume snapshot -r /mnt /mnt/snap2
       $ btrfs receive /mnt/ -f /tmp/1.snap
       # Take note that once the filesystem is created, its current
       # generation has value 7 so the inodes from the second snapshot all have
       # a generation value of 7. And after receiving the first snapshot
       # the filesystem is at a generation value of 10, because the call to
       # create the second snapshot bumps the generation to 8 (the snapshot
       # creation ioctl does a transaction commit), the receive command calls
       # the snapshot creation ioctl to create the first snapshot, which bumps
       # the filesystem's generation to 9, and finally when the receive
       # operation finishes it calls an ioctl to transition the first snapshot
       # (snap1) from RW mode to RO mode, which does another transaction commit
       # and bumps the filesystem's generation to 10. This means all the inodes
       # in the first snapshot (snap1) have a generation value of 9.
       $ rm -f /tmp/1.snap
       $ btrfs send /mnt/snap1 -f /tmp/1.snap
       $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
       $ umount /mnt
      
       $ mkfs.btrfs -f /dev/sdd
       $ mount /dev/sdd /mnt
       $ btrfs receive /mnt -f /tmp/1.snap
       $ btrfs receive -vv /mnt -f /tmp/2.snap
       receiving snapshot mysnap2 uuid=9c03962f-f620-0047-9f98-32e5a87116d9, ctransid=7 parent_uuid=d17a6e3f-14e5-df4f-be39-a7951a5399aa, parent_ctransid=9
       utimes
       unlink f
       mkdir o257-7-0
       mkdir o259-7-0
       rename o257-7-0 -> o259-7-0/d1
       chown o259-7-0/d1 - uid=0, gid=0
       chmod o259-7-0/d1 - mode=0755
       utimes o259-7-0/d1
       rmdir o258-9-0
       ERROR: rmdir o258-9-0 failed: No such file or directory
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      [Rewrote changelog to be more precise and clear]
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      01914101
    • Filipe Manana's avatar
      Btrfs: incremental send, do not delay rename when parent inode is new · fe9c798d
      Filipe Manana authored
      When we are checking if we need to delay the rename operation for an
      inode we not checking if a parent inode that exists in the send and
      parent snapshots is really the same inode or not, that is, we are not
      comparing the generation number of the parent inode in the send and
      parent snapshots. Not only this results in unnecessarily delaying a
      rename operation but also can later on make us generate an incorrect
      name for a new inode in the send snapshot that has the same number
      as another inode in the parent snapshot but a different generation.
      
      Here follows an example where this happens.
      
      Parent snapshot:
      
       .                                                  (ino 256, gen 3)
       |--- dir258/                                       (ino 258, gen 7)
       |       |--- dir257/                               (ino 257, gen 7)
       |
       |--- dir259/                                       (ino 259, gen 7)
      
      Send snapshot:
      
       .                                                  (ino 256, gen 3)
       |--- file258                                       (ino 258, gen 10)
       |
       |--- new_dir259/                                   (ino 259, gen 10)
                |--- dir257/                              (ino 257, gen 7)
      
      The following steps happen when computing the incremental send stream:
      
      1) When processing inode 257, its new parent is created using its orphan
         name (o257-21-0), and the rename operation for inode 257 is delayed
         because its new parent (inode 259) was not yet processed - this
         decision to delay the rename operation does not make much sense
         because the inode 259 in the send snapshot is a new inode, it's not
         the same as inode 259 in the parent snapshot.
      
      2) When processing inode 258 we end up delaying its rmdir operation,
         because inode 257 was not yet renamed (moved away from the directory
         inode 258 represents). We also create the new inode 258 using its
         orphan name "o258-10-0", then rename it to its final name of "file258"
         and then issue a truncate operation for it. However this truncate
         operation contains an incorrect name, which corresponds to the orphan
         name and not to the final name, which makes the receiver fail. This
         happens because when we attempt to compute the inode's current name
         we verify that there's another inode with the same number (258) that
         has its rmdir operation pending and because of that we generate an
         orphan name for the new inode 258 (we do this in the function
         get_cur_path()).
      
      Fix this by not delayed the rename operation of an inode if it has parents
      with the same number but different generations in both snapshots.
      
      The following steps reproduce this example scenario.
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ mkdir /mnt/dir257
       $ mkdir /mnt/dir258
       $ mkdir /mnt/dir259
       $ mv /mnt/dir257 /mnt/dir258/dir257
       $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      
       $ mv /mnt/dir258/dir257 /mnt/dir257
       $ rmdir /mnt/dir258
       $ rmdir /mnt/dir259
      
       # Remount the filesystem so that the next created inodes will have the
       # numbers 258 and 259. This is because when a filesystem is mounted,
       # btrfs sets the subvolume's inode counter to a value corresponding to
       # the highest inode number in the subvolume plus 1. This inode counter
       # is used to assign a unique number to each new inode and it's
       # incremented by 1 after very inode creation.
       # Note: we unmount and then mount instead of doing a mount with
       # "-o remount" because otherwise the inode counter remains at value 260.
       $ umount /mnt
       $ mount /dev/sdb /mnt
       $ touch /mnt/file258
       $ mkdir /mnt/new_dir259
       $ mv /mnt/dir257 /mnt/new_dir259/dir257
       $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      
       $ btrfs send /mnt/snap1 -f /tmp/1.snap
       $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
      
       $ umount /mnt
       $ mkfs.btrfs -f /dev/sdc
       $ mount /dev/sdc /mnt
       $ btrfs receive /mnt -f /tmo/1.snap
       $ btrfs receive /mnt -f /tmo/2.snap -vv
       receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7
       utimes
       mkdir o259-10-0
       rename dir258 -> o258-7-0
       utimes
       mkfile o258-10-0
       rename o258-10-0 -> file258
       utimes
       truncate o258-10-0 size=0
       ERROR: truncate o258-10-0 failed: No such file or directory
      Reported-by: default avatarRobbie Ko <robbieko@synology.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      fe9c798d
    • Robbie Ko's avatar
      Btrfs: send, fix failure to rename top level inode due to name collision · 4dd9920d
      Robbie Ko authored
      Under certain situations, an incremental send operation can fail due to a
      premature attempt to create a new top level inode (a direct child of the
      subvolume/snapshot root) whose name collides with another inode that was
      removed from the send snapshot.
      
      Consider the following example scenario.
      
      Parent snapshot:
      
        .                 (ino 256, gen 8)
        |---- a1/         (ino 257, gen 9)
        |---- a2/         (ino 258, gen 9)
      
      Send snapshot:
      
        .                 (ino 256, gen 3)
        |---- a2/         (ino 257, gen 7)
      
      In this scenario, when receiving the incremental send stream, the btrfs
      receive command fails like this (ran in verbose mode, -vv argument):
      
        rmdir a1
        mkfile o257-7-0
        rename o257-7-0 -> a2
        ERROR: rename o257-7-0 -> a2 failed: Is a directory
      
      What happens when computing the incremental send stream is:
      
      1) An operation to remove the directory with inode number 257 and
         generation 9 is issued.
      
      2) An operation to create the inode with number 257 and generation 7 is
         issued. This creates the inode with an orphanized name of "o257-7-0".
      
      3) An operation rename the new inode 257 to its final name, "a2", is
         issued. This is incorrect because inode 258, which has the same name
         and it's a child of the same parent (root inode 256), was not yet
         processed and therefore no rmdir operation for it was yet issued.
         The rename operation is issued because we fail to detect that the
         name of the new inode 257 collides with inode 258, because their
         parent, a subvolume/snapshot root (inode 256) has a different
         generation in both snapshots.
      
      So fix this by ignoring the generation value of a parent directory that
      matches a root inode (number 256) when we are checking if the name of the
      inode currently being processed collides with the name of some other
      inode that was not yet processed.
      
      We can achieve this scenario of different inodes with the same number but
      different generation values either by mounting a filesystem with the inode
      cache option (-o inode_cache) or by creating and sending snapshots across
      different filesystems, like in the following example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ mkdir /mnt/a1
        $ mkdir /mnt/a2
        $ btrfs subvolume snapshot -r /mnt /mnt/snap1
        $ btrfs send /mnt/snap1 -f /tmp/1.snap
        $ umount /mnt
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
        $ touch /mnt/a2
        $ btrfs subvolume snapshot -r /mnt /mnt/snap2
        $ btrfs receive /mnt -f /tmp/1.snap
        # Take note that once the filesystem is created, its current
        # generation has value 7 so the inode from the second snapshot has
        # a generation value of 7. And after receiving the first snapshot
        # the filesystem is at a generation value of 10, because the call to
        # create the second snapshot bumps the generation to 8 (the snapshot
        # creation ioctl does a transaction commit), the receive command calls
        # the snapshot creation ioctl to create the first snapshot, which bumps
        # the filesystem's generation to 9, and finally when the receive
        # operation finishes it calls an ioctl to transition the first snapshot
        # (snap1) from RW mode to RO mode, which does another transaction commit
        # and bumps the filesystem's generation to 10.
        $ rm -f /tmp/1.snap
        $ btrfs send /mnt/snap1 -f /tmp/1.snap
        $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
        $ umount /mnt
      
        $ mkfs.btrfs -f /dev/sdd
        $ mount /dev/sdd /mnt
        $ btrfs receive /mnt /tmp/1.snap
        # Receive of snapshot snap2 used to fail.
        $ btrfs receive /mnt /tmp/2.snap
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      [Rewrote changelog to be more precise and clear]
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      4dd9920d
  2. 22 Feb, 2017 2 commits
    • Liu Bo's avatar
      Btrfs: use the correct type when creating cow dio extent · 6288d6ea
      Liu Bo authored
      'BTRFS_ORDERED_REGULAR' was introduced for the cow case in patch
      'Btrfs: specify a new ordered extent type for create_io_em',
      but it missed the directIO cow case.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      6288d6ea
    • Filipe Manana's avatar
      Btrfs: fix deadlock between dedup on same file and starting writeback · b1517622
      Filipe Manana authored
      If we are deduping two ranges of the same file we need to make sure that
      we lock all pages in ascending order, that is, lock first the pages from
      the range with lower offset and then the pages from the other range, as
      otherwise we can deadlock with a concurrent task that is starting delalloc
      (writeback). Example trace:
      
      [74073.052218] INFO: task kworker/u32:10:17997 blocked for more than 120 seconds.
      [74073.053889]       Tainted: G        W       4.9.0-rc7-btrfs-next-36+ #1
      [74073.055071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [74073.056696] kworker/u32:10  D    0 17997      2 0x00000000
      [74073.058606] Workqueue: writeback wb_workfn (flush-btrfs-53176)
      [74073.061370]  ffff880031e79858 ffff8802159d2580 ffff880237004580 ffff880031e79240
      [74073.064784]  ffff88023f4978c0 ffffc9000817b638 ffffffff814c15e1 0000000000000000
      [74073.068386]  ffff88023f4978d8 ffff88023f4978c0 000000000017b620 ffff880031e79240
      [74073.071712] Call Trace:
      [74073.072884]  [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4
      [74073.075395]  [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f
      [74073.077511]  [<ffffffff814c18d2>] schedule+0x8c/0xa0
      [74073.079440]  [<ffffffff814c4b36>] schedule_timeout+0x43/0xff
      [74073.081637]  [<ffffffff8110953e>] ? time_hardirqs_on+0x9/0x14
      [74073.083809]  [<ffffffff81095c67>] ? trace_hardirqs_on_caller+0x16/0x197
      [74073.086314]  [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32
      [74073.100654]  [<ffffffff810be048>] ? ktime_get+0x41/0x52
      [74073.102619]  [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102
      [74073.104771]  [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102
      [74073.106969]  [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39
      [74073.108954]  [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99
      [74073.110981]  [<ffffffff8112b692>] __lock_page+0x6b/0x6d
      [74073.112833]  [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a
      [74073.115010]  [<ffffffffa031178b>] lock_page+0x2f/0x32 [btrfs]
      [74073.116999]  [<ffffffffa0311d9f>] lock_delalloc_pages+0xc7/0x1a0 [btrfs]
      [74073.119243]  [<ffffffffa0313d15>] find_lock_delalloc_range+0xc3/0x1a4 [btrfs]
      [74073.121636]  [<ffffffffa0313e81>] writepage_delalloc.isra.31+0x8b/0x134 [btrfs]
      [74073.124229]  [<ffffffffa0315d69>] __extent_writepage+0x1c1/0x2bf [btrfs]
      [74073.126372]  [<ffffffffa03160f2>] extent_write_cache_pages.isra.30.constprop.49+0x28b/0x36c [btrfs]
      [74073.129371]  [<ffffffffa03165b9>] extent_writepages+0x4b/0x5c [btrfs]
      [74073.131440]  [<ffffffffa02fcb59>] ? insert_reserved_file_extent.constprop.42+0x261/0x261 [btrfs]
      [74073.134303]  [<ffffffff811b4ce4>] ? writeback_sb_inodes+0xe0/0x4a1
      [74073.136298]  [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs]
      [74073.138248]  [<ffffffff81138200>] do_writepages+0x23/0x2c
      [74073.139910]  [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2
      [74073.142003]  [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1
      [74073.136298]  [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs]
      [74073.138248]  [<ffffffff81138200>] do_writepages+0x23/0x2c
      [74073.139910]  [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2
      [74073.142003]  [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1
      [74073.143911]  [<ffffffff811b511b>] __writeback_inodes_wb+0x76/0xae
      [74073.145787]  [<ffffffff811b53ca>] wb_writeback+0x1cc/0x4d7
      [74073.147452]  [<ffffffff811b60cd>] wb_workfn+0x194/0x37d
      [74073.149084]  [<ffffffff811b60cd>] ? wb_workfn+0x194/0x37d
      [74073.150726]  [<ffffffff8106ce77>] ? process_one_work+0x154/0x4e4
      [74073.152694]  [<ffffffff8106cf96>] process_one_work+0x273/0x4e4
      [74073.154452]  [<ffffffff8106d6db>] worker_thread+0x1eb/0x2ca
      [74073.156138]  [<ffffffff8106d4f0>] ? rescuer_thread+0x2b6/0x2b6
      [74073.157837]  [<ffffffff81072a81>] kthread+0xd5/0xdd
      [74073.159339]  [<ffffffff810729ac>] ? __kthread_unpark+0x5a/0x5a
      [74073.161088]  [<ffffffff814c6257>] ret_from_fork+0x27/0x40
      [74073.162680] INFO: lockdep is turned off.
      [74073.163855] INFO: task do-dedup:30264 blocked for more than 120 seconds.
      [74073.181180]       Tainted: G        W       4.9.0-rc7-btrfs-next-36+ #1
      [74073.181180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [74073.185296] fdm-stress      D    0 30264  29974 0x00000000
      [74073.186810]  ffff880089595118 ffff880211b8eac0 ffff880237030380 ffff880089594b00
      [74073.188998]  ffff88023f2978c0 ffffc900063abb68 ffffffff814c15e1 0000000000000000
      [74073.191070]  ffff88023f2978d8 ffff88023f2978c0 00000000003abb50 ffff880089594b00
      [74073.193286] Call Trace:
      [74073.193990]  [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4
      [74073.195418]  [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f
      [74073.196796]  [<ffffffff814c18d2>] schedule+0x8c/0xa0
      [74073.198163]  [<ffffffff814c4b36>] schedule_timeout+0x43/0xff
      [74073.199621]  [<ffffffff81095df5>] ? trace_hardirqs_on+0xd/0xf
      [74073.201100]  [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32
      [74073.202686]  [<ffffffff810be048>] ? ktime_get+0x41/0x52
      [74073.204051]  [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102
      [74073.205585]  [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102
      [74073.207123]  [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39
      [74073.208238]  [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99
      [74073.208871]  [<ffffffff8112b692>] __lock_page+0x6b/0x6d
      [74073.209430]  [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a
      [74073.210101]  [<ffffffff8112b800>] lock_page+0x2f/0x32
      [74073.210636]  [<ffffffff8112c502>] pagecache_get_page+0x5e/0x153
      [74073.211270]  [<ffffffffa03257eb>] gather_extent_pages+0x4e/0x109 [btrfs]
      [74073.212166]  [<ffffffffa032a04c>] btrfs_dedupe_file_range+0x1e1/0x4dd [btrfs]
      [74073.213257]  [<ffffffff8118d9b5>] vfs_dedupe_file_range+0x1c1/0x221
      [74073.214086]  [<ffffffff8119e0c4>] do_vfs_ioctl+0x442/0x600
      [74073.214767]  [<ffffffff811a7874>] ? rcu_read_unlock+0x5b/0x5d
      [74073.215619]  [<ffffffff811a7953>] ? __fget+0x6b/0x77
      [74073.216338]  [<ffffffff8119e2d9>] SyS_ioctl+0x57/0x79
      [74073.217149]  [<ffffffff814c5fea>] entry_SYSCALL_64_fastpath+0x18/0xad
      [74073.218102]  [<ffffffff81109552>] ? time_hardirqs_off+0x9/0x14
      [74073.218968]  [<ffffffff810938ce>] ? trace_hardirqs_off_caller+0x1f/0xaa
      [74073.219938] INFO: lockdep is turned off.
      
      What happened was the following:
      
            CPU 1                                       CPU 2
      
                                                   btrfs_dedupe_file_range()
                                                     --> using same inode as source
                                                         and target
                                                     --> src range is [768K, 1Mb[
                                                     --> dst range is [0, 256K[
                                                    btrfs_cmp_data_prepare()
                                                     --> calls gather_extent_pages()
                                                         for range [768K, 1Mb[ and
                                                         locks all pages in that range
      
       do_writepages()
        btrfs_writepages()
         extent_writepages()
          extent_write_cache_pages()
           __extent_writepage()
            writepage_delalloc()
             find_lock_delalloc_range()
               --> finds range [0, 1Mb[
               lock_delalloc_pages()
                --> locks all pages in the
                    range [0, 768K[
                --> tries to lock page at
                    offset 768K
                      --> deadlock
      
                                                     --> calls gather_extent_pages()
                                                         to lock pages in the range
                                                         [0, 256K[
                                                          --> deadlock, task at CPU 1
                                                              already locked that
                                                              range and it's trying
                                                              to lock the range we
                                                              locked previously
      
      So fix this by making sure that during a dedup we always lock first the
      pages from the range with lower offset.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      b1517622
  3. 17 Feb, 2017 35 commits