1. 08 Sep, 2014 2 commits
    • Filipe Manana's avatar
      Btrfs: fix fsync data loss after a ranged fsync · 49dae1bc
      Filipe Manana authored
      While we're doing a full fsync (when the inode has the flag
      BTRFS_INODE_NEEDS_FULL_SYNC set) that is ranged too (covers only a
      portion of the file), we might have ordered operations that are started
      before or while we're logging the inode and that fall outside the fsync
      range.
      
      Therefore when a full ranged fsync finishes don't remove every extent
      map from the list of modified extent maps - as for some of them, that
      fall outside our fsync range, their respective ordered operation hasn't
      finished yet, meaning the corresponding file extent item wasn't inserted
      into the fs/subvol tree yet and therefore we didn't log it, and we must
      let the next fast fsync (one that checks only the modified list) see this
      extent map and log a matching file extent item to the log btree and wait
      for its ordered operation to finish (if it's still ongoing).
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      49dae1bc
    • Dan Carpenter's avatar
      Btrfs: kfree()ing ERR_PTRs · c47ca32d
      Dan Carpenter authored
      The "inherit" in btrfs_ioctl_snap_create_v2() and "vol_args" in
      btrfs_ioctl_rm_dev() are ERR_PTRs so we can't call kfree() on them.
      
      These kind of bugs are "One Err Bugs" where there is just one error
      label that does everything.  I could set the "inherit = NULL" and keep
      the single out label but it ends up being more complicated that way.  It
      makes the code simpler to re-order the unwind so it's in the mirror
      order of the allocation and introduce some new error labels.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c47ca32d
  2. 02 Sep, 2014 2 commits
    • Filipe Manana's avatar
      Btrfs: fix crash while doing a ranged fsync · dac5705c
      Filipe Manana authored
      While doing a ranged fsync, that is, one whose range doesn't cover the
      whole possible file range (0 to LLONG_MAX), we can crash under certain
      circumstances with a trace like the following:
      
      [41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      (...)
      [41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
      (...)
      [41074.643886] RIP: 0010:[<ffffffffa01ecc99>]  [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
      (...)
      [41074.644919] Stack:
      (...)
      [41074.644919] Call Trace:
      [41074.644919]  [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
      [41074.644919]  [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
      [41074.644919]  [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
      [41074.644919]  [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
      [41074.644919]  [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
      [41074.644919]  [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
      [41074.644919]  [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
      [41074.644919]  [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
      [41074.644919]  [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
      [41074.644919]  [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
      [41074.644919]  [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
      (...)
      
      The necessary conditions that lead to such crash are:
      
      * an incremental fsync (when the inode doesn't have the
        BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
        a file extent item ending at offset X;
      
      * the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
        to a file truncate operation that reduces the file to a size smaller
        than X;
      
      * a ranged fsync call happens (via an msync for example), with a range that
        doesn't cover the whole file and the end of this range, lets call it Y, is
        smaller than X;
      
      * btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
        calls btrfs_truncate_inode_items() to remove all items from the log
        tree that are associated with our file;
      
      * btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
        file extent item it removed is the one ending at offset X, where X > 0 and
        X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
        parameter set to X;
      
      * btrfs_ordered_update_i_size() sees that X is greater then the current ordered
        size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
        ordered operation with a range covering the offset X, calling a BUG_ON() if
        such ordered operation exists. This assumption is made because the disk_i_size
        is only increased after the corresponding file extent item is added to the
        btree (btrfs_finish_ordered_io);
      
      * But because our fsync covers only a limited range, such an ordered extent might
        exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
        extent to finish when calling btrfs_wait_ordered_range();
      
      And then by the time btrfs_ordered_update_i_size() is called, via:
      
         btrfs_sync_file() ->
             btrfs_log_dentry_safe() ->
                 btrfs_log_inode_parent() ->
                     btrfs_log_inode() ->
                         btrfs_truncate_inode_items() ->
                             btrfs_ordered_update_i_size()
      
      We hit the BUG_ON(), which could never happen if the fsync range covered the whole
      possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
      finish before calling btrfs_truncate_inode_items().
      
      So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
      from a log tree, which isn't supposed to change the in memory inode's disk_i_size.
      
      Issue found while running xfstests/generic/127 (happens very rarely for me), more
      specifically via the fsx calls that use memory mapped IO (and issue msync calls).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      dac5705c
    • Filipe Manana's avatar
      Btrfs: fix corruption after write/fsync failure + fsync + log recovery · d9f85963
      Filipe Manana authored
      While writing to a file, in inode.c:cow_file_range() (and same applies to
      submit_compressed_extents()), after reserving an extent for the file data,
      we create a new extent map for the written range and insert it into the
      extent map cache. After that, we create an ordered operation, but if it
      fails (due to a transient/temporary-ENOMEM), we return without dropping
      that extent map, which points to a reserved extent that is freed when we
      return. A subsequent incremental fsync (when the btrfs inode doesn't have
      the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
      logs a file extent item based on that extent map, which points to a disk
      extent that doesn't contain valid data - it was freed by us earlier, at this
      point it might contain any random/garbage data.
      
      Therefore, if we reach an error condition when cowing a file range after
      we added the new extent map to the cache, drop it from the cache before
      returning.
      
      Some sequence of steps that lead to this:
      
          $ mkfs.btrfs -f /dev/sdd
          $ mount -o commit=9999 /dev/sdd /mnt
          $ cd /mnt
      
          $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
          $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
          $ sync
      
          $ od -t x1 foo
          0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
          *
          0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
          *
          0020000
      
          $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo
      
          # Now this write + fsync fail with -ENOMEM, which was returned by
          # btrfs_add_ordered_extent() in inode.c:cow_file_range().
          $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
          $ xfs_io -c "fsync" foo
          fsync: Cannot allocate memory
      
          # Now do a new write + fsync, which will succeed. Our previous
          # -ENOMEM was a transient/temporary error.
          $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
          $ xfs_io -c "fsync" foo
      
          # Our file content (in page cache) is now:
          $ od -t x1 foo
          0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
          *
          0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
          *
          0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
          *
          0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
          *
          0050000
      
          # Now reboot the machine, and mount the fs, so that fsync log replay
          # takes place.
      
          # The file content is now weird, in particular the first 8Kb, which
          # do not match our data before nor after the sync command above.
          $ od -t x1 foo
          0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
          *
          0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
          *
          0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
          *
          0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
          *
          0050000
      
          # In fact these first 4Kb are a duplicate of the last 4kb block.
          # The last write got an extent map/file extent item that points to
          # the same disk extent that we got in the write+fsync that failed
          # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
          # verify that:
      
          $ btrfs-debug-tree /dev/sdd
          (...)
      	item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
      		extent data disk byte 12582912 nr 8192
      		extent data offset 0 nr 8192 ram 8192
      	item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
      		extent data disk byte 0 nr 0
      		extent data offset 0 nr 8192 ram 8192
      	item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
      		extent data disk byte 12582912 nr 4096
      		extent data offset 0 nr 4096 ram 4096
      
          $ umount /dev/sdd
          $ btrfsck /dev/sdd
          Checking filesystem on /dev/sdd
          UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
          checking extents
          extent item 12582912 has multiple extent items
          ref mismatch on [12582912 4096] extent item 1, found 2
          Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
          backpointer mismatch on [12582912 4096]
          Errors found in extent allocation tree or chunk allocation
          checking free space cache
          checking fs roots
          root 5 inode 257 errors 1000, some csum missing
          found 131074 bytes used err is 1
          total csum bytes: 4
          total tree bytes: 131072
          total fs tree bytes: 32768
          total extent tree bytes: 16384
          btree space waste bytes: 123404
          file data blocks allocated: 274432
           referenced 274432
          Btrfs v3.14.1-96-gcc7fd5a-dirty
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d9f85963
  3. 27 Aug, 2014 1 commit
    • Chris Mason's avatar
      Btrfs: fix autodefrag with compression · e9512d72
      Chris Mason authored
      The autodefrag code skips defrag when two extents are adjacent.  But one
      big advantage for autodefrag is cutting down on the number of small
      extents, even when they are adjacent.  This commit changes it to defrag
      all small extents.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e9512d72
  4. 24 Aug, 2014 1 commit
    • Liu Bo's avatar
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo authored
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9e0af237
  5. 21 Aug, 2014 10 commits
    • Chris Mason's avatar
      Btrfs: fix filemap_flush call in btrfs_file_release · f6dc45c7
      Chris Mason authored
      We should only be flushing on close if the file was flagged as needing
      it during truncate.  I broke this with my ordered data vs transaction
      commit deadlock fix.
      
      Thanks to Miao Xie for catching this.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reported-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      f6dc45c7
    • Liu Bo's avatar
      Btrfs: fix crash on endio of reading corrupted block · 38c1c2e4
      Liu Bo authored
      The crash is
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/extent_io.c:2124!
      [...]
      Workqueue: btrfs-endio normal_work_helper [btrfs]
      RIP: 0010:[<ffffffffa02d6055>]  [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs]
      
      This is in fact a regression.
      
      It is because we forgot to increase @offset properly in reading corrupted block,
      so that the @offset remains, and this leads to checksum errors while reading
      left blocks queued up in the same bio, and then ends up with hiting the above
      BUG_ON.
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      38c1c2e4
    • Eric Sandeen's avatar
      btrfs: fix leak in qgroup_subtree_accounting() error path · a3c10895
      Eric Sandeen authored
      Coverity pointed this out; in the newly added
      qgroup_subtree_accounting(), if btrfs_find_all_roots()
      returns an error, we leak at least the parents pointer,
      and possibly the roots pointer, depending on what failure
      occurs.
      
      If btrfs_find_all_roots() returns an error, we need to
      free up all allocations before we return.  "roots" is
      initialized to NULL, so it should be safe to free
      it unconditionally (ulist_free() handles that case).
      
      Cc: Mark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a3c10895
    • Qu Wenruo's avatar
      btrfs: Use right extent length when inserting overlap extent map. · 51f395ad
      Qu Wenruo authored
      When current btrfs finds that a new extent map is going to be insereted
      but failed with -EEXIST, it will try again to insert the extent map
      but with the length of sectorsize.
      This is OK if we don't enable 'no-holes' feature since all extent space
      is continuous, we will not go into the not found->insert routine.
      
      But if we enable 'no-holes' feature, it will make things out of control.
      e.g. in 4K sectorsize, we pass the following args to btrfs_get_extent():
      btrfs_get_extent() args: start:  27874 len 4100
      28672		  27874		28672	27874+4100	32768
                          |-----------------------|
      |---------hole--------------------|---------data----------|
      
      1) not found and insert
      Since no extent map containing the range, btrfs_get_extent() will go
      into the not_found and insert routine, which will try to insert the
      extent map (27874, 27847 + 4100).
      
      2) first overlap
      But it overlaps with (28672, 32768) extent, so -EEXIST will be returned
      by add_extent_mapping().
      
      3) retry but still overlap
      After catching the -EEXIST, then btrfs_get_extent() will try insert it
      again but with 4K length, which still overlaps, so -EEXIST will be
      returned.
      
      This makes the following patch fail to punch hole.
      d7781546 btrfs: Avoid trucating page or punching hole in a already existed hole.
      
      This patch will use the right length, which is the (exsisting->start -
      em->start) to insert, making the above patch works in 'no-holes' mode.
      Also, some small code style problems in above patch is fixed too.
      Reported-by: default avatarFilipe David Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarFilipe David Manana <fdmanana@suse.com>
      Tested-by: default avatarFilipe David Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      51f395ad
    • Filipe Manana's avatar
      Btrfs: clone, don't create invalid hole extent map · 62e2390e
      Filipe Manana authored
      When cloning a file that consists of an inline extent, we were creating
      an extent map that represents a non-existing trailing hole starting at a
      file offset that isn't a multiple of the sector size. This happened because
      when processing an inline extent we weren't aligning the extent's length to
      the sector size, and therefore incorrectly treating the range
      [inline_extent_length; sector_size[ as a hole.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      62e2390e
    • Filipe Manana's avatar
      Btrfs: don't monopolize a core when evicting inode · 7064dd5c
      Filipe Manana authored
      If an inode has a very large number of extent maps, we can spend
      a lot of time freeing them, which triggers a soft lockup warning.
      Therefore reschedule if we need to when freeing the extent maps
      while evicting the inode.
      
      I could trigger this all the time by running xfstests/generic/299 on
      a file system with the no-holes feature enabled. That test creates
      an inode with 11386677 extent maps.
      
          $ mkfs.btrfs -f -O no-holes $TEST_DEV
          $ MKFS_OPTIONS="-O no-holes" ./check generic/299
          generic/299 382s ...
          Message from syslogd@debian-vm3 at Aug  7 10:44:29 ...
           kernel:[85304.208017] BUG: soft lockup - CPU#0 stuck for 22s! [umount:25330]
           384s
          Ran: generic/299
          Passed all 1 tests
      
          $ dmesg
          (...)
          [86304.300017] BUG: soft lockup - CPU#0 stuck for 23s! [umount:25330]
          (...)
          [86304.300036] Call Trace:
          [86304.300036]  [<ffffffff81698ba9>] __slab_free+0x54/0x295
          [86304.300036]  [<ffffffffa02ee9cc>] ? free_extent_map+0x5c/0xb0 [btrfs]
          [86304.300036]  [<ffffffff811a6cd2>] kmem_cache_free+0x282/0x2a0
          [86304.300036]  [<ffffffffa02ee9cc>] free_extent_map+0x5c/0xb0 [btrfs]
          [86304.300036]  [<ffffffffa02e3775>] btrfs_evict_inode+0xd5/0x660 [btrfs]
          [86304.300036]  [<ffffffff811e7c8d>] ? __inode_wait_for_writeback+0x6d/0xc0
          [86304.300036]  [<ffffffff816a389b>] ? _raw_spin_unlock+0x2b/0x40
          [86304.300036]  [<ffffffff811d8cbb>] evict+0xab/0x180
          [86304.300036]  [<ffffffff811d8dce>] dispose_list+0x3e/0x60
          [86304.300036]  [<ffffffff811d9b04>] evict_inodes+0xf4/0x110
          [86304.300036]  [<ffffffff811bd953>] generic_shutdown_super+0x53/0x110
          [86304.300036]  [<ffffffff811bdaa6>] kill_anon_super+0x16/0x30
          [86304.300036]  [<ffffffffa02a78ba>] btrfs_kill_super+0x1a/0xa0 [btrfs]
          [86304.300036]  [<ffffffff811bd3a9>] deactivate_locked_super+0x59/0x80
          [86304.300036]  [<ffffffff811be44e>] deactivate_super+0x4e/0x70
          [86304.300036]  [<ffffffff811dec14>] mntput_no_expire+0x174/0x1f0
          [86304.300036]  [<ffffffff811deab7>] ? mntput_no_expire+0x17/0x1f0
          [86304.300036]  [<ffffffff811e0517>] SyS_umount+0x97/0x100
          (...)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Tested-by: default avatarSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      7064dd5c
    • Filipe Manana's avatar
      Btrfs: fix hole detection during file fsync · 74121f7c
      Filipe Manana authored
      The file hole detection logic during a file fsync wasn't correct,
      because it didn't look back (in a previous leaf) for the last file
      extent item that can be in a leaf to the left of our leaf and that
      has a generation lower than the current transaction id. This made it
      assume that a hole exists when it really doesn't exist in the file.
      
      Such false positive hole detection happens in the following scenario:
      
      * We have a file that has many file extent items, covering 3 or more
        btree leafs (the first leaf must contain non file extent items too).
      
      * Two ranges of the file are modified, with their extent items being
        located at 2 different leafs and those leafs aren't consecutive.
      
      * When processing the second modified leaf, we weren't checking if
        some file extent item exists that is located in some leaf that is
        between our 2 modified leafs, and therefore assumed the range defined
        between the last file extent item in the first leaf and the first file
        extent item in the second leaf matched a hole.
      
      Fortunately this didn't result in overriding the log with wrong data,
      instead it made the last loop in copy_items() attempt to insert a
      duplicated key (for a hole file extent item), which makes the file
      fsync code return with -EEXIST to file.c:btrfs_sync_file() which in
      turn ends up doing a full transaction commit, which is much more
      expensive then writing only to the log tree and wait for it to be
      durably persisted (as well as the file's modified extents/pages).
      Therefore fix the hole detection logic, so that we don't pay the
      cost of doing full transaction commits.
      
      I could trigger this issue with the following test for xfstests (which
      never fails, either without or with this patch). The last fsync call
      results in a full transaction commit, due to the -EEXIST error mentioned
      above. I could also observe this behaviour happening frequently when
      running xfstests/generic/075 in a loop.
      
      Test:
      
          _cleanup()
          {
              _cleanup_flakey
              rm -fr $tmp
          }
      
          # get standard environment, filters and checks
          . ./common/rc
          . ./common/filter
          . ./common/dmflakey
      
          # real QA test starts here
          _supported_fs btrfs
          _supported_os Linux
          _require_scratch
          _require_dm_flakey
          _need_to_be_root
      
          rm -f $seqres.full
      
          # Create a file with many file extent items, each representing a 4Kb extent.
          # These items span 3 btree leaves, of 16Kb each (default mkfs.btrfs leaf size
          # as of btrfs-progs 3.12).
          _scratch_mkfs -l 16384 >/dev/null 2>&1
          _init_flakey
          SAVE_MOUNT_OPTIONS="$MOUNT_OPTIONS"
          MOUNT_OPTIONS="$MOUNT_OPTIONS -o commit=999"
          _mount_flakey
      
          # First fsync, inode has BTRFS_INODE_NEEDS_FULL_SYNC flag set.
          $XFS_IO_PROG -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" \
                  $SCRATCH_MNT/foo | _filter_xfs_io
      
          # For any of the following fsync calls, inode doesn't have the flag
          # BTRFS_INODE_NEEDS_FULL_SYNC set.
          for ((i = 1; i <= 500; i++)); do
              OFFSET=$((4096 * i))
              LEN=4096
              $XFS_IO_PROG -c "pwrite -S 0x01 $OFFSET $LEN" -c "fsync" \
                      $SCRATCH_MNT/foo | _filter_xfs_io
          done
      
          # Commit transaction and bump next transaction's id (to 7).
          sync
      
          # Truncate will set the BTRFS_INODE_NEEDS_FULL_SYNC flag in the btrfs's
          # inode runtime flags.
          $XFS_IO_PROG -c "truncate 2048000" $SCRATCH_MNT/foo
      
          # Commit transaction and bump next transaction's id (to 8).
          sync
      
          # Touch 1 extent item from the first leaf and 1 from the last leaf. The leaf
          # in the middle, containing only file extent items, isn't touched. So the
          # next fsync, when calling btrfs_search_forward(), won't visit that middle
          # leaf. First and 3rd leaf have now a generation with value 8, while the
          # middle leaf remains with a generation with value 6.
          $XFS_IO_PROG \
              -c "pwrite -S 0xee -b 4096 0 4096" \
              -c "pwrite -S 0xff -b 4096 2043904 4096" \
              -c "fsync" \
              $SCRATCH_MNT/foo | _filter_xfs_io
      
          _load_flakey_table $FLAKEY_DROP_WRITES
          md5sum $SCRATCH_MNT/foo | _filter_scratch
          _unmount_flakey
      
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          # During mount, we'll replay the log created by the fsync above, and the file's
          # md5 digest should be the same we got before the unmount.
          _mount_flakey
          md5sum $SCRATCH_MNT/foo | _filter_scratch
          _unmount_flakey
          MOUNT_OPTIONS="$SAVE_MOUNT_OPTIONS"
      
          status=0
          exit
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      74121f7c
    • Filipe Manana's avatar
      Btrfs: ensure tmpfile inode is always persisted with link count of 0 · 5762b5c9
      Filipe Manana authored
      If we open a file with O_TMPFILE, don't do any further operation on
      it (so that the inode item isn't updated) and then force a transaction
      commit, we get a persisted inode item with a link count of 1, and not 0
      as it should be.
      
      Steps to reproduce it (requires a modern xfs_io with -T support):
      
          $ mkfs.btrfs -f /dev/sdd
          $ mount -o /dev/sdd /mnt
          $ xfs_io -T /mnt &
          $ sync
      
      Then btrfs-debug-tree shows the inode item with a link count of 1:
      
          $ btrfs-debug-tree /dev/sdd
          (...)
          fs tree key (FS_TREE ROOT_ITEM 0)
          leaf 29556736 items 4 free space 15851 generation 6 owner 5
          fs uuid f164d01b-1b92-481d-a4e4-435fb0f843d0
          chunk uuid 0e3d0e56-bcca-4a1c-aa5f-cec2c6f4f7a6
          	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
      		inode generation 3 transid 6 size 0 block group 0 mode 40755 links 1
          	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
          		inode ref index 0 namelen 2 name: ..
          	item 2 key (257 INODE_ITEM 0) itemoff 15951 itemsize 160
          		inode generation 6 transid 6 size 0 block group 0 mode 100600 links 1
          	item 3 key (ORPHAN ORPHAN_ITEM 257) itemoff 15951 itemsize 0
      		orphan item
          checksum tree key (CSUM_TREE ROOT_ITEM 0)
          (...)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5762b5c9
    • Filipe Manana's avatar
      Btrfs: race free update of commit root for ro snapshots · 9c3b306e
      Filipe Manana authored
      This is a better solution for the problem addressed in the following
      commit:
      
          Btrfs: update commit root on snapshot creation after orphan cleanup
          (3821f348)
      
      The previous solution wasn't the best because of 2 reasons:
      
          1) It added another full transaction commit, which is more expensive
             than just swapping the commit root with the root;
      
          2) If a reboot happened after the first transaction commit (the one
             that creates the snapshot) and before the second transaction commit,
             then we would end up with the same problem if a send using that
             snapshot was requested before the first transaction commit after
             the reboot.
      
      This change addresses those 2 issues. The second issue is addressed by
      switching the commit root in the dentry lookup VFS callback, which is
      also called by the snapshot/subvol creation ioctl and performs orphan
      cleanup if needed. Like the vfs, the ioctl locks the parent inode too,
      preventing race issues between a dentry lookup and snapshot creation.
      
      Cc: Alex Lyakas <alex.btrfs@zadarastorage.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9c3b306e
    • Liu Bo's avatar
      Btrfs: fix regression of btrfs device replace · 87fa3bb0
      Liu Bo authored
      Commit 49c6f736(
      btrfs: dev replace should replace the sysfs entry) added the missing sysfs entry
      in the process of device replace, but didn't take missing devices into account,
      so now we have
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
      IP: [<ffffffffa0268551>] btrfs_kobj_rm_device+0x21/0x40 [btrfs]
      ...
      
      To reproduce it,
      1. mkfs.btrfs -f disk1 disk2
      2. mkfs.ext4 disk1
      3. mount disk2 /mnt -odegraded
      4. btrfs replace start -B 1 disk3 /mnt
      --------------------------
      
      This fixes the problem.
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Tested-by: default avatarSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      87fa3bb0
  6. 19 Aug, 2014 13 commits
  7. 15 Aug, 2014 9 commits
    • Chris Mason's avatar
      btrfs: disable strict file flushes for renames and truncates · 8d875f95
      Chris Mason authored
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      deadlock.
      
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      deadlocks.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8d875f95
    • Filipe Manana's avatar
      Btrfs: fix csum tree corruption, duplicate and outdated checksums · 27b9a812
      Filipe Manana authored
      Under rare circumstances we can end up leaving 2 versions of a checksum
      for the same file extent range.
      
      The reason for this is that after calling btrfs_next_leaf we process
      slot 0 of the leaf it returns, instead of processing the slot set in
      path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
      btrfs_next_leaf() releases the path and before it searches for the next
      leaf, another task might cause a split of the next leaf, which migrates
      some of its keys to the leaf we were processing before calling
      btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
      same leaf but with path->slots[0] having a slot number corresponding
      to the first new key it got, that is, a slot number that didn't exist
      before calling btrfs_next_leaf(), as the leaf now has more keys than
      it had before. So we must really process the returned leaf starting at
      path->slots[0] always, as it isn't always 0, and the key at slot 0 can
      have an offset much lower than our search offset/bytenr.
      
      For example, consider the following scenario, where we have:
      
      sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
      four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472
      
        Leaf N:
      
          slot = 0                           slot = btrfs_header_nritems() - 1
        |-------------------------------------------------------------------|
        | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
        |-------------------------------------------------------------------|
      
        Leaf N + 1:
      
            slot = 0                          slot = btrfs_header_nritems() - 1
        |--------------------------------------------------------------------|
        | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
        |--------------------------------------------------------------------|
      
      Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
      find the next highest key, which releases the current path and then searches
      for that next key. However after releasing the path and before finding that
      next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
      to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
      btrfs_next_leaf() will returns us a path again with leaf N but with the slot
      pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
      is then:
      
          slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
        |----------------------------------------------------------------------------------------------------|
        | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
        |----------------------------------------------------------------------------------------------------|
      
      And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
      into the "insert:" label, which will set tmp to:
      
          tmp = min((sums->len - total_bytes) >> blocksize_bits,
              (next_offset - file_key.offset) >> blocksize_bits) =
          min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
          min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4
      
      and
      
         ins_size = csum_size * tmp = 4 * 4 = 16 bytes.
      
      In other words, we insert a new csum item in the tree with key
      (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
      for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
      because the item with key (CSUM CSUM 40161280) (the one that was moved from
      leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
      bytes of our data and won't get those old checksums removed.
      
      So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
      and breaks the logical rule:
      
         Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover
      
      An obvious bad effect of this is that a subsequent csum tree lookup to get
      the checksum of any of the blocks with logical offset of 40161280, 40165376
      or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      27b9a812
    • Takashi Iwai's avatar
      Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch · 4eb1f66d
      Takashi Iwai authored
      We've got bug reports that btrfs crashes when quota is enabled on
      32bit kernel, typically with the Oops like below:
       BUG: unable to handle kernel NULL pointer dereference at 00000004
       IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs]
       *pde = 00000000
       Oops: 0000 [#1] SMP
       CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S      W 3.15.2-1.gd43d97e-default #1
       Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
       task: f1478130 ti: f147c000 task.ti: f147c000
       EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0
       EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
       EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
       ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
        DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
       CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
       Stack:
        00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
        00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
        00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
       Call Trace:
        [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
        [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
        [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs]
        [<c025e38b>] process_one_work+0x11b/0x390
        [<c025eea1>] worker_thread+0x101/0x340
        [<c026432b>] kthread+0x9b/0xb0
        [<c0712a71>] ret_from_kernel_thread+0x21/0x30
        [<c0264290>] kthread_create_on_node+0x110/0x110
      
      This indicates a NULL corruption in prefs_delayed list.  The further
      investigation and bisection pointed that the call of ulist_add_merge()
      results in the corruption.
      
      ulist_add_merge() takes u64 as aux and writes a 64bit value into
      old_aux.  The callers of this function in backref.c, however, pass a
      pointer of a pointer to old_aux.  That is, the function overwrites
      64bit value on 32bit pointer.  This caused a NULL in the adjacent
      variable, in this case, prefs_delayed.
      
      Here is a quick attempt to band-aid over this: a new function,
      ulist_add_merge_ptr() is introduced to pass/store properly a pointer
      value instead of u64.  There are still ugly void ** cast remaining
      in the callers because void ** cannot be taken implicitly.  But, it's
      safer than explicit cast to u64, anyway.
      
      Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
      Cc: <stable@vger.kernel.org> [v3.11+]
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4eb1f66d
    • Liu Bo's avatar
      Btrfs: fix compressed write corruption on enospc · ce62003f
      Liu Bo authored
      When failing to allocate space for the whole compressed extent, we'll
      fallback to uncompressed IO, but we've forgotten to redirty the pages
      which belong to this compressed extent, and these 'clean' pages will
      simply skip 'submit' part and go to endio directly, at last we got data
      corruption as we write nothing.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Tested-By: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ce62003f
    • Mark Fasheh's avatar
      btrfs: correctly handle return from ulist_add · f90e579c
      Mark Fasheh authored
      ulist_add() can return '1' on sucess, which qgroup_subtree_accounting()
      doesn't take into account. As a result, that value can be bubbled up to
      callers, causing an error to be printed. Fix this by only returning the
      value of ulist_add() when it indicates an error.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f90e579c
    • Mark Fasheh's avatar
      btrfs: qgroup: account shared subtrees during snapshot delete · 1152651a
      Mark Fasheh authored
      During its tree walk, btrfs_drop_snapshot() will skip any shared
      subtrees it encounters. This is incorrect when we have qgroups
      turned on as those subtrees need to have their contents
      accounted. In particular, the case we're concerned with is when
      removing our snapshot root leaves the subtree with only one root
      reference.
      
      In those cases we need to find the last remaining root and add
      each extent in the subtree to the corresponding qgroup exclusive
      counts.
      
      This patch implements the shared subtree walk and a new qgroup
      operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
      this type is encountered during qgroup accounting, we search for
      any root references to that extent and in the case that we find
      only one reference left, we go ahead and do the math on it's
      exclusive counts.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1152651a
    • Filipe Manana's avatar
      Btrfs: read lock extent buffer while walking backrefs · 6f7ff6d7
      Filipe Manana authored
      Before processing the extent buffer, acquire a read lock on it, so
      that we're safe against concurrent updates on the extent buffer.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      6f7ff6d7
    • Josef Bacik's avatar
      Btrfs: __btrfs_mod_ref should always use no_quota · e339a6b0
      Josef Bacik authored
      Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
      understand how snapshot delete was using it and assumed that we needed the
      quota operations there.  With Mark's work this has turned out to be not the
      case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
      argument and make __btrfs_mod_ref call it's process function with no_quota set
      always.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e339a6b0
    • David Sterba's avatar
      btrfs: adjust statfs calculations according to raid profiles · ba7b6e62
      David Sterba authored
      This has been discussed in thread:
      http://thread.gmane.org/gmane.comp.file-systems.btrfs/32528
      
      and this patch implements this proposal:
      http://thread.gmane.org/gmane.comp.file-systems.btrfs/32536
      
      Works fine for "clean" raid profiles where the raid factor correction
      does the right job. Otherwise it's pessimistic and may show low space
      although there's still some left.
      
      The df nubmers are lightly wrong in case of mixed block groups, but this
      is not a major usecase and can be addressed later.
      
      The RAID56 numbers are wrong almost the same way as before and will be
      addressed separately.
      
      CC: Hugo Mills <hugo@carfax.org.uk>
      CC: cwillu <cwillu@cwillu.com>
      CC: Josef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ba7b6e62
  8. 03 Aug, 2014 2 commits