1. 03 Oct, 2014 10 commits
    • Sage Weil's avatar
      Btrfs: fix race in WAIT_SYNC ioctl · 42383020
      Sage Weil authored
      We check whether transid is already committed via last_trans_committed and
      then search through trans_list for pending transactions.  If
      last_trans_committed is updated by btrfs_commit_transaction after we check
      it (there is no locking), we will fail to find the committed transaction
      and return EINVAL to the caller.  This has been observed occasionally by
      ceph-osd (which uses this ioctl heavily).
      
      Fix by rechecking whether the provided transid <= last_trans_committed
      after the search fails, and if so return 0.
      Signed-off-by: default avatarSage Weil <sage@redhat.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      42383020
    • Filipe Manana's avatar
      Btrfs: be aware of btree inode write errors to avoid fs corruption · 656f30db
      Filipe Manana authored
      While we have a transaction ongoing, the VM might decide at any time
      to call btree_inode->i_mapping->a_ops->writepages(), which will start
      writeback of dirty pages belonging to btree nodes/leafs. This call
      might return an error or the writeback might finish with an error
      before we attempt to commit the running transaction. If this happens,
      we might have no way of knowing that such error happened when we are
      committing the transaction - because the pages might no longer be
      marked dirty nor tagged for writeback (if a subsequent modification
      to the extent buffer didn't happen before the transaction commit) which
      makes filemap_fdata[write|wait]_range unable to find such pages (even
      if they're marked with SetPageError).
      So if this happens we must abort the transaction, otherwise we commit
      a super block with btree roots that point to btree nodes/leafs whose
      content on disk is invalid - either garbage or the content of some
      node/leaf from a past generation that got cowed or deleted and is no
      longer valid (for this later case we end up getting error messages like
      "parent transid verify failed on 10826481664 wanted 25748 found 29562"
      when reading btree nodes/leafs from disk).
      
      Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
      i_mapping would not be enough because we need to distinguish between
      log tree extents (not fatal) vs non-log tree extents (fatal) and
      because the next call to filemap_fdatawait_range() will catch and clear
      such errors in the mapping - and that call might be from a log sync and
      not from a transaction commit, which means we would not know about the
      error at transaction commit time. Also, checking for the eb flag
      EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
      not be completely reliable, as the eb might be removed from memory and
      read back when trying to get it, which clears that flag right before
      reading the eb's pages from disk, making us not know about the previous
      write error.
      
      Using the new 3 flags for the btree inode also makes us achieve the
      goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
      writeback for all dirty pages and before filemap_fdatawait_range() is
      called, the writeback for all dirty pages had already finished with
      errors - because we were not using AS_EIO/AS_ENOSPC,
      filemap_fdatawait_range() would return success, as it could not know
      that writeback errors happened (the pages were no longer tagged for
      writeback).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      656f30db
    • Fabian Frederick's avatar
      Btrfs: remove redundant btrfs_verify_qgroup_counts declaration. · 15b636e1
      Fabian Frederick authored
      Do like disk-io function declared under CONFIG_BTRFS_FS_RUN_SANITY_TESTS
      and keep prototype in qgroup.h only
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      15b636e1
    • Fabian Frederick's avatar
      btrfs: fix shadow warning on cmp · b99d9a6a
      Fabian Frederick authored
      cmp was declared twice in btrfs_compare_trees resulting in a shadow
      warning. This patch renames second internal variable.
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      b99d9a6a
    • Fabian Frederick's avatar
      Btrfs: fix compilation errors under DEBUG · 1b6e4469
      Fabian Frederick authored
      bi_sector and bi_size moved to bi_iter since commit 4f024f37
      ("block: Abstract out bvec iterator")
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1b6e4469
    • Liu Bo's avatar
      Btrfs: fix crash of btrfs_release_extent_buffer_page · 81465028
      Liu Bo authored
      This is actually inspired by Filipe's patch.  When write_one_eb() fails on
      submit_extent_page(), it'll give up writing this eb and mark it with
      EXTENT_BUFFER_IOERR.  So if it's not the last page that encounter the failure,
      there are some left pages which remain DIRTY, and if a later COW on this eb
      happens, ie. eb is COWed and freed, it'd run into BUG_ON in
      btrfs_release_extent_buffer_page() for the DIRTY page, ie. BUG_ON(PageDirty(page));
      
      This adds the missing clear_page_dirty_for_io() for the rest pages of eb.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      81465028
    • Filipe Manana's avatar
      Btrfs: add missing end_page_writeback on submit_extent_page failure · 55e3bd2e
      Filipe Manana authored
      If submit_extent_page() fails in write_one_eb(), we end up with the current
      page not marked dirty anymore, unlocked and marked for writeback. But we never
      end up calling end_page_writeback() against the page, which will make calls to
      filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
      for the writeback bit to be cleared from the page.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      55e3bd2e
    • Qu Wenruo's avatar
      btrfs: Fix the wrong condition judgment about subset extent map · 32be3a1a
      Qu Wenruo authored
      Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert
      best fitted extent map
      is using wrong condition to judgement whether the range is a subset of a
      existing extent map.
      
      This may cause bug in btrfs no-holes mode.
      
      This patch will correct the judgment and fix the bug.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      32be3a1a
    • Josef Bacik's avatar
      Btrfs: fix build_backref_tree issue with multiple shared blocks · bbe90514
      Josef Bacik authored
      Marc Merlin sent me a broken fs image months ago where it would blow up in the
      upper->checked BUG_ON() in build_backref_tree.  This is because we had a
      scenario like this
      
      block a -- level 4 (not shared)
         |
      block b -- level 3 (reloc block, shared)
         |
      block c -- level 2 (not shared)
         |
      block d -- level 1 (shared)
         |
      block e -- level 0 (shared)
      
      We go to build a backref tree for block e, we notice block d is shared and add
      it to the list of blocks to lookup it's backrefs for.  Now when we loop around
      we will check edges for the block, so we will see we looked up block c last
      time.  So we lookup block d and then see that the block that points to it is
      block c and we can just skip that edge since we've already been up this path.
      The problem is because we clear need_check when we see block d (as it is shared)
      we never add block b as needing to be checked.  And because block c is in our
      path already we bail out before we walk up to block b and add it to the backref
      check list.
      
      To fix this we need to reset need_check if we trip over a block that doesn't
      need to be checked.  This will make sure that any subsequent blocks in the path
      as we're walking up afterwards are added to the list to be processed.  With this
      patch I can now mount Marc's fs image and it'll complete the balance without
      panicing.  Thanks,
      Reported-by: default avatarMarc MERLIN <marc@merlins.org>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      bbe90514
    • Josef Bacik's avatar
      Btrfs: cleanup error handling in build_backref_tree · 75bfb9af
      Josef Bacik authored
      When balance panics it tends to panic in the
      
      BUG_ON(!upper->checked);
      
      test, because it means it couldn't build the backref tree properly.  This is
      annoying to users and frankly a recoverable error, nothing in this function is
      actually fatal since it is just an in-memory building of the backrefs for a
      given bytenr.  So go through and change all the BUG_ON()'s to ASSERT()'s, and
      fix the BUG_ON(!upper->checked) thing to just return an error.
      
      This patch also fixes the error handling so it tears down the work we've done
      properly.  This code was horribly broken since we always just panic'ed instead
      of actually erroring out, so it needed to be completely re-worked.  With this
      patch my broken image no longer panics when I mount it.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      75bfb9af
  2. 23 Sep, 2014 3 commits
    • Josef Bacik's avatar
      Btrfs: try not to ENOSPC on log replay · 1d52c78a
      Josef Bacik authored
      When doing log replay we may have to update inodes, which traditionally goes
      through our delayed inode stuff.  This will try to move space over from the
      trans handle, but we don't reserve space in our trans handle on replay since we
      don't know how much we will need, so instead we try to flush.  But because we
      have a trans handle open we won't flush anything, so if we are out of reserve
      space we will simply return ENOSPC.  Since we know that if an operation made it
      into the log then we definitely had space before the box bought the farm then we
      don't need to worry about doing this space reservation.  Use the
      fs_info->log_root_recovering flag to skip the delayed inode stuff and update the
      item directly.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1d52c78a
    • Josef Bacik's avatar
      Btrfs: don't do async reclaim during log replay · f6acfd50
      Josef Bacik authored
      Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
      during log replay.  This is because we use fs_info->fs_root as our root for
      shrinking and such.  Technically we can use whatever root we want, but let's
      just not allow async reclaim while we're doing log replay.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f6acfd50
    • Josef Bacik's avatar
      Btrfs: remove empty block groups automatically · 47ab2a6c
      Josef Bacik authored
      One problem that has plagued us is that a user will use up all of his space with
      data, remove a bunch of that data, and then try to create a bunch of small files
      and run out of space.  This happens because all the chunks were allocated for
      data since the metadata requirements were so low.  But now there's a bunch of
      empty data block groups and not enough metadata space to do anything.  This
      patch solves this problem by automatically deleting empty block groups.  If we
      notice the used count go down to 0 when deleting or on mount notice that a block
      group has a used count of 0 then we will queue it to be deleted.
      
      When the cleaner thread runs we will double check to make sure the block group
      is still empty and then we will delete it.  This patch has the side effect of no
      longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
      helpful for both this and relocate.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      47ab2a6c
  3. 19 Sep, 2014 2 commits
    • Filipe Manana's avatar
      Btrfs: fix data corruption after fast fsync and writeback error · 8407f553
      Filipe Manana authored
      When we do a fast fsync, we start all ordered operations and then while
      they're running in parallel we visit the list of modified extent maps
      and construct their matching file extent items and write them to the
      log btree. After that, in btrfs_sync_log() we wait for all the ordered
      operations to finish (via btrfs_wait_logged_extents).
      
      The problem with this is that we were completely ignoring errors that
      can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM
      or -EIO errors for example. When such error happens, it means we have parts
      of the on disk extent that weren't written to, and so we end up logging
      file extent items that point to these extents that contain garbage/random
      data - so after a crash/reboot plus log replay, we get our inode's metadata
      pointing to those extents.
      
      This worked in contrast with the full (non-fast) fsync path, where we
      start all ordered operations, wait for them to finish and then write
      to the log btree. In this path, after each ordered operation completes
      we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return
      -EIO if so (via btrfs_wait_ordered_range).
      
      So if an error happens with any ordered operation, just return a -EIO
      error to userspace, so that it knows that not all of its previous writes
      were durably persisted and the application can take proper action (like
      redo the writes for e.g.) - and definitely not leave any file extent items
      in the log refer to non fully written extents.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8407f553
    • Filipe Manana's avatar
      Btrfs: fix fsync race leading to invalid data after log replay · 669249ee
      Filipe Manana authored
      When the fsync callback (btrfs_sync_file) starts, it first waits for
      the writeback of any dirty pages to start and finish without holding
      the inode's mutex (to reduce contention). After this it acquires the
      inode's mutex and repeats that process via btrfs_wait_ordered_range
      only if we're doing a full sync (BTRFS_INODE_NEEDS_FULL_SYNC flag
      is set on the inode).
      
      This is not safe for a non full sync - we need to start and wait for
      writeback to finish for any pages that might have been made dirty
      before acquiring the inode's mutex and after that first step mentioned
      before. Why this is needed is explained by the following comment added
      to btrfs_sync_file:
      
        "Right before acquiring the inode's mutex, we might have new
         writes dirtying pages, which won't immediately start the
         respective ordered operations - that is done through the
         fill_delalloc callbacks invoked from the writepage and
         writepages address space operations. So make sure we start
         all ordered operations before starting to log our inode. Not
         doing this means that while logging the inode, writeback
         could start and invoke writepage/writepages, which would call
         the fill_delalloc callbacks (cow_file_range,
         submit_compressed_extents). These callbacks add first an
         extent map to the modified list of extents and then create
         the respective ordered operation, which means in
         tree-log.c:btrfs_log_inode() we might capture all existing
         ordered operations (with btrfs_get_logged_extents()) before
         the fill_delalloc callback adds its ordered operation, and by
         the time we visit the modified list of extent maps (with
         btrfs_log_changed_extents()), we see and process the extent
         map they created. We then use the extent map to construct a
         file extent item for logging without waiting for the
         respective ordered operation to finish - this file extent
         item points to a disk location that might not have yet been
         written to, containing random data - so after a crash a log
         replay will make our inode have file extent items that point
         to disk locations containing invalid data, as we returned
         success to userspace without waiting for the respective
         ordered operation to finish, because it wasn't captured by
         btrfs_get_logged_extents()."
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      669249ee
  4. 18 Sep, 2014 2 commits
    • Liu Bo's avatar
      Btrfs: fix wrong parse of extent map's tracepoint · 254a2d14
      Liu Bo authored
      The tracepoint of extent map doesn't parse @flag correctly, we set @flag via
      set_bit(), so we need to parse it on a bit bias.
      
      Also add the missing flag, EXTENT_FLAG_FS_MAPPING.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      254a2d14
    • Qu Wenruo's avatar
      btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map · e6c4efd8
      Qu Wenruo authored
      The following commit enhanced the merge_extent_mapping() to reduce
      fragment in extent map tree, but it can't handle case which existing
      lies before map_start:
      51f39 btrfs: Use right extent length when inserting overlap extent map.
      
      [BUG]
      When existing extent map's start is before map_start,
      the em->len will be minus, which will corrupt the extent map and fail to
      insert the new extent map.
      This will happen when someone get a large extent map, but when it is
      going to insert it into extent map tree, some one has already commit
      some write and split the huge extent into small parts.
      
      [REPRODUCER]
      It is very easy to tiger using filebench with randomrw personality.
      It is about 100% to reproduce when using 8G preallocated file in 60s
      randonrw test.
      
      [FIX]
      This patch can now handle any existing extent position.
      Since it does not directly use existing->start, now it will find the
      previous and next extent around map_start.
      So the old existing->start < map_start bug will never happen again.
      
      [ENHANCE]
      This patch will insert the best fitted extent map into extent map tree,
      other than the oldest [map_start, map_start + sectorsize) or the
      relatively newer but not perfect [map_start, existing->start).
      
      The patch will first search existing extent that does not intersects with
      the desired map range [map_start, map_start + len).
      The existing extent will be either before or behind map_start, and based
      on the existing extent, we can find out the previous and next extent
      around map_start.
      
      So the best fitted extent would be [prev->end, next->start).
      For prev or next is not found, em->start would be prev->end and em->end
      wold be next->start.
      
      With this patch, the fragment in extent map tree should be reduced much
      more than the 51f39 commit and reduce an unneeded extent map tree search.
      Reported-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e6c4efd8
  5. 17 Sep, 2014 23 commits