1. 12 May, 2011 3 commits
    • Ilya Dryomov's avatar
      btrfs scrub: make fixups sync · 96e36920
      Ilya Dryomov authored
      btrfs scrub - make fixups sync, don't reuse fixup bios
      
      Fixups are already sync for csum failures, this patch makes them sync
      for EIO case as well.
      
      Fixups are now sharing pages with the parent sbio - instead of
      allocating a separate page to do a fixup we grab the page from the sbio
      buffer.
      
      Fixup bios are no longer reused.
      
      struct fixup is no longer needed, instead pass [sbio pointer, index].
      
      Originally this was added to look at the possibility of sharing the code
      between drive swap and scrub, but it actually fixes a serious bug in
      scrub code where errors that could be corrected were ignored and
      reported as uncorrectable.
      
      btrfs scrub - restore bios properly after media errors
      
      The current code reallocates a bio after a media error.  This is a
      temporary measure introduced in v3 after a serious problem related to
      bio reuse was found in v2 of scrub patchset.
      
      Basically we did not reset bv_offset and bv_len fields of the bio_vec
      structure.  They are changed in case I/O error happens, for example, at
      offset 512 or 1024 into the page.  Also bi_flags field wasn't properly
      setup before reusing the bio.
      Signed-off-by: default avatarArne Jansen <sensille@gmx.net>
      96e36920
    • Jan Schmidt's avatar
      btrfs: new ioctls for scrub · 475f6387
      Jan Schmidt authored
      adds ioctls necessary to start and cancel scrubs, to get current
      progress and to get info about devices to be scrubbed.
      Note that the scrub is done per-device and that the ioctl only
      returns after the scrub for this devices is finished or has been
      canceled.
      Signed-off-by: default avatarArne Jansen <sensille@gmx.net>
      475f6387
    • Arne Jansen's avatar
      btrfs: scrub · a2de733c
      Arne Jansen authored
      This adds an initial implementation for scrub. It works quite
      straightforward. The usermode issues an ioctl for each device in the
      fs. For each device, it enumerates the allocated device chunks. For
      each chunk, the contained extents are enumerated and the data checksums
      fetched. The extents are read sequentially and the checksums verified.
      If an error occurs (checksum or EIO), a good copy is searched for. If
      one is found, the bad copy will be rewritten.
      All enumerations happen from the commit roots. During a transaction
      commit, the scrubs get paused and afterwards continue from the new
      roots.
      
      This commit is based on the series originally posted to linux-btrfs
      with some improvements that resulted from comments from David Sterba,
      Ilya Dryomov and Jan Schmidt.
      Signed-off-by: default avatarArne Jansen <sensille@gmx.net>
      a2de733c
  2. 25 Apr, 2011 8 commits
  3. 18 Apr, 2011 1 commit
    • Chris Mason's avatar
      Btrfs: fix free space cache leak · f65647c2
      Chris Mason authored
      The free space caching code was recently reworked to
      cache all the pages it needed instead of using find_get_page everywhere.
      
      One loop was missed though, so it ended up leaking pages.  This fixes
      it to use our page array instead of find_get_page.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f65647c2
  4. 16 Apr, 2011 2 commits
  5. 15 Apr, 2011 1 commit
    • Chris Mason's avatar
      Btrfs: don't force chunk allocation in find_free_extent · 0e4f8f88
      Chris Mason authored
      find_free_extent likes to allocate in contiguous clusters,
      which makes writeback faster, especially on SSD storage.  As
      the FS fragments, these clusters become harder to find and we have
      to decide between allocating a new chunk to make more clusters
      or giving up on the cluster to allocate from the free space
      we have.
      
      Right now it creates too many chunks, and you can end up with
      a whole FS that is mostly empty metadata chunks.  This commit
      changes the allocation code to be more strict and only
      allocate new chunks when we've made good use of the chunks we
      already have.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      0e4f8f88
  6. 13 Apr, 2011 5 commits
  7. 12 Apr, 2011 8 commits
  8. 08 Apr, 2011 8 commits
    • Josef Bacik's avatar
      Btrfs: check for duplicate iov_base's when doing dio reads · 93a54bc4
      Josef Bacik authored
      Apparently it is ok to submit a read to an IDE device with the same target page
      for different offsets.  This is what Windows does under qemu.  The problem is
      under DIO we expect them to be different buffers for checksumming reasons, and
      so this sort of thing will result in checksum errors, when in reality the file
      is fine.  So when reading, check to make sure that all iov bases are different,
      and if they aren't fall back to buffered mode, since that will work out right.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      93a54bc4
    • Josef Bacik's avatar
      Btrfs: reuse the extent_map we found when calling btrfs_get_extent · 16d299ac
      Josef Bacik authored
      In btrfs_get_block_direct we call btrfs_get_extent to lookup the extent for the
      range that we are looking for.  If we don't find an extent, btrfs_get_extent
      will insert a extent_map for that area and mark it as a hole.  So it does the
      job of allocating a new extent map and inserting it into the io tree.  But if
      we're creating a new extent we free it up and redo all of that work.  So instead
      pass the em to btrfs_new_extent_direct(), and if it will work just allocate the
      disk space and set it up properly and bypass the freeing/allocating of a new
      extent map and the expensive operation of inserting the thing into the io_tree.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      16d299ac
    • Josef Bacik's avatar
      Btrfs: do not use async submit for small DIO io's · 1ae39938
      Josef Bacik authored
      When looking at our DIO performance Chris said that for small IO's doing the
      async submit stuff tends to be more overhead than it's worth.  With this on top
      of my other fixes I get about a 17-20% speedup doing a sequential dd with 4k
      IO's.  Basically if we don't have to split the bio for the map length it's small
      enough to be directly submitted, otherwise go back to the async submit.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      1ae39938
    • Josef Bacik's avatar
      Btrfs: don't split dio bios if we don't have to · 02f57c7a
      Josef Bacik authored
      We have been unconditionally allocating a new bio and re-adding all pages from
      our original bio to the new bio.  This is needed if our original bio is larger
      than our stripe size, but if it is smaller than the stripe size then there is no
      need to do this.  So check the map length and if we are under that then go ahead
      and submit the original bio.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      02f57c7a
    • Josef Bacik's avatar
      Btrfs: do not call btrfs_update_inode in endio if nothing changed · 1ef30be1
      Josef Bacik authored
      In the DIO code we often don't update the i_disk_size because the i_size isn't
      updated until after the DIO is completed, so basically we are allocating a path,
      doing a search, and updating the inode item for no reason since nothing changed.
      btrfs_ordered_update_i_size will return 1 if it didn't update i_disk_size, so
      only run btrfs_update_inode if btrfs_ordered_update_i_size returns 0.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      1ef30be1
    • Josef Bacik's avatar
      Btrfs: map the inode item when doing fill_inode_item · 12ddb96c
      Josef Bacik authored
      Instead of calling kmap_atomic for every thing we set in the inode item, map the
      entire inode item at the start and unmap it at the end.  This makes a sequential
      dd of 400mb O_DIRECT something like 1% faster.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      12ddb96c
    • Josef Bacik's avatar
      Btrfs: only retry transaction reservation once · 06d5a589
      Josef Bacik authored
      I saw a lockup where we kept getting into this start transaction->commit
      transaction loop because of enospce.  The fact is if we fail to make our
      reservation, we've tried _everything_ several times, so we only need to try and
      commit the transaction once, and if that doesn't work then we really are out of
      space and need to just exit.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      06d5a589
    • Josef Bacik's avatar
      Btrfs: deal with the case that we run out of space in the cache · be1a12a0
      Josef Bacik authored
      Currently we don't handle running out of space in the cache, so to fix this we
      keep track of how far in the cache we are.  Then we only dirty the pages if we
      successfully modify all of them, otherwise if we have an error or run out of
      space we can just drop them and not worry about the vm writing them out.
      Thanks,
      
      Tested-by Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      be1a12a0
  9. 05 Apr, 2011 4 commits
    • Josef Bacik's avatar
      Btrfs: don't warn in btrfs_add_orphan · c9ddec74
      Josef Bacik authored
      When I moved the orphan adding to btrfs_truncate I missed the fact that during
      orphan cleanup we just add the orphan items to the orphan list without going
      through btrfs_orphan_add, which results in lots of warnings on mount if you have
      any orphan items that need to be truncated.  Just remove this warning since it's
      ok, this will allow all of the normal space accounting take place.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c9ddec74
    • Josef Bacik's avatar
      Btrfs: fix free space cache when there are pinned extents and clusters V2 · 43be2146
      Josef Bacik authored
      I noticed a huge problem with the free space cache that was presenting
      as an early ENOSPC.  Turns out when writing the free space cache out I
      forgot to take into account pinned extents and more importantly
      clusters.  This would result in us leaking free space everytime we
      unmounted the filesystem and remounted it.
      
      I fix this by making sure to check and see if the current block group
      has a cluster and writing out any entries that are in the cluster to the
      cache, as well as writing any pinned extents we currently have to the
      cache since those will be available for us to use the next time the fs
      mounts.
      
      This patch also adds a check to the end of load_free_space_cache to make
      sure we got the right amount of free space cache, and if not make sure
      to clear the cache and re-cache the old fashioned way.
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      43be2146
    • Li Zefan's avatar
      Btrfs: Fix uninitialized root flags for subvolumes · 08fe4db1
      Li Zefan authored
      root_item->flags and root_item->byte_limit are not initialized when
      a subvolume is created. This bug is not revealed until we added
      readonly snapshot support - now you mount a btrfs filesystem and you
      may find the subvolumes in it are readonly.
      
      To work around this problem, we steal a bit from root_item->inode_item->flags,
      and use it to indicate if those fields have been properly initialized.
      When we read a tree root from disk, we check if the bit is set, and if
      not we'll set the flag and initialize the two fields of the root item.
      Reported-by: default avatarAndreas Philipp <philipp.andreas@gmail.com>
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Tested-by: default avatarAndreas Philipp <philipp.andreas@gmail.com>
      cc: stable@kernel.org
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      08fe4db1
    • Miao Xie's avatar
      btrfs: clear __GFP_FS flag in the space cache inode · adae52b9
      Miao Xie authored
      the object id of the space cache inode's key is allocated from the relative
      root, just like the regular file. So we can't identify space cache inode by
      checking the object id of the inode's key, and we have to clear __GFP_FS flag
      at the time we look up the space cache inode.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      adae52b9