1. 15 Jun, 2012 11 commits
    • Li Zefan's avatar
      Btrfs: fix defrag regression · 6c282eb4
      Li Zefan authored
      If a file has 3 small extents:
      
      | ext1 | ext2 | ext3 |
      
      Running "btrfs fi defrag" will only defrag the last two extents, if those
      extent mappings hasn't been read into memory from disk.
      
      This bug was introduced by commit 17ce6ef8
      ("Btrfs: add a check to decide if we should defrag the range")
      
      The cause is, that commit looked into previous and next extents using
      lookup_extent_mapping() only.
      
      While at it, remove the code that checks the previous extent, since
      it's sufficient to check the next extent.
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      6c282eb4
    • Josef Bacik's avatar
      Btrfs: call filemap_fdatawrite twice for compression · 7ddf5a42
      Josef Bacik authored
      I removed this in an earlier commit and I was wrong.  Because compression
      can return from filemap_fdatawrite() without having actually set any of it's
      pages as writeback() it can make filemap_fdatawait() do essentially nothing,
      and then we won't find any ordered extents because they may not have been
      created yet.  So not only does this make fsync() completely useless, but it
      will also screw up if you truncate on a non-page aligned offset since we
      zero out the end and then wait on ordered extents and then call drop caches.
      We can drop the cache before the io completes and then we try to unpin the
      extent we just wrote we won't find it and everything goes sideways.  So fix
      this by putting it back and put a giant comment there to keep me from trying
      to remove it in the future.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      7ddf5a42
    • Josef Bacik's avatar
      Btrfs: keep inode pinned when compressing writes · 8180ef88
      Josef Bacik authored
      A user reported lots of problems using compression on the new code and it
      turns out part of the problem was that igrab() was failing when we added a
      new ordered extent.  This is because when writing out an inode under
      compression we immediately return without actually doing anything to the
      pages, and then in another thread at some point down the line actually do
      the ordered dance.  The problem is between the point that we start writeback
      and we actually add the ordered extent we could be trying to reclaim the
      inode, which makes igrab() return NULL.  So we need to do an igrab() when we
      create the async extent and then drop it when we are done with it.  This
      makes sure we stay pinned in memory until the ordered extent can get a
      reference on it and we are good to go.  With this patch we no longer panic
      in btrfs_finish_ordered_io().  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      8180ef88
    • Josef Bacik's avatar
      Btrfs: implement ->show_devname · 9c5085c1
      Josef Bacik authored
      Because btrfs can remove the device that was mounted we need to have a
      ->show_devname so that in this case we can print out some other device in
      the file system to /proc/mount.  So if there are multiple devices in a btrfs
      file system we will just print the device with the lowest devid that we can
      find.  This will make everything consistent and deal with device removal
      properly.  The drawback is if you mount with a device that is higher than
      the lowest devicd it won't show up as the mounted device in /proc/mounts,
      but this is a small price to pay. This was inspired by Miao Xie's patch.
      Thanks,
      Reviewed-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      9c5085c1
    • Josef Bacik's avatar
      Btrfs: use rcu to protect device->name · 606686ee
      Josef Bacik authored
      Al pointed out that we can just toss out the old name on a device and add a
      new one arbitrarily, so anybody who uses device->name in printk could
      possibly use free'd memory.  Instead of adding locking around all of this he
      suggested doing it with RCU, so I've introduced a struct rcu_string that
      does just that and have gone through and protected all accesses to
      device->name that aren't under the uuid_mutex with rcu_read_lock().  This
      protects us and I will use it for dealing with removing the device that we
      used to mount the file system in a later patch.  Thanks,
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      606686ee
    • Josef Bacik's avatar
      Btrfs: unlock everything properly in the error case for nocow · 17ca04af
      Josef Bacik authored
      I was getting hung on umount when a transaction was aborted because a range
      of one of the free space inodes was still locked.  This is because the nocow
      stuff doesn't unlock anything on error.  This fixed the problem and I
      verified that is what was happening.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      17ca04af
    • Josef Bacik's avatar
      Btrfs: fix btrfs_destroy_marked_extents · ee670f0a
      Josef Bacik authored
      So we're forcing the eb's to have their ref count set to 1 so invalidatepage
      works but this breaks lots of things, for example root nodes, and is just
      plain wrong, we don't need to just evict all of this stuff.  Also drop the
      invalidatepage altogether and add a page_cache_release().  With this patch
      we no longer hang when trying to access the root nodes after an aborted
      transaction and we no longer leak memory.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      ee670f0a
    • Josef Bacik's avatar
      Btrfs: abort the transaction if the commit fails · 7b8b92af
      Josef Bacik authored
      If a transaction commit fails we don't abort it so we don't set an error on
      the file system.  This patch fixes that by actually calling the abort stuff
      and then adding a check for a fs error in the transaction start stuff to
      make sure it is caught properly.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      7b8b92af
    • Josef Bacik's avatar
      Btrfs: wake up transaction waiters when aborting a transaction · d7096fc3
      Josef Bacik authored
      I was getting lots of hung tasks and a NULL pointer dereference because we
      are not cleaning up the transaction properly when it aborts.  First we need
      to reset the running_transaction to NULL so we don't get a bad dereference
      for any start_transaction callers after this.  Also we cannot rely on
      waitqueue_active() since it's just a list_empty(), so just call wake_up()
      directly since that will do the barrier for us and such.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      d7096fc3
    • Josef Bacik's avatar
      Btrfs: fix locking in btrfs_destroy_delayed_refs · b939d1ab
      Josef Bacik authored
      The transaction abort stuff was throwing warnings from the list debugging
      code because we do a list_del_init outside of the delayed_refs spin lock.
      The delayed refs locking makes baby Jesus cry so it's not hard to get wrong,
      but we need to take the ref head mutex to make sure it's not being processed
      currently, and so if it is we need to drop the spin lock and then take and
      drop the mutex and do the search again.  If we can take the mutex then we
      can safely remove the head from the list and carry on.  Now when the
      transaction aborts I don't get the list debugging warnings.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      b939d1ab
    • Josef Bacik's avatar
      Btrfs: pass locked_page into extent_clear_unlock_delalloc if theres an error · beb42dd7
      Josef Bacik authored
      While doing my enospc work I got a transaction abortion that resulted in a
      panic when we tried to unlock_page() an already unlocked page.  This is
      because we aren't calling extent_clear_unlock_delalloc with the locked page
      so it was unlocking all the pages in the range.  This is wrong since
      __extent_writepage expects to have the page locked still unless we return
      *page_started as 1.  This should keep us from panicing.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      beb42dd7
  2. 31 May, 2012 6 commits
  3. 30 May, 2012 23 commits
    • Jan Schmidt's avatar
      Btrfs: use delayed ref sequence numbers for all fs-tree updates · 95a06077
      Jan Schmidt authored
      The sequence number for delayed refs is needed to postpone certain delayed
      refs for a very short period while walking backrefs. Before the tree
      modification log, we thought we'd only have to hold back those references
      that don't have a counter operation.
      
      While now we've the tree mod log, we're rewinding fs tree blocks to a
      defined consistent state. We cannot know in advance for which tree block
      we'll be doing rewind operations later. Therefore, we must postpone all the
      delayed refs for fs-tree blocks, even those having a counter operation.
      Signed-off-by: default avatarJan Schmidt <list.btrfs@jan-o-sch.net>
      95a06077
    • Chris Mason's avatar
      Merge branch 'for-chris' of... · cfc442b6
      Chris Mason authored
      Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into HEAD
      cfc442b6
    • Stefan Behrens's avatar
      Btrfs: fix false positive in check-integrity on unmount · 48235a68
      Stefan Behrens authored
      During unmount, it could happen that the integrity checker printed a
      warning message "attempt to free ... on umount which is not yet iodone"
      which turned out to be a false positive.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      48235a68
    • Stefan Behrens's avatar
      Btrfs: fix runtime warning in check-integrity check data mode · 86ff7ffc
      Stefan Behrens authored
      If a file_extent_item was located at the very end of a leaf and there was
      not enough space to hold a full item, but there was enough space to hold
      one of type BTRFS_FILE_EXTENT_INLINE or PREALLOC, and it was only such a
      short item, a warning was printed anyway. This check is now fixed.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      86ff7ffc
    • Stefan Behrens's avatar
      Btrfs: set ioprio of scrub readahead to idle · 3d136a11
      Stefan Behrens authored
      Reduce ioprio class of scrub readahead threads to idle priority.
      This setting is fixed. This priority has shown the best performance
      during all measurements.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      3d136a11
    • Josef Bacik's avatar
      Btrfs: fix return code in drop_objectid_items · 5bdbeb21
      Josef Bacik authored
      So dpkg fsync()'s the file and the directory containing the file whenever it
      writes to a file which is really slow in btrfs.  This is partly because
      fsync()'ing a directory _always_ committed the transaction instead of just
      going to the tree log.  This is because drop_objectid_items() would return 1
      since it does a btrfs_search_slot() which returns 1.  In tree-log jargon
      this means that we have to commit the transaction to be safe.  So just check
      if ret is greater than 0 and set it to 0 if it does.  With this patch we now
      use the tree-log instead of committing the entire transaction, which is
      twice as fast on my box.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      5bdbeb21
    • Josef Bacik's avatar
      Btrfs: check to see if the inode is in the log before fsyncing · 22ee6985
      Josef Bacik authored
      We have this check down in the actual logging code, but this is after we
      start a transaction and all that good stuff.  So move the helper
      inode_in_log() out so we can call it in fsync() and avoid starting a
      transaction altogether and just exit if we've already fsync()'ed this file
      recently.  You would notice this issue if you fsync()'ed a file over and
      over again until the transaction committed.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      22ee6985
    • Tsutomu Itoh's avatar
      Btrfs: return value of btrfs_read_buffer is checked correctly · 018642a1
      Tsutomu Itoh authored
      btrfs_read_buffer() has the possibility of returning the error.
      Therefore, I add the code in which the return value of btrfs_read_buffer()
      is checked.
      Signed-off-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      018642a1
    • Stefan Behrens's avatar
      Btrfs: read device stats on mount, write modified ones during commit · 733f4fbb
      Stefan Behrens authored
      The device statistics are written into the device tree with each
      transaction commit. Only modified statistics are written.
      When a filesystem is mounted, the device statistics for each involved
      device are read from the device tree and used to initialize the
      counters.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      733f4fbb
    • Stefan Behrens's avatar
      Btrfs: add ioctl to get and reset the device stats · c11d2c23
      Stefan Behrens authored
      An ioctl interface is added to get the device statistic counters.
      A second ioctl is added to atomically get and reset these counters.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      c11d2c23
    • Stefan Behrens's avatar
      Btrfs: add device counters for detected IO and checksum errors · 442a4f63
      Stefan Behrens authored
      The goal is to detect when drives start to get an increased error rate,
      when drives should be replaced soon. Therefore statistic counters are
      added that count IO errors (read, write and flush). Additionally, the
      software detected errors like checksum errors and corrupted blocks are
      counted.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      442a4f63
    • Asias He's avatar
      btrfs: Drop unused function btrfs_abort_devices() · d07eb911
      Asias He authored
      1) This function is not used anywhere.
      
      2) Using the blk_abort_queue() to abort the queue seems not correct.
      blk_abort_queue() is used for timeout handling (block/blk-timeout.c).
      
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: linux-btrfs@vger.kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarAsias He <asias@redhat.com>
      d07eb911
    • Miao Xie's avatar
      Btrfs: fix the same inode id problem when doing auto defragment · 762f2263
      Miao Xie authored
      Two files in the different subvolumes may have the same inode id, so
      The rb-tree which is used to manage the defragment object must take it
      into account. This patch fix this problem.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      762f2263
    • Josef Bacik's avatar
      Btrfs: fall back to non-inline if we don't have enough space · 2adcac1a
      Josef Bacik authored
      If cow_file_range_inline fails with ENOSPC we abort the transaction which
      isn't very nice.  This really shouldn't be happening anyways but there's no
      sense in making it a horrible error when we can easily just go allocate
      normal data space for this stuff.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      2adcac1a
    • Josef Bacik's avatar
      Btrfs: fix how we deal with the orphan block rsv · 8a35d95f
      Josef Bacik authored
      Ceph was hitting this race where we would remove an inode from the per-root
      orphan list before we would release the space we had reserved for the inode.
      We actually don't need a list or anything, we just need to make sure the
      root doesn't try to free up the orphan reserve until after the inodes have
      released their reservations.  So use an atomic counter instead of a list on
      the root and only decrement the counter after we've released our
      reservation.  I've tested this as well as several others and we no longer
      see the warnings that you would see while running ceph.  Thanks,
      Btrfs: fix how we deal with the orphan block rsv
      
      Ceph was hitting this race where we would remove an inode from the per-root
      orphan list before we would release the space we had reserved for the inode.
      We actually don't need a list or anything, we just need to make sure the
      root doesn't try to free up the orphan reserve until after the inodes have
      released their reservations.  So use an atomic counter instead of a list on
      the root and only decrement the counter after we've released our
      reservation.  I've tested this as well as several others and we no longer
      see the warnings that you would see while running ceph.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      8a35d95f
    • Josef Bacik's avatar
      Btrfs: convert the inode bit field to use the actual bit operations · 72ac3c0d
      Josef Bacik authored
      Miao pointed this out while I was working on an orphan problem that messing
      with a bitfield where different ranges are protected by different locks
      doesn't work out right.  Turns out we've been doing this forever where we
      have different parts of the bit field protected by either no lock at all or
      different locks which could cause all sorts of weird problems including the
      issue I was hitting.  So instead make a runtime_flags thing that we use the
      normal bit operations on that are all atomic so we can keep having our
      no/different locking for the different flags and then make force_compress
      it's own thing so it can be treated normally.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      72ac3c0d
    • Josef Bacik's avatar
      Btrfs: merge contigous regions when loading free space cache · cd023e7b
      Josef Bacik authored
      When we write out the free space cache we will write out everything that is
      in our in memory tree, and then we will just walk the pinned extents tree
      and write anything we see there.  The problem with this is that during
      normal operations the pinned extents will be merged back into the free space
      tree normally, and then we can allocate space from the merged areas and
      commit them to the tree log.  If we crash and replay the tree log we will
      crash again because the tree log will try to free up space from what looks
      like 2 seperate but contiguous entries, since one entry is from the original
      free space cache and the other was a pinned extent that was merged back.  To
      fix this we just need to walk the free space tree after we load it and merge
      contiguous entries back together.  This will keep the tree log stuff from
      breaking and it will make the allocator behave more nicely.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      cd023e7b
    • Liu Bo's avatar
      Btrfs: do not do balance in readonly mode · 9ba1f6e4
      Liu Bo authored
      In normal cases, we would not be allowed to do balance in RO mode.
      However, when we're using a seeding device and adding another device to sprout,
      things will change:
      
      $ mkfs.btrfs /dev/sdb7
      $ btrfstune -S 1 /dev/sdb7
      $ mount /dev/sdb7 /mnt/btrfs -o ro
      $ btrfs fi bal /mnt/btrfs   -----------------------> fail.
      $ btrfs dev add /dev/sdb8 /mnt/btrfs
      $ btrfs fi bal /mnt/btrfs   -----------------------> works!
      
      It should not be designed as an exception, and we'd better add another check for
      mnt flags.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <josef@redhat.com>
      9ba1f6e4
    • Liu Bo's avatar
      Btrfs: use fastpath in extent state ops as much as possible · d1ac6e41
      Liu Bo authored
      Fully utilize our extent state's new helper functions to use
      fastpath as much as possible.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <josef@redhat.com>
      d1ac6e41
    • Liu Bo's avatar
      Btrfs: fix wrong error returned by adding a device · f8c5d0b4
      Liu Bo authored
      Reproduce:
      $ mkfs.btrfs /dev/sdb7
      $ mount /dev/sdb7 /mnt/btrfs -o ro
      $ btrfs dev add /dev/sdb8 /mnt/btrfs
      ERROR: error adding the device '/dev/sdb8' - Invalid argument
      
      Since we mount with readonly options, and /dev/sdb7 is not a seeding one,
      a readonly notification is preferred.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <josef@redhat.com>
      f8c5d0b4
    • Josef Bacik's avatar
      Btrfs: finish ordered extents in their own thread · 5fd02043
      Josef Bacik authored
      We noticed that the ordered extent completion doesn't really rely on having
      a page and that it could be done independantly of ending the writeback on a
      page.  This patch makes us not do the threaded endio stuff for normal
      buffered writes and direct writes so we can end page writeback as soon as
      possible (in irq context) and only start threads to do the ordered work when
      it is actually done.  Compression needs to be reworked some to take
      advantage of this as well, but atm it has to do a find_get_page in its endio
      handler so it must be done in its own thread.  This makes direct writes
      quite a bit faster.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      5fd02043
    • Josef Bacik's avatar
      Btrfs: do not check delalloc when updating disk_i_size · 4e899152
      Josef Bacik authored
      We are checking delalloc to see if it is ok to update the i_size.  There are
      2 cases it stops us from updating
      
      1) If there is delalloc between our current disk_i_size and this ordered
      extent
      
      2) If there is delalloc between our current ordered extent and the next
      ordered extent
      
      These tests are racy however since we can set delalloc for these ranges at
      any time.  Also for the first case if we notice there is delalloc between
      disk_i_size and our ordered extent we will not update disk_i_size and assume
      that when that delalloc bit gets written out it will update everything
      properly.  However if we crash before that we will have file extents outside
      of our i_size, which is not good, so this test is dangerous as well as racy.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      4e899152
    • Jim Meyering's avatar
      Btrfs: avoid buffer overrun in mount option handling · f60d16a8
      Jim Meyering authored
      There is an off-by-one error: allocating room for a maximal result
      string but without room for a trailing NUL.  That, can lead to
      returning a transformed string that is not NUL-terminated, and
      then to a caller reading beyond end of the malloc'd buffer.
      
      Rewrite to s/kzalloc/kmalloc/, remove unwarranted use of strncpy
      (the result is guaranteed to fit), remove dead strlen at end, and
      change a few variable names and comments.
      Reviewed-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarJim Meyering <meyering@redhat.com>
      f60d16a8