1. 16 Jan, 2012 7 commits
    • Ilya Dryomov's avatar
      Btrfs: profiles filter · ed25e9b2
      Ilya Dryomov authored
      Select chunks based on a given profile mask.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      ed25e9b2
    • Ilya Dryomov's avatar
      Btrfs: add basic infrastructure for selective balancing · f43ffb60
      Ilya Dryomov authored
      This allows to have a separate set of filters for each chunk type
      (data,meta,sys).  The code however is generic and switch on chunk type
      is only done once.
      
      This commit also adds a type filter: it allows to balance for example
      meta and system chunks w/o touching data ones.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      f43ffb60
    • Ilya Dryomov's avatar
      Btrfs: add basic restriper infrastructure · c9e9f97b
      Ilya Dryomov authored
      Add basic restriper infrastructure: extended balancing ioctl and all
      related ioctl data structures, add data structure for tracking
      restriper's state to fs_info, etc.  The semantics of the old balancing
      ioctl are fully preserved.
      
      Explicitly disallow any volume operations when balance is in progress.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      c9e9f97b
    • Ilya Dryomov's avatar
      Btrfs: make avail_*_alloc_bits fields dynamic · 10ea00f5
      Ilya Dryomov authored
      Currently when new chunks are created respective avail_alloc_bits field
      is updated to reflect profiles of all chunks present in the system.
      However when chunks are removed profile bits are never cleared.
      
      This patch clears profile bit of respective avail_alloc_bits field when
      the last chunk with that profile is removed.  Restriper needs this to
      properly operate when "downgrading".
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      10ea00f5
    • Ilya Dryomov's avatar
      Btrfs: add BTRFS_AVAIL_ALLOC_BIT_SINGLE bit · a46d11a8
      Ilya Dryomov authored
      Right now on-disk BTRFS_BLOCK_GROUP_* profile bits are used for
      avail_{data,metadata,system}_alloc_bits fields, which gather info about
      available allocation profiles in the FS.  When chunk is created or read
      from disk, its profile is OR'ed with the corresponding avail_alloc_bits
      field.  Since SINGLE is denoted by 0 in the on-disk format, currently
      there is no way to tell when such chunks become avaialble.  Restriper
      needs that information, so add a separate bit for SINGLE profile.
      
      This bit is going to be in-memory only, it should never be written out
      to disk, so it's not a disk format change.  However to avoid remappings
      in future, reserve corresponding on-disk bit.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      a46d11a8
    • Ilya Dryomov's avatar
      Btrfs: introduce masks for chunk type and profile · 52ba6929
      Ilya Dryomov authored
      Chunk's type and profile are encoded in u64 flags field.  Introduce
      masks to easily access them.  Also fix the type of BTRFS_BLOCK_GROUP_*
      constants, it should be ULL.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      52ba6929
    • Ilya Dryomov's avatar
      Btrfs: get rid of *_alloc_profile fields · 6fef8df1
      Ilya Dryomov authored
      {data,metadata,system}_alloc_profile fields have been unused for a long
      time now.  Get rid of them.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      6fef8df1
  2. 23 Dec, 2011 2 commits
  3. 15 Dec, 2011 16 commits
    • Chris Mason's avatar
      Btrfs: unplug every once and a while · d85c8a6f
      Chris Mason authored
      The btrfs io submission threads can build up massive plug lists.  This
      keeps things more reasonable so we don't hand over huge dumps of IO at
      once.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d85c8a6f
    • Chris Mason's avatar
      Merge branch 'for-chris' of... · 567a45e9
      Chris Mason authored
      Merge branch 'for-chris' of http://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into integration
      
      Conflicts:
      	fs/btrfs/inode.c
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      567a45e9
    • Chris Mason's avatar
      Btrfs: deal with NULL srv_rsv in the delalloc inode reservation code · e755d9ab
      Chris Mason authored
      btrfs_update_inode is sometimes called with a null reservation.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      e755d9ab
    • Josef Bacik's avatar
      Btrfs: only set cache_generation if we setup the block group · e65cbb94
      Josef Bacik authored
      A user reported a problem booting into a new kernel with the old format inodes.
      He was panicing in cow_file_range while writing out the inode cache.  This is
      because if the block group is not cached we'll just skip writing out the cache,
      however if it gets dirtied again in the same transaction and it finished caching
      we'd go ahead and write it out, but since we set cache_generation to the transid
      we think we've already truncated it and will just carry on, running into
      cow_file_range and blowing up.  We need to make sure we only set
      cache_generation if we've done the truncate.  The user tested this patch and
      verified that the panic no longer occured.  Thanks,
      Reported-and-Tested-by: default avatarKlaus Bitto <klaus.bitto@gmail.com>
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      e65cbb94
    • Josef Bacik's avatar
      Btrfs: don't panic if orphan item already exists · ee4d89f0
      Josef Bacik authored
      I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in
      a loop.  This is because we will add an orphan item, do the truncate, the
      truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're
      left with an orphan item still in the fs.  Then we come back later to do another
      truncate and it blows up because we already have an orphan item.  This is ok so
      just fix the BUG_ON() to only BUG() if ret is not EEXIST.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      ee4d89f0
    • Josef Bacik's avatar
      Btrfs: fix leaked space in truncate · 7041ee97
      Josef Bacik authored
      We were occasionaly leaking space when running xfstest 269.  This is because if
      we failed to start the transaction in the truncate loop we'd just goto out, but
      we need to break so that the inode is removed from the orphan list and the space
      is properly freed.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      7041ee97
    • Josef Bacik's avatar
      Btrfs: fix how we do delalloc reservations and how we free reservations on error · 660d3f6c
      Josef Bacik authored
      Running xfstests 269 with some tracing my scripts kept spitting out errors about
      releasing bytes that we didn't actually have reserved.  This took me down a huge
      rabbit hole and it turns out the way we deal with reserved_extents is wrong,
      we need to only be setting it if the reservation succeeds, otherwise the free()
      method will come in and unreserve space that isn't actually reserved yet, which
      can lead to other warnings and such.  The math was all working out right in the
      end, but it caused all sorts of other issues in addition to making my scripts
      yell and scream and generally make it impossible for me to track down the
      original issue I was looking for.  The other problem is with our error handling
      in the reservation code.  There are two cases that we need to deal with
      
      1) We raced with free.  In this case free won't free anything because csum_bytes
      is modified before we dro the lock in our reservation path, so free rightly
      doesn't release any space because the reservation code may be depending on that
      reservation.  However if we fail, we need the reservation side to do the free at
      that point since that space is no longer in use.  So as it stands the code was
      doing this fine and it worked out, except in case #2
      
      2) We don't race with free.  Nobody comes in and changes anything, and our
      reservation fails.  In this case we didn't reserve anything anyway and we just
      need to clean up csum_bytes but not free anything.  So we keep track of
      csum_bytes before we drop the lock and if it hasn't changed we know we can just
      decrement csum_bytes and carry on.
      
      Because of the case where we can race with free()'s since we have to drop our
      spin_lock to do the reservation, I'm going to serialize all reservations with
      the i_mutex.  We already get this for free in the heavy use paths, truncate and
      file write all hold the i_mutex, just needed to add it to page_mkwrite and
      various ioctl/balance things.  With this patch my space leak scripts no longer
      scream bloody murder.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      660d3f6c
    • Josef Bacik's avatar
      Btrfs: deal with enospc from dirtying inodes properly · 22c44fe6
      Josef Bacik authored
      Now that we're properly keeping track of delayed inode space we've been getting
      a lot of warnings out of btrfs_dirty_inode() when running xfstest 83.  This is
      because a bunch of people call mark_inode_dirty, which is void so we can't
      return ENOSPC.  This needs to be fixed in a few areas
      
      1) file_update_time - this updates the mtime and such when writing to a file,
      which will call mark_inode_dirty.  So copy file_update_time into btrfs so we can
      call btrfs_dirty_inode directly and return an error if we get one appropriately.
      
      2) fix symlinks to use btrfs_setattr for ->setattr.  For some reason we weren't
      setting ->setattr for symlinks, even though we should have been.  This catches
      one of the cases where we were getting errors in mark_inode_dirty.
      
      3) Fix btrfs_setattr and btrfs_setsize to call btrfs_dirty_inode directly
      instead of mark_inode_dirty.  This lets us return errors properly for truncate
      and chown/anything related to setattr.
      
      4) Add a new btrfs_fs_dirty_inode which will just call btrfs_dirty_inode and
      print an error if we have one.  The only remaining user we can't control for
      this is touch_atime(), but we don't really want to keep people from walking
      down the tree if we don't have space to save the atime update, so just complain
      but don't worry about it.
      
      With this patch xfstests 83 complains a handful of times instead of hundreds of
      times.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      22c44fe6
    • Josef Bacik's avatar
      Btrfs: fix num_workers_starting bug and other bugs in async thread · 0dc3b84a
      Josef Bacik authored
      Al pointed out we have some random problems with the way we account for
      num_workers_starting in the async thread stuff.  First of all we need to make
      sure to decrement num_workers_starting if we fail to start the worker, so make
      __btrfs_start_workers do this.  Also fix __btrfs_start_workers so that it
      doesn't call btrfs_stop_workers(), there is no point in stopping everybody if we
      failed to create a worker.  Also check_pending_worker_creates needs to call
      __btrfs_start_work in it's work function since it already increments
      num_workers_starting.
      
      People only start one worker at a time, so get rid of the num_workers argument
      everywhere, and make btrfs_queue_worker a void since it will always succeed.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      0dc3b84a
    • Casey Schaufler's avatar
      BTRFS: Establish i_ops before calling d_instantiate · ad19db71
      Casey Schaufler authored
      The Smack LSM hook for security_d_instantiate checks
      the inode's i_op->getxattr value to determine if the
      containing filesystem supports extended attributes.
      The BTRFS filesystem sets the inode's i_op value only
      after it has instantiated the inode. This results in
      Smack incorrectly giving new BTRFS inodes attributes
      from the filesystem defaults on the assumption that
      values can't be stored on the filesystem. This patch
      moves the assignment of inode operation vectors ahead
      of the calls to d_instantiate, letting Smack know that
      the filesystem supports extended attributes. There
      should be no impact on the performance or behavior of
      BTRFS.
      Signed-off-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      ad19db71
    • Chris Mason's avatar
      Btrfs: add a cond_resched() into the worker loop · 8f3b65a3
      Chris Mason authored
      If we have a constant stream of end_io completions or crc work,
      we can hit softlockup messages from the async helper threads.  This
      adds a cond_resched() into the loop to avoid them.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      8f3b65a3
    • Li Zefan's avatar
      Btrfs: fix ctime update of on-disk inode · 306424cc
      Li Zefan authored
      To reproduce the bug:
      
          # touch /mnt/tmp
          # stat /mnt/tmp | grep Change
          Change: 2011-12-09 09:32:23.412105981 +0800
          # chattr +i /mnt/tmp
          # stat /mnt/tmp | grep Change
          Change: 2011-12-09 09:32:43.198105295 +0800
          # umount /mnt
          # mount /dev/loop1 /mnt
          # stat /mnt/tmp | grep Change
          Change: 2011-12-09 09:32:23.412105981 +0800
      
      We should update ctime of in-memory inode before calling
      btrfs_update_inode().
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      306424cc
    • Arne Jansen's avatar
      btrfs: keep orphans for subvolume deletion · f8e9e0b0
      Arne Jansen authored
      Since we have the free space caches, btrfs_orphan_cleanup also runs for
      the tree_root. Unfortunately this also cleans up the orphans used to mark
      subvol deletions in progress.
      
      Currently if a subvol deletion gets interrupted twice by umount/mount, the
      deletion will not be continued and the space permanently lost, though it
      would be possible to write a tool to recover those lost subvol deletions.
      This patch checks if the orphan belongs to a subvol (dead root) and skips
      the deletion.
      Signed-off-by: default avatarArne Jansen <sensille@gmx.net>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f8e9e0b0
    • Miao Xie's avatar
      Btrfs: fix inaccurate available space on raid0 profile · 39fb26c3
      Miao Xie authored
      When we use raid0 as the data profile, df command may show us a very
      inaccurate value of the available space, which may be much less than the
      real one. It may make the users puzzled. Fix it by changing the calculation
      of the available space, and making it be more similar to a fake chunk
      allocation.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      39fb26c3
    • Miao Xie's avatar
      Btrfs: fix wrong disk space information of the files · 3642320e
      Miao Xie authored
      Btrfsck report errors after the 83th case of xfstests was run, The error
      number is 400, it means the used disk space of the file is wrong.
      
      The reason of this bug is that:
      The file truncation may fail when the space of the file system is not enough,
      and leave some file extents, whose offset are beyond the end of the files.
      When we want to expand those files, we will drop those file extents, and
      put in dummy file extents, and then we should update the i-node. But btrfs
      forgets to do it.
      
      This patch adds the forgotten i-node update.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      3642320e
    • Miao Xie's avatar
      Btrfs: fix wrong i_size when truncating a file to a larger size · f4a2f4c5
      Miao Xie authored
      Btrfsck report error 100 after the 83th case of xfstests was run, it means
      the i_size of the file is wrong.
      
      The reason of this bug is that:
      Btrfs increased i_size of the file at the beginning, but it failed to expand
      the file, and failed to update the i_size to the old size because there is no
      enough space in the file system, so we found a wrong i_size.
      
      This patch fixes this bug by updating the i_size just when we pass the file
      expanding and get enough space to update i-node.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f4a2f4c5
  4. 09 Dec, 2011 1 commit
    • Chris Mason's avatar
      Btrfs: fix btrfs_end_bio to deal with write errors to a single mirror · 5dbc8fca
      Chris Mason authored
      btrfs_end_bio checks the number of errors on a bio against the max
      number of errors allowed before sending any EIOs up to the higher
      levels.
      
      If we got enough copies of the bio done for a given raid level, it is
      supposed to clear the bio error flag and return success.
      
      We have pointers to the original bio sent down by the higher layers and
      pointers to any cloned bios we made for raid purposes.  If the original
      bio happens to be the one that got an io error, but not the last one to
      finish, it might not have the BIO_UPTODATE bit set.
      
      Then, when the last bio does finish, we'll call bio_end_io on the
      original bio.  It won't have the uptodate bit set and we'll end up
      sending EIO to the higher layers.
      
      We already had a check for this, it just was conditional on getting the
      IO error on the very last bio.  Make the check unconditional so we eat
      the EIOs properly.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5dbc8fca
  5. 08 Dec, 2011 4 commits
  6. 01 Dec, 2011 1 commit
  7. 30 Nov, 2011 9 commits
    • Alexandre Oliva's avatar
      Btrfs: skip allocation attempt from empty cluster · be064d11
      Alexandre Oliva authored
      If we don't have a cluster, don't bother trying to allocate from it,
      jumping right away to the attempt to allocate a new cluster.
      Signed-off-by: default avatarAlexandre Oliva <oliva@lsd.ic.unicamp.br>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      be064d11
    • Alexandre Oliva's avatar
      Btrfs: skip block groups without enough space for a cluster · 425d8315
      Alexandre Oliva authored
      We test whether a block group has enough free space to hold the
      requested block, but when we're doing clustered allocation, we can
      save some cycles by testing whether it has enough room for the cluster
      upfront, otherwise we end up attempting to set up a cluster and
      failing.  Only in the NO_EMPTY_SIZE loop do we attempt an unclustered
      allocation, and by then we'll have zeroed the cluster size, so this
      patch won't stop us from using the block group as a last resort.
      Signed-off-by: default avatarAlexandre Oliva <oliva@lsd.ic.unicamp.br>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      425d8315
    • Alexandre Oliva's avatar
      Btrfs: start search for new cluster at the beginning · 1b22bad7
      Alexandre Oliva authored
      Instead of starting at zero (offset is always zero), request a cluster
      starting at search_start, that denotes the beginning of the current
      block group.
      Signed-off-by: default avatarAlexandre Oliva <oliva@lsd.ic.unicamp.br>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1b22bad7
    • Alexandre Oliva's avatar
      Btrfs: reset cluster's max_size when creating bitmap · b78d09bc
      Alexandre Oliva authored
      The field that indicates the size of the largest contiguous chunk of
      free space in the cluster is not initialized when setting up bitmaps,
      it's only increased when we find a larger contiguous chunk.  We end up
      retaining a larger value than appropriate for highly-fragmented
      clusters, which may cause pointless searches for large contiguous
      groups, and even cause clusters that do not meet the density
      requirements to be set up.
      Signed-off-by: default avatarAlexandre Oliva <oliva@lsd.ic.unicamp.br>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b78d09bc
    • Alexandre Oliva's avatar
      Btrfs: initialize new bitmaps' list · f2d0f676
      Alexandre Oliva authored
      We're failing to create clusters with bitmaps because
      setup_cluster_no_bitmap checks that the list is empty before inserting
      the bitmap entry in the list for setup_cluster_bitmap, but the list
      field is only initialized when it is restored from the on-disk free
      space cache, or when it is written out to disk.
      
      Besides a potential race condition due to the multiple use of the list
      field, filesystem performance severely degrades over time: as we use
      up all non-bitmap free extents, the try-to-set-up-cluster dance is
      done at every metadata block allocation.  For every block group, we
      fail to set up a cluster, and after failing on them all up to twice,
      we fall back to the much slower unclustered allocation.
      
      To make matters worse, before the unclustered allocation, we try to
      create new block groups until we reach the 1% threshold, which
      introduces additional bitmaps and thus block groups that we'll iterate
      over at each metadata block request.
      f2d0f676
    • Li Zefan's avatar
      Btrfs: fix oops when calling statfs on readonly device · b772a86e
      Li Zefan authored
      To reproduce this bug:
      
        # dd if=/dev/zero of=img bs=1M count=256
        # mkfs.btrfs img
        # losetup -r /dev/loop1 img
        # mount /dev/loop1 /mnt
        OOPS!!
      
      It triggered BUG_ON(!nr_devices) in btrfs_calc_avail_data_space().
      
      To fix this, instead of checking write-only devices, we check all open
      deivces:
      
        # df -h /dev/loop1
        Filesystem            Size  Used Avail Use% Mounted on
        /dev/loop1            250M   28K  238M   1% /mnt
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      b772a86e
    • Mike Fleetwood's avatar
      Btrfs: Don't error on resizing FS to same size · ece7d20e
      Mike Fleetwood authored
      It seems overly harsh to fail a resize of a btrfs file system to the
      same size when a shrink or grow would succeed.  User app GParted trips
      over this error.  Allow it by bypassing the shrink or grow operation.
      Signed-off-by: default avatarMike Fleetwood <mike.fleetwood@googlemail.com>
      ece7d20e
    • Miao Xie's avatar
      Btrfs: fix deadlock on metadata reservation when evicting a inode · aa38a711
      Miao Xie authored
      When I ran the xfstests, I found the test tasks was blocked on meta-data
      reservation.
      
      By debugging, I found the reason of this bug:
         start transaction
              |
      	v
         reserve meta-data space
      	|
      	v
         flush delay allocation -> iput inode -> evict inode
      	^					|
      	|					v
         wait for delay allocation flush <- reserve meta-data space
      
      And besides that, the flush on evicting inode will block the thread, which
      is reclaiming the memory, and make oom happen easily.
      
      Fix this bug by skipping the flush step when evicting inode.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      aa38a711
    • Arnd Hannemann's avatar
      Fix URL of btrfs-progs git repository in docs · b52f75a5
      Arnd Hannemann authored
      	The location of the btrfs-progs repository has been changed.
      	This patch updates the documentation accordingly.
      Signed-off-by: default avatarArnd Hannemann <arnd@arndnet.de>
      b52f75a5