1. 17 Sep, 2014 40 commits
    • Miao Xie's avatar
      Btrfs: Set real mirror number for read operation on RAID0/5/6 · 28e1cc7d
      Miao Xie authored
      We need real mirror number for RAID0/5/6 when reading data, or if read error
      happens, we would pass 0 as the number of the mirror on which the io error
      happens. It is wrong and would cause the filesystem read the data from the
      corrupted mirror again.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      28e1cc7d
    • Miao Xie's avatar
      Btrfs: modify clean_io_failure and make it suit direct io · 1203b681
      Miao Xie authored
      We could not use clean_io_failure in the direct IO path because it got the
      filesystem information from the page structure, but the page in the direct
      IO bio didn't have the filesystem information in its structure. So we need
      modify it and pass all the information it need by parameters.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1203b681
    • Miao Xie's avatar
      Btrfs: modify repair_io_failure and make it suit direct io · ffdd2018
      Miao Xie authored
      The original code of repair_io_failure was just used for buffered read,
      because it got some filesystem data from page structure, it is safe for
      the page in the page cache. But when we do a direct read, the pages in bio
      are not in the page cache, that is there is no filesystem data in the page
      structure. In order to implement direct read data repair, we need modify
      repair_io_failure and pass all filesystem data it need by function
      parameters.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ffdd2018
    • Miao Xie's avatar
      Btrfs: split bio_readpage_error into several functions · 2fe6303e
      Miao Xie authored
      The data repair function of direct read will be implemented later, and some code
      in bio_readpage_error will be reused, so split bio_readpage_error into
      several functions which will be used in direct read repair later.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2fe6303e
    • Miao Xie's avatar
      454ff3de
    • Miao Xie's avatar
      Btrfs: fix missing error handler if submiting re-read bio fails · 6c387ab2
      Miao Xie authored
      We forgot to free failure record and bio after submitting re-read bio failed,
      fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      6c387ab2
    • Miao Xie's avatar
      Btrfs: do file data check by sub-bio's self · c1dc0896
      Miao Xie authored
      Direct IO splits the original bio to several sub-bios because of the limit of
      raid stripe, and the filesystem will wait for all sub-bios and then run final
      end io process.
      
      But it was very hard to implement the data repair when dio read failure happens,
      because at the final end io function, we didn't know which mirror the data was
      read from. So in order to implement the data repair, we have to move the file data
      check in the final end io function to the sub-bio end io function, in which we can
      get the mirror number of the device we access. This patch did this work as the
      first step of the direct io data repair implementation.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c1dc0896
    • Miao Xie's avatar
    • Miao Xie's avatar
      Btrfs: load checksum data once when submitting a direct read io · 23ea8e5a
      Miao Xie authored
      The current code would load checksum data for several times when we split
      a whole direct read io because of the limit of the raid stripe, it would
      make us search the csum tree for several times. In fact, it just wasted time,
      and made the contention of the csum tree root be more serious. This patch
      improves this problem by loading the data at once.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      23ea8e5a
    • Miao Xie's avatar
      Btrfs: modify rw_devices counter under chunk_mutex context · c3929c36
      Miao Xie authored
      rw_devices counter is often used to tune the profile when doing chunk allocation,
      so we should modify it under the chunk_mutex context to avoid getting wrong
      chunk profile.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c3929c36
    • Miao Xie's avatar
      Btrfs: move the missing device to its own fs device list · 5f375835
      Miao Xie authored
      For a missing device, we don't know it belong to which fs before we read its
      fsid from the chunk tree. So we add them into the current fs device list at first.
      When we get its fsid, we should move them to their own fs device list.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5f375835
    • Miao Xie's avatar
      Btrfs: stop mounting the fs if the non-ENOENT errors happen when opening seed fs · 416d7b80
      Miao Xie authored
      When we open a seed filesystem, if the degraded mount option is set, we continue to
      mount the fs if we don't find some devices in the seed filesystem. But we should stop
      mounting if other errors happen. Fix it
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      416d7b80
    • Miao Xie's avatar
      82372bc8
    • Miao Xie's avatar
      Btrfs: fix use-after-free problem of the device during device replace · 67a2c45e
      Miao Xie authored
      The problem is:
      	Task0(device scan task)		Task1(device replace task)
      	scan_one_device()
      	mutex_lock(&uuid_mutex)
      	device = find_device()
      					mutex_lock(&device_list_mutex)
      					lock_chunk()
      					rm_and_free_source_device
      					unlock_chunk()
      					mutex_unlock(&device_list_mutex)
      	check device
      
      Destroying the target device if device replace fails also has the same problem.
      
      We fix this problem by locking uuid_mutex during destroying source device or
      target device, just like the device remove operation.
      
      It is a temporary solution, we can fix this problem and make the code more
      clear by atomic counter in the future.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      67a2c45e
    • Miao Xie's avatar
      Btrfs: fix unprotected device list access when cloning fs devices · adbbb863
      Miao Xie authored
      We can build a new filesystem based a seed filesystem, and we need clone
      the fs devices when we open the new filesystem. But someone might clear
      the seed flag of the seed filesystem, then mount that filesystem and
      remove some device. If we mount the new filesystem, we might access
      a device list which was being changed when we clone the fs devices.
      Fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      adbbb863
    • Miao Xie's avatar
      Btrfs: Fix misuse of chunk mutex · 2196d6e8
      Miao Xie authored
      There were several problems about chunk mutex usage:
      - Lock chunk mutex when updating metadata. It would cause the nested
        deadlock because updating metadata might need allocate new chunks
        that need acquire chunk mutex. We remove chunk mutex at this case,
        because b-tree lock and other lock mechanism can help us.
      - ABBA deadlock occured between device_list_mutex and chunk_mutex.
        When we update device status, we must acquire device_list_mutex at the
        beginning, and then we might get chunk_mutex during the device status
        update because we need allocate new chunks for metadata COW. But at
        most place, we acquire chunk_mutex at first and then acquire device list
        mutex. We need change the lock order.
      - Some place we needn't acquire chunk_mutex. For example we needn't get
        chunk_mutex when we free a empty seed fs_devices structure.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2196d6e8
    • Miao Xie's avatar
      Btrfs: fix unprotected device list access when getting the fs information · 15484377
      Miao Xie authored
      When we get the fs information, we forgot to acquire the mutex of device list,
      it might cause the problem we might access a device that was removed. Fix
      it by acquiring the device list mutex.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      15484377
    • Miao Xie's avatar
      Btrfs: fix unprotected system chunk array insertion · fe48a5c0
      Miao Xie authored
      We didn't protect the system chunk array when we added a new
      system chunk into it, it would cause the array be corrupted
      if someone remove/add some system chunk into array at the same
      time. Fix it by chunk lock.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      fe48a5c0
    • Miao Xie's avatar
      Btrfs: fix unprotected device's variants on 32bits machine · 7cc8e58d
      Miao Xie authored
      ->total_bytes,->disk_total_bytes,->bytes_used is protected by chunk
      lock when we change them, but sometimes we read them without any lock,
      and we might get unexpected value. We fix this problem like inode's
      i_size.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      7cc8e58d
    • Miao Xie's avatar
      Btrfs: update free_chunk_space during allocting a new chunk · 1c116187
      Miao Xie authored
      We should update free_chunk_space in time when we allocate a new chunk,
      not when we deal with the pending device update and block group insertion,
      because we need the real free_chunk_space data to calculate the reserved
      space, if we don't update it in time, we would consider the disk space which
      has be allocated as free space, and would use it to do overcommit reservation.
      Fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1c116187
    • Miao Xie's avatar
      Btrfs: fix unprotected device->bytes_used update · 43530c46
      Miao Xie authored
      We should update device->bytes_used in the lock context of
      chunk_mutex, or we would get wrong data.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      43530c46
    • Miao Xie's avatar
      Btrfs: Fix wrong free_chunk_space assignment during removing a device · 5d778aae
      Miao Xie authored
      During removing a device, we have modified free_chunk_space when we
      shrink the device, so we needn't assign a new value to it after
      the device shrink. Fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5d778aae
    • Miao Xie's avatar
      Btrfs: fix wrong device bytes_used in the super block · ce7213c7
      Miao Xie authored
      device->bytes_used will be changed when allocating a new chunk, and
      disk_total_size will be changed if resizing is successful.
      Meanwhile, the on-disk super blocks of the previous transaction
      might not be updated. Considering the consistency of the metadata
      in the previous transaction, We should use the size in the previous
      transaction to check if the super block is beyond the boundary
      of the device.
      
      Though it is not big problem because we don't use it now, but anyway
      it is better that we make it be consistent with the common metadata,
      maybe we will use it in the future.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ce7213c7
    • Miao Xie's avatar
      Btrfs: fix wrong disk size when writing super blocks · 935e5cc9
      Miao Xie authored
      total_size will be changed when resizing a device, and disk_total_size
      will be changed if resizing is successful. Meanwhile, the on-disk super
      blocks of the previous transaction might not be updated. Considering
      the consistency of the metadata in the previous transaction, We should
      use the size in the previous transaction to check if the super block is
      beyond the boundary of the device. Fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      935e5cc9
    • Miao Xie's avatar
      Btrfs: fix unprotected assignment of the target device · 1c43366d
      Miao Xie authored
      We didn't protect the assignment of the target device, it might cause the
      problem that the super block update was skipped because we might find wrong
      size of the target device during the assignment. Fix it by moving the
      assignment sentences into the initialization function of the target device.
      And there is another merit that we can check if the target device is suitable
      more early.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1c43366d
    • Miao Xie's avatar
    • Miao Xie's avatar
      Btrfs: cleanup unused num_can_discard in fs_devices · 90180da4
      Miao Xie authored
      The member variants - num_can_discard - of fs_devices structure
      are set, but no one use them to do anything. so remove them.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      90180da4
    • Li RongQing's avatar
      btrfs: remove the wrong comments · 82f70d62
      Li RongQing authored
      This comments became wrong after c3c532[bdi: add helper function for
      doing init and register of a bdi for a file system], so remove them.
      Signed-off-by: default avatarLi RongQing <roy.qing.li@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      82f70d62
    • Filipe Manana's avatar
      Btrfs: fix directory recovery from fsync log · a2cc11db
      Filipe Manana authored
      When replaying a directory from the fsync log, if a directory entry
      exists both in the fs/subvol tree and in the log, the directory's inode
      got its i_size updated incorrectly, accounting for the dentry's name
      twice.
      
      Reproducer, from a test for xfstests:
      
          _scratch_mkfs >> $seqres.full 2>&1
          _init_flakey
          _mount_flakey
      
          touch $SCRATCH_MNT/foo
          sync
      
          touch $SCRATCH_MNT/bar
          xfs_io -c "fsync" $SCRATCH_MNT
          xfs_io -c "fsync" $SCRATCH_MNT/bar
      
          _load_flakey_table $FLAKEY_DROP_WRITES
          _unmount_flakey
      
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          _mount_flakey
      
          [ -f $SCRATCH_MNT/foo ] || echo "file foo is missing"
          [ -f $SCRATCH_MNT/bar ] || echo "file bar is missing"
      
          _unmount_flakey
          _check_scratch_fs $FLAKEY_DEV
      
      The filesystem check at the end failed with the message:
      "root 5 root dir 256 error".
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a2cc11db
    • Liu Bo's avatar
      Btrfs: fix loop writing of async reclaim · 25ce459c
      Liu Bo authored
      One of my tests shows that when we really don't have space to reclaim via
      flush_space and also run out of space, this async reclaim work loops on adding
      itself into the workqueue and keeps writing something to disk according to
      iostat's results, and these writes mainly comes from commit_transaction which
      writes super_block.  This's unacceptable as it can be bad to disks, especially
      memeory storages.
      
      This adds a check to avoid the above situation.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      25ce459c
    • Josef Bacik's avatar
      Btrfs: make fiemap not blow when you have lots of snapshots · dc046b10
      Josef Bacik authored
      We have been iterating all references for each extent we have in a file when we
      do fiemap to see if it is shared.  This is fine when you have a few clones or a
      few snapshots, but when you have 5k snapshots suddenly fiemap just sits there
      and stares at you.  So add btrfs_check_shared which will use the backref walking
      code but will short circuit as soon as it finds a root or inode that doesn't
      match the one we currently have.  This makes fiemap on my testbox go from
      looking at me blankly for a day to spitting out actual output in a reasonable
      amount of time.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      dc046b10
    • Filipe Manana's avatar
      Btrfs: add missing compression property remove in btrfs_ioctl_setflags · 78a017a2
      Filipe Manana authored
      The behaviour of a 'chattr -c' consists of getting the current flags,
      clearing the FS_COMPR_FL bit and then sending the result to the set
      flags ioctl - this means the bit FS_NOCOMP_FL isn't set in the flags
      passed to the ioctl. This results in the compression property not being
      cleared from the inode - it was cleared only if the bit FS_NOCOMP_FL
      was set in the received flags.
      
      Reproducer:
      
          $ mkfs.btrfs -f /dev/sdd
          $ mount /dev/sdd /mnt && cd /mnt
          $ mkdir a
          $ chattr +c a
          $ touch a/file
          $ lsattr a/file
          --------c------- a/file
          $ chattr -c a
          $ touch a/file2
          $ lsattr a/file2
          --------c------- a/file2
          $ lsattr -d a
          ---------------- a
      Reported-by: default avatarAndreas Schneider <asn@cryptomilk.org>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      78a017a2
    • Qu Wenruo's avatar
      btrfs: Fix a deadlock in btrfs_dev_replace_finishing() · 12b894cb
      Qu Wenruo authored
      btrfs-transacion:5657
      [stack snip]
      btrfs_bio_map()
          btrfs_bio_counter_inc_blocked()
              percpu_counter_inc(&fs_info->bio_counter)  ###bio_counter > 0(A)
              __btrfs_bio_map()
                  btrfs_dev_replace_lock()
                      mutex_lock(dev_replace->lock)	   ###wait mutex(B)
      
      btrfs:32612
      [stack snip]
      btrfs_dev_replace_start()
          btrfs_dev_replace_lock()
      	mutex_lock(dev_replace->lock)		   ###hold mutex(B)
          btrfs_dev_replace_finishing()
              btrfs_rm_dev_replace_blocked()
                  wait until percpu_counter_sum == 0	   ###wait on bio_counter(A)
      
      This bug can be triggered quite easily by the following test script:
      http://pastebin.com/MQmb37Cy
      
      This patch will fix the ABBA problem by calling
      btrfs_dev_replace_unlock() before btrfs_rm_dev_replace_blocked().
      
      The consistency of btrfs devices list and their superblocks is protected
      by device_list_mutex, not btrfs_dev_replace_lock/unlock().
      So it is safe the move btrfs_dev_replace_unlock() before
      btrfs_rm_dev_replace_blocked().
      Reported-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Cc: Stefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      12b894cb
    • Liu Bo's avatar
      Btrfs: cleanup the same name in end_bio_extent_readpage · a583c026
      Liu Bo authored
      We've defined a 'offset' out of bio_for_each_segment_all.
      
      This is just a clean rename, no function changes.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a583c026
    • Mark Fasheh's avatar
      btrfs: don't go readonly on existing qgroup items · 0b4699dc
      Mark Fasheh authored
      btrfs_drop_snapshot() leaves subvolume qgroup items on disk after
      completion. This can cause problems with snapshot creation. If a new
      snapshot tries to claim the deleted subvolumes id, btrfs will get -EEXIST
      from add_qgroup_item() and go read-only. The following commands will
      reproduce this problem (assume btrfs is on /dev/sda and is mounted at
      /btrfs)
      
      mkfs.btrfs -f /dev/sda
      mount -t btrfs /dev/sda /btrfs/
      btrfs quota enable /btrfs/
      btrfs su sna /btrfs/ /btrfs/snap
      btrfs su de /btrfs/snap
      sleep 45
      umount /btrfs/
      mount -t btrfs /dev/sda /btrfs/
      
      We can fix this by catching -EEXIST in add_qgroup_item() and
      initializing the existing items. We have the problem of orphaned
      relation items being on disk from an old snapshot but that is outside
      the scope of this patch.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0b4699dc
    • Liu Bo's avatar
      Btrfs: show real function name in btrfs workqueue tracepoint · b7831b20
      Liu Bo authored
      Use %pf instead of %p, just same as kernel workqueue tracepoints.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      b7831b20
    • Filipe Manana's avatar
      Btrfs: shrink further sizeof(struct extent_buffer) · 2a39e598
      Filipe Manana authored
      The map_start and map_len fields aren't used anywhere, so just remove
      them. On a x86_64 system, this reduced sizeof(struct extent_buffer)
      from 296 bytes to 280 bytes, and therefore 14 extent_buffer structs can
      now fit into a page instead of 13.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2a39e598
    • Filipe Manana's avatar
      Btrfs: send, lower mem requirements for processing xattrs · 4395e0c4
      Filipe Manana authored
      Maximum xattr size can be up to nearly the leaf size. For an fs with a
      leaf size larger than the page size, using kmalloc requires allocating
      multiple pages that are contiguous, which might not be possible if
      there's heavy memory fragmentation. Therefore fallback to vmalloc if
      we fail to allocate with kmalloc. Also start with a smaller buffer size,
      since xattr values typically are smaller than a page.
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4395e0c4
    • David Sterba's avatar
      btrfs: remove stale define after removing ordered operations · f87c4318
      David Sterba authored
      Last user removed in commit "btrfs: disable strict file flushes for
      renames and truncates" (8d875f95).
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f87c4318
    • Filipe Manana's avatar
      Btrfs: improve free space cache management and space allocation · 20005523
      Filipe Manana authored
      While under random IO, a block group's free space cache eventually reaches
      a state where it has a mix of extent entries and bitmap entries representing
      free space regions.
      
      As later free space regions are returned to the cache, some of them are merged
      with existing extent entries if they are contiguous with them. But others are
      not merged, because despite the existence of adjacent free space regions in
      the cache, the merging doesn't happen because the existing free space regions
      are represented in bitmap extents. Even when new free space regions are merged
      with existing extent entries (enlarging the free space range they represent),
      we create chances of having after an enlarged region that is contiguous with
      some other region represented in a bitmap entry.
      
      Both clustered and non-clustered space allocation work by iterating over our
      extent and bitmap entries and skipping any that represents a region smaller
      then the allocation request (and giving preference to extent entries before
      bitmap entries). By having a contiguous free space region that is represented
      by 2 (or more) entries (mix of extent and bitmap entries), we end up not
      satisfying an allocation request with a size larger than the size of any of
      the entries but no larger than the sum of their sizes. Making the caller assume
      we're under a ENOSPC condition or force it to allocate multiple smaller space
      regions (as we do for file data writes), which adds extra overhead and more
      chances of causing fragmentation due to the smaller regions being all spread
      apart from each other (more likely when under concurrency).
      
      For example, if we have the following in the cache:
      
      * extent entry representing free space range: [128Mb - 256Kb, 128Mb[
      
      * bitmap entry covering the range [128Mb, 256Mb[, but only with the bits
        representing the range [128Mb, 128Mb + 768Kb[ set - that is, only that
        space in this 128Mb area is marked as free
      
      An allocation request for 1Mb, starting at offset not greater than 128Mb - 256Kb,
      would fail before, despite the existence of such contiguous free space area in the
      cache. The caller could only allocate up to 768Kb of space at once and later another
      256Kb (or vice-versa). In between each smaller allocation request, another task
      working on a different file/inode might come in and take that space, preventing the
      former task of getting a contiguous 1Mb region of free space.
      
      Therefore this change implements the ability to move free space from bitmap
      entries into existing and new free space regions represented with extent
      entries. This is done when a space region is added to the cache.
      
      A test was added to the sanity tests that explains in detail the issue too.
      
      Some performance test results with compilebench on a 4 cores machine, with
      32Gb of ram and using an HDD follow.
      
      Test: compilebench -D /mnt -i 30 -r 1000 --makej
      
      Before this change:
      
         intial create total runs 30 avg 69.02 MB/s (user 0.28s sys 0.57s)
         compile total runs 30 avg 314.96 MB/s (user 0.12s sys 0.25s)
         read compiled tree total runs 3 avg 27.14 MB/s (user 1.52s sys 0.90s)
         delete compiled tree total runs 30 avg 3.14 seconds (user 0.15s sys 0.66s)
      
      After this change:
      
         intial create total runs 30 avg 68.37 MB/s (user 0.29s sys 0.55s)
         compile total runs 30 avg 382.83 MB/s (user 0.12s sys 0.24s)
         read compiled tree total runs 3 avg 27.82 MB/s (user 1.45s sys 0.97s)
         delete compiled tree total runs 30 avg 3.18 seconds (user 0.17s sys 0.65s)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      20005523