1. 13 Apr, 2015 29 commits
    • Qu Wenruo's avatar
      btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags... · 9c8b35b1
      Qu Wenruo authored
      btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations.
      
      Operation like qgroups assigning/deleting qgroup relations will mostly
      cause qgroup data inconsistent, since it needs to do the full rescan to
      determine whether shared extents are exclusive or still shared in
      parent qgroups.
      
      But there are some exceptions, like qgroup with only exclusive extents
      (qgroup->excl == qgroup->rfer), in that case, we only needs to
      modify all its parents' excl and rfer.
      
      So this patch adds a quick path for such qgroup in qgroup
      assign/remove routine, and if quick path failed, the qgroup status will
      be marked INCONSISTENT, and return 1 to info user-land.
      
      BTW since the quick path is much the same of qgroup_excl_accounting(),
      so move the core of it to __qgroup_excl_accounting() and reuse it.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9c8b35b1
    • Dongsheng Yang's avatar
      btrfs: qgroup: clear STATUS_FLAG_ON in disabling quota. · 8ea0ec9e
      Dongsheng Yang authored
      we forgot to clear STATUS_FLAG_ON in quota_disable(), it
      will cause a problem shown as below:
      
      	# mount /dev/sdc /mnt
      	# btrfs quota enable /mnt
      	# btrfs quota disable /mnt
      	# btrfs quota rescan /mnt
      	quota rescan started <--- expecting it fail here.
      	# echo $?
      	0
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8ea0ec9e
    • Qu Wenruo's avatar
      btrfs: Update btrfs qgroup status item when rescan is done. · 53b7cde9
      Qu Wenruo authored
      Update qgroup status when rescan is done.
      
      Before this patch, status item is not updated on rescan finish, which
      causing the RESCAN and INCONSISTENT flags never cleared.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      53b7cde9
    • Qu Wenruo's avatar
      btrfs: qgroup: Fix dead judgement on qgroup_rescan_leaf() return value. · 3393168d
      Qu Wenruo authored
      Old qgroup_rescan_leaf() comment indicates ret == 2 as complete and
      cleared INCONSISTENT flag.
      
      This is not true since it will never return 2, and inside it no codes
      will clear INCONSISTENT flag.
      The flag clearance is done in btrfs_qgroup_rescan_work().
      This caused the bug that INCONSISTENT flag is never cleared.
      
      So change the comment and fix the dead judgment.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3393168d
    • Qu Wenruo's avatar
      btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created · e09fe2d2
      Qu Wenruo authored
      Btrfs will create qgroup on subvolume creation if quota is enabled, but
      qgroup uses the high bits(currently 16 bits) as level, to build the
      inheritance.
      
      However it is fully possible a subvolume can be created with a
      subvolumeid larger than 1 << BTRFS_QGROUP_LEVEL_SHIFT, so it will be
      considered as level 1 and can't be assigned to other qgroup in level 1.
      
      This patch will prevent such things so qgroup inheritance will not be
      screwed up.
      The downside is very clear, btrfs subvolume number limit will decrease
      from (u64 max - 256(fisrt free objectid) - 256(last free objectid)) to
      (u48 max -256(first free objectid)).
      But we still have near u48(that's 15 digits in dec), so that should not
      be a huge problem.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e09fe2d2
    • Qu Wenruo's avatar
      btrfs: Check qgroup level in kernel qgroup assign. · 8465ecec
      Qu Wenruo authored
      Although we have qgroup level check in btrfs-progs, it's not enough
      since other programe may still call ioctl directly not using
      btrfs-progs. For example, systemd.
      
      But it's btrfs-progs to be blame since we don't provide a
      full-function(like subvolume create things) btrfs library with enough
      check, and only rely on kernel ioctl.
      
      So Add level checks in kernel too.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8465ecec
    • Dongsheng Yang's avatar
      btrfs: qgroup: allow to remove qgroup which has parent but no child. · f5a6b1c5
      Dongsheng Yang authored
      When a qgroup has parents but no child, it should be removable in
      Theory I think. But currently, we can not remove it when it has
      either parent or child.
      
      Example:
      	# btrfs quota enable /mnt
      	# btrfs qgroup create 1/0 /mnt
      	# btrfs qgroup create 2/0 /mnt
      	# btrfs qgroup assign 1/0 2/0 /mnt
      	# btrfs qgroup show -pcre /mnt
      qgroupid rfer  excl  max_rfer max_excl parent  child
      -------- ----  ----  -------- -------- ------  -----
      0/5      16384 16384 0        0        ---     ---
      1/0      0     0     0        0        2/0     ---
      2/0      0     0     0        0        ---     1/0
      
      At this time, there is no subvol or qgroup depending on it.
      Just a qgroup 2/0 is its parent, but 2/0 can work well without
      1/0. So I think 1/0 should be removalbe. But:
      	# btrfs qgroup destroy 1/0 /mnt
      ERROR: unable to destroy quota group: Device or resource busy
      
      This patch remove the check of qgroup->parent in removing it,
      then we can remove a qgroup when it has a parent.
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f5a6b1c5
    • Dongsheng Yang's avatar
      btrfs: qgroup: return EINVAL if level of parent is not higher than child's. · 09870d27
      Dongsheng Yang authored
      When we create a subvol inheriting a qgroup, we need to check the level
      of them. Otherwise, there is a chance a qgroup can inherit another qgroup
      at the same level.
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      09870d27
    • Dongsheng Yang's avatar
      btrfs: qgroup: do a reservation in a higher level. · e2d1f923
      Dongsheng Yang authored
      There are two problems in qgroup:
      
      a). The PAGE_CACHE is 4K, even when we are writing a data of 1K,
      qgroup will reserve a 4K size. It will cause the last 3K in a qgroup
      is not available to user.
      
      b). When user is writing a inline data, qgroup will not reserve it,
      it means this is a window we can exceed the limit of a qgroup.
      
      The main idea of this patch is reserving the data size of write_bytes
      rather than the reserve_bytes. It means qgroup will not care about
      the data size btrfs will reserve for user, but only care about the
      data size user is going to write. Then reserve it when user want to
      write and release it in transaction committed.
      
      In this way, qgroup can be released from the complex procedure in
      btrfs and only do the reserve when user want to write and account
      when the data is written in commit_transaction().
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e2d1f923
    • Dongsheng Yang's avatar
      Btrfs: qgroup, Account data space in more proper timings. · 237c0e9f
      Dongsheng Yang authored
      Currenly, in data writing, ->reserved is accounted in
      fill_delalloc(), but ->may_use is released in clear_bit_hook()
      which is called by btrfs_finish_ordered_io(). That's too late,
      that said, between fill_delalloc() and btrfs_finish_ordered_io(),
      the data is doublely accounted by qgroup. It will cause some
      unexpected -EDQUOT.
      
      Example:
      	# btrfs quota enable /root/btrfs-auto-test/
      	# btrfs subvolume create /root/btrfs-auto-test//sub
      	Create subvolume '/root/btrfs-auto-test/sub'
      	# btrfs qgroup limit 1G /root/btrfs-auto-test//sub
      	dd if=/dev/zero of=/root/btrfs-auto-test//sub/file bs=1024 count=1500000
      	dd: error writing '/root/btrfs-auto-test//sub/file': Disk quota exceeded
      	681353+0 records in
      	681352+0 records out
      	697704448 bytes (698 MB) copied, 8.15563 s, 85.5 MB/s
      It's (698 MB) when we got an -EDQUOT, but we limit it by 1G.
      
      This patch move the btrfs_qgroup_reserve/free() for data from
      btrfs_delalloc_reserve/release_metadata() to btrfs_check_data_free_space()
      and btrfs_free_reserved_data_space(). Then the accounter in qgroup
      will be updated at the same time with the accounter in space_info updated.
      In this way, the unexpected -EDQUOT will be killed.
      Reported-by: default avatarSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      237c0e9f
    • Dongsheng Yang's avatar
      Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use. · 31193213
      Dongsheng Yang authored
      Currently, for pre_alloc or delay_alloc, the bytes will be accounted
      in space_info by the three guys.
      space_info->bytes_may_use --- space_info->reserved --- space_info->used.
      But on the other hand, in qgroup, there are only two counters to account the
      bytes, qgroup->reserved and qgroup->excl. And qg->reserved accounts
      bytes in space_info->bytes_may_use and qg->excl accounts bytes in
      space_info->used. So the bytes in space_info->reserved is not accounted
      in qgroup. If so, there is a window we can exceed the quota limit when
      bytes is in space_info->reserved.
      
      Example:
      	# btrfs quota enable /mnt
      	# btrfs qgroup limit -e 10M /mnt
      	# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
      	# sync
      	# btrfs qgroup show -pcre /mnt
      qgroupid rfer     excl     max_rfer max_excl parent  child
      -------- ----     ----     -------- -------- ------  -----
      0/5      20987904 20987904 0        10485760 ---     ---
      
      qg->excl is 20987904 larger than max_excl 10485760.
      
      This patch introduce a new counter named may_use to qgroup, then
      there are three counters in qgroup to account bytes in space_info
      as below.
      space_info->bytes_may_use --- space_info->reserved --- space_info->used.
      qgroup->may_use           --- qgroup->reserved     --- qgroup->excl
      
      With this patch applied:
      	# btrfs quota enable /mnt
      	# btrfs qgroup limit -e 10M /mnt
      	# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
      fallocate: /mnt/data9: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data10: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data11: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data12: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data13: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data14: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data15: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data16: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data17: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data18: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data19: fallocate failed: Disk quota exceeded
      	# sync
      	# btrfs qgroup show -pcre /mnt
      qgroupid rfer    excl    max_rfer max_excl parent  child
      -------- ----    ----    -------- -------- ------  -----
      0/5      9453568 9453568 0        10485760 ---     ---
      Reported-by: default avatarCyril SCETBON <cyril.scetbon@free.fr>
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      31193213
    • Dongsheng Yang's avatar
      Btrfs: qgroup: free reserved in exceeding quota. · 804ca127
      Dongsheng Yang authored
      When we exceed quota limit in writing, we will free
      some reserved extent when we need to drop but not free
      account in qgroup. It means, each time we exceed quota
      in writing, there will be some remain space in qg->reserved
      we can not use any more. If things go on like this, the
      all space will be ate up.
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      804ca127
    • Dongsheng Yang's avatar
    • Dongsheng Yang's avatar
      btrfs: qgroup: fix limit args override whole limit struct · 03477d94
      Dongsheng Yang authored
      btrfs_limit_group use arg limit to override the old qgroup_limit of
      corresponding qgroup. However, we should override part of old qgroup_limit
      according to the bit which has been set in arg limit.
      Signed-off-by: default avatarFan Chengniang <fancn.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      03477d94
    • Dongsheng Yang's avatar
      btrfs: qgroup: update limit info in function btrfs_run_qgroups(). · d3001ed3
      Dongsheng Yang authored
      When we commit_transaction(), qgroups in btree should be updated.
      But, limit info is not considered currently. It will cause a problem
      when a qgroup of a snapshot inherit the limit info from srcqgroup,
      then there is an inconsistency.
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d3001ed3
    • Dongsheng Yang's avatar
      btrfs: qgroup: consolidate the parameter of fucntion update_qgroup_limit_item(). · 1510e71c
      Dongsheng Yang authored
      Cleanup: Change the parameter of update_qgroup_limit_item() to the family of
      update_qgroup_xxx_item().
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1510e71c
    • Dongsheng Yang's avatar
      btrfs: qgroup: update qgroup in memory at the same time when we update it in btree. · e8c8541a
      Dongsheng Yang authored
      When we call btrfs_qgroup_inherit() with BTRFS_QGROUP_INHERIT_SET_LIMITS,
      btrfs will update the limit info of qgroup in btree but forget to update
      the qgroup in rbtree at the same time. It obviousely will cause an inconsistency.
      
      This patch fix it by updating the rbtree at the same time.
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e8c8541a
    • Dongsheng Yang's avatar
      btrfs: qgroup: inherit limit info from srcgroup in creating snapshot. · 3eeb4d59
      Dongsheng Yang authored
      Currently, when we snapshot a subvol, snapshot will not copy the limits
      from srcqgroup.
      
      This patch make the qgroup in snapshot inherit the limit info when create
      a snapshot.
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3eeb4d59
    • Zhao Lei's avatar
      btrfs: Support busy loop of write and delete · c99f1b0c
      Zhao Lei authored
      Reproduce:
       while true; do
         dd if=/dev/zero of=/mnt/btrfs/file count=[75% fs_size]
         rm /mnt/btrfs/file
       done
       Then we can see above loop failed on NO_SPACE.
      
      It it long-term problem since very beginning, because delayed-iput
      after rm are not run.
      
      We already have commit_transaction() in alloc_space code, but it is
      not triggered in above case.
      This patch trigger commit_transaction() to run delayed-iput and
      reflash pinned-space to to make write success.
      
      It is based on previous fix of delayed-iput in commit_transaction(),
      need to be applied on top of:
      btrfs: Fix NO_SPACE bug caused by delayed-iput
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c99f1b0c
    • Zhao Lei's avatar
      btrfs: Fix NO_SPACE bug caused by delayed-iput · d7c15171
      Zhao Lei authored
      Steps to reproduce:
        while true; do
          dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
          rm /btrfs_dir/file
          sync
        done
      
        And we'll see dd failed because btrfs return NO_SPACE.
      
      Reason:
        Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
        in end to free fs space for next write, but sometimes it hadn't
        done work on time, because btrfs-cleaner thread get delayed-iputs
        from list before, but do iput() after next write.
      
        This is log:
        [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin
      
        [ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs()
        [ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs()
        [ 2569.087554] comm=sync func=btrfs_commit_transaction() end
      
        [ 2569.191081] comm=dd begin
        [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28
      
        [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888
        [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512
        ...
        [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496
        [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end
      
      Fix:
        Make btrfs_commit_transaction() wait current running btrfs-cleaner's
        delayed-iputs() done in end.
      
      Test:
        Use script similar to above(more complex),
        before patch:
          7 failed in 100 * 20 loop.
        after patch:
          0 failed in 100 * 20 loop.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d7c15171
    • Zhao Lei's avatar
      btrfs: add WARN_ON() to check is space_info op current · 18d018ad
      Zhao Lei authored
      space_info's value calculation is some complex and easy to cause
      bug, add WARN_ON() to help debug.
      
      Changelog v1->v2:
       Put WARN_ON()s under the ENOSPC_DEBUG mount option.
       Suggested by: David Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      18d018ad
    • Zhao Lei's avatar
      btrfs: Set relative data on clear btrfs_block_group_cache->pinned · c30666d4
      Zhao Lei authored
      Bug1:
        space_info->bytes_readonly was set to very large(negative) value in
        btrfs_remove_block_group().
      
      Reason:
        Current code set block_group_cache->pinned = 0 in btrfs_delete_unused_bgs(),
        but above space was not counted to space_info->bytes_readonly.
      
        Then in btrfs_remove_block_group():
          block_group->space_info->bytes_readonly -= block_group->key.offset;
        We can see following value in trace:
          btrfs_remove_block_group: pid=2677 comm=btrfs-cleaner WARNING: bytes_readonly=12582912, key.offset=134217728
      
      Bug2:
        space_info->total_bytes_pinned grow to value larger than fs size.
        In a 1.2G fs, we can get following trace log:
        at first:
          ZL_DEBUG: add_pinned_bytes: pid=2710 comm=sync change total_bytes_pinned flags=1 869793792 + 95944704 = 965738496
        after some op:
          ZL_DEBUG: add_pinned_bytes: pid=2770 comm=sync change total_bytes_pinned flags=1 1780178944 + 95944704 = 1876123648
        after some op:
          ZL_DEBUG: add_pinned_bytes: pid=3193 comm=sync change total_bytes_pinned flags=1 2924568576 + 95551488 = 3020120064
        ...
      
      Reason:
        Similar to bug1, we also need to adjust space_info->total_bytes_pinned
        in above code block.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c30666d4
    • Zhao Lei's avatar
      btrfs: Adjust commit-transaction condition to avoid NO_SPACE more · 264ca0f6
      Zhao Lei authored
      If we have any chance to make a successful write, we should not give up.
      
      This patch adjust commit-transaction condition from:
        pinned >= wanted
      to
        left + pinned >= wanted
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      264ca0f6
    • Zhao Lei's avatar
      btrfs: Fix tail space processing in find_free_dev_extent() · f2ab7618
      Zhao Lei authored
      It is another reason for NO_SPACE case.
      
      When we found enough free space in loop and saved them to
      max_hole_start/size before, and tail space contains pending extent,
      origional innocent max_hole_start/size are reset in retry.
      
      As a result, find_free_dev_extent() returns less space than it can,
      and cause NO_SPACE in user program.
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f2ab7618
    • Zhao Lei's avatar
      btrfs: fix condition of commit transaction · 94b947b2
      Zhao Lei authored
      Old code bypass commit transaction when we don't have enough
      pinned space, but another case is there exist freed bgs in current
      transction, it have possibility to make alloc_chunk success.
      
      This patch modify the condition to:
      if (have_free_bg || have_pinned_space) commit_transaction()
      
      Confirmed above action by printk before and after patch.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      94b947b2
    • Chris Mason's avatar
      Btrfs: fix uninit variable in clone ioctl · de249e66
      Chris Mason authored
      Commit 0d97a64e0 creates a new variable but doesn't always set it up.
      This puts it back to the original method (key.offset + 1) for the cases
      not covered by Filipe's new logic.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      de249e66
    • Filipe Manana's avatar
      Btrfs: fix inode eviction infinite loop after cloning into it · ccccf3d6
      Filipe Manana authored
      If we attempt to clone a 0 length region into a file we can end up
      inserting a range in the inode's extent_io tree with a start offset
      that is greater then the end offset, which triggers immediately the
      following warning:
      
      [ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
      [ 3914.620886] BTRFS: end < start 4095 4096
      (...)
      [ 3914.638093] Call Trace:
      [ 3914.638636]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
      [ 3914.639620]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
      [ 3914.640789]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
      [ 3914.642041]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
      [ 3914.643236]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
      [ 3914.644441]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
      [ 3914.645711]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
      [ 3914.646914]  [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33
      [ 3914.648058]  [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs]
      [ 3914.650105]  [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs]
      [ 3914.651361]  [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs]
      [ 3914.652761]  [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs]
      [ 3914.654128]  [<ffffffff811226dd>] ? might_fault+0x58/0xb5
      [ 3914.655320]  [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs]
      (...)
      [ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]---
      
      This later makes the inode eviction handler enter an infinite loop that
      keeps dumping the following warning over and over:
      
      [ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
      [ 3915.119913] BTRFS: end < start 4095 4096
      (...)
      [ 3915.137394] Call Trace:
      [ 3915.137913]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
      [ 3915.139154]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
      [ 3915.140316]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
      [ 3915.141505]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
      [ 3915.142709]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
      [ 3915.143849]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
      [ 3915.145120]  [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs]
      [ 3915.146352]  [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50
      [ 3915.147565]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
      [ 3915.148785]  [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33
      [ 3915.149931]  [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs]
      [ 3915.151154]  [<ffffffff81168904>] evict+0xa0/0x148
      [ 3915.152094]  [<ffffffff811689e5>] dispose_list+0x39/0x43
      [ 3915.153081]  [<ffffffff81169564>] evict_inodes+0xdc/0xeb
      [ 3915.154062]  [<ffffffff81154418>] generic_shutdown_super+0x49/0xef
      [ 3915.155193]  [<ffffffff811546d1>] kill_anon_super+0x13/0x1e
      [ 3915.156274]  [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
      (...)
      [ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]---
      
      So just bail out of the clone ioctl if the length of the region to clone
      is zero, without locking any extent range, in order to prevent this issue
      (same behaviour as a pwrite with a 0 length for example).
      
      This is trivial to reproduce. For example, the steps for the test I just
      made for fstests:
      
        mkfs.btrfs -f SCRATCH_DEV
        mount SCRATCH_DEV $SCRATCH_MNT
      
        touch $SCRATCH_MNT/foo
        touch $SCRATCH_MNT/bar
      
        $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
        umount $SCRATCH_MNT
      
      A test case for fstests follows soon.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ccccf3d6
    • Filipe Manana's avatar
      Btrfs: fix inode eviction infinite loop after extent_same ioctl · 113e8283
      Filipe Manana authored
      If we pass a length of 0 to the extent_same ioctl, we end up locking an
      extent range with a start offset greater then its end offset (if the
      destination file's offset is greater than zero). This results in a warning
      from extent_io.c:insert_state through the following call chain:
      
        btrfs_extent_same()
          btrfs_double_lock()
            lock_extent_range()
              lock_extent(inode->io_tree, offset, offset + len - 1)
                lock_extent_bits()
                  __set_extent_bit()
                    insert_state()
                      --> WARN_ON(end < start)
      
      This leads to an infinite loop when evicting the inode. This is the same
      problem that my previous patch titled
      "Btrfs: fix inode eviction infinite loop after cloning into it" addressed
      but for the extent_same ioctl instead of the clone ioctl.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      113e8283
    • Filipe Manana's avatar
      Btrfs: fix range cloning when same inode used as source and destination · df858e76
      Filipe Manana authored
      While searching for extents to clone we might find one where we only use
      a part of it coming from its tail. If our destination inode is the same
      the source inode, we end up removing the tail part of the extent item and
      insert after a new one that point to the same extent with an adjusted
      key file offset and data offset. After this we search for the next extent
      item in the fs/subvol tree with a key that has an offset incremented by
      one. But this second search leaves us at the new extent item we inserted
      previously, and since that extent item has a non-zero data offset, it
      it can make us call btrfs_drop_extents with an empty range (start == end)
      which causes the following warning:
      
      [23978.537119] WARNING: CPU: 6 PID: 16251 at fs/btrfs/file.c:550 btrfs_drop_extent_cache+0x43/0x385 [btrfs]()
      (...)
      [23978.557266] Call Trace:
      [23978.557978]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
      [23978.559191]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
      [23978.560699]  [<ffffffffa047f0ea>] ? btrfs_drop_extent_cache+0x43/0x385 [btrfs]
      [23978.562389]  [<ffffffff8104544d>] warn_slowpath_null+0x1a/0x1c
      [23978.563613]  [<ffffffffa047f0ea>] btrfs_drop_extent_cache+0x43/0x385 [btrfs]
      [23978.565103]  [<ffffffff810e3a18>] ? time_hardirqs_off+0x15/0x28
      [23978.566294]  [<ffffffff81079ff8>] ? trace_hardirqs_off+0xd/0xf
      [23978.567438]  [<ffffffffa047f73d>] __btrfs_drop_extents+0x6b/0x9e1 [btrfs]
      [23978.568702]  [<ffffffff8107c03f>] ? trace_hardirqs_on+0xd/0xf
      [23978.569763]  [<ffffffff811441c0>] ? ____cache_alloc+0x69/0x2eb
      [23978.570817]  [<ffffffff81142269>] ? virt_to_head_page+0x9/0x36
      [23978.571872]  [<ffffffff81143c15>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb
      [23978.573466]  [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
      [23978.574962]  [<ffffffffa0480d07>] btrfs_drop_extents+0x66/0x7f [btrfs]
      [23978.576179]  [<ffffffffa049aa35>] btrfs_clone+0x516/0xaf5 [btrfs]
      [23978.577311]  [<ffffffffa04983dc>] ? lock_extent_range+0x7b/0xcd [btrfs]
      [23978.578520]  [<ffffffffa049b2a2>] btrfs_ioctl_clone+0x28e/0x39f [btrfs]
      [23978.580282]  [<ffffffffa049d9ae>] btrfs_ioctl+0xb51/0x219a [btrfs]
      (...)
      [23978.591887] ---[ end trace 988ec2a653d03ed3 ]---
      
      Then we attempt to insert a new extent item with a key that already
      exists, which makes btrfs_insert_empty_item return -EEXIST resulting in
      abortion of the current transaction:
      
      [23978.594355] WARNING: CPU: 6 PID: 16251 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      (...)
      [23978.622589] Call Trace:
      [23978.623181]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
      [23978.624359]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
      [23978.625573]  [<ffffffffa044ab6c>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [23978.626971]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
      [23978.628003]  [<ffffffff8108a6c8>] ? vprintk_default+0x1d/0x1f
      [23978.629138]  [<ffffffffa044ab6c>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [23978.630528]  [<ffffffffa049ad1b>] btrfs_clone+0x7fc/0xaf5 [btrfs]
      [23978.631635]  [<ffffffffa04983dc>] ? lock_extent_range+0x7b/0xcd [btrfs]
      [23978.632886]  [<ffffffffa049b2a2>] btrfs_ioctl_clone+0x28e/0x39f [btrfs]
      [23978.634119]  [<ffffffffa049d9ae>] btrfs_ioctl+0xb51/0x219a [btrfs]
      (...)
      [23978.647714] ---[ end trace 988ec2a653d03ed4 ]---
      
      This is wrong because we should not process the extent item that we just
      inserted previously, and instead process the extent item that follows it
      in the tree
      
      For example for the test case I wrote for fstests:
      
         bs=$((64 * 1024))
         mkfs.btrfs -f -l $bs -O ^no-holes /dev/sdc
         mount /dev/sdc /mnt
      
         xfs_io -f -c "pwrite -S 0xaa $(($bs * 2)) $(($bs * 2))" /mnt/foo
      
         $CLONER_PROG -s $((3 * $bs)) -d $((267 * $bs)) -l 0 /mnt/foo /mnt/foo
         $CLONER_PROG -s $((217 * $bs)) -d $((95 * $bs)) -l 0 /mnt/foo /mnt/foo
      
      The second clone call fails with -EEXIST, because when we process the
      first extent item (offset 262144), we drop part of it (counting from the
      end) and then insert a new extent item with a key greater then the key we
      found. The next time we search the tree we search for a key with offset
      262144 + 1, which leaves us at the new extent item we have just inserted
      but we think it refers to an extent that we need to clone.
      
      Fix this by ensuring the next search key uses an offset corresponding to
      the offset of the key we found previously plus the data length of the
      corresponding extent item. This ensures we skip new extent items that we
      inserted and works for the case of implicit holes too (NO_HOLES feature).
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      df858e76
  2. 10 Apr, 2015 11 commits
    • Chris Mason's avatar
      Btrfs: fix use after free when close_ctree frees the orphan_rsv · cdfb080e
      Chris Mason authored
      Near the end of close_ctree, we're calling btrfs_free_block_rsv
      to free up the orphan rsv.  The problem is this call updates the
      space_info, which has already been freed.
      
      This adds a new __ function that directly calls kfree instead of trying
      to update the space infos.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cdfb080e
    • Chris Mason's avatar
      Btrfs: allow block group cache writeout outside critical section in commit · 1bbc621e
      Chris Mason authored
      We loop through all of the dirty block groups during commit and write
      the free space cache.  In order to make sure the cache is currect, we do
      this while no other writers are allowed in the commit.
      
      If a large number of block groups are dirty, this can introduce long
      stalls during the final stages of the commit, which can block new procs
      trying to change the filesystem.
      
      This commit changes the block group cache writeout to take appropriate
      locks and allow it to run earlier in the commit.  We'll still have to
      redo some of the block groups, but it means we can get most of the work
      out of the way without blocking the entire FS.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1bbc621e
    • Chris Mason's avatar
      Btrfs: don't use highmem for free space cache pages · 2b108268
      Chris Mason authored
      In order to create the free space cache concurrently with FS modifications,
      we need to take a few block group locks.
      
      The cache code also does kmap, which would schedule with the locks held.
      Instead of going through kmap_atomic, lets just use lowmem for the cache
      pages.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2b108268
    • Chris Mason's avatar
      Btrfs: two stage dirty block group writeout · c9dc4c65
      Chris Mason authored
      Block group cache writeout is currently waiting on the pages for each
      block group cache before moving on to writing the next one.  This commit
      switches things around to send down all the caches and then wait on them
      in batches.
      
      The end result is much faster, since we're keeping the disk pipeline
      full.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c9dc4c65
    • Chris Mason's avatar
      btrfs: move struct io_ctl into ctree.h and rename it · 4c6d1d85
      Chris Mason authored
      We'll need to put the io_ctl into the block_group cache struct, so
      name it struct btrfs_io_ctl and move it into ctree.h
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4c6d1d85
    • Josef Bacik's avatar
      Btrfs: don't steal from the global reserve if we don't have the space · 3bce876f
      Josef Bacik authored
      btrfs_evict_inode() needs to be more careful about stealing from the
      global_rsv.  We dont' want to end up aborting commit with ENOSPC just
      because the evict_inode code was too greedy.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3bce876f
    • Josef Bacik's avatar
      Btrfs: don't commit the transaction in the async space flushing · 365c5313
      Josef Bacik authored
      We're triggering a huge number of commits from
      btrfs_async_reclaim_metadata_space.  These aren't really requried,
      because everyone calling the async reclaim code is going to end up
      triggering a commit on their own.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      365c5313
    • Josef Bacik's avatar
      Btrfs: reserve space for block groups · cb723e49
      Josef Bacik authored
      This changes our delayed refs calculations to include the space needed
      to write back dirty block groups.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cb723e49
    • Chris Mason's avatar
      Btrfs: refill block reserves during truncate · 28f75a0e
      Chris Mason authored
      When truncate starts, it allocates some space in the block reserves so
      that we'll have enough to update metadata along the way.
      
      For very large files, we can easily go through all of that space as we
      loop through the extents.  This changes truncate to refill the space
      reservation as it progresses through the file.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      28f75a0e
    • Josef Bacik's avatar
      Btrfs: account for crcs in delayed ref processing · 1262133b
      Josef Bacik authored
      As we delete large extents, we end up doing huge amounts of COW in order
      to delete the corresponding crcs.  This adds accounting so that we keep
      track of that space and flushing of delayed refs so that we don't build
      up too much delayed crc work.
      
      This helps limit the delayed work that must be done at commit time and
      tries to avoid ENOSPC aborts because the crcs eat all the global
      reserves.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1262133b
    • Chris Mason's avatar
      btrfs: actively run the delayed refs while deleting large files · 28ed1345
      Chris Mason authored
      When we are deleting large files with large extents, we are building up
      a huge set of delayed refs for processing.  Truncate isn't checking
      often enough to see if we need to back off and process those, or let
      a commit proceed.
      
      The end result is long stalls after the rm, and very long commit times.
      During the commits, other processes back up waiting to start new
      transactions and we get into trouble.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      28ed1345