1. 10 Mar, 2014 40 commits
    • Filipe Manana's avatar
      Btrfs: avoid unnecessary utimes update in incremental send · fcbd2154
      Filipe Manana authored
      When we're finishing processing of an inode, if we're dealing with a
      directory inode that has a pending move/rename operation, we don't
      need to send a utimes update instruction to the send stream, as we'll
      do it later after doing the move/rename operation. Therefore we save
      some time here building paths and doing btree lookups.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      fcbd2154
    • Filipe Manana's avatar
      Btrfs: make defrag not fragment files when using prealloc extents · e2127cf0
      Filipe Manana authored
      When using prealloc extents, a file defragment operation may actually
      fragment the file and increase the amount of data space used by the file.
      This change fixes that behaviour.
      
      Example:
      
      $ mkfs.btrfs -f /dev/sdb3
      $ mount /dev/sdb3 /mnt
      $ cd /mnt
      $ xfs_io -f -c 'falloc 0 1048576' foobar && sync
      $ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar
      $ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar
      $ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync
      
      Before defragmenting the file:
      
      $ btrfs filesystem df /mnt
      Data, single: total=8.00MiB, used=1.25MiB
      System, DUP: total=8.00MiB, used=16.00KiB
      System, single: total=4.00MiB, used=0.00
      Metadata, DUP: total=1.00GiB, used=112.00KiB
      Metadata, single: total=8.00MiB, used=0.00
      
      $ btrfs-debug-tree /dev/sdb3
      (...)
      	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 0 nr 4096
      	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 4096 nr 102400 ram 1048576
      		extent compression 0
      	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 106496 nr 90112
      	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 196608 nr 106496 ram 1048576
      		extent compression 0
      	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 303104 nr 593920
      	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 897024 nr 106496 ram 1048576
      		extent compression 0
      	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 1003520 nr 45056
      (...)
      
      Now defragmenting the file results in more data space used than before:
      
      $ btrfs filesystem defragment -f foobar && sync
      $ btrfs filesystem df /mnt
      Data, single: total=8.00MiB, used=1.55MiB
      System, DUP: total=8.00MiB, used=16.00KiB
      System, single: total=4.00MiB, used=0.00
      Metadata, DUP: total=1.00GiB, used=112.00KiB
      Metadata, single: total=8.00MiB, used=0.00
      
      And the corresponding file extent items are now no longer perfectly sequential
      as before, and we're now needlessly using more space from data block groups:
      
      $ btrfs-debug-tree /dev/sdb3
      (...)
      	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 0 nr 4096 ram 1048576
      		extent compression 0
      	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
      		extent data disk byte 13893632 nr 102400
      		extent data offset 0 nr 102400 ram 102400
      		extent compression 0
      	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 106496 nr 90112 ram 1048576
      		extent compression 0
      	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
      		extent data disk byte 13996032 nr 106496
      		extent data offset 0 nr 106496 ram 106496
      		extent compression 0
      	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 303104 nr 593920
      	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
      		extent data disk byte 14102528 nr 106496
      		extent data offset 0 nr 106496 ram 106496
      		extent compression 0
      	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 1003520 nr 45056 ram 1048576
      		extent compression 0
      (...)
      
      With this change, the above example will no longer cause allocation of new data
      space nor change the sequentiality of the file extents, that is, defragment will
      be effectless, leaving all extent items pointing to the extent starting at disk
      byte 12845056.
      
      In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of
      400Mb each, initially consisting of a single prealloc extent of 400Mb, having
      random writes happening at a low rate, lead to a total of over ~17Gb of data
      space used, not far from eventually reaching an ENOSPC state.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      e2127cf0
    • Filipe Manana's avatar
      Btrfs: correctly flush data on defrag when compression is enabled · dec8ef90
      Filipe Manana authored
      When the defrag flag BTRFS_DEFRAG_RANGE_START_IO is set and compression
      enabled, we weren't flushing completely, as writing compressed extents
      is a 2 steps process, one to compress the data and another one to write
      the compressed data to disk.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      dec8ef90
    • Qu Wenruo's avatar
      btrfs: Cleanup the "_struct" suffix in btrfs_workequeue · d458b054
      Qu Wenruo authored
      Since the "_struct" suffix is mainly used for distinguish the differnt
      btrfs_work between the original and the newly created one,
      there is no need using the suffix since all btrfs_workers are changed
      into btrfs_workqueue.
      
      Also this patch fixed some codes whose code style is changed due to the
      too long "_struct" suffix.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      d458b054
    • Qu Wenruo's avatar
      btrfs: Cleanup the old btrfs_worker. · a046e9c8
      Qu Wenruo authored
      Since all the btrfs_worker is replaced with the newly created
      btrfs_workqueue, the old codes can be easily remove.
      Signed-off-by: default avatarQuwenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      a046e9c8
    • Qu Wenruo's avatar
      btrfs: Replace fs_info->scrub_* workqueue with btrfs_workqueue. · 0339ef2f
      Qu Wenruo authored
      Replace the fs_info->scrub_* with the newly created
      btrfs_workqueue.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      0339ef2f
    • Qu Wenruo's avatar
      btrfs: Replace fs_info->qgroup_rescan_worker workqueue with btrfs_workqueue. · fc97fab0
      Qu Wenruo authored
      Replace the fs_info->qgroup_rescan_worker with the newly created
      btrfs_workqueue.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      fc97fab0
    • Qu Wenruo's avatar
      btrfs: Replace fs_info->delayed_workers workqueue with btrfs_workqueue. · 5b3bc44e
      Qu Wenruo authored
      Replace the fs_info->delayed_workers with the newly created
      btrfs_workqueue.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      5b3bc44e
    • Qu Wenruo's avatar
      btrfs: Replace fs_info->fixup_workers workqueue with btrfs_workqueue. · dc6e3209
      Qu Wenruo authored
      Replace the fs_info->fixup_workers with the newly created
      btrfs_workqueue.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      dc6e3209
    • Qu Wenruo's avatar
      btrfs: Replace fs_info->readahead_workers workqueue with btrfs_workqueue. · 736cfa15
      Qu Wenruo authored
      Replace the fs_info->readahead_workers with the newly created
      btrfs_workqueue.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      736cfa15
    • Qu Wenruo's avatar
      btrfs: Replace fs_info->cache_workers workqueue with btrfs_workqueue. · e66f0bb1
      Qu Wenruo authored
      Replace the fs_info->cache_workers with the newly created
      btrfs_workqueue.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      e66f0bb1
    • Qu Wenruo's avatar
      btrfs: Replace fs_info->rmw_workers workqueue with btrfs_workqueue. · d05a33ac
      Qu Wenruo authored
      Replace the fs_info->rmw_workers with the newly created
      btrfs_workqueue.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      d05a33ac
    • Qu Wenruo's avatar
      btrfs: Replace fs_info->endio_* workqueue with btrfs_workqueue. · fccb5d86
      Qu Wenruo authored
      Replace the fs_info->endio_* workqueues with the newly created
      btrfs_workqueue.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      fccb5d86
    • Qu Wenruo's avatar
      btrfs: Replace fs_info->flush_workers with btrfs_workqueue. · a44903ab
      Qu Wenruo authored
      Replace the fs_info->submit_workers with the newly created
      btrfs_workqueue.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      a44903ab
    • Qu Wenruo's avatar
      btrfs: Replace fs_info->submit_workers with btrfs_workqueue. · a8c93d4e
      Qu Wenruo authored
      Much like the fs_info->workers, replace the fs_info->submit_workers
      use the same btrfs_workqueue.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      a8c93d4e
    • Qu Wenruo's avatar
      btrfs: Replace fs_info->delalloc_workers with btrfs_workqueue · afe3d242
      Qu Wenruo authored
      Much like the fs_info->workers, replace the fs_info->delalloc_workers
      use the same btrfs_workqueue.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      afe3d242
    • Qu Wenruo's avatar
      btrfs: Replace fs_info->workers with btrfs_workqueue. · 5cdc7ad3
      Qu Wenruo authored
      Use the newly created btrfs_workqueue_struct to replace the original
      fs_info->workers
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      5cdc7ad3
    • Qu Wenruo's avatar
      btrfs: Add threshold workqueue based on kernel workqueue · 0bd9289c
      Qu Wenruo authored
      The original btrfs_workers has thresholding functions to dynamically
      create or destroy kthreads.
      
      Though there is no such function in kernel workqueue because the worker
      is not created manually, we can still use the workqueue_set_max_active
      to simulated the behavior, mainly to achieve a better HDD performance by
      setting a high threshold on submit_workers.
      (Sadly, no resource can be saved)
      
      So in this patch, extra workqueue pending counters are introduced to
      dynamically change the max active of each btrfs_workqueue_struct, hoping
      to restore the behavior of the original thresholding function.
      
      Also, workqueue_set_max_active use a mutex to protect workqueue_struct,
      which is not meant to be called too frequently, so a new interval
      mechanism is applied, that will only call workqueue_set_max_active after
      a count of work is queued. Hoping to balance both the random and
      sequence performance on HDD.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      0bd9289c
    • Qu Wenruo's avatar
      btrfs: Add high priority workqueue support for btrfs_workqueue_struct · 1ca08976
      Qu Wenruo authored
      Add high priority function to btrfs_workqueue.
      
      This is implemented by embedding a btrfs_workqueue into a
      btrfs_workqueue and use some helper functions to differ the normal
      priority wq and high priority wq.
      So the high priority wq is completely independent from the normal
      workqueue.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      1ca08976
    • Qu Wenruo's avatar
      btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue · 08a9ff32
      Qu Wenruo authored
      Use kernel workqueue to implement a new btrfs_workqueue_struct, which
      has the ordering execution feature like the btrfs_worker.
      
      The func is executed in a concurrency way, and the
      ordred_func/ordered_free is executed in the sequence them are queued
      after the corresponding func is done.
      
      The new btrfs_workqueue works much like the original one, one workqueue
      for normal work and a list for ordered work.
      When a work is queued, ordered work will be added to the list and helper
      function will be queued into the workqueue.
      The helper function will execute a normal work and then check and execute as many
      ordered work as possible in the sequence they were queued.
      
      At this patch, high priority work queue or thresholding is not added yet.
      The high priority feature and thresholding will be added in the following patches.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      08a9ff32
    • Qu Wenruo's avatar
      btrfs: Cleanup the unused struct async_sched. · f5961d41
      Qu Wenruo authored
      The struct async_sched is not used by any codes and can be removed.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      f5961d41
    • Liu Bo's avatar
      Btrfs: skip search tree for REG files · 644d1940
      Liu Bo authored
      It is really unnecessary to search tree again for @gen, @mode and @rdev
      in the case of REG inodes' creation, as we've got btrfs_inode_item in sctx,
      and @gen, @mode and @rdev can easily be fetched.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      644d1940
    • Miao Xie's avatar
      Btrfs: fix preallocate vs double nocow write · 7b2b7085
      Miao Xie authored
      We can not release the reserved metadata space for the first write if we
      find the write position is pre-allocated. Because the kernel might write
      the data on the disk before we do the second write but after the can-nocow
      check, if we release the space for the first write, we might fail to update
      the metadata because of no space.
      
      Fix this problem by end nocow write if there is dirty data in the range whose
      space is pre-allocated.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      7b2b7085
    • Miao Xie's avatar
      Btrfs: fix wrong lock range and write size in check_can_nocow() · c933956d
      Miao Xie authored
      The write range may not be sector-aligned, for example:
      
             |--------|--------|	<- write range, sector-unaligned, size: 2blocks
        |--------|--------|--------|  <- correct lock range, size: 3blocks
      
      But according to the old code, we used the size of write range to calculate
      the lock range directly, not considered the offset, we would get a wrong lock
      range:
      
             |--------|--------|	<- write range, sector-unaligned, size: 2blocks
        |--------|--------|		<- wrong lock range, size: 2blocks
      
      And besides that, the old code also had the same problem when calculating
      the real write size. Correct them.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      c933956d
    • David Sterba's avatar
      9c9ca00b
    • David Sterba's avatar
      btrfs: send: fix old buffer length in fs_path_ensure_buf · 1b2782c8
      David Sterba authored
      In "btrfs: send: lower memory requirements in common case" the code to
      save the old_buf_len was incorrectly moved to a wrong place and broke
      the original logic.
      Reported-by: default avatarFilipe David Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Reviewed-by: default avatarFilipe David Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      1b2782c8
    • Filipe Manana's avatar
      Btrfs: more efficient btrfs_drop_extent_cache · 176840b3
      Filipe Manana authored
      While droping extent map structures from the extent cache that cover our
      target range, we would remove each extent map structure from the red black
      tree and then add either 1 or 2 new extent map structures if the former
      extent map covered sections outside our target range.
      
      This change simply attempts to replace the existing extent map structure
      with a new one that covers the subsection we're not interested in, instead
      of doing a red black remove operation followed by an insertion operation.
      
      The number of elements in an inode's extent map tree can get very high for large
      files under random writes. For example, while running the following test:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G \
              --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
              --max-requests=500000 --file-rw-ratio=2 [prepare|run]
      
      I captured the following histogram capturing the number of extent_map items
      in the red black tree while that test was running:
      
          Count: 122462
          Range:  1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981
          Percentiles:  90th: 160120.000; 95th: 166335.000; 99th: 171070.000
             1.000 -    5.231:   452 |
             5.231 -  187.392:    87 |
           187.392 -  585.911:   206 |
           585.911 - 1827.438:   623 |
          1827.438 - 5695.245:  1962 #
          5695.245 - 17744.861:  6204 ####
         17744.861 - 55283.764: 21115 ############
         55283.764 - 172231.000: 91813 #####################################################
      
      Benchmark:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \
              --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \
              --file-io-mode=sync --file-fsync-freq=0 [prepare|run]
      
      Before this change: 122.1Mb/sec
      After this change:  125.07Mb/sec
      (averages of 5 test runs)
      
      Test machine: quad core intel i5-3570K, 32Gb of ram, SSD
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      176840b3
    • Filipe Manana's avatar
      Btrfs: more efficient split extent state insertion · f2071b21
      Filipe Manana authored
      When we split an extent state there's no need to start the rbtree search
      from the root node - we can start it from the original extent state node,
      since we would end up in its subtree if we do the search starting at the
      root node anyway.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      f2071b21
    • Filipe Manana's avatar
      Btrfs: remove unneeded field / smaller extent_map structure · cbc0e928
      Filipe Manana authored
      We don't need to have an unsigned int field in the extent_map struct
      to tell us whether the extent map is in the inode's extent_map tree or
      not. We can use the rb_node struct field and the RB_CLEAR_NODE and
      RB_EMPTY_NODE macros to achieve the same task.
      
      This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a
      64 bits system).
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      cbc0e928
    • Wang Shilong's avatar
      Btrfs: skip locking when searching commit root · e84752d4
      Wang Shilong authored
      We won't change commit root, skip locking dance with commit root
      when walking backrefs, this can speed up btrfs send operations.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      e84752d4
    • Wang Shilong's avatar
      Btrfs: wake up @scrub_pause_wait as much as we can · 32a44789
      Wang Shilong authored
      check if @scrubs_running=@scrubs_paused condition inside wait_event()
      is not an atomic operation which means we may inc/dec @scrub_running/
      paused at any time. Let's wake up @scrub_pause_wait as much as we can
      to let commit transaction blocked less.
      
      An example below:
      
      Thread1				Thread2
      |->scrub_blocked_if_needed()	|->scrub_pending_trans_workers_inc
        |->increase @scrub_paused
                                             |->increase @scrub_running
        |->wake up scrub_pause_wait list
                                             |->scrub blocked
                                             |->increase @scrub_paused
      
      Thread3 is commiting transaction which is blocked at btrfs_scrub_pause().
      So after Thread2 increase @scrub_paused, we meet the condition
      @scrub_paused=@scrub_running, but transaction will be still blocked until
      another calling to wake up @scrub_pause_wait.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      32a44789
    • Wang Shilong's avatar
      Btrfs: cancel scrub on transaction abortion · c0af8f0b
      Wang Shilong authored
      If we fail to commit transaction, we'd better
      cancel scrub operations.
      Suggested-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      c0af8f0b
    • Wang Shilong's avatar
      Btrfs: device_replace: fix deadlock for nocow case · 12cf9372
      Wang Shilong authored
      commit cb7ab021 cause a following deadlock found by
      xfstests,btrfs/011:
      
      Thread1 is commiting transaction which is blocked at
      btrfs_scrub_pause().
      
      Thread2 is calling btrfs_file_aio_write() which has held
      inode's @i_mutex and commit transaction(blocked because
      Thread1 is committing transaction).
      
      Thread3 is copy_nocow_page worker which will also try to
      hold inode @i_mutex, so thread3 will wait Thread1 finished.
      
      Thread4 is waiting pending workers finished which will wait
      Thread3 finished. So the problem is like this:
      
      Thread1--->Thread4--->Thread3--->Thread2---->Thread1
      
      Deadlock happens! we fix it by letting Thread1 go firstly,
      which means we won't block transaction commit while we are
      waiting pending workers finished.
      Reported-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      12cf9372
    • Wang Shilong's avatar
      Btrfs: fix a possible deadlock between scrub and transaction committing · 6cf7f77e
      Wang Shilong authored
      btrfs_scrub_continue() will be called when cleaning up transaction.However,
      this can only be called if btrfs_scrub_pause() is called before.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      6cf7f77e
    • Sachin Kamat's avatar
      btrfs: Use PTR_ERR_OR_ZERO · 886322e8
      Sachin Kamat authored
      PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it
      also include missing err.h header.
      Signed-off-by: default avatarSachin Kamat <sachin.kamat@linaro.org>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      886322e8
    • Filipe Manana's avatar
      Btrfs: fix send issuing outdated paths for utimes, chown and chmod · bf0d1f44
      Filipe Manana authored
      When doing an incremental send, if we had a directory pending a move/rename
      operation and none of its parents, except for the immediate parent, were
      pending a move/rename, after processing the directory's references, we would
      be issuing utimes, chown and chmod intructions against am outdated path - a
      path which matched the one in the parent root.
      
      This change also simplifies a bit the code that deals with building a path
      for a directory which has a move/rename operation delayed.
      
      Steps to reproduce:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b/c/d/e
          $ mkdir /mnt/btrfs/a/b/c/f
          $ chmod 0777 /mnt/btrfs/a/b/c/d/e
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ mv /mnt/btrfs/a/b/c/f /mnt/btrfs/a/b/f2
          $ mv /mnt/btrfs/a/b/c/d/e /mnt/btrfs/a/b/f2/e2
          $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2
          $ mv /mnt/btrfs/a/b/c2/d /mnt/btrfs/a/b/c2/d2
          $ chmod 0700 /mnt/btrfs/a/b/f2/e2
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umount /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
      
          ERROR: chmod a/b/c/d/e failed. No such file or directory
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      bf0d1f44
    • Filipe Manana's avatar
      Btrfs: correctly determine if blocks are shared in btrfs_compare_trees · 6baa4293
      Filipe Manana authored
      Just comparing the pointers (logical disk addresses) of the btree nodes is
      not completely bullet proof, we have to check if their generation numbers
      match too.
      
      It is guaranteed that a COW operation will result in a block with a different
      logical disk address than the original block's address, but over time we can
      reuse that former logical disk address.
      
      For example, creating a 2Gb filesystem on a loop device, and having a script
      running in a loop always updating the access timestamp of a file, resulted in
      the same logical disk address being reused for the same fs btree block in about
      only 4 minutes.
      
      This could make us skip entire subtrees when doing an incremental send (which
      is currently the only user of btrfs_compare_trees). However the odds of getting
      2 blocks at the same tree level, with the same logical disk address, equal first
      slot keys and different generations, should hopefully be very low.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      6baa4293
    • Filipe Manana's avatar
      Btrfs: fix send attempting to rmdir non-empty directories · 9dc44214
      Filipe Manana authored
      The incremental send algorithm assumed that it was possible to issue
      a directory remove (rmdir) if the the inode number it was currently
      processing was greater than (or equal) to any inode that referenced
      the directory's inode. This wasn't a valid assumption because any such
      inode might be a child directory that is pending a move/rename operation,
      because it was moved into a directory that has a higher inode number and
      was moved/renamed too - in other words, the case the following commit
      addressed:
      
          9f03740a
          (Btrfs: fix infinite path build loops in incremental send)
      
      This made an incremental send issue an rmdir operation before the
      target directory was actually empty, which made btrfs receive fail.
      Therefore it needs to wait for all pending child directory inodes to
      be moved/renamed before sending an rmdir operation.
      
      Simple steps to reproduce this issue:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b/c/x
          $ mkdir /mnt/btrfs/a/b/y
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ mv /mnt/btrfs/a/b/y /mnt/btrfs/a/b/YY
          $ mv /mnt/btrfs/a/b/c/x /mnt/btrfs/a/b/YY
          $ rmdir /mnt/btrfs/a/b/c
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umount /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
      
          ERROR: rmdir o259-6-0 failed. Directory not empty
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      9dc44214
    • Filipe Manana's avatar
      Btrfs: send, don't send rmdir for same target multiple times · 29d6d30f
      Filipe Manana authored
      When doing an incremental send, if we delete a directory that has N > 1
      hardlinks for the same file and that file has the highest inode number
      inside the directory contents, an incremental send would send N times an
      rmdir operation against the directory. This made the btrfs receive command
      fail on the second rmdir instruction, as the target directory didn't exist
      anymore.
      
      Steps to reproduce the issue:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b/c
          $ echo 'ola mundo' > /mnt/btrfs/a/b/c/foo.txt
          $ ln /mnt/btrfs/a/b/c/foo.txt /mnt/btrfs/a/b/c/bar.txt
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ rm -f /mnt/btrfs/a/b/c/foo.txt
          $ rm -f /mnt/btrfs/a/b/c/bar.txt
          $ rmdir /mnt/btrfs/a/b/c
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umount /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
      
          ERROR: rmdir o259-6-0 failed. No such file or directory
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      29d6d30f
    • Filipe Manana's avatar
      Btrfs: incremental send, fix invalid path after dir rename · 2b863a13
      Filipe Manana authored
      This fixes yet one more case not caught by the commit titled:
      
         Btrfs: fix infinite path build loops in incremental send
      
      In this case, even before the initial full send, we have a directory
      which is a child of a directory with a higher inode number. Then we
      perform the initial send, and after we rename both the child and the
      parent, without moving them around. After doing these 2 renames, an
      incremental send sent a rename instruction for the child directory
      which contained an invalid "from" path (referenced the parent's old
      name, not the new one), which made the btrfs receive command fail.
      
      Steps to reproduce:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b
          $ mkdir /mnt/btrfs/d
          $ mkdir /mnt/btrfs/a/b/c
          $ mv /mnt/btrfs/d /mnt/btrfs/a/b/c
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/x
          $ mv /mnt/btrfs/a/b/x/d /mnt/btrfs/a/b/x/y
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umout /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
        "ERROR: rename a/b/c/d -> a/b/x/y failed. No such file or directory"
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      2b863a13