1. 19 Apr, 2021 40 commits
    • Josef Bacik's avatar
      btrfs: convert some BUG_ON()'s to ASSERT()'s in do_relocation · ffe30dd8
      Josef Bacik authored
      A few of these are checking for correctness, and won't be triggered by
      corrupted file systems, so convert them to ASSERT() instead of BUG_ON()
      and add a comment explaining their existence.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffe30dd8
    • Matthew Wilcox (Oracle)'s avatar
      btrfs: add and use readahead_batch_length · 32c0a6bc
      Matthew Wilcox (Oracle) authored
      Implement readahead_batch_length() to determine the number of bytes in
      the current batch of readahead pages and use it in btrfs. Also use the
      readahead_pos to get the offset.
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32c0a6bc
    • Wan Jiabing's avatar
      btrfs: move forward declarations to the beginning of extent_io.h · 183ebab7
      Wan Jiabing authored
      There are two forward declarations deep in extent_io.h, move them
      to the beginning and remove the duplicate one.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarWan Jiabing <wanjiabing@vivo.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      183ebab7
    • Qu Wenruo's avatar
      btrfs: subpage: add overview comments · 894d1378
      Qu Wenruo authored
      This patch adds an overview how btrfs subpage support works:
      
      - limitations
      - behavior
      - basic implementation points
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      894d1378
    • Qu Wenruo's avatar
      btrfs: make set_btree_ioerr accept extent buffer and be subpage compatible · 5a2c6075
      Qu Wenruo authored
      Current set_btree_ioerr() only accepts @page parameter and grabs extent
      buffer from page::private.  This works fine for sector size == PAGE_SIZE
      case, but not for subpage case.
      
      Add an extra parameter, @eb, for callers to pass extent buffer to this
      function, so that subpage code can reuse this function.
      
      And also add subpage special handling to update
      btrfs_subpage::error_bitmap.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a2c6075
    • Qu Wenruo's avatar
      btrfs: make set/clear_extent_buffer_dirty() subpage compatible · 0d27797e
      Qu Wenruo authored
      For set_extent_buffer_dirty() to support subpage sized metadata, just
      call btrfs_page_set_dirty() to handle both cases.
      
      For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
      only if all extent buffers in the page range are no longer dirty.
      Also do the same for page error.
      
      This is pretty different from the existing clear_extent_buffer_dirty()
      routine, so add a new helper function,
      clear_subpage_extent_buffer_dirty() to do this for subpage metadata.
      
      Also since the main part of clearing page dirty code is still the same,
      extract that into btree_clear_page_dirty() so that it can be utilized
      for both cases.
      
      But there is a special race between set_extent_buffer_dirty() and
      clear_extent_buffer_dirty(), where we can clear the page dirty.
      
      [POSSIBLE RACE WINDOW]
      For the race window between clear_subpage_extent_buffer_dirty() and
      set_extent_buffer_dirty(), due to the fact that we can't call
      clear_page_dirty_for_io() under subpage spin lock, we can race like
      below:
      
         T1 (eb1 in the same page)	|  T2 (eb2 in the same page)
       -------------------------------+------------------------------
       set_extent_buffer_dirty()	| clear_extent_buffer_dirty()
       |- was_dirty = false;		| |- clear_subpagE_extent_buffer_dirty()
       |				|    |- btrfs_clear_and_test_dirty()
       |				|    |  Since eb2 is the last dirty page
       |				|    |  we got:
       |				|    |  last == true;
       |				|    |
       |- btrfs_page_set_dirty()	|    |
       |  We set the page dirty and   |    |
       |  subpage dirty bitmap	|    |
       |				|    |- if (last)
       |				|    |  Since we don't have subpage lock
       |				|    |  held, now @last is no longer
       |				|    |  correct
       |				|    |- btree_clear_page_dirty()
       |				|	Now PageDirty == false, even if
       |				|       we have dirty_bitmap not zero.
       |- ASSERT(PageDirty());	|
          ^^^^ CRASH
      
      The solution here is to also lock the eb->pages[0] for subpage case of
      set_extent_buffer_dirty(), to prevent racing with
      clear_extent_buffer_dirty().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d27797e
    • Qu Wenruo's avatar
      btrfs: support page uptodate assertions in subpage mode · b8f95771
      Qu Wenruo authored
      There are quite some assert checks on page uptodate in extent buffer
      write accessors.  They ensure the destination page is already uptodate.
      
      This is fine for regular sector size case, but not for subpage case, as
      for subpage we only mark the page uptodate if the page contains no hole
      and all its extent buffers are uptodate.
      
      So instead of checking PageUptodate(), for subpage case we check the
      uptodate bitmap of btrfs_subpage structure.
      
      To make the check more elegant, introduce a helper,
      assert_eb_page_uptodate() to do the check for both subpage and regular
      sector size cases.
      
      The following functions are involved:
      
      - write_extent_buffer_chunk_tree_uuid()
      - write_extent_buffer_fsid()
      - write_extent_buffer()
      - memzero_extent_buffer()
      - copy_extent_buffer()
      - extent_buffer_test_bit()
      - extent_buffer_bitmap_set()
      - extent_buffer_bitmap_clear()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b8f95771
    • Qu Wenruo's avatar
      btrfs: make alloc_extent_buffer() check subpage dirty bitmap · 1e5eb3d6
      Qu Wenruo authored
      In alloc_extent_buffer(), we make sure that the newly allocated page is
      never dirty.
      
      This is fine for sector size == PAGE_SIZE case, but for subpage it's
      possible that one extent buffer in the page is dirty, thus the whole
      page is marked dirty, and could cause false alert.
      
      To support subpage, call btrfs_page_test_dirty() to handle both cases.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e5eb3d6
    • Qu Wenruo's avatar
      btrfs: subpage: support metadata checksum calculation at write time · eca0f6f6
      Qu Wenruo authored
      Add a new helper, csum_dirty_subpage_buffers(), to iterate through all
      dirty extent buffers in one bvec.
      
      Also extract the code of calculating csum for one extent buffer into
      csum_one_extent_buffer(), so that both the existing csum_dirty_buffer()
      and the new csum_dirty_subpage_buffers() can reuse the same routine.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eca0f6f6
    • Qu Wenruo's avatar
      btrfs: subpage: do more sanity checks on metadata page dirtying · 139e8cd3
      Qu Wenruo authored
      For btree_set_page_dirty(), we should also check the extent buffer
      sanity for subpage support.
      
      Unlike the regular sector size case, since one page can contain multiple
      extent buffers, we need to make sure there is at least one dirty extent
      buffer in the page.
      
      So this patch will iterate through the btrfs_subpage::dirty_bitmap
      to get the extent buffers, and check if any dirty extent buffer in the page
      range has EXTENT_BUFFER_DIRTY and proper refs.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      139e8cd3
    • Qu Wenruo's avatar
      btrfs: subpage: introduce helpers for writeback status · 3470da3b
      Qu Wenruo authored
      Introduces the following functions to handle subpage writeback status:
      
      - btrfs_subpage_set_writeback()
      - btrfs_subpage_clear_writeback()
      - btrfs_subpage_test_writeback()
        These helpers can only be called when the range is ensured to be
        inside the page.
      
      - btrfs_page_set_writeback()
      - btrfs_page_clear_writeback()
      - btrfs_page_test_writeback()
        These helpers can handle both regular sector size and subpage without
        problem.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3470da3b
    • Qu Wenruo's avatar
      btrfs: subpage: introduce helpers for dirty status · d8a5713e
      Qu Wenruo authored
      Introduce the following functions to handle subpage dirty status:
      
      - btrfs_subpage_set_dirty()
      - btrfs_subpage_clear_dirty()
      - btrfs_subpage_test_dirty()
        These helpers can only be called when the range is ensured to be
        inside the page.
      
      - btrfs_page_set_dirty()
      - btrfs_page_clear_dirty()
      - btrfs_page_test_dirty()
        These helpers can handle both regular sector size and subpage without
        problem.
        Thus they would be used to replace PageDirty() related calls in
        later patches.
      
      There is one special point to note here, just like set_page_dirty() and
      clear_page_dirty_for_io(), btrfs_*page_set_dirty() and
      btrfs_*page_clear_dirty() must be called with page locked.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d8a5713e
    • Qu Wenruo's avatar
      btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage() · d239bcb8
      Qu Wenruo authored
      In btrfs_invalidatepage() we re-declare @tree variable as
      btrfs_ordered_inode_tree.
      
      Since it's only used to do the spinlock, we can grab it from inode
      directly, and remove the unnecessary declaration completely.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d239bcb8
    • Qu Wenruo's avatar
      btrfs: use min() to replace open-code in btrfs_invalidatepage() · ac5804eb
      Qu Wenruo authored
      In btrfs_invalidatepage() we introduce a temporary variable, new_len, to
      update ordered->truncated_len.  But we can use min() to replace it
      completely and no need for the variable.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac5804eb
    • Qu Wenruo's avatar
      btrfs: add sysfs interface for supported sectorsize · fc57ad8d
      Qu Wenruo authored
      Export supported sector sizes in /sys/fs/btrfs/features/supported_sectorsizes.
      
      Currently all architectures have PAGE_SIZE, There's some disparity
      between read-only and read-write support but that will be unified in the
      future so there's only one file exporting the size.
      
      The read-only support for systems with 64K pages also works for 4K
      sector size.
      
      This new sysfs interface would help eg. mkfs.btrfs to print more
      accurate warnings about potentially incompatible option combinations.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc57ad8d
    • Filipe Manana's avatar
      btrfs: improve btree readahead for full send operations · ace75066
      Filipe Manana authored
      Currently a full send operation uses the standard btree readahead when
      iterating over the subvolume/snapshot btree, which despite bringing good
      performance benefits, it could be improved in a few aspects for use cases
      such as full send operations, which are guaranteed to visit every node
      and leaf of a btree, in ascending and sequential order. The limitations
      of that standard btree readahead implementation are the following:
      
      1) It only triggers readahead for leaves that are physically close
         to the leaf being read, within a 64K range;
      
      2) It only triggers readahead for the next or previous leaves if the
         leaf being read is not currently in memory;
      
      3) It never triggers readahead for nodes.
      
      So add a new readahead mode that addresses all these points and use it
      for full send operations.
      
      The following test script was used to measure the improvement on a box
      using an average, consumer grade, spinning disk and with 16GiB of RAM:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
        MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
        MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create files with inline data to make it easier and faster to create
        # large btrees.
        add_files()
        {
            local total=$1
            local start_offset=$2
            local number_jobs=$3
            local total_per_job=$(($total / $number_jobs))
      
            echo "Creating $total new files using $number_jobs jobs"
            for ((n = 0; n < $number_jobs; n++)); do
                (
                    local start_num=$(($start_offset + $n * $total_per_job))
                    for ((i = 1; i <= $total_per_job; i++)); do
                        local file_num=$((start_num + $i))
                        local file_path="$MNT/file_${file_num}"
                        xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                        if [ $? -ne 0 ]; then
                            echo "Failed creating file $file_path"
                            break
                        fi
                    done
                ) &
                worker_pids[$n]=$!
            done
      
            wait ${worker_pids[@]}
      
            sync
            echo
            echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
        }
      
        initial_file_count=500000
        add_files $initial_file_count 0 4
      
        echo
        echo "Creating first snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap1
      
        echo
        echo "Adding more files..."
        add_files $((initial_file_count / 4)) $initial_file_count 4
      
        echo
        echo "Updating 1/50th of the initial files..."
        for ((i = 1; i < $initial_file_count; i += 50)); do
            xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
        done
      
        echo
        echo "Creating second snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap2
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing full send..."
        start=$(date +%s)
        btrfs send $MNT/snap1 > /dev/null
        end=$(date +%s)
        echo
        echo "Full send took $((end - start)) seconds"
      
        umount $MNT
      
      The durations of the full send operation in seconds were the following:
      
      Before this change:  217 seconds
      After this change:   205 seconds (-5.7%)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ace75066
    • Filipe Manana's avatar
      btrfs: fix exhaustion of the system chunk array due to concurrent allocations · eafa4fd0
      Filipe Manana authored
      When we are running out of space for updating the chunk tree, that is,
      when we are low on available space in the system space info, if we have
      many task concurrently allocating block groups, via fallocate for example,
      many of them can end up all allocating new system chunks when only one is
      needed. In extreme cases this can lead to exhaustion of the system chunk
      array, which has a size limit of 2048 bytes, and results in a transaction
      abort with errno EFBIG, producing a trace in dmesg like the following,
      which was triggered on a PowerPC machine with a node/leaf size of 64K:
      
        [1359.518899] ------------[ cut here ]------------
        [1359.518980] BTRFS: Transaction aborted (error -27)
        [1359.519135] WARNING: CPU: 3 PID: 16463 at ../fs/btrfs/block-group.c:1968 btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs]
        [1359.519152] Modules linked in: (...)
        [1359.519239] Supported: Yes, External
        [1359.519252] CPU: 3 PID: 16463 Comm: stress-ng Tainted: G               X    5.3.18-47-default #1 SLE15-SP3
        [1359.519274] NIP:  c008000000e36fe8 LR: c008000000e36fe4 CTR: 00000000006de8e8
        [1359.519293] REGS: c00000056890b700 TRAP: 0700   Tainted: G               X     (5.3.18-47-default)
        [1359.519317] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48008222  XER: 00000007
        [1359.519356] CFAR: c00000000013e170 IRQMASK: 0
        [1359.519356] GPR00: c008000000e36fe4 c00000056890b990 c008000000e83200 0000000000000026
        [1359.519356] GPR04: 0000000000000000 0000000000000000 0000d52a3b027651 0000000000000007
        [1359.519356] GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000000
        [1359.519356] GPR12: 0000000000008000 c00000063fe44600 000000001015e028 000000001015dfd0
        [1359.519356] GPR16: 000000000000404f 0000000000000001 0000000000010000 0000dd1e287affff
        [1359.519356] GPR20: 0000000000000001 c000000637c9a000 ffffffffffffffe5 0000000000000000
        [1359.519356] GPR24: 0000000000000004 0000000000000000 0000000000000100 ffffffffffffffc0
        [1359.519356] GPR28: c000000637c9a000 c000000630e09230 c000000630e091d8 c000000562188b08
        [1359.519561] NIP [c008000000e36fe8] btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs]
        [1359.519613] LR [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs]
        [1359.519626] Call Trace:
        [1359.519671] [c00000056890b990] [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] (unreliable)
        [1359.519729] [c00000056890ba90] [c008000000d68d44] __btrfs_end_transaction+0xbc/0x2f0 [btrfs]
        [1359.519782] [c00000056890bae0] [c008000000e309ac] btrfs_alloc_data_chunk_ondemand+0x154/0x610 [btrfs]
        [1359.519844] [c00000056890bba0] [c008000000d8a0fc] btrfs_fallocate+0xe4/0x10e0 [btrfs]
        [1359.519891] [c00000056890bd00] [c0000000004a23b4] vfs_fallocate+0x174/0x350
        [1359.519929] [c00000056890bd50] [c0000000004a3cf8] ksys_fallocate+0x68/0xf0
        [1359.519957] [c00000056890bda0] [c0000000004a3da8] sys_fallocate+0x28/0x40
        [1359.519988] [c00000056890bdc0] [c000000000038968] system_call_exception+0xe8/0x170
        [1359.520021] [c00000056890be20] [c00000000000cb70] system_call_common+0xf0/0x278
        [1359.520037] Instruction dump:
        [1359.520049] 7d0049ad 40c2fff4 7c0004ac 71490004 40820024 2f83fffb 419e0048 3c620000
        [1359.520082] e863bcb8 7ec4b378 48010d91 e8410018 <0fe00000> 3c820000 e884bcc8 7ec6b378
        [1359.520122] ---[ end trace d6c186e151022e20 ]---
      
      The following steps explain how we can end up in this situation:
      
      1) Task A is at check_system_chunk(), either because it is allocating a
         new data or metadata block group, at btrfs_chunk_alloc(), or because
         it is removing a block group or turning a block group RO. It does not
         matter why;
      
      2) Task A sees that there is not enough free space in the system
         space_info object, that is 'left' is < 'thresh'. And at this point
         the system space_info has a value of 0 for its 'bytes_may_use'
         counter;
      
      3) As a consequence task A calls btrfs_alloc_chunk() in order to allocate
         a new system block group (chunk) and then reserves 'thresh' bytes in
         the chunk block reserve with the call to btrfs_block_rsv_add(). This
         changes the chunk block reserve's 'reserved' and 'size' counters by an
         amount of 'thresh', and changes the 'bytes_may_use' counter of the
         system space_info object from 0 to 'thresh'.
      
         Also during its call to btrfs_alloc_chunk(), we end up increasing the
         value of the 'total_bytes' counter of the system space_info object by
         8MiB (the size of a system chunk stripe). This happens through the
         call chain:
      
         btrfs_alloc_chunk()
             create_chunk()
                 btrfs_make_block_group()
                     btrfs_update_space_info()
      
      4) After it finishes the first phase of the block group allocation, at
         btrfs_chunk_alloc(), task A unlocks the chunk mutex;
      
      5) At this point the new system block group was added to the transaction
         handle's list of new block groups, but its block group item, device
         items and chunk item were not yet inserted in the extent, device and
         chunk trees, respectively. That only happens later when we call
         btrfs_finish_chunk_alloc() through a call to
         btrfs_create_pending_block_groups();
      
         Note that only when we update the chunk tree, through the call to
         btrfs_finish_chunk_alloc(), we decrement the 'reserved' counter
         of the chunk block reserve as we COW/allocate extent buffers,
         through:
      
         btrfs_alloc_tree_block()
            btrfs_use_block_rsv()
               btrfs_block_rsv_use_bytes()
      
         And the system space_info's 'bytes_may_use' is decremented everytime
         we allocate an extent buffer for COW operations on the chunk tree,
         through:
      
         btrfs_alloc_tree_block()
            btrfs_reserve_extent()
               find_free_extent()
                  btrfs_add_reserved_bytes()
      
         If we end up COWing less chunk btree nodes/leaves than expected, which
         is the typical case since the amount of space we reserve is always
         pessimistic to account for the worst possible case, we release the
         unused space through:
      
         btrfs_create_pending_block_groups()
            btrfs_trans_release_chunk_metadata()
               btrfs_block_rsv_release()
                  block_rsv_release_bytes()
                      btrfs_space_info_free_bytes_may_use()
      
         But before task A gets into btrfs_create_pending_block_groups()...
      
      6) Many other tasks start allocating new block groups through fallocate,
         each one does the first phase of block group allocation in a
         serialized way, since btrfs_chunk_alloc() takes the chunk mutex
         before calling check_system_chunk() and btrfs_alloc_chunk().
      
         However before everyone enters the final phase of the block group
         allocation, that is, before calling btrfs_create_pending_block_groups(),
         new tasks keep coming to allocate new block groups and while at
         check_system_chunk(), the system space_info's 'bytes_may_use' keeps
         increasing each time a task reserves space in the chunk block reserve.
         This means that eventually some other task can end up not seeing enough
         free space in the system space_info and decide to allocate yet another
         system chunk.
      
         This may repeat several times if yet more new tasks keep allocating
         new block groups before task A, and all the other tasks, finish the
         creation of the pending block groups, which is when reserved space
         in excess is released. Eventually this can result in exhaustion of
         system chunk array in the superblock, with btrfs_add_system_chunk()
         returning EFBIG, resulting later in a transaction abort.
      
         Even when we don't reach the extreme case of exhausting the system
         array, most, if not all, unnecessarily created system block groups
         end up being unused since when finishing creation of the first
         pending system block group, the creation of the following ones end
         up not needing to COW nodes/leaves of the chunk tree, so we never
         allocate and deallocate from them, resulting in them never being
         added to the list of unused block groups - as a consequence they
         don't get deleted by the cleaner kthread - the only exceptions are
         if we unmount and mount the filesystem again, which adds any unused
         block groups to the list of unused block groups, if a scrub is
         run, which also adds unused block groups to the unused list, and
         under some circumstances when using a zoned filesystem or async
         discard, which may also add unused block groups to the unused list.
      
      So fix this by:
      
      *) Tracking the number of reserved bytes for the chunk tree per
         transaction, which is the sum of reserved chunk bytes by each
         transaction handle currently being used;
      
      *) When there is not enough free space in the system space_info,
         if there are other transaction handles which reserved chunk space,
         wait for some of them to complete in order to have enough excess
         reserved space released, and then try again. Otherwise proceed with
         the creation of a new system chunk.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eafa4fd0
    • Filipe Manana's avatar
      btrfs: make reflinks respect O_SYNC O_DSYNC and S_SYNC flags · b7a7a834
      Filipe Manana authored
      If we reflink to or from a file opened with O_SYNC/O_DSYNC or to/from a
      file that has the S_SYNC attribute set, we totally ignore that and do not
      durably persist the reflink changes. Since a reflink can change the data
      readable from a file (and mtime/ctime, or a file size), it makes sense to
      durably persist (fsync) the source and destination files/ranges.
      
      This was previously discussed at:
      
      https://lore.kernel.org/linux-btrfs/20200903035225.GJ6090@magnolia/
      
      The recently introduced test case generic/628, from fstests, exercises
      these scenarios and currently fails without this change.
      
      So make sure we fsync the source and destination files/ranges when either
      of them was opened with O_SYNC/O_DSYNC or has the S_SYNC attribute set,
      just like XFS already does.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b7a7a834
    • Arnd Bergmann's avatar
      btrfs: zoned: bail out in btrfs_alloc_chunk for bad input · bb05b298
      Arnd Bergmann authored
      gcc complains that the ctl->max_chunk_size member might be used
      uninitialized when none of the three conditions for initializing it in
      init_alloc_chunk_ctl_policy_zoned() are true:
      
      In function ‘init_alloc_chunk_ctl_policy_zoned’,
          inlined from ‘init_alloc_chunk_ctl’ at fs/btrfs/volumes.c:5023:3,
          inlined from ‘btrfs_alloc_chunk’ at fs/btrfs/volumes.c:5340:2:
      include/linux/compiler-gcc.h:48:45: error: ‘ctl.max_chunk_size’ may be used uninitialized [-Werror=maybe-uninitialized]
       4998 |         ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
            |                               ^~~
      fs/btrfs/volumes.c: In function ‘btrfs_alloc_chunk’:
      fs/btrfs/volumes.c:5316:32: note: ‘ctl’ declared here
       5316 |         struct alloc_chunk_ctl ctl;
            |                                ^~~
      
      If we ever get into this condition, something is seriously
      wrong, as validity is checked in the callers
      
        btrfs_alloc_chunk
          init_alloc_chunk_ctl
            init_alloc_chunk_ctl_policy_zoned
      
      so the same logic as in init_alloc_chunk_ctl_policy_regular()
      and a few other places should be applied. This avoids both further
      data corruption, and the compile-time warning.
      
      Fixes: 1cd6121f ("btrfs: zoned: implement zoned chunk allocator")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb05b298
    • BingJing Chang's avatar
      btrfs: fix a potential hole punching failure · 3227788c
      BingJing Chang authored
      In commit d7781546 ("btrfs: Avoid trucating page or punching hole
      in a already existed hole."), existing holes can be skipped by calling
      find_first_non_hole() to adjust start and len. However, if the given len
      is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
      not be set to zero because (em->start + em->len) is less than
      (start + len). Then the ret will be 1 but len will not be set to 0.
      The propagated non-zero ret will result in fallocate failure.
      
      In the while-loop of btrfs_replace_file_extents(), len is not updated
      every time before it calls find_first_non_hole(). That is, after
      btrfs_drop_extents() successfully drops the last non-hole file extent,
      it may fail with ENOSPC when attempting to drop a file extent item
      representing a hole. The problem can happen. After it calls
      find_first_non_hole(), the cur_offset will be adjusted to be larger
      than or equal to end. However, since the len is not set to zero, the
      break-loop condition (ret && !len) will not be met. After it leaves the
      while-loop, fallocate will return 1, which is an unexpected return
      value.
      
      We're not able to construct a reproducible way to let
      btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
      file extent but with remaining holes left. However, it's quite easy to
      fix. We just need to update and check the len every time before we call
      find_first_non_hole(). To make the while loop more readable, we also
      pull the variable updates to the bottom of loop like this:
        while (cur_offset < end) {
      	  ...
      	  // update cur_offset & len
      	  // advance cur_offset & len in hole-punching case if needed
        }
      Reported-by: default avatarRobbie Ko <robbieko@synology.com>
      Fixes: d7781546 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarChung-Chiang Cheng <cccheng@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBingJing Chang <bingjingc@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3227788c
    • Naohiro Aota's avatar
      btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex · e75f9fd1
      Naohiro Aota authored
      Commit 6e37d245 ("btrfs: zoned: fix deadlock on log sync") pointed out
      a deadlock warning and removed mutex_{lock,unlock} of fs_info::tree_root->log_mutex.
      While it looks like it always cause a deadlock, we didn't see actual
      deadlock in fstests runs. The reason is log_root_tree->log_mutex !=
      fs_info->tree_root->log_mutex, not taking the same lock. So, the warning
      was actually a false-positive.
      
      Since btrfs_alloc_log_tree_node() is protected only by
      fs_info->tree_root->log_mutex, we can (and should) move the code out of
      the lock scope of log_root_tree->log_mutex and silence the warning.
      
      Fixes: 6e37d245 ("btrfs: zoned: fix deadlock on log sync")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e75f9fd1
    • Josef Bacik's avatar
      btrfs: use percpu_read_positive instead of sum_positive for need_preempt · 2cdb3909
      Josef Bacik authored
      Looking at perf data for a fio workload I noticed that we were spending
      a pretty large chunk of time (around 5%) doing percpu_counter_sum() in
      need_preemptive_reclaim.  This is silly, as we only want to know if we
      have more ordered than delalloc to see if we should be counting the
      delayed items in our threshold calculation.  Change this to
      percpu_read_positive() to avoid the overhead.
      
      I ran this through fsperf to validate the changes, obviously the latency
      numbers in dbench and fio are quite jittery, so take them as you wish,
      but overall the improvements on throughput, iops, and bw are all
      positive.  Each test was run two times, the given value is the average
      of both runs for their respective column.
      
        btrfs ssd normal test results
      
        bufferedrandwrite16g results
             metric         baseline   current          diff
        ==========================================================
        write_io_kbytes     16777216   16777216     0.00%
        read_clat_ns_p99           0          0     0.00%
        write_bw_bytes      1.04e+08   1.05e+08     1.12%
        read_iops                  0          0     0.00%
        write_clat_ns_p50      13888      11840   -14.75%
        read_io_kbytes             0          0     0.00%
        read_io_bytes              0          0     0.00%
        write_clat_ns_p99      35008      29312   -16.27%
        read_bw_bytes              0          0     0.00%
        elapsed                  170        167    -1.76%
        write_lat_ns_min     4221.50    3762.50   -10.87%
        sys_cpu                39.65      35.37   -10.79%
        write_lat_ns_max    2.67e+10   2.50e+10    -6.63%
        read_lat_ns_min            0          0     0.00%
        write_iops          25270.10   25553.43     1.12%
        read_lat_ns_max            0          0     0.00%
        read_clat_ns_p50           0          0     0.00%
      
        dbench60 results
          metric     baseline   current         diff
        ==================================================
        qpathinfo       11.12     12.73    14.52%
        throughput     416.09    445.66     7.11%
        flush         3485.63   1887.55   -45.85%
        qfileinfo        0.70      1.92   173.86%
        ntcreatex      992.60    695.76   -29.91%
        qfsinfo          2.43      3.71    52.48%
        close            1.67      3.14    88.09%
        sfileinfo       66.54    105.20    58.10%
        rename         809.23    619.59   -23.43%
        find            16.88     15.46    -8.41%
        unlink         820.54    670.86   -18.24%
        writex        3375.20   2637.91   -21.84%
        deltree        386.33    449.98    16.48%
        readx            3.43      3.41    -0.60%
        mkdir            0.05      0.03   -38.46%
        lockx            0.26      0.26    -0.76%
        unlockx          0.81      0.32   -60.33%
      
        dio4kbs16threads results
             metric          baseline       current           diff
        ================================================================
        write_io_kbytes         5249676       3357150   -36.05%
        read_clat_ns_p99              0             0     0.00%
        write_bw_bytes      89583501.50   57291192.50   -36.05%
        read_iops                     0             0     0.00%
        write_clat_ns_p50        242688        263680     8.65%
        read_io_kbytes                0             0     0.00%
        read_io_bytes                 0             0     0.00%
        write_clat_ns_p99      15826944      36732928   132.09%
        read_bw_bytes                 0             0     0.00%
        elapsed                      61            61     0.00%
        write_lat_ns_min          42704         42095    -1.43%
        sys_cpu                    5.27          3.45   -34.52%
        write_lat_ns_max       7.43e+08      9.27e+08    24.71%
        read_lat_ns_min               0             0     0.00%
        write_iops             21870.97      13987.11   -36.05%
        read_lat_ns_max               0             0     0.00%
        read_clat_ns_p50              0             0     0.00%
      
        randwrite2xram results
             metric          baseline       current           diff
        ================================================================
        write_io_kbytes        24831972      28876262    16.29%
        read_clat_ns_p99              0             0     0.00%
        write_bw_bytes      83745273.50   92182192.50    10.07%
        read_iops                     0             0     0.00%
        write_clat_ns_p50         13952         11648   -16.51%
        read_io_kbytes                0             0     0.00%
        read_io_bytes                 0             0     0.00%
        write_clat_ns_p99         50176         52992     5.61%
        read_bw_bytes                 0             0     0.00%
        elapsed                     314           332     5.73%
        write_lat_ns_min        5920.50          5127   -13.40%
        sys_cpu                    7.82          7.35    -6.07%
        write_lat_ns_max       5.27e+10      3.88e+10   -26.44%
        read_lat_ns_min               0             0     0.00%
        write_iops             20445.62      22505.42    10.07%
        read_lat_ns_max               0             0     0.00%
        read_clat_ns_p50              0             0     0.00%
      
        untarfirefox results
        metric    baseline   current        diff
        ==============================================
        elapsed      47.41     47.40   -0.03%
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2cdb3909
    • Filipe Manana's avatar
      btrfs: update outdated comment at btrfs_replace_file_extents() · e2b84217
      Filipe Manana authored
      There is a comment at btrfs_replace_file_extents() that mentions that we
      set the full sync flag on an inode when cloning into a file with a size
      greater than or equals to 16MiB, through try_release_extent_mapping() when
      we truncate the page cache after replacing file extents during a clone
      operation.
      
      That is not true anymore since commit 5e548b32 ("btrfs: do not set
      the full sync flag on the inode during page release"), so update the
      comment to remove that part and rephrase it slightly to make it more
      clear why the full sync flag is set at btrfs_replace_file_extents().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e2b84217
    • Filipe Manana's avatar
      btrfs: update outdated comment at btrfs_orphan_cleanup() · 0c0218e9
      Filipe Manana authored
      btrfs_orphan_cleanup() has a comment referring to find_dead_roots, but
      function does not exists since commit cb517eab ("Btrfs: cleanup the
      similar code of the fs root read"). What we use now to find and load dead
      roots is btrfs_find_orphan_roots(). So update the comment and make it a
      bit more detailed about why we can not delete an orphan item for a root.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0c0218e9
    • Filipe Manana's avatar
      btrfs: update debug message when checking seq number of a delayed ref · ffbc10a1
      Filipe Manana authored
      We used to encode two different numbers in the tree mod log counter used
      for sequence numbers, one in the upper 32 bits and the other one in the
      lower 32 bits. However that is no longer the case, we stopped doing that
      since commit fcebe456 ("Btrfs: rework qgroup accounting").
      
      So update the debug message at btrfs_check_delayed_seq to stop extracting
      the two 32 bits counters and print instead the 64 bits sequence numbers.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffbc10a1
    • Filipe Manana's avatar
      btrfs: add and use helper to get lowest sequence number for the tree mod log · 4bae7880
      Filipe Manana authored
      There are two places outside the tree mod log module that extract the
      lowest sequence number of the tree mod log. These places end up
      duplicating code and open coding the logic and internal implementation
      details of the tree mod log. So add a helper to the tree mod log module
      and header that returns the lowest sequence number or 0 if there aren't
      any tree mod log users at the moment.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4bae7880
    • Filipe Manana's avatar
      btrfs: remove unnecessary leaf check at btrfs_tree_mod_log_free_eb() · ffe1d039
      Filipe Manana authored
      At btrfs_tree_mod_log_free_eb() we check if we are dealing with a leaf,
      and if so, return immediately and do nothing. However this check can be
      removed, because after it we call tree_mod_need_log(), which returns
      false when given an extent buffer that corresponds to a leaf.
      
      So just remove the leaf check and pass the extent buffer to
      tree_mod_need_log().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffe1d039
    • Filipe Manana's avatar
      btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at btrfs_free_tree_block() · 888dd183
      Filipe Manana authored
      Instead of exposing implementation details of the tree mod log to check
      if there are active tree mod log users at btrfs_free_tree_block(), use
      the new bit BTRFS_FS_TREE_MOD_LOG_USERS for fs_info->flags instead. This
      way extent-tree.c does not need to known about any of the internals of
      the tree mod log and avoids taking a lock unnecessarily as well.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      888dd183
    • Filipe Manana's avatar
      btrfs: use a bit to track the existence of tree mod log users · bc03f39e
      Filipe Manana authored
      The tree modification log functions are called very frequently, basically
      they are called every time a btree is modified (a pointer added or removed
      to a node, a new root for a btree is set, etc). Because of that, to avoid
      heavy lock contention on the lock that protects the list of tree mod log
      users, we have checks that test the emptiness of the list with a full
      memory barrier before the checks, so that when there are no tree mod log
      users we avoid taking the lock.
      
      Replace the memory barrier and list emptiness check with a test for a new
      bit set at fs_info->flags. This bit is used to indicate when there are
      tree mod log users, set whenever a user is added to the list and cleared
      when the last user is removed from the list. This makes the intention a
      bit more obvious and possibly more efficient (assuming test_bit() may be
      cheaper than a full memory barrier on some architectures).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bc03f39e
    • Filipe Manana's avatar
      btrfs: use booleans where appropriate for the tree mod log functions · 406808ab
      Filipe Manana authored
      Several functions of the tree modification log use integers as booleans,
      so change them to use booleans instead, making their use more clear.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      406808ab
    • Filipe Manana's avatar
      btrfs: move the tree mod log code into its own file · f3a84ccd
      Filipe Manana authored
      The tree modification log, which records modifications done to btrees, is
      quite large and currently spread all over ctree.c, which is a huge file
      already.
      
      To make things better organized, move all that code into its own separate
      source and header files. Functions and definitions that are used outside
      of the module (mostly by ctree.c) are renamed so that they start with a
      "btrfs_" prefix. Everything else remains unchanged.
      
      This makes it easier to go over the tree modification log code every
      time I need to go read it to fix a bug.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ minor comment updates ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3a84ccd
    • Ira Weiny's avatar
      btrfs: integrity-checker: convert block context kmap's to kmap_local_page · 9a002d53
      Ira Weiny authored
      btrfsic_read_block() (which calls kmap()) and
      btrfsic_release_block_ctx() (which calls kunmap()) are always called
      within a single thread of execution.
      
      Therefore the mappings created within these calls can be a thread local
      mapping.
      
      Convert the kmap() of bloc_ctx->pagev to kmap_local_page().  Luckily the
      unmap loops backwards through the array pointer so no adjustment needs
      to be made to the unmapping order.
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a002d53
    • Ira Weiny's avatar
      btrfs: integrity-checker: use kmap_local_page in __btrfsic_submit_bio · 3e037efd
      Ira Weiny authored
      Again there is an array of pointers which must be unmapped in the correct
      order.
      
      Convert the kmap()'s to kmap_local_page() and adjust the unmapping
      to work backwards through the unmapping loop.
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e037efd
    • Ira Weiny's avatar
      btrfs: raid56: convert kmaps to kmap_local_page · 94a0b58d
      Ira Weiny authored
      These kmaps are thread local and don't need to be atomic.  So they can use
      the more efficient kmap_local_page().  However, the mapping of pages in
      the stripes and the additional parity and qstripe pages are a bit
      trickier because the unmapping must occur in the opposite order from the
      mapping.  Furthermore, the pointer array in __raid_recover_end_io() may
      get reordered.
      
      Convert these calls to kmap_local_page() taking care to reverse the
      unmappings of any page arrays as well as being careful with the mappings
      of any special pages such as the parity and qstripe pages.
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      94a0b58d
    • Ira Weiny's avatar
      btrfs: convert kmap to kmap_local_page, simple cases · 58c1a35c
      Ira Weiny authored
      Use a simple coccinelle script to help convert the most common
      kmap()/kunmap() patterns to kmap_local_page()/kunmap_local().
      
      Note that some kmaps which were caught by this script needed to be
      handled by hand because of the strict unmapping order of kunmap_local()
      so they are not included in this patch.  But this script got us started.
      
      There's another temp variable added for the final length write to the
      first page so it does not interfere with cpage_out that is used for
      mapping other pages.
      
      The development of this patch was aided by the follow script:
      
      // <smpl>
      // SPDX-License-Identifier: GPL-2.0-only
      // Find kmap and replace with kmap_local_page then mark kunmap
      //
      // Confidence: Low
      // Copyright: (C) 2021 Intel Corporation
      // URL: http://coccinelle.lip6.fr/
      
      @ catch_all @
      expression e, e2;
      @@
      
      (
      -kmap(e)
      +kmap_local_page(e)
      )
      ...
      (
      -kunmap(...)
      +kunmap_local()
      )
      
      // </smpl>
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      58c1a35c
    • Johannes Thumshirn's avatar
      btrfs: remove duplicated in_range() macro · cea62800
      Johannes Thumshirn authored
      The in_range() macro is defined twice in btrfs' source, once in ctree.h
      and once in misc.h.
      
      Remove the definition in ctree.h and include misc.h in the files depending
      on it.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cea62800
    • Filipe Manana's avatar
      btrfs: remove stale comment and logic from btrfs_inode_in_log() · 209ecbb8
      Filipe Manana authored
      Currently btrfs_inode_in_log() checks the list of modified extents of the
      inode, and has a comment mentioning why, as it used to be necessary to
      make sure if we did something like the following:
      
        mmap write range A
        mmap write range B
        msync range A (ranged fsync)
        msync range B (ranged fsync)
      
      we ended up with both ranges being logged.
      
      If we did not check it, then the second fsync would do nothing because
      btrfs_inode_in_log() would return true. This was added in 125c4cf9
      ("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync") and
      test case generic/325 from fstests exercises that scenario.
      
      However, as of commit 48778179 ("btrfs: make fast fsyncs wait only
      for writeback"), every ranged fsync is now turned into a full ranged fsync
      (operates on the range from 0 to LLONG_MAX), so it is now pointless to
      test of emptiness of the list of modified extents, and the comment is
      clearly outdated.
      
      So just remove the comment and list emptiness check, while also changing
      the function's return type to be a boolean instead of an integer.
      In case one day we get support for ranged fsyncs again, it will be easy
      to notice the check is necessary again, because it will make generic/325
      always fail.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      209ecbb8
    • Filipe Manana's avatar
      btrfs: fix race between marking inode needs to be logged and log syncing · bc0939fc
      Filipe Manana authored
      We have a race between marking that an inode needs to be logged, either
      at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
      btrfs_sync_log(). The following steps describe how the race happens.
      
      1) We are at transaction N;
      
      2) Inode I was previously fsynced in the current transaction so it has:
      
          inode->logged_trans set to N;
      
      3) The inode's root currently has:
      
         root->log_transid set to 1
         root->last_log_commit set to 0
      
         Which means only one log transaction was committed to far, log
         transaction 0. When a log tree is created we set ->log_transid and
         ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());
      
      4) One more range of pages is dirtied in inode I;
      
      5) Some task A starts an fsync against some other inode J (same root), and
         so it joins log transaction 1.
      
         Before task A calls btrfs_sync_log()...
      
      6) Task B starts an fsync against inode I, which currently has the full
         sync flag set, so it starts delalloc and waits for the ordered extent
         to complete before calling btrfs_inode_in_log() at btrfs_sync_file();
      
      7) During ordered extent completion we have btrfs_update_inode() called
         against inode I, which in turn calls btrfs_set_inode_last_trans(),
         which does the following:
      
           spin_lock(&inode->lock);
           inode->last_trans = trans->transaction->transid;
           inode->last_sub_trans = inode->root->log_transid;
           inode->last_log_commit = inode->root->last_log_commit;
           spin_unlock(&inode->lock);
      
         So ->last_trans is set to N and ->last_sub_trans set to 1.
         But before setting ->last_log_commit...
      
      8) Task A is at btrfs_sync_log():
      
         - it increments root->log_transid to 2
         - starts writeback for all log tree extent buffers
         - waits for the writeback to complete
         - writes the super blocks
         - updates root->last_log_commit to 1
      
         It's a lot of slow steps between updating root->log_transid and
         root->last_log_commit;
      
      9) The task doing the ordered extent completion, currently at
         btrfs_set_inode_last_trans(), then finally runs:
      
           inode->last_log_commit = inode->root->last_log_commit;
           spin_unlock(&inode->lock);
      
         Which results in inode->last_log_commit being set to 1.
         The ordered extent completes;
      
      10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
          true because we have all the following conditions met:
      
          inode->logged_trans == N which matches fs_info->generation &&
          inode->last_subtrans (1) <= inode->last_log_commit (1) &&
          inode->last_subtrans (1) <= root->last_log_commit (1) &&
          list inode->extent_tree.modified_extents is empty
      
          And as a consequence we return without logging the inode, so the
          existing logged version of the inode does not point to the extent
          that was written after the previous fsync.
      
      It should be impossible in practice for one task be able to do so much
      progress in btrfs_sync_log() while another task is at
      btrfs_set_inode_last_trans() right after it reads root->log_transid and
      before it reads root->last_log_commit. Even if kernel preemption is enabled
      we know the task at btrfs_set_inode_last_trans() can not be preempted
      because it is holding the inode's spinlock.
      
      However there is another place where we do the same without holding the
      spinlock, which is in the memory mapped write path at:
      
        vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
        {
           (...)
           BTRFS_I(inode)->last_trans = fs_info->generation;
           BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
           BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
           (...)
      
      So with preemption happening after setting ->last_sub_trans and before
      setting ->last_log_commit, it is less of a stretch to have another task
      do enough progress at btrfs_sync_log() such that the task doing the memory
      mapped write ends up with ->last_sub_trans and ->last_log_commit set to
      the same value. It is still a big stretch to get there, as the task doing
      btrfs_sync_log() has to start writeback, wait for its completion and write
      the super blocks.
      
      So fix this in two different ways:
      
      1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
         value of ->last_sub_trans minus 1;
      
      2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
         like we do for buffered and direct writes at btrfs_file_write_iter(),
         which is all we need to make sure multiple writes and fsyncs to an
         inode in the same transaction never result in an fsync missing that
         the inode changed and needs to be logged. Turn this into a helper
         function and use it both at btrfs_page_mkwrite() and at
         btrfs_file_write_iter() - this also fixes the problem that at
         btrfs_page_mkwrite() we were setting those fields without the
         protection of the inode's spinlock.
      
      This is an extremely unlikely race to happen in practice.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bc0939fc
    • Filipe Manana's avatar
      btrfs: fix race between memory mapped writes and fsync · 885f46d8
      Filipe Manana authored
      When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush
      any new delalloc that might have been created before taking the lock and
      then wait either for the ordered extents to complete or just for the
      writeback to complete (depending on whether the full sync flag is set or
      not). We then start logging the inode and assume that while we are doing it
      no one else is touching the inode's file extent items (or adding new ones).
      
      That is generally true because all operations that modify an inode acquire
      the inode's lock first, including buffered and direct IO writes. However
      there is one exception: memory mapped writes, which do not and can not
      acquire the inode's lock.
      
      This can cause two types of issues: ending up logging file extent items
      with overlapping ranges, which is detected by the tree checker and will
      result in aborting the transaction when starting writeback for a log
      tree's extent buffers, or a silent corruption where we log a version of
      the file that never existed.
      
      Scenario 1 - logging overlapping extents
      
      The following steps explain how we can end up with file extents items with
      overlapping ranges in a log tree due to a race between a fsync and memory
      mapped writes:
      
      1) Task A starts an fsync on inode X, which has the full sync runtime flag
         set. First it starts by flushing all delalloc for the inode;
      
      2) Task A then locks the inode and flushes any other delalloc that might
         have been created after the previous flush and waits for all ordered
         extents to complete;
      
      3) In the inode's root we have the following leaf:
      
         Leaf N, generation == current transaction id:
      
         ---------------------------------------------------------
         | (...)  [ file extent item, offset 640K, length 128K ] |
         ---------------------------------------------------------
      
         The last file extent item in leaf N covers the file range from 640K to
         768K;
      
      4) Task B does a memory mapped write for the page corresponding to the
         file range from 764K to 768K;
      
      5) Task A starts logging the inode. At copy_inode_items_to_log() it uses
         btrfs_search_forward() to search for leafs modified in the current
         transaction that contain items for the inode. It finds leaf N and copies
         all the inode items from that leaf into the log tree.
      
         Now the log tree has a copy of the last file extent item from leaf N.
      
         At the end of the while loop at copy_inode_items_to_log(), we have the
         minimum key set to:
      
         min_key.objectid = <inode X number>
         min_key.type = BTRFS_EXTENT_DATA_KEY
         min_key.offset = 640K
      
         Then we increment the key's offset by 1 so that the next call to
         btrfs_search_forward() leaves us at the first key greater than the key
         we just processed.
      
         But before btrfs_search_forward() is called again...
      
      6) Dellaloc for the page at offset 764K, dirtied by task B, is started.
         It can be started for several reasons:
      
           - The async reclaim task is attempting to satisfy metadata or data
             reservation requests, and it has reached a point where it decided
             to flush delalloc;
           - Due to memory pressure the VMM triggers writeback of dirty pages;
           - The system call sync_file_range(2) is called from user space.
      
      7) When the respective ordered extent completes, it trims the length of
         the existing file extent item for file offset 640K from 128K to 124K,
         and a new file extent item is added with a key offset of 764K and a
         length of 4K;
      
      8) Task A calls btrfs_search_forward(), which returns us a path pointing
         to the leaf (can be leaf N or some other) containing the new file extent
         item for file offset 764K.
      
         We end up copying this item to the log tree, which overlaps with the
         last copied file extent item, which covers the file range from 640K to
         768K.
      
         When writeback is triggered for log tree's extent buffers, the issue
         will be detected by the tree checker which will dump a trace and an
         error message on dmesg/syslog. If the writeback is triggered when
         syncing the log, which typically is, then we also end up aborting the
         current transaction.
      
      This is the same type of problem fixed in 0c713cba ("Btrfs: fix race
      between ranged fsync and writeback of adjacent ranges").
      
      Scenario 2 - logging a version of the file that never existed
      
      This scenario only happens when using the NO_HOLES feature and results in
      a silent corruption, in the sense that is not detectable by 'btrfs check'
      or the tree checker:
      
      1) We have an inode I with a size of 1M and two file extent items, one
         covering an extent with disk_bytenr == X for the file range [0, 512K)
         and another one covering another extent with disk_bytenr == Y for the
         file range [512K, 1M);
      
      2) A hole is punched for the file range [512K, 1M);
      
      3) Task A starts an fsync of inode I, which has the full sync runtime flag
         set. It starts by flushing all existing delalloc, locks the inode (VFS
         lock), starts any new delalloc that might have been created before
         taking the lock and waits for all ordered extents to complete;
      
      4) Some other task does a memory mapped write for the page corresponding to
         the file range [640K, 644K) for example;
      
      5) Task A then logs all items of the inode with the call to
         copy_inode_items_to_log();
      
      6) In the meanwhile delalloc for the range [640K, 644K) is started. It can
         be started for several reasons:
      
           - The async reclaim task is attempting to satisfy metadata or data
             reservation requests, and it has reached a point where it decided
             to flush delalloc;
           - Due to memory pressure the VMM triggers writeback of dirty pages;
           - The system call sync_file_range(2) is called from user space.
      
      7) The ordered extent for the range [640K, 644K) completes and a file
         extent item for that range is added to the subvolume tree, pointing
         to a 4K extent with a disk_bytenr == Z;
      
      8) Task A then calls btrfs_log_holes(), to scan for implicit holes in
         the subvolume tree. It finds two implicit holes:
      
         - one for the file range [512K, 640K)
         - one for the file range [644K, 1M)
      
         As a result we end up neither logging a hole for the range [640K, 644K)
         nor logging the file extent item with a disk_bytenr == Z.
         This means that if we have a power failure and replay the log tree we
         end up getting the following file extent layout:
      
         [ disk_bytenr X ]    [   hole   ]    [ disk_bytenr Y ]    [  hole  ]
         0             512K  512K      640K  640K           644K  644K     1M
      
         Which does not corresponding to any layout the file ever had before
         the power failure. The only two valid layouts would be:
      
         [ disk_bytenr X ]    [   hole   ]
         0             512K  512K        1M
      
         and
      
         [ disk_bytenr X ]    [   hole   ]    [ disk_bytenr Z ]    [  hole  ]
         0             512K  512K      640K  640K           644K  644K     1M
      
      This can be fixed by serializing memory mapped writes with fsync, and there
      are two ways to do it:
      
      1) Make a fsync lock the entire file range, from 0 to (u64)-1 / LLONG_MAX
         in the inode's io tree. This prevents the race but also blocks any reads
         during the duration of the fsync, which has a negative impact for many
         common workloads;
      
      2) Make an fsync write lock the i_mmap_lock semaphore in the inode. This
         semaphore was recently added by Josef's patch set:
      
         btrfs: add a i_mmap_lock to our inode
         btrfs: cleanup inode_lock/inode_unlock uses
         btrfs: exclude mmaps while doing remap
         btrfs: exclude mmap from happening during all fallocate operations
      
         and is used to solve races between memory mapped writes and
         clone/dedupe/fallocate. This also makes us have the same behaviour we
         have regarding other writes (buffered and direct IO) and fsync - block
         them while the inode logging is in progress.
      
      This change uses the second approach due to the performance impact of the
      first one.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      885f46d8
    • Josef Bacik's avatar
      btrfs: exclude mmap from happening during all fallocate operations · 8d9b4a16
      Josef Bacik authored
      There's a small window where a deadlock can happen between fallocate and
      mmap.  This is described in detail by Filipe:
      
      """
      When doing a fallocate operation we lock the inode, flush delalloc within
      the target range, wait for any ordered extents to complete and then lock
      the file range. Before we lock the range and after we flush delalloc,
      there is a time window where another task can come in and do a memory
      mapped write for a page within the fallocate range.
      
      This means that after fallocate locks the range, there can be a dirty page
      in the range. More often than not, this does not cause any problem.
      The exception is when we are low on available metadata space, because an
      fallocate operation needs to start a transaction while holding the file
      range locked, either through btrfs_prealloc_file_range() or through the
      call to btrfs_fallocate_update_isize(). If that's the case, we can end up
      in a deadlock. The following list of steps explains how that happens:
      
      1) A fallocate operation starts, locks the inode, flushes delalloc in the
         range and waits for ordered extents in the range to complete;
      
      2) Before the fallocate task locks the file range, another task does a
         memory mapped write for a page in the fallocate target range. This is
         possible since memory mapped writes do not (and can not) lock the
         inode;
      
      3) The fallocate task locks the file range. At this point there is one
         dirty page in the range (due to the memory mapped write);
      
      4) When the fallocate task attempts to start a transaction, it blocks when
         attempting to reserve metadata space, since we are low on available
         metadata space. Before blocking (wait on its reservation ticket), it
         starts the async reclaim task (if not running already);
      
      5) The async reclaim task is not able to release space through any other
         means, so it decides to flush delalloc for inodes with dirty pages.
         It finds that the inode used in the fallocate operation has a dirty
         page and therefore queues a job (fs_info->flush_workers workqueue) to
         flush delalloc for that inode and waits on that job to complete;
      
      6) The flush job blocks when attempting to lock the file range because
         it is currently locked by the fallocate task;
      
      7) The fallocate task keeps waiting for its metadata reservation, waiting
         for a wakeup on its reservation ticket. The async reclaim task is
         waiting on the flush job, which in turn is waiting for locking the file
         range that is currently locked by the fallocate task. So unless some
         other task is able to release enough metadata space, for example an
         ordered extent for some other inode completes, we end up in a deadlock
         between all these tasks.
      
      When this happens stack traces like the following show up in dmesg/syslog:
      
       INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
             Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
       Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
       Call Trace:
        __schedule+0x5d1/0xcf0
        schedule+0x45/0xe0
        lock_extent_bits+0x1e6/0x2d0 [btrfs]
        ? finish_wait+0x90/0x90
        btrfs_invalidatepage+0x32c/0x390 [btrfs]
        ? __mod_memcg_state+0x8e/0x160
        __extent_writepage+0x2d4/0x400 [btrfs]
        extent_write_cache_pages+0x2b2/0x500 [btrfs]
        ? lock_release+0x20e/0x4c0
        ? trace_hardirqs_on+0x1b/0xf0
        extent_writepages+0x43/0x90 [btrfs]
        ? lock_acquire+0x1a3/0x490
        do_writepages+0x43/0xe0
        ? __filemap_fdatawrite_range+0xa4/0x100
        __filemap_fdatawrite_range+0xc5/0x100
        btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        btrfs_work_helper+0xf1/0x600 [btrfs]
        process_one_work+0x24e/0x5e0
        worker_thread+0x50/0x3b0
        ? process_one_work+0x5e0/0x5e0
        kthread+0x153/0x170
        ? kthread_mod_delayed_work+0xc0/0xc0
        ret_from_fork+0x22/0x30
       INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
             Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
       Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
       Call Trace:
        __schedule+0x5d1/0xcf0
        ? kvm_clock_read+0x14/0x30
        ? wait_for_completion+0x81/0x110
        schedule+0x45/0xe0
        schedule_timeout+0x30c/0x580
        ? _raw_spin_unlock_irqrestore+0x3c/0x60
        ? lock_acquire+0x1a3/0x490
        ? try_to_wake_up+0x7a/0xa20
        ? lock_release+0x20e/0x4c0
        ? lock_acquired+0x199/0x490
        ? wait_for_completion+0x81/0x110
        wait_for_completion+0xab/0x110
        start_delalloc_inodes+0x2af/0x390 [btrfs]
        btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
        flush_space+0x24f/0x660 [btrfs]
        btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
        process_one_work+0x24e/0x5e0
        worker_thread+0x20f/0x3b0
        ? process_one_work+0x5e0/0x5e0
        kthread+0x153/0x170
        ? kthread_mod_delayed_work+0xc0/0xc0
        ret_from_fork+0x22/0x30
      (...)
      several tasks waiting for the inode lock held by the fallocate task below
      (...)
       RIP: 0033:0x7f61efe73fff
       Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
       RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
       RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
       RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c
       RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
       R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001
       R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
       task:fdm-stress        state:D stack:    0 pid:2508243 ppid:2508153 flags:0x00000000
       Call Trace:
        __schedule+0x5d1/0xcf0
        ? _raw_spin_unlock_irqrestore+0x3c/0x60
        schedule+0x45/0xe0
        __reserve_bytes+0x4a4/0xb10 [btrfs]
        ? finish_wait+0x90/0x90
        btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
        btrfs_block_rsv_add+0x1f/0x50 [btrfs]
        start_transaction+0x2d1/0x760 [btrfs]
        btrfs_replace_file_extents+0x120/0x930 [btrfs]
        ? btrfs_fallocate+0xdcf/0x1260 [btrfs]
        btrfs_fallocate+0xdfb/0x1260 [btrfs]
        ? filename_lookup+0xf1/0x180
        vfs_fallocate+0x14f/0x440
        ioctl_preallocate+0x92/0xc0
        do_vfs_ioctl+0x66b/0x750
        ? __do_sys_newfstat+0x53/0x60
        __x64_sys_ioctl+0x62/0xb0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      """
      
      Fix this by disallowing mmaps from happening while we're doing any of
      the fallocate operations on this inode.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d9b4a16