• Filipe Manana's avatar
    btrfs: avoid double search for block group during NOCOW writes · 2306e83e
    Filipe Manana authored
    When doing a NOCOW write, either through direct IO or buffered IO, we do
    two lookups for the block group that contains the target extent: once
    when we call btrfs_inc_nocow_writers() and then later again when we call
    btrfs_dec_nocow_writers() after creating the ordered extent.
    
    The lookups require taking a lock and navigating the red black tree used
    to track all block groups, which can take a non-negligible amount of time
    for a large filesystem with thousands of block groups, as well as lock
    contention and cache line bouncing.
    
    Improve on this by having a single block group search: making
    btrfs_inc_nocow_writers() return the block group to its caller and then
    have the caller pass that block group to btrfs_dec_nocow_writers().
    
    This is part of a patchset comprised of the following patches:
    
      btrfs: remove search start argument from first_logical_byte()
      btrfs: use rbtree with leftmost node cached for tracking lowest block group
      btrfs: use a read/write lock for protecting the block groups tree
      btrfs: return block group directly at btrfs_next_block_group()
      btrfs: avoid double search for block group during NOCOW writes
    
    The following test was used to test these changes from a performance
    perspective:
    
       $ cat test.sh
       #!/bin/bash
    
       modprobe null_blk nr_devices=0
    
       NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
       mkdir $NULL_DEV_PATH
       if [ $? -ne 0 ]; then
           echo "Failed to create nullb0 directory."
           exit 1
       fi
       echo 2 > $NULL_DEV_PATH/submit_queues
       echo 16384 > $NULL_DEV_PATH/size # 16G
       echo 1 > $NULL_DEV_PATH/memory_backed
       echo 1 > $NULL_DEV_PATH/power
    
       DEV=/dev/nullb0
       MNT=/mnt/nullb0
       LOOP_MNT="$MNT/loop"
       MOUNT_OPTIONS="-o ssd -o nodatacow"
       MKFS_OPTIONS="-R free-space-tree -O no-holes"
    
       cat <<EOF > /tmp/fio-job.ini
       [io_uring_writes]
       rw=randwrite
       fsync=0
       fallocate=posix
       group_reporting=1
       direct=1
       ioengine=io_uring
       iodepth=64
       bs=64k
       filesize=1g
       runtime=300
       time_based
       directory=$LOOP_MNT
       numjobs=8
       thread
       EOF
    
       echo performance | \
           tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    
       echo
       echo "Using config:"
       echo
       cat /tmp/fio-job.ini
       echo
    
       umount $MNT &> /dev/null
       mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
       mount $MOUNT_OPTIONS $DEV $MNT
    
       mkdir $LOOP_MNT
    
       truncate -s 4T $MNT/loopfile
       mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
       mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
    
       # Trigger the allocation of about 3500 data block groups, without
       # actually consuming space on underlying filesystem, just to make
       # the tree of block group large.
       fallocate -l 3500G $LOOP_MNT/filler
    
       fio /tmp/fio-job.ini
    
       umount $LOOP_MNT
       umount $MNT
    
       echo 0 > $NULL_DEV_PATH/power
       rmdir $NULL_DEV_PATH
    
    The test was run on a non-debug kernel (Debian's default kernel config),
    the result were the following.
    
    Before patchset:
    
      WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
    
    After patchset:
    
      WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
    
      +3.3% write throughput and +3.3% IO done in the same time period.
    
    The test has somewhat limited coverage scope, as with only NOCOW writes
    we get less contention on the red black tree of block groups, since we
    don't have the extra contention caused by COW writes, namely when
    allocating data extents, pinning and unpinning data extents, but on the
    hand there's access to tree in the NOCOW path, when incrementing a block
    group's number of NOCOW writers.
    Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    2306e83e
inode.c 323 KB