• Filipe Manana's avatar
    btrfs: stop copying old dir items when logging a directory · 732d591a
    Filipe Manana authored
    When logging a directory, we go over every leaf of the subvolume tree that
    was changed in the current transaction and copy all its dir index keys to
    the log tree.
    
    That includes copying dir index keys created in past transactions. This is
    done mostly for simplicity, as after logging the keys we log an item that
    specifies the start and end ranges of the keys we logged. That item is
    then used during log replay to figure out which keys need to be deleted -
    every key in that range that we find in the subvolume tree and is not in
    the log tree, needs to be deleted.
    
    Now that we log only dir index keys, and not dir item keys anymore, when
    we remove dentries from a directory (due to unlink and rename operations),
    we can get entire leaves that we changed only for deleting old dir index
    keys, or that have few dir index keys that are new - this is due to the
    fact that the offset for new index keys comes from a monotonically
    increasing counter.
    
    We can avoid logging dir index keys from past transactions, and in order
    to track the deletions, only log range items (BTRFS_DIR_LOG_INDEX_KEY key
    type) when we find gaps between consecutive index keys. This massively
    reduces the amount of logged metadata when we have deleted directory
    entries, even if it's a small percentage of the total number of entries.
    The reduction comes from both less items that are logged and instead of
    logging many dir index items (struct btrfs_dir_item), which have a size
    of 30 bytes plus a file name, we typically log just a few range items
    (struct btrfs_dir_log_item), which take only 8 bytes each.
    
    Even if no entries were deleted from a directory and only new entries
    were added, we typically still get a reduction on the amount of logged
    metadata, because it's very likely the first leaf that got the new
    dir index entries also has several old dir index entries.
    
    So change the logging logic to not log dir index keys created in past
    transactions and log a range item for every gap it finds between each
    pair of consecutive index keys, to ensure deletions are tracked and
    replayed on log replay.
    
    This patch is part of a patchset comprised of the following patches:
    
     1/4 btrfs: don't log unnecessary boundary keys when logging directory
     2/4 btrfs: put initial index value of a directory in a constant
     3/4 btrfs: stop copying old dir items when logging a directory
     4/4 btrfs: stop trying to log subdirectories created in past transactions
    
    The following test was run on a branch without this patchset and on a
    branch with the first three patches applied:
    
      $ cat test.sh
      #!/bin/bash
    
      DEV=/dev/nvme0n1
      MNT=/mnt/nvme0n1
    
      NUM_FILES=1000000
      NUM_FILE_DELETES=10000
    
      MKFS_OPTIONS="-O no-holes -R free-space-tree"
      MOUNT_OPTIONS="-o ssd"
    
      mkfs.btrfs -f $MKFS_OPTIONS $DEV
      mount $MOUNT_OPTIONS $DEV $MNT
    
      mkdir $MNT/testdir
      for ((i = 1; i <= $NUM_FILES; i++)); do
          echo -n > $MNT/testdir/file_$i
      done
    
      sync
    
      del_inc=$(( $NUM_FILES / $NUM_FILE_DELETES ))
      for ((i = 1; i <= $NUM_FILES; i += $del_inc)); do
          rm -f $MNT/testdir/file_$i
      done
    
      start=$(date +%s%N)
      xfs_io -c "fsync" $MNT/testdir
      end=$(date +%s%N)
    
      dur=$(( (end - start) / 1000000 ))
      echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
      echo
    
      umount $MNT
    
    The test was run on a non-debug kernel (Debian's default kernel config),
    and the results were the following for various values of NUM_FILES and
    NUM_FILE_DELETES:
    
    ** before, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
    
    dir fsync took 585 ms after deleting 10000 files
    
    ** after, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
    
    dir fsync took 34 ms after deleting 10000 files   (-94.2%)
    
    ** before, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
    
    dir fsync took 50 ms after deleting 1000 files
    
    ** after, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
    
    dir fsync took 7 ms after deleting 1000 files    (-86.0%)
    
    ** before, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
    
    dir fsync took 9 ms after deleting 100 files
    
    ** after, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
    
    dir fsync took 5 ms after deleting 100 files     (-44.4%)
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    732d591a
tree-log.c 187 KB