• Filipe Manana's avatar
    btrfs: avoid inode logging during rename and link when possible · 0f8ce498
    Filipe Manana authored
    During a rename or link operation, we need to determine if an inode was
    previously logged or not, and if it was, do some update to the logged
    inode. We used to rely exclusively on the logged_trans field of struct
    btrfs_inode to determine that, but that was not reliable because the
    value of that field is not persisted in the inode item, so it's lost
    when an inode is evicted and loaded back again. That led to several
    issues in the past, such as not persisting deletions (such as the case
    fixed by commit 803f0f64 ("Btrfs: fix fsync not persisting dentry
    deletions due to inode evictions")), or resulting in losing a file
    after an inode eviction followed by a rename (commit ecc64fab
    ("btrfs: fix lost inode on log replay after mix of fsync, rename and
    inode eviction")), besides other issues.
    
    So the inode_logged() helper was introduced and used to determine if an
    inode was possibly logged before in the current transaction, with the
    caveat that it could return false positives, in the sense that even if an
    inode was not logged before in the current transaction, it could still
    return true, but never to return false in case the inode was logged.
    >From a functional point of view that is fine, but from a performance
    perspective it can introduce significant latencies to rename and link
    operations, as they will end up doing inode logging even when it is not
    necessary.
    
    Recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package
    installations and upgrades, with the zypper tool, were often taking a
    long time to complete. With strace it could be observed that zypper was
    spending about 99% of its time on rename operations, and then with
    further analysis we checked that directory logging was happening too
    frequently. Taking into account that installation/upgrade of some of the
    packages needed a few thousand file renames, the slowdown was very
    noticeable for the user.
    
    The issue was caused indirectly due to an excessive number of inode
    evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or
    a 5.16-rc8 kernel. While triggering the inode evictions if something
    outside btrfs' control, btrfs could still behave better by eliminating
    the false positives from the inode_logged() helper.
    
    So change inode_logged() to actually eliminate such false positives caused
    by inode eviction and when an inode was never logged since the filesystem
    was mounted, as both cases relate to when the logged_trans field of struct
    btrfs_inode has a value of zero. When it can not determine if the inode
    was logged based only on the logged_trans value, lookup for the existence
    of the inode item in the log tree - if it's there then we known the inode
    was logged, if it's not there then it can not have been logged in the
    current transaction. Once we determine if the inode was logged, update
    the logged_trans value to avoid future calls to have to search in the log
    tree again.
    
    Alternatively, we could start storing logged_trans in the on disk inode
    item structure (struct btrfs_inode_item) in the unused space it still has,
    but that would be a bit odd because:
    
    1) We only care about logged_trans since the filesystem was mounted, we
       don't care about its value from a previous mount. Having it persisted
       in the inode item structure would not make the best use of the precious
       unused space;
    
    2) In order to get logged_trans persisted before inode eviction, we would
       have to update the delayed inode when we finish logging the inode and
       update its logged_trans in struct btrfs_inode, which makes it a bit
       cumbersome since we need to check if the delayed inode exists, if not
       create it and populate it and deal with any errors (-ENOMEM mostly).
    
    This change is part of a patchset comprised of the following patches:
    
      1/5 btrfs: add helper to delete a dir entry from a log tree
      2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
      3/5 btrfs: avoid logging all directory changes during renames
      4/5 btrfs: stop doing unnecessary log updates during a rename
      5/5 btrfs: avoid inode logging during rename and link when possible
    
    The following test script mimics part of what the zypper tool does during
    package installations/upgrades. It does not triggers inode evictions, but
    it's similar because it triggers false positives from the inode_logged()
    helper, because the inodes have a logged_trans of 0, there's a log tree
    due to a fsync of an unrelated file and the directory inode has its
    last_trans field set to the current transaction:
    
      $ cat test.sh
    
      #!/bin/bash
    
      DEV=/dev/nvme0n1
      MNT=/mnt/nvme0n1
    
      NUM_FILES=10000
    
      mkfs.btrfs -f $DEV
      mount $DEV $MNT
    
      mkdir $MNT/testdir
    
      for ((i = 1; i <= $NUM_FILES; i++)); do
          echo -n > $MNT/testdir/file_$i
      done
    
      sync
    
      # Now do some change to an unrelated file and fsync it.
      # This is just to create a log tree to make sure that inode_logged()
      # does not return false when called against "testdir".
      xfs_io -f -c "pwrite 0 4K" -c "fsync" $MNT/foo
    
      # Do some change to testdir. This is to make sure inode_logged()
      # will return true when called against "testdir", because its
      # logged_trans is 0, it was changed in the current transaction
      # and there's a log tree.
      echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
    
      echo "Renaming $NUM_FILES files..."
      start=$(date +%s%N)
      for ((i = 1; i <= $NUM_FILES; i++)); do
          mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
      done
      end=$(date +%s%N)
    
      dur=$(( (end - start) / 1000000 ))
      echo "Renames took $dur milliseconds"
    
      umount $MNT
    
    Testing this change on a box using a non-debug kernel (Debian's default
    kernel config) gave the following results:
    
    NUM_FILES=10000, before patchset:                   27837 ms
    NUM_FILES=10000, after patches 1/5 to 4/5 applied:   9236 ms (-66.8%)
    NUM_FILES=10000, after whole patchset applied:       8902 ms (-68.0%)
    
    NUM_FILES=5000, before patchset:                     9127 ms
    NUM_FILES=5000, after patches 1/5 to 4/5 applied:    4640 ms (-49.2%)
    NUM_FILES=5000, after whole patchset applied:        4441 ms (-51.3%)
    
    NUM_FILES=2000, before patchset:                     2528 ms
    NUM_FILES=2000, after patches 1/5 to 4/5 applied:    1983 ms (-21.6%)
    NUM_FILES=2000, after whole patchset applied:        1747 ms (-30.9%)
    
    NUM_FILES=1000, before patchset:                     1085 ms
    NUM_FILES=1000, after patches 1/5 to 4/5 applied:     893 ms (-17.7%)
    NUM_FILES=1000, after whole patchset applied:         867 ms (-20.1%)
    
    Running dbench on the same physical machine with the following script:
    
      $ cat run-dbench.sh
      #!/bin/bash
    
      NUM_JOBS=$(nproc --all)
    
      DEV=/dev/nvme0n1
      MNT=/mnt/nvme0n1
      MOUNT_OPTIONS="-o ssd"
      MKFS_OPTIONS="-O no-holes -R free-space-tree"
    
      echo "performance" | \
          tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    
      mkfs.btrfs -f $MKFS_OPTIONS $DEV
      mount $MOUNT_OPTIONS $DEV $MNT
    
      dbench -D $MNT -t 120 $NUM_JOBS
    
      umount $MNT
    
    Before patchset:
    
     Operation      Count    AvgLat    MaxLat
     ----------------------------------------
     NTCreateX    3761352     0.032   143.843
     Close        2762770     0.002     2.273
     Rename        159304     0.291    67.037
     Unlink        759784     0.207   143.998
     Deltree           72     4.028    15.977
     Mkdir             36     0.003     0.006
     Qpathinfo    3409780     0.013     9.678
     Qfileinfo     596772     0.001     0.878
     Qfsinfo       625189     0.003     1.245
     Sfileinfo     306443     0.006     1.840
     Find         13181061     0.063    19.798
     WriteX       1871137     0.021     8.532
     ReadX        5897325     0.003     3.567
     LockX          12252     0.003     0.258
     UnlockX        12252     0.002     0.100
     Flush         263666     3.327   155.632
    
    Throughput 980.047 MB/sec  12 clients  12 procs  max_latency=155.636 ms
    
    After whole patchset applied:
    
     Operation      Count    AvgLat    MaxLat
     ----------------------------------------
     NTCreateX    4195584     0.033   107.742
     Close        3081932     0.002     1.935
     Rename        177641     0.218    14.905
     Unlink        847333     0.166   107.822
     Deltree          118     5.315    15.247
     Mkdir             59     0.004     0.048
     Qpathinfo    3802612     0.014    10.302
     Qfileinfo     666748     0.001     1.034
     Qfsinfo       697329     0.003     0.944
     Sfileinfo     341712     0.006     2.099
     Find         1470365     0.065     9.359
     WriteX       2093921     0.021     8.087
     ReadX        6576234     0.003     3.407
     LockX          13660     0.003     0.308
     UnlockX        13660     0.002     0.114
     Flush         294090     2.906   115.539
    
    Throughput 1093.11 MB/sec  12 clients  12 procs  max_latency=115.544 ms
    
    +11.5% throughput    -25.8% max latency   rename max latency -77.8%
    
    Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    0f8ce498
tree-log.h 2.98 KB