Commit 0f8ce498 authored by Filipe Manana's avatar Filipe Manana Committed by David Sterba

btrfs: avoid inode logging during rename and link when possible

During a rename or link operation, we need to determine if an inode was
previously logged or not, and if it was, do some update to the logged
inode. We used to rely exclusively on the logged_trans field of struct
btrfs_inode to determine that, but that was not reliable because the
value of that field is not persisted in the inode item, so it's lost
when an inode is evicted and loaded back again. That led to several
issues in the past, such as not persisting deletions (such as the case
fixed by commit 803f0f64 ("Btrfs: fix fsync not persisting dentry
deletions due to inode evictions")), or resulting in losing a file
after an inode eviction followed by a rename (commit ecc64fab
("btrfs: fix lost inode on log replay after mix of fsync, rename and
inode eviction")), besides other issues.

So the inode_logged() helper was introduced and used to determine if an
inode was possibly logged before in the current transaction, with the
caveat that it could return false positives, in the sense that even if an
inode was not logged before in the current transaction, it could still
return true, but never to return false in case the inode was logged.
>From a functional point of view that is fine, but from a performance
perspective it can introduce significant latencies to rename and link
operations, as they will end up doing inode logging even when it is not
necessary.

Recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package
installations and upgrades, with the zypper tool, were often taking a
long time to complete. With strace it could be observed that zypper was
spending about 99% of its time on rename operations, and then with
further analysis we checked that directory logging was happening too
frequently. Taking into account that installation/upgrade of some of the
packages needed a few thousand file renames, the slowdown was very
noticeable for the user.

The issue was caused indirectly due to an excessive number of inode
evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or
a 5.16-rc8 kernel. While triggering the inode evictions if something
outside btrfs' control, btrfs could still behave better by eliminating
the false positives from the inode_logged() helper.

So change inode_logged() to actually eliminate such false positives caused
by inode eviction and when an inode was never logged since the filesystem
was mounted, as both cases relate to when the logged_trans field of struct
btrfs_inode has a value of zero. When it can not determine if the inode
was logged based only on the logged_trans value, lookup for the existence
of the inode item in the log tree - if it's there then we known the inode
was logged, if it's not there then it can not have been logged in the
current transaction. Once we determine if the inode was logged, update
the logged_trans value to avoid future calls to have to search in the log
tree again.

Alternatively, we could start storing logged_trans in the on disk inode
item structure (struct btrfs_inode_item) in the unused space it still has,
but that would be a bit odd because:

1) We only care about logged_trans since the filesystem was mounted, we
   don't care about its value from a previous mount. Having it persisted
   in the inode item structure would not make the best use of the precious
   unused space;

2) In order to get logged_trans persisted before inode eviction, we would
   have to update the delayed inode when we finish logging the inode and
   update its logged_trans in struct btrfs_inode, which makes it a bit
   cumbersome since we need to check if the delayed inode exists, if not
   create it and populate it and deal with any errors (-ENOMEM mostly).

This change is part of a patchset comprised of the following patches:

  1/5 btrfs: add helper to delete a dir entry from a log tree
  2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
  3/5 btrfs: avoid logging all directory changes during renames
  4/5 btrfs: stop doing unnecessary log updates during a rename
  5/5 btrfs: avoid inode logging during rename and link when possible

The following test script mimics part of what the zypper tool does during
package installations/upgrades. It does not triggers inode evictions, but
it's similar because it triggers false positives from the inode_logged()
helper, because the inodes have a logged_trans of 0, there's a log tree
due to a fsync of an unrelated file and the directory inode has its
last_trans field set to the current transaction:

  $ cat test.sh

  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/nvme0n1

  NUM_FILES=10000

  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  mkdir $MNT/testdir

  for ((i = 1; i <= $NUM_FILES; i++)); do
      echo -n > $MNT/testdir/file_$i
  done

  sync

  # Now do some change to an unrelated file and fsync it.
  # This is just to create a log tree to make sure that inode_logged()
  # does not return false when called against "testdir".
  xfs_io -f -c "pwrite 0 4K" -c "fsync" $MNT/foo

  # Do some change to testdir. This is to make sure inode_logged()
  # will return true when called against "testdir", because its
  # logged_trans is 0, it was changed in the current transaction
  # and there's a log tree.
  echo -n > $MNT/testdir/file_$((NUM_FILES + 1))

  echo "Renaming $NUM_FILES files..."
  start=$(date +%s%N)
  for ((i = 1; i <= $NUM_FILES; i++)); do
      mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
  done
  end=$(date +%s%N)

  dur=$(( (end - start) / 1000000 ))
  echo "Renames took $dur milliseconds"

  umount $MNT

Testing this change on a box using a non-debug kernel (Debian's default
kernel config) gave the following results:

NUM_FILES=10000, before patchset:                   27837 ms
NUM_FILES=10000, after patches 1/5 to 4/5 applied:   9236 ms (-66.8%)
NUM_FILES=10000, after whole patchset applied:       8902 ms (-68.0%)

NUM_FILES=5000, before patchset:                     9127 ms
NUM_FILES=5000, after patches 1/5 to 4/5 applied:    4640 ms (-49.2%)
NUM_FILES=5000, after whole patchset applied:        4441 ms (-51.3%)

NUM_FILES=2000, before patchset:                     2528 ms
NUM_FILES=2000, after patches 1/5 to 4/5 applied:    1983 ms (-21.6%)
NUM_FILES=2000, after whole patchset applied:        1747 ms (-30.9%)

NUM_FILES=1000, before patchset:                     1085 ms
NUM_FILES=1000, after patches 1/5 to 4/5 applied:     893 ms (-17.7%)
NUM_FILES=1000, after whole patchset applied:         867 ms (-20.1%)

Running dbench on the same physical machine with the following script:

  $ cat run-dbench.sh
  #!/bin/bash

  NUM_JOBS=$(nproc --all)

  DEV=/dev/nvme0n1
  MNT=/mnt/nvme0n1
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-O no-holes -R free-space-tree"

  echo "performance" | \
      tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 120 $NUM_JOBS

  umount $MNT

Before patchset:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    3761352     0.032   143.843
 Close        2762770     0.002     2.273
 Rename        159304     0.291    67.037
 Unlink        759784     0.207   143.998
 Deltree           72     4.028    15.977
 Mkdir             36     0.003     0.006
 Qpathinfo    3409780     0.013     9.678
 Qfileinfo     596772     0.001     0.878
 Qfsinfo       625189     0.003     1.245
 Sfileinfo     306443     0.006     1.840
 Find         13181061     0.063    19.798
 WriteX       1871137     0.021     8.532
 ReadX        5897325     0.003     3.567
 LockX          12252     0.003     0.258
 UnlockX        12252     0.002     0.100
 Flush         263666     3.327   155.632

Throughput 980.047 MB/sec  12 clients  12 procs  max_latency=155.636 ms

After whole patchset applied:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4195584     0.033   107.742
 Close        3081932     0.002     1.935
 Rename        177641     0.218    14.905
 Unlink        847333     0.166   107.822
 Deltree          118     5.315    15.247
 Mkdir             59     0.004     0.048
 Qpathinfo    3802612     0.014    10.302
 Qfileinfo     666748     0.001     1.034
 Qfsinfo       697329     0.003     0.944
 Sfileinfo     341712     0.006     2.099
 Find         1470365     0.065     9.359
 WriteX       2093921     0.021     8.087
 ReadX        6576234     0.003     3.407
 LockX          13660     0.003     0.308
 UnlockX        13660     0.002     0.114
 Flush         294090     2.906   115.539

Throughput 1093.11 MB/sec  12 clients  12 procs  max_latency=115.544 ms

+11.5% throughput    -25.8% max latency   rename max latency -77.8%

Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
parent 259c4b96
......@@ -3477,35 +3477,121 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
}
/*
* Check if an inode was logged in the current transaction. This may often
* return some false positives, because logged_trans is an in memory only field,
* not persisted anywhere. This is meant to be used in contexts where a false
* positive has no functional consequences.
* Check if an inode was logged in the current transaction. This correctly deals
* with the case where the inode was logged but has a logged_trans of 0, which
* happens if the inode is evicted and loaded again, as logged_trans is an in
* memory only field (not persisted).
*
* Returns 1 if the inode was logged before in the transaction, 0 if it was not,
* and < 0 on error.
*/
static bool inode_logged(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode)
static int inode_logged(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode,
struct btrfs_path *path_in)
{
struct btrfs_path *path = path_in;
struct btrfs_key key;
int ret;
if (inode->logged_trans == trans->transid)
return true;
return 1;
if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &inode->root->state))
return false;
/*
* If logged_trans is not 0, then we know the inode logged was not logged
* in this transaction, so we can return false right away.
*/
if (inode->logged_trans > 0)
return 0;
/*
* If no log tree was created for this root in this transaction, then
* the inode can not have been logged in this transaction. In that case
* set logged_trans to anything greater than 0 and less than the current
* transaction's ID, to avoid the search below in a future call in case
* a log tree gets created after this.
*/
if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &inode->root->state)) {
inode->logged_trans = trans->transid - 1;
return 0;
}
/*
* We have a log tree and the inode's logged_trans is 0. We can't tell
* for sure if the inode was logged before in this transaction by looking
* only at logged_trans. We could be pessimistic and assume it was, but
* that can lead to unnecessarily logging an inode during rename and link
* operations, and then further updating the log in followup rename and
* link operations, specially if it's a directory, which adds latency
* visible to applications doing a series of rename or link operations.
*
* A logged_trans of 0 here can mean several things:
*
* 1) The inode was never logged since the filesystem was mounted, and may
* or may have not been evicted and loaded again;
*
* 2) The inode was logged in a previous transaction, then evicted and
* then loaded again;
*
* 3) The inode was logged in the current transaction, then evicted and
* then loaded again.
*
* For cases 1) and 2) we don't want to return true, but we need to detect
* case 3) and return true. So we do a search in the log root for the inode
* item.
*/
key.objectid = btrfs_ino(inode);
key.type = BTRFS_INODE_ITEM_KEY;
key.offset = 0;
if (!path) {
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
}
ret = btrfs_search_slot(NULL, inode->root->log_root, &key, path, 0, 0);
if (path_in)
btrfs_release_path(path);
else
btrfs_free_path(path);
/*
* The inode's logged_trans is always 0 when we load it (because it is
* not persisted in the inode item or elsewhere). So if it is 0, the
* inode was last modified in the current transaction then the inode may
* have been logged before in the current transaction, then evicted and
* loaded again in the current transaction - or may have never been logged
* in the current transaction, but since we can not be sure, we have to
* assume it was, otherwise our callers can leave an inconsistent log.
* Logging an inode always results in logging its inode item. So if we
* did not find the item we know the inode was not logged for sure.
*/
if (inode->logged_trans == 0 &&
inode->last_trans == trans->transid &&
!test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags))
return true;
if (ret < 0) {
return ret;
} else if (ret > 0) {
/*
* Set logged_trans to a value greater than 0 and less then the
* current transaction to avoid doing the search in future calls.
*/
inode->logged_trans = trans->transid - 1;
return 0;
}
/*
* The inode was previously logged and then evicted, set logged_trans to
* the current transacion's ID, to avoid future tree searches as long as
* the inode is not evicted again.
*/
inode->logged_trans = trans->transid;
/*
* If it's a directory, then we must set last_dir_index_offset to the
* maximum possible value, so that the next attempt to log the inode does
* not skip checking if dir index keys found in modified subvolume tree
* leaves have been logged before, otherwise it would result in attempts
* to insert duplicate dir index keys in the log tree. This must be done
* because last_dir_index_offset is an in-memory only field, not persisted
* in the inode item or any other on-disk structure, so its value is lost
* once the inode is evicted.
*/
if (S_ISDIR(inode->vfs_inode.i_mode))
inode->last_dir_index_offset = (u64)-1;
return false;
return 1;
}
/*
......@@ -3572,8 +3658,13 @@ void btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans,
struct btrfs_path *path;
int ret;
if (!inode_logged(trans, dir))
ret = inode_logged(trans, dir, NULL);
if (ret == 0)
return;
else if (ret < 0) {
btrfs_set_log_full_commit(trans);
return;
}
ret = join_running_log_trans(root);
if (ret)
......@@ -3607,8 +3698,13 @@ void btrfs_del_inode_ref_in_log(struct btrfs_trans_handle *trans,
u64 index;
int ret;
if (!inode_logged(trans, inode))
ret = inode_logged(trans, inode, NULL);
if (ret == 0)
return;
else if (ret < 0) {
btrfs_set_log_full_commit(trans);
return;
}
ret = join_running_log_trans(root);
if (ret)
......@@ -3740,7 +3836,6 @@ static int process_dir_items_leaf(struct btrfs_trans_handle *trans,
struct extent_buffer *src = path->nodes[0];
const int nritems = btrfs_header_nritems(src);
const u64 ino = btrfs_ino(inode);
const bool inode_logged_before = inode_logged(trans, inode);
bool last_found = false;
int batch_start = 0;
int batch_size = 0;
......@@ -3816,14 +3911,16 @@ static int process_dir_items_leaf(struct btrfs_trans_handle *trans,
ctx->log_new_dentries = true;
}
if (!inode_logged_before)
if (!ctx->logged_before)
goto add_to_batch;
/*
* If we were logged before and have logged dir items, we can skip
* checking if any item with a key offset larger than the last one
* we logged is in the log tree, saving time and avoiding adding
* contention on the log tree.
* contention on the log tree. We can only rely on the value of
* last_dir_index_offset when we know for sure that the inode was
* previously logged in the current transaction.
*/
if (key.offset > inode->last_dir_index_offset)
goto add_to_batch;
......@@ -4066,21 +4163,6 @@ static noinline int log_directory_changes(struct btrfs_trans_handle *trans,
u64 max_key;
int ret;
/*
* If this is the first time we are being logged in the current
* transaction, or we were logged before but the inode was evicted and
* reloaded later, in which case its logged_trans is 0, reset the value
* of the last logged key offset. Note that we don't use the helper
* function inode_logged() here - that is because the function returns
* true after an inode eviction, assuming the worst case as it can not
* know for sure if the inode was logged before. So we can not skip key
* searches in the case the inode was evicted, because it may not have
* been logged in this transaction and may have been logged in a past
* transaction, so we need to reset the last dir index offset to (u64)-1.
*/
if (inode->logged_trans != trans->transid)
inode->last_dir_index_offset = (u64)-1;
min_key = BTRFS_DIR_START_INDEX;
max_key = 0;
ctx->last_dir_item_offset = inode->last_dir_index_offset;
......@@ -4117,9 +4199,6 @@ static int drop_inode_items(struct btrfs_trans_handle *trans,
struct btrfs_key found_key;
int start_slot;
if (!inode_logged(trans, inode))
return 0;
key.objectid = btrfs_ino(inode);
key.type = max_key_type;
key.offset = (u64)-1;
......@@ -4617,7 +4696,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
* are small, with a root at level 2 or 3 at most, due to their short
* life span.
*/
if (inode_logged(trans, inode)) {
if (ctx->logged_before) {
drop_args.path = path;
drop_args.start = em->start;
drop_args.end = em->start + em->len;
......@@ -5633,6 +5712,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
bool xattrs_logged = false;
bool recursive_logging = false;
bool inode_item_dropped = true;
const bool orig_logged_before = ctx->logged_before;
path = btrfs_alloc_path();
if (!path)
......@@ -5682,6 +5762,18 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
mutex_lock(&inode->log_mutex);
}
/*
* Before logging the inode item, cache the value returned by
* inode_logged(), because after that we have the need to figure out if
* the inode was previously logged in this transaction.
*/
ret = inode_logged(trans, inode, path);
if (ret < 0) {
err = ret;
goto out_unlock;
}
ctx->logged_before = (ret == 1);
/*
* This is for cases where logging a directory could result in losing a
* a file after replaying the log. For example, if we move a file from a
......@@ -5707,9 +5799,11 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
clear_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags);
if (inode_only == LOG_INODE_EXISTS)
max_key_type = BTRFS_XATTR_ITEM_KEY;
ret = drop_inode_items(trans, log, path, inode, max_key_type);
if (ctx->logged_before)
ret = drop_inode_items(trans, log, path, inode,
max_key_type);
} else {
if (inode_only == LOG_INODE_EXISTS && inode_logged(trans, inode)) {
if (inode_only == LOG_INODE_EXISTS && ctx->logged_before) {
/*
* Make sure the new inode item we write to the log has
* the same isize as the current one (if it exists).
......@@ -5731,14 +5825,15 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
&inode->runtime_flags)) {
if (inode_only == LOG_INODE_EXISTS) {
max_key.type = BTRFS_XATTR_ITEM_KEY;
ret = drop_inode_items(trans, log, path, inode,
max_key.type);
if (ctx->logged_before)
ret = drop_inode_items(trans, log, path,
inode, max_key.type);
} else {
clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
&inode->runtime_flags);
clear_bit(BTRFS_INODE_COPY_EVERYTHING,
&inode->runtime_flags);
if (inode_logged(trans, inode))
if (ctx->logged_before)
ret = truncate_inode_items(trans, log,
inode, 0, 0);
}
......@@ -5748,8 +5843,9 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
if (inode_only == LOG_INODE_ALL)
fast_search = true;
max_key.type = BTRFS_XATTR_ITEM_KEY;
ret = drop_inode_items(trans, log, path, inode,
max_key.type);
if (ctx->logged_before)
ret = drop_inode_items(trans, log, path, inode,
max_key.type);
} else {
if (inode_only == LOG_INODE_ALL)
fast_search = true;
......@@ -5869,6 +5965,10 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
out:
btrfs_free_path(path);
btrfs_free_path(dst_path);
if (recursive_logging)
ctx->logged_before = orig_logged_before;
return err;
}
......@@ -6813,7 +6913,7 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
struct btrfs_root *root = inode->root;
struct btrfs_log_ctx ctx;
bool log_pinned = false;
int ret = 0;
int ret;
/*
* this will force the logging code to walk the dentry chain
......@@ -6826,9 +6926,24 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
* if this inode hasn't been logged and directory we're renaming it
* from hasn't been logged, we don't need to log it
*/
if (!inode_logged(trans, inode) &&
(!old_dir || !inode_logged(trans, old_dir)))
return;
ret = inode_logged(trans, inode, NULL);
if (ret < 0) {
goto out;
} else if (ret == 0) {
if (!old_dir)
return;
/*
* If the inode was not logged and we are doing a rename (old_dir is not
* NULL), check if old_dir was logged - if it was not we can return and
* do nothing.
*/
ret = inode_logged(trans, old_dir, NULL);
if (ret < 0)
goto out;
else if (ret == 0)
return;
}
ret = 0;
/*
* If we are doing a rename (old_dir is not NULL) from a directory that
......@@ -6900,15 +7015,15 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
*/
btrfs_log_inode_parent(trans, inode, parent, LOG_INODE_EXISTS, &ctx);
out:
if (log_pinned) {
/*
* If an error happened mark the log for a full commit because
* it's not consistent and up to date. Do it before unpinning the
* log, to avoid any races with someone else trying to commit it.
*/
if (ret < 0)
btrfs_set_log_full_commit(trans);
/*
* If an error happened mark the log for a full commit because it's not
* consistent and up to date or we couldn't find out if one of the
* inodes was logged before in this transaction. Do it before unpinning
* the log, to avoid any races with someone else trying to commit it.
*/
if (ret < 0)
btrfs_set_log_full_commit(trans);
if (log_pinned)
btrfs_end_log_trans(root);
}
}
......@@ -17,6 +17,8 @@ struct btrfs_log_ctx {
int log_transid;
bool log_new_dentries;
bool logging_new_name;
/* Indicate if the inode being logged was logged before. */
bool logged_before;
/* Tracks the last logged dir item/index key offset. */
u64 last_dir_item_offset;
struct inode *inode;
......@@ -32,6 +34,7 @@ static inline void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx,
ctx->log_transid = 0;
ctx->log_new_dentries = false;
ctx->logging_new_name = false;
ctx->logged_before = false;
ctx->inode = inode;
INIT_LIST_HEAD(&ctx->list);
INIT_LIST_HEAD(&ctx->ordered_extents);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment