1. 14 Mar, 2022 40 commits
    • Filipe Manana's avatar
      btrfs: stop copying old file extents when doing a full fsync · 7f30c072
      Filipe Manana authored
      When logging an inode in full sync mode, we go over every leaf that was
      modified in the current transaction and has items associated to our inode,
      and then copy all those items into the log tree. This includes copying
      file extent items that were created and added to the inode in past
      transactions, which is useless and only makes use more leaf space in the
      log tree.
      
      It's common to have a file with many file extent items spanning many
      leaves where only a few file extent items are new and need to be logged,
      and in such case we log all the file extent items we find in the modified
      leaves.
      
      So change the full sync behaviour to skip over file extent items that are
      not needed. Those are the ones that match the following criteria:
      
      1) Have a generation older than the current transaction and the inode
         was not a target of a reflink operation, as that can copy file extent
         items from a past generation from some other inode into our inode, so
         we have to log them;
      
      2) Start at an offset within i_size - we must log anything at or beyond
         i_size, otherwise we would lose prealloc extents after log replay.
      
      The following script exercises a scenario where this happens, and it's
      somehow close enough to what happened often on a SQL Server workload which
      I had to debug sometime ago to fix an issue where a pattern of writes to
      prealloc extents and fsync resulted in fsync failing with -EIO (that was
      commit ea7036de ("btrfs: fix fsync failure and transaction abort
      after writes to prealloc extents")). In that particular case, we had large
      files that had random writes and were often truncated, which made the
      next fsync be a full sync.
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        MKFS_OPTIONS="-O no-holes -R free-space-tree"
        MOUNT_OPTIONS="-o ssd"
      
        FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # 1G
        # FILE_SIZE=$((2 * 1024 * 1024 * 1024)) # 2G
        # FILE_SIZE=$((512 * 1024 * 1024)) # 512M
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create a file with many extents. Use direct IO to make it faster
        # to create the file - using buffered IO we would have to fsync
        # after each write (terribly slow).
        echo "Creating file with $((FILE_SIZE / 4096)) extents of 4K each..."
        xfs_io -f -d -c "pwrite -b 4K 0 $FILE_SIZE" $MNT/foobar
      
        # Commit the transaction, so every extent after this is from an
        # old generation.
        sync
      
        # Now rewrite only a few extents, which are all far spread apart from
        # each other (e.g. 1G / 32M = 32 extents).
        # After this only a few extents have a new generation, while all other
        # ones have an old generation.
        echo "Rewriting $((FILE_SIZE / (32 * 1024 * 1024))) extents..."
        for ((i = 0; i < $FILE_SIZE; i += $((32 * 1024 * 1024)))); do
            xfs_io -c "pwrite $i 4K" $MNT/foobar >/dev/null
        done
      
        # Fsync, the inode logged in full sync mode since it was never fsynced
        # before.
        echo "Fsyncing file..."
        xfs_io -c "fsync" $MNT/foobar
      
        umount $MNT
      
      And the following bpftrace program was running when executing the test
      script:
      
        $ cat bpf-script.sh
        #!/usr/bin/bpftrace
      
        k:btrfs_log_inode
        {
            @start_log_inode[tid] = nsecs;
        }
      
        kr:btrfs_log_inode
        /@start_log_inode[tid]/
        {
            @log_inode_dur[tid] = (nsecs - @start_log_inode[tid]) / 1000;
            delete(@start_log_inode[tid]);
        }
      
        k:btrfs_sync_log
        {
            @start_sync_log[tid] = nsecs;
        }
      
        kr:btrfs_sync_log
        /@start_sync_log[tid]/
        {
            $sync_log_dur = (nsecs - @start_sync_log[tid]) / 1000;
            printf("btrfs_log_inode() took %llu us\n", @log_inode_dur[tid]);
            printf("btrfs_sync_log()  took %llu us\n", $sync_log_dur);
            delete(@start_sync_log[tid]);
            delete(@log_inode_dur[tid]);
            exit();
        }
      
      With 512M test file, before this patch:
      
        btrfs_log_inode() took 15218 us
        btrfs_sync_log()  took 1328 us
      
        Log tree has 17 leaves and 1 node, its total size is 294912 bytes.
      
      With 512M test file, after this patch:
      
        btrfs_log_inode() took 14760 us
        btrfs_sync_log()  took 588 us
      
        Log tree has a single leaf, its total size is 16K.
      
      With 1G test file, before this patch:
      
        btrfs_log_inode() took 27301 us
        btrfs_sync_log()  took 1767 us
      
        Log tree has 33 leaves and 1 node, its total size is 557056 bytes.
      
      With 1G test file, after this patch:
      
        btrfs_log_inode() took 26166 us
        btrfs_sync_log()  took 593 us
      
        Log tree has a single leaf, its total size is 16K
      
      With 2G test file, before this patch:
      
        btrfs_log_inode() took 50892 us
        btrfs_sync_log()  took 3127 us
      
        Log tree has 65 leaves and 1 node, its total size is 1081344 bytes.
      
      With 2G test file, after this patch:
      
        btrfs_log_inode() took 50126 us
        btrfs_sync_log()  took 586 us
      
        Log tree has a single leaf, its total size is 16K.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f30c072
    • Josef Bacik's avatar
      btrfs: do not clean up repair bio if submit fails · 8cbc3001
      Josef Bacik authored
      The submit helper will always run bio_endio() on the bio if it fails to
      submit, so cleaning up the bio just leads to a variety of use-after-free
      and NULL pointer dereference bugs because we race with the endio
      function that is cleaning up the bio.  Instead just return BLK_STS_OK as
      the repair function has to continue to process the rest of the pages,
      and the endio for the repair bio will do the appropriate cleanup for the
      page that it was given.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8cbc3001
    • Josef Bacik's avatar
      btrfs: do not try to repair bio that has no mirror set · 510671d2
      Josef Bacik authored
      If we fail to submit a bio for whatever reason, we may not have setup a
      mirror_num for that bio.  This means we shouldn't try to do the repair
      workflow, if we do we'll hit an BUG_ON(!failrec->this_mirror) in
      clean_io_failure.  Instead simply skip the repair workflow if we have no
      mirror set, and add an assert to btrfs_check_repairable() to make it
      easier to catch what is happening in the future.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      510671d2
    • Josef Bacik's avatar
      btrfs: do not double complete bio on errors during compressed reads · f9f15de8
      Josef Bacik authored
      I hit some weird panics while fixing up the error handling from
      btrfs_lookup_bio_sums().  Turns out the compression path will complete
      the bio we use if we set up any of the compression bios and then return
      an error, and then btrfs_submit_data_bio() will also call bio_endio() on
      the bio.
      
      Fix this by making btrfs_submit_compressed_read() responsible for
      calling bio_endio() on the bio if there are any errors.  Currently it
      was only doing it if we created the compression bios, otherwise it was
      depending on btrfs_submit_data_bio() to do the right thing.  This
      creates the above problem, so fix up btrfs_submit_compressed_read() to
      always call bio_endio() in case of an error, and then simply return from
      btrfs_submit_data_bio() if we had to call
      btrfs_submit_compressed_read().
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9f15de8
    • Josef Bacik's avatar
      btrfs: track compressed bio errors as blk_status_t · 606f82e7
      Josef Bacik authored
      Right now we just have a binary "errors" flag, so any error we get on
      the compressed bio's gets translated to EIO.  This isn't necessarily a
      bad thing, but if we get an ENOMEM it may be nice to know that's what
      happened instead of an EIO.  Track our errors as a blk_status_t, and do
      the appropriate setting of the errors accordingly.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      606f82e7
    • Josef Bacik's avatar
      btrfs: remove the bio argument from finish_compressed_bio_read · e14bfdb5
      Josef Bacik authored
      This bio is usually one of the compressed bio's, and we don't actually
      need it in this function, so remove the argument and stop passing it
      around.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e14bfdb5
    • Josef Bacik's avatar
      btrfs: check correct bio in finish_compressed_bio_read · b0bbc8a3
      Josef Bacik authored
      Commit c09abff8 ("btrfs: cloned bios must not be iterated by
      bio_for_each_segment_all") added ASSERT()'s to make sure we weren't
      calling bio_for_each_segment_all() on a RAID5/6 bio.  However it was
      checking the bio that the compression code passed in, not the
      cb->orig_bio that we actually iterate over, so adjust this ASSERT() to
      check the correct bio.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0bbc8a3
    • Josef Bacik's avatar
      btrfs: handle csum lookup errors properly on reads · 1784b7d5
      Josef Bacik authored
      Currently any error we get while trying to lookup csums during reads
      shows up as a missing csum, and then on the read completion side we
      print an error saying there was a csum mismatch and we increase the
      device corruption count.
      
      However we could have gotten an EIO from the lookup.  We could also be
      inside of a memory constrained container and gotten a ENOMEM while
      trying to do the read.  In either case we don't want to make this look
      like a file system corruption problem, we want to make it look like the
      actual error it is.  Capture any negative value, convert it to the
      appropriate blk_status_t, free the csum array if we have one and bail.
      
      Note: a possible improvement would be to make the relocation code look
      up the owning inode and see if it's marked as NODATASUM and set
      EXTENT_NODATASUM there, that way if there's corruption and there isn't a
      checksum when we want it we can fail here rather than later.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1784b7d5
    • Josef Bacik's avatar
      btrfs: make search_csum_tree return 0 if we get -EFBIG · 03ddb19d
      Josef Bacik authored
      We can either fail to find a csum entry at all and return -ENOENT, or we
      can find a range that is close, but return -EFBIG.  In essence these
      both mean the same thing when we are doing a lookup for a csum in an
      existing range, we didn't find a csum.  We want to treat both of these
      errors the same way, complain loudly that there wasn't a csum.  This
      currently happens anyway because we do
      
      	count = search_csum_tree();
      	if (count <= 0) {
      		// reloc and error handling
      	}
      
      However it forces us to incorrectly treat EIO or ENOMEM errors as on
      disk corruption.  Fix this by returning 0 if we get either -ENOENT or
      -EFBIG from btrfs_lookup_csum() so we can do proper error handling.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      03ddb19d
    • Omar Sandoval's avatar
      btrfs: add BTRFS_IOC_ENCODED_WRITE · 7c0c7269
      Omar Sandoval authored
      The implementation resembles direct I/O: we have to flush any ordered
      extents, invalidate the page cache, and do the io tree/delalloc/extent
      map/ordered extent dance. From there, we can reuse the compression code
      with a minor modification to distinguish the write from writeback. This
      also creates inline extents when possible.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7c0c7269
    • Omar Sandoval's avatar
      btrfs: add BTRFS_IOC_ENCODED_READ ioctl · 1881fba8
      Omar Sandoval authored
      There are 4 main cases:
      
      1. Inline extents: we copy the data straight out of the extent buffer.
      2. Hole/preallocated extents: we fill in zeroes.
      3. Regular, uncompressed extents: we read the sectors we need directly
         from disk.
      4. Regular, compressed extents: we read the entire compressed extent
         from disk and indicate what subset of the decompressed extent is in
         the file.
      
      This initial implementation simplifies a few things that can be improved
      in the future:
      
      - Cases 1, 3, and 4 allocate temporary memory to read into before
        copying out to userspace.
      - We don't do read repair, because it turns out that read repair is
        currently broken for compressed data.
      - We hold the inode lock during the operation.
      
      Note that we don't need to hold the mmap lock. We may race with
      btrfs_page_mkwrite() and read the old data from before the page was
      dirtied:
      
      btrfs_page_mkwrite         btrfs_encoded_read
      ---------------------------------------------------
      (enter)                    (enter)
                                 btrfs_wait_ordered_range
      lock_extent_bits
      btrfs_page_set_dirty
      unlock_extent_cached
      (exit)
                                 lock_extent_bits
                                 read extent (dirty page hasn't been flushed,
                                              so this is the old data)
                                 unlock_extent_cached
                                 (exit)
      
      we read the old data from before the page was dirtied. But, that's true
      even if we were to hold the mmap lock:
      
      btrfs_page_mkwrite               btrfs_encoded_read
      -------------------------------------------------------------------
      (enter)                          (enter)
                                       btrfs_inode_lock(BTRFS_ILOCK_MMAP)
      down_read(i_mmap_lock) (blocked)
                                       btrfs_wait_ordered_range
                                       lock_extent_bits
      				 read extent (page hasn't been dirtied,
                                                    so this is the old data)
                                       unlock_extent_cached
                                       btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
      down_read(i_mmap_lock) returns
      lock_extent_bits
      btrfs_page_set_dirty
      unlock_extent_cached
      
      In other words, this is inherently racy, so it's fine that we return the
      old data in this tiny window.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1881fba8
    • Omar Sandoval's avatar
      btrfs: add definitions and documentation for encoded I/O ioctls · dcb77a9a
      Omar Sandoval authored
      In order to allow sending and receiving compressed data without
      decompressing it, we need an interface to write pre-compressed data
      directly to the filesystem and the matching interface to read compressed
      data without decompressing it. This adds the definitions for ioctls to
      do that and detailed explanations of how to use them.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dcb77a9a
    • Omar Sandoval's avatar
      btrfs: optionally extend i_size in cow_file_range_inline() · d9496e8a
      Omar Sandoval authored
      Currently, an inline extent is always created after i_size is extended
      from btrfs_dirty_pages(). However, for encoded writes, we only want to
      update i_size after we successfully created the inline extent. Add an
      update_i_size parameter to cow_file_range_inline() and
      insert_inline_extent() and pass in the size of the extent rather than
      determining it from i_size.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ reformat comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d9496e8a
    • Omar Sandoval's avatar
      btrfs: clean up cow_file_range_inline() · 8dd9872d
      Omar Sandoval authored
      The start parameter to cow_file_range_inline() (and
      insert_inline_extent()) is always 0, so get rid of it and simplify the
      logic in those two functions. Pass btrfs_inode to insert_inline_extent()
      and remove the redundant root parameter. Also document the requirements
      for creating an inline extent. No functional change.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8dd9872d
    • Omar Sandoval's avatar
      btrfs: support different disk extent size for delalloc · 28c9b1e7
      Omar Sandoval authored
      Currently, we always reserve the same extent size in the file and extent
      size on disk for delalloc because the former is the worst case for the
      latter. For BTRFS_IOC_ENCODED_WRITE writes, we know the exact size of
      the extent on disk, which may be less than or greater than (for
      bookends) the size in the file. Add a disk_num_bytes parameter to
      btrfs_delalloc_reserve_metadata() so that we can reserve the correct
      amount of csum bytes. No functional change.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28c9b1e7
    • Omar Sandoval's avatar
      btrfs: add ram_bytes and offset to btrfs_ordered_extent · cb36a9bb
      Omar Sandoval authored
      Currently, we only create ordered extents when ram_bytes == num_bytes
      and offset == 0. However, BTRFS_IOC_ENCODED_WRITE writes may create
      extents which only refer to a subset of the full unencoded extent, so we
      need to plumb these fields through the ordered extent infrastructure and
      pass them down to insert_reserved_file_extent().
      
      Since we're changing the btrfs_add_ordered_extent* signature, let's get
      rid of the trivial wrappers and add a kernel-doc.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb36a9bb
    • Omar Sandoval's avatar
      btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() · e331f6b1
      Omar Sandoval authored
      btrfs_csum_one_bio() loops over each filesystem block in the bio while
      keeping a cursor of its current logical position in the file in order to
      look up the ordered extent to add the checksums to. However, this
      doesn't make much sense for compressed extents, as a sector on disk does
      not correspond to a sector of decompressed file data. It happens to work
      because:
      
      1) the compressed bio always covers one ordered extent
      2) the size of the bio is always less than the size of the ordered
         extent
      
      However, the second point will not always be true for encoded writes.
      
      Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that
      it can assume that the bio only covers one ordered extent. Since we're
      already changing the signature, let's get rid of the contig parameter
      and make it implied by the offset parameter, similar to the change we
      recently made to btrfs_lookup_bio_sums(). Additionally, let's rename
      nr_sectors to blockcount to make it clear that it's the number of
      filesystem blocks, not the number of 512-byte sectors.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e331f6b1
    • Omar Sandoval's avatar
      fs: export variant of generic_write_checks without iov_iter · f6f7a25a
      Omar Sandoval authored
      Encoded I/O in Btrfs needs to check a write with a given logical size
      without an iov_iter that matches that size (because the iov_iter we have
      is for the compressed data). So, factor out the parts of
      generic_write_check() that don't need an iov_iter into a new
      generic_write_checks_count() function and export that.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f6f7a25a
    • Omar Sandoval's avatar
      fs: export rw_verify_area() · 87112933
      Omar Sandoval authored
      I'm adding btrfs ioctls to read and write compressed data, and rather
      than duplicating the checks in rw_verify_area(), let's just export it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      87112933
    • Sidong Yang's avatar
      btrfs: qgroup: remove outdated TODO comments · 457b0a3d
      Sidong Yang authored
      These comments are old, outdated and not very specific. It seems that it
      doesn't help to inspire anybody to work on that.  So we remove them.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarSidong Yang <realwakka@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      457b0a3d
    • Sidong Yang's avatar
      btrfs: qgroup: remove duplicated check in adding qgroup relations · a8f6f619
      Sidong Yang authored
      Removes duplicated check when adding qgroup relations.
      btrfs_add_qgroup_relations function adds relations by calling
      add_relation_rb(). add_relation_rb() checks that member/parentid exists
      in current qgroup_tree. But it already checked before calling the
      function. It seems that we don't need to double check.
      
      Add new function __add_relation_rb() that adds relations with
      qgroup structures and makes old function use the new one. And it makes
      btrfs_add_qgroup_relation() function work without double checks by
      calling the new function.
      Signed-off-by: default avatarSidong Yang <realwakka@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a8f6f619
    • Dāvis Mosāns's avatar
      btrfs: add lzo workspace buffer length constants · dc4a4bdb
      Dāvis Mosāns authored
      It makes it more readable for length checking and is be used repeatedly.
      Signed-off-by: default avatarDāvis Mosāns <davispuh@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc4a4bdb
    • Qu Wenruo's avatar
      btrfs: populate extent_map::generation when reading from disk · 40e7efe0
      Qu Wenruo authored
      When btrfs_get_extent() tries to get some file extent from disk, it
      never populates extent_map::generation, leaving the value to be 0.
      
      On the other hand, for extent map generated by IO, it will get its
      generation properly set at finish_ordered_io()
      
       finish_ordered_io()
       |- unpin_extent_cache(gen = trans->transid)
          |- em->generation = gen;
      
      [CAUSE]
      Since extent_map::generation is mostly used by fsync code, and for fsync
      they only care about modified extents, which all have their
      em::generation > 0.
      
      Thus it's fine to not populate em read from disk for fsync.
      
      [CORNER CASE]
      However autodefrag also relies on em::generation to determine if one
      extent needs to be defragged.
      
      This unpopulated extent_map::generation can prevent the following
      autodefrag case from working:
      
      	mkfs.btrfs -f $dev
      	mount $dev $mnt -o autodefrag
      
      	# initial write to queue the inode for autodefrag
      	xfs_io -f -c "pwrite 0 4k" $mnt/file
      	sync
      
      	# Real fragmented write
      	xfs_io -f -s -c "pwrite -b 4096 0 32k" $mnt/file
      	sync
      	echo "=== before autodefrag ==="
      	xfs_io -c "fiemap -v" $mnt/file
      
      	# Drop cache to force em to be read from disk
      	echo 3 > /proc/sys/vm/drop_caches
      	mount -o remount,commit=1 $mnt
      	sleep 3
      	sync
      
      	echo "=== After autodefrag ==="
      	xfs_io -c "fiemap -v" $mnt/file
      	umount $mnt
      
      The result looks like this:
      
        === before autodefrag ===
        /mnt/btrfs/file:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..15]:         26672..26687        16   0x0
           1: [16..31]:        26656..26671        16   0x0
           2: [32..47]:        26640..26655        16   0x0
           3: [48..63]:        26624..26639        16   0x1
        === After autodefrag ===
        /mnt/btrfs/file:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..15]:         26672..26687        16   0x0
           1: [16..31]:        26656..26671        16   0x0
           2: [32..47]:        26640..26655        16   0x0
           3: [48..63]:        26624..26639        16   0x1
      
      This fragmented 32K will not be defragged by autodefrag.
      
      [FIX]
      To make things less weird, just populate extent_map::generation when
      reading file extents from disk.
      
      This would make above fragmented extents to be properly defragged:
      
        == before autodefrag ===
        /mnt/btrfs/file:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..15]:         26672..26687        16   0x0
           1: [16..31]:        26656..26671        16   0x0
           2: [32..47]:        26640..26655        16   0x0
           3: [48..63]:        26624..26639        16   0x1
        === After autodefrag ===
        /mnt/btrfs/file:
         EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
           0: [0..63]:         26688..26751        64   0x1
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40e7efe0
    • Filipe Manana's avatar
      btrfs: assert we have a write lock when removing and replacing extent maps · 6d3b050e
      Filipe Manana authored
      Removing or replacing an extent map requires holding a write lock on the
      extent map's tree. We currently do that everywhere, except in one of the
      self tests, where it's harmless since there's no concurrency.
      
      In order to catch possible races in the future, assert that we are holding
      a write lock on the extent map tree before removing or replacing an extent
      map in the tree, and update the self test to obtain a write lock before
      removing extent maps.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d3b050e
    • Filipe Manana's avatar
      btrfs: remove no longer used counter when reading data page · ad3fc794
      Filipe Manana authored
      After commit 92082d40 ("btrfs: integrate page status update for
      data read path into begin/end_page_read"), the 'nr' counter at
      btrfs_do_readpage() is no longer used, we increment it but we never
      read from it. So just remove it.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad3fc794
    • Filipe Manana's avatar
      btrfs: fix lost error return value when reading a data page · bbf0ea7e
      Filipe Manana authored
      At btrfs_do_readpage(), if we get an error when trying to lookup for an
      extent map, we end up marking the page with the error bit, clearing
      the uptodate bit on it, and doing everything else that should be done.
      However we return success (0) to the caller, when we should return the
      error encoded in the extent map pointer. So fix that by returning the
      error encoded in the pointer.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbf0ea7e
    • Filipe Manana's avatar
      btrfs: stop checking for NULL return from btrfs_get_extent() · c0347550
      Filipe Manana authored
      At extent_io.c, in the read page and write page code paths, we are testing
      if the return value from btrfs_get_extent() can be NULL. However that is
      not possible, as btrfs_get_extent() always returns either an error pointer
      or a (non-NULL) pointer to an extent map structure.
      
      Everywhere else outside extent_io.c we never check for NULL, we always
      treat any returned value as a non-NULL pointer if it does not encode an
      error.
      
      So check only for the IS_ERR() case at extent_io.c.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c0347550
    • Filipe Manana's avatar
      btrfs: prepare extents to be logged before locking a log tree path · e1f53ed8
      Filipe Manana authored
      When we want to log an extent, in the fast fsync path, we obtain a path
      to the leaf that will hold the file extent item either through a deletion
      search, via btrfs_drop_extents(), or through an insertion search using
      btrfs_insert_empty_item(). After that we fill the file extent item's
      fields one by one directly on the leaf.
      
      Instead of doing that, we could prepare the file extent item before
      obtaining a btree path, and then copy the prepared extent item with a
      single operation once we get the path. This helps avoid some contention
      on the log tree, since we are holding write locks for longer than
      necessary, especially in the case where the path is obtained via
      btrfs_drop_extents() through a deletion search, which always keeps a
      write lock on the nodes at levels 1 and 2 (besides the leaf).
      
      This change does that, we prepare the file extent item that is going to
      be inserted before acquiring a path, and then copy it into a leaf using
      a single copy operation once we get a path.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The following test was run to measure the impact of the whole patchset:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
        NUM_JOBS=8
        FILE_SIZE=128M
        RUN_TIME=200
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=randwrite
        fsync=1
        fallocate=none
        group_reporting=1
        direct=0
        bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
        ioengine=sync
        filesize=$FILE_SIZE
        runtime=$RUN_TIME
        time_based
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        echo
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        fio /tmp/fio-job.ini
      
        umount $MNT
      
      The test ran inside a VM (8 cores, 32G of RAM) with the target disk
      mapping to a raw NVMe device, and using a non-debug kernel config
      (Debian's default config).
      
      Before the patchset:
      
      WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=22.7GiB (24.4GB), run=200013-200013msec
      
      After the patchset:
      
      WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=24.3GiB (26.1GB), run=200007-200007msec
      
      A 7.8% gain on throughput and +7.0% more IO done in the same period of
      time (200 seconds).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e1f53ed8
    • Filipe Manana's avatar
      btrfs: remove useless path release in the fast fsync path · d8457531
      Filipe Manana authored
      There's no point in calling btrfs_release_path() after finishing the loop
      that logs the modified extents, since log_one_extent() returns with the
      path released. In case the list of extents is empty, the path is already
      released, so there's no need for that case as well.
      So just remove that unnecessary btrfs_release_path() call.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d8457531
    • Filipe Manana's avatar
      btrfs: remove constraint on number of visited leaves when replacing extents · 7ecb4c31
      Filipe Manana authored
      At btrfs_drop_extents(), we try to replace a range of file extent items
      with a new file extent in a single btree search, to avoid the need to do
      a search for deletion, followed by a path release and followed by yet
      another search for insertion.
      
      When I originally added that optimization, in commit 1acae57b
      ("Btrfs: faster file extent item replace operations"), I left a constraint
      to do the fast replace only if we visited a single leaf. That was because
      in the most common case we find all file extent items that need to be
      deleted (or trimmed) in a single leaf, however it can work for other
      common cases like when we need to delete a few file extent items located
      at the end of a leaf and a few more located at the beginning of the next
      leaf. The key for the new file extent item is greater than the key of
      any deleted or trimmed file extent item from previous leaves, so we are
      fine to use the last leaf that we found as long as we are holding a
      write lock on it - even if the new key ends up at slot 0, as if that's
      the case, the btree search has obtained a write lock on any upper nodes
      that need to have a key pointer updated.
      
      So removed the constraint that limits the optimization to the case where
      we visited only a single leaf.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7ecb4c31
    • Filipe Manana's avatar
      btrfs: avoid unnecessary computation when deleting items from a leaf · 0cae23b6
      Filipe Manana authored
      When deleting items from a leaf, we always compute the sum of the data
      sizes of the items that are going to be deleted. However we only use
      that sum when the last item to delete is behind the last item in the
      leaf. This unnecessarily wastes CPU time when we are deleting either
      the whole leaf or from some slot > 0 up to the last item in the leaf,
      and both of these cases are common (e.g. truncation operation, either
      as a result of truncate(2) or when logging inodes, deleting checksums
      after removing a large enough extent, etc).
      
      So compute only the sum of the data sizes if the last item to be
      deleted does not match the last item in the leaf.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0cae23b6
    • Filipe Manana's avatar
      btrfs: avoid unnecessary COW of leaves when deleting items from a leaf · 7c4063d1
      Filipe Manana authored
      When we delete items from a leaf, if we end up with more than two thirds
      of unused leaf space, we try to delete the leaf by moving all its items
      into its left and right neighbour leaves. Sometimes that is not possible
      because there is not enough free space in the left and right leaves, and
      in that case we end up not deleting our leaf.
      
      The way we are doing this is not ideal and can be improved in the
      following ways:
      
      1) When we call push_leaf_left(), we pass a value of 1 byte to the data
         size parameter of push_leaf_left(). This is not realistic value because
         no item can have a size less than 25 bytes, which is the size of struct
         btrfs_item. This means that means that if the left leaf has not enough
         free space to push any item, we end up COWing it even if we end up not
         changing its content at all.
      
         COWing that leaf means allocating a new metadata extent, marking it
         dirty and doing more IO when committing a transaction or when syncing a
         log tree. For a log tree case, it's particularly more important to
         avoid the useless COW operation, as more IO can imply a higher latency
         for an fsync operation.
      
         So instead of passing 1 as the minimum data size for push_leaf_left(),
         pass the size of the first item in our leaf, as we don't want to COW
         the left leaf if we can't at least push the first item of our leaf;
      
      2) When we call push_leaf_right(), we also pass a value of 1 byte as the
         data size parameter of push_leaf_right(). Like the previous case, it
         will also result in COWing the right leaf even if we are not able to
         move any items into it, since there can't be any item with a size
         smaller than 25 bytes (the size of struct btrfs_item).
      
         So instead of passing 1 as the minimum data size to push_leaf_right(),
         pass a size that corresponds to the sum of the size of all the
         remaining items in our leaf. We are not interested in moving less than
         that, because if we do, we are not able to delete our leaf and we have
         COWed the right leaf for nothing. Plus, moving only some of the items
         of our leaf, it means an even less balanced tree.
      
         Just like the previous case, we want to avoid the useless COW of the
         right leaf, this way we don't have to spend time allocating one new
         metadata extent, and doing more IO when committing a transaction or
         syncing a log tree. For the log tree case it's specially more important
         because more IO can result in a higher latency for a fsync operation.
      
      So adjust the minimum data size passed to push_leaf_left() and
      push_leaf_right() as mentioned above.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      Not being able to delete a leaf that became less than 1/3 full after
      deleting items from it is actually common. For example, for the fio test
      mentioned in the changelog of patch 6/6, we are only able to delete a
      leaf at btrfs_del_items() about 5.3% of the time, due to its left and
      right neighbour leaves not having enough free space to push all the
      remaining items into them.
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7c4063d1
    • Filipe Manana's avatar
      btrfs: remove unnecessary leaf free space checks when pushing items · b4e098a9
      Filipe Manana authored
      When trying to push items from a leaf into its left and right neighbours,
      we lock the left or right leaf, check if it has the required minimum free
      space, COW the leaf and then check again if it has the minimum required
      free space. This second check is pointless:
      
      1) Most and foremost because it's not needed. We have a write lock on the
         leaf and on its parent node, so no one can come in and change either
         the pre-COW or post-COW version of the leaf for the whole duration of
         the push_leaf_left() and push_leaf_right() calls;
      
      2) The call to btrfs_leaf_free_space() is not trivial, it has a fair
         amount of arithmetic operations and access to fields in the leaf's
         header and items, so it's not very cheap.
      
      So remove the duplicated free space checks.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b4e098a9
    • Johannes Thumshirn's avatar
      btrfs: stop checking for NULL return from btrfs_get_extent_fiemap() · 6b5b7a41
      Johannes Thumshirn authored
      In get_extent_skip_holes() we're checking the return of
      btrfs_get_extent_fiemap() for an error pointer or NULL, but
      btrfs_get_extent_fiemap() will never return NULL, only error pointers or
      a valid extent_map.
      
      The other caller of btrfs_get_extent_fiemap(), find_desired_extent(),
      correctly only checks for error-pointers.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b5b7a41
    • Pankaj Raghav's avatar
      btrfs: zoned: remove redundant assignment in btrfs_check_zoned_mode · f716fa47
      Pankaj Raghav authored
      Remove the redundant assignment to zone_info variable in
      btrfs_check_zoned_mode function.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f716fa47
    • David Sterba's avatar
      btrfs: replace BUILD_BUG_ON by static_assert · a55e65b8
      David Sterba authored
      The static_assert introduced in 6bab69c6 ("build_bug.h: add wrapper
      for _Static_assert") has been supported by compilers for a long time
      (gcc 4.6, clang 3.0) and can be used in header files. We don't need to
      put BUILD_BUG_ON to random functions but rather keep it next to the
      definition.
      
      The exception here is the UAPI header btrfs_tree.h that could be
      potentially included by userspace code and the static assert is not
      defined (nor used in any other header).
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a55e65b8
    • Johannes Thumshirn's avatar
      btrfs: zoned: allow DUP on meta-data block groups · 265f7237
      Johannes Thumshirn authored
      Allow creating or reading block-groups on a zoned device with DUP as a
      meta-data profile.
      
      This works because we're using the zoned_meta_io_lock and REQ_OP_WRITE
      operations for meta-data on zoned btrfs, so all writes to meta-data zones
      are aligned to the zone's write-pointer.
      
      Upon loading of the block-group, it is ensured both zones do have the same
      zone capacity and write-pointer offsets, so no extra machinery is needed
      to keep the write-pointers in sync for the meta-data zones. If this
      prerequisite is not met, loading of the block-group is refused.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      265f7237
    • Johannes Thumshirn's avatar
      btrfs: zoned: prepare for allowing DUP on zoned · dbfcc18f
      Johannes Thumshirn authored
      Allow for a block-group to be placed on more than one physical zone.
      
      This is a preparation for allowing DUP profiles for meta-data on a zoned
      file-system.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dbfcc18f
    • Johannes Thumshirn's avatar
      btrfs: zoned: make zone finishing multi stripe capable · 4dcbb8ab
      Johannes Thumshirn authored
      Currently finishing of a zone only works if the block group isn't
      spanning more than one zone.
      
      This limitation is purely artificial and can be easily expanded to block
      groups being places across multiple zones.
      
      This is a preparation for allowing DUP and later more complex block-group
      profiles on zoned btrfs.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4dcbb8ab
    • Johannes Thumshirn's avatar
      btrfs: zoned: make zone activation multi stripe capable · f9a912a3
      Johannes Thumshirn authored
      Currently activation of a zone only works if the block group isn't
      spanning more than one zone.
      
      This limitation is purely artificial and can be easily expanded to block
      groups being places across multiple zones.
      
      This is a preparation for allowing DUP and later more complex block-group
      profiles on zoned btrfs.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9a912a3