1. 25 May, 2020 40 commits
    • Filipe Manana's avatar
      btrfs: rename member 'trimming' of block group to a more generic name · 6b7304af
      Filipe Manana authored
      Back in 2014, commit 04216820 ("Btrfs: fix race between fs trimming
      and block group remove/allocation"), I added the 'trimming' member to the
      block group structure. Its purpose was to prevent races between trimming
      and block group deletion/allocation by pinning the block group in a way
      that prevents its logical address and device extents from being reused
      while trimming is in progress for a block group, so that if another task
      deletes the block group and then another task allocates a new block group
      that gets the same logical address and device extents while the trimming
      task is still in progress.
      
      After the previous fix for scrub (patch "btrfs: fix a race between scrub
      and block group removal/allocation"), scrub now also has the same needs that
      trimming has, so the member name 'trimming' no longer makes sense.
      Since there is already a 'pinned' member in the block group that refers
      to space reservations (pinned bytes), rename the member to 'frozen',
      add a comment on top of it to describe its general purpose and rename
      the helpers to increment and decrement the counter as well, to match
      the new member name.
      
      The next patch in the series will move the helpers into a more suitable
      file (from free-space-cache.c to block-group.c).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b7304af
    • Filipe Manana's avatar
      btrfs: fix a race between scrub and block group removal/allocation · 2473d24f
      Filipe Manana authored
      When scrub is verifying the extents of a block group for a device, it is
      possible that the corresponding block group gets removed and its logical
      address and device extents get used for a new block group allocation.
      When this happens scrub incorrectly reports that errors were detected
      and, if the the new block group has a different profile then the old one,
      deleted block group, we can crash due to a null pointer dereference.
      Possibly other unexpected and weird consequences can happen as well.
      
      Consider the following sequence of actions that leads to the null pointer
      dereference crash when scrub is running in parallel with balance:
      
      1) Balance sets block group X to read-only mode and starts relocating it.
         Block group X is a metadata block group, has a raid1 profile (two
         device extents, each one in a different device) and a logical address
         of 19424870400;
      
      2) Scrub is running and finds device extent E, which belongs to block
         group X. It enters scrub_stripe() to find all extents allocated to
         block group X, the search is done using the extent tree;
      
      3) Balance finishes relocating block group X and removes block group X;
      
      4) Balance starts relocating another block group and when trying to
         commit the current transaction as part of the preparation step
         (prepare_to_relocate()), it blocks because scrub is running;
      
      5) The scrub task finds the metadata extent at the logical address
         19425001472 and marks the pages of the extent to be read by a bio
         (struct scrub_bio). The extent item's flags, which have the bit
         BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
         scrub_page). It is these flags in the scrub pages that tells the
         bio's end io function (scrub_bio_end_io_worker) which type of extent
         it is dealing with. At this point we end up with 4 pages in a bio
         which is ready for submission (the metadata extent has a size of
         16Kb, so that gives 4 pages on x86);
      
      6) At the next iteration of scrub_stripe(), scrub checks that there is a
         pause request from the relocation task trying to commit a transaction,
         therefore it submits the pending bio and pauses, waiting for the
         transaction commit to complete before resuming;
      
      7) The relocation task commits the transaction. The device extent E, that
         was used by our block group X, is now available for allocation, since
         the commit root for the device tree was swapped by the transaction
         commit;
      
      8) Another task doing a direct IO write allocates a new data block group Y
         which ends using device extent E. This new block group Y also ends up
         getting the same logical address that block group X had: 19424870400.
         This happens because block group X was the block group with the highest
         logical address and, when allocating Y, find_next_chunk() returns the
         end offset of the current last block group to be used as the logical
         address for the new block group, which is
      
              18351128576 + 1073741824 = 19424870400
      
         So our new block group Y has the same logical address and device extent
         that block group X had. However Y is a data block group, while X was
         a metadata one, and Y has a raid0 profile, while X had a raid1 profile;
      
      9) After allocating block group Y, the direct IO submits a bio to write
         to device extent E;
      
      10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
          extent E, which now correspond to the data written by the task that
          did a direct IO write. Then at the end io function associated with
          the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
          which calls scrub_checksum(). This later function checks the flags
          of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
          is set in the flags, so it assumes it has a metadata extent and
          then calls scrub_checksum_tree_block(). That functions returns an
          error, since interpreting data as a metadata extent causes the
          checksum verification to fail.
      
          So this makes scrub_checksum() call scrub_handle_errored_block(),
          which determines 'failed_mirror_index' to be 1, since the device
          extent E was allocated as the second mirror of block group X.
      
          It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
          'sblocks_for_recheck', and all the memory is initialized to zeroes by
          kcalloc().
      
          After that it calls scrub_setup_recheck_block(), which is responsible
          for filling each of those structures. However, when that function
          calls btrfs_map_sblock() against the logical address of the metadata
          extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
          the current block group Y. However block group Y has a raid0 profile
          and not a raid1 profile like X had, so the following call returns 1:
      
             scrub_nr_raid_mirrors(bbio)
      
          And as a result scrub_setup_recheck_block() only initializes the
          first (index 0) scrub_block structure in 'sblocks_for_recheck'.
      
          Then scrub_recheck_block() is called by scrub_handle_errored_block()
          with the second (index 1) scrub_block structure as the argument,
          because 'failed_mirror_index' was previously set to 1.
          This scrub_block was not initialized by scrub_setup_recheck_block(),
          so it has zero pages, its 'page_count' member is 0 and its 'pagev'
          page array has all members pointing to NULL.
      
          Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
          we have a NULL pointer dereference when accessing the flags of the first
          page, as pavev[0] is NULL:
      
          static void scrub_recheck_block_checksum(struct scrub_block *sblock)
          {
              (...)
              if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
                  scrub_checksum_data(sblock);
              (...)
          }
      
          Producing a stack trace like the following:
      
          [542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
          [542998.010238] #PF: supervisor read access in kernel mode
          [542998.010878] #PF: error_code(0x0000) - not-present page
          [542998.011516] PGD 0 P4D 0
          [542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
          [542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G    B   W         5.6.0-rc7-btrfs-next-58 #1
          [542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
          [542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
          [542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
          [542998.018474] Code: 4c 89 e6 ...
          [542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
          [542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
          [542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
          [542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
          [542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
          [542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
          [542998.027237] FS:  0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
          [542998.028327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
          [542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          [542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          [542998.032380] Call Trace:
          [542998.032752]  scrub_recheck_block+0x162/0x400 [btrfs]
          [542998.033500]  ? __alloc_pages_nodemask+0x31e/0x460
          [542998.034228]  scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
          [542998.035170]  scrub_bio_end_io_worker+0x100/0x520 [btrfs]
          [542998.035991]  btrfs_work_helper+0xaa/0x720 [btrfs]
          [542998.036735]  process_one_work+0x26d/0x6a0
          [542998.037275]  worker_thread+0x4f/0x3e0
          [542998.037740]  ? process_one_work+0x6a0/0x6a0
          [542998.038378]  kthread+0x103/0x140
          [542998.038789]  ? kthread_create_worker_on_cpu+0x70/0x70
          [542998.039419]  ret_from_fork+0x3a/0x50
          [542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
          [542998.047288] CR2: 0000000000000028
          [542998.047724] ---[ end trace bde186e176c7f96a ]---
      
      This issue has been around for a long time, possibly since scrub exists.
      The last time I ran into it was over 2 years ago. After recently fixing
      fstests to pass the "--full-balance" command line option to btrfs-progs
      when doing balance, several tests started to more heavily exercise balance
      with fsstress, scrub and other operations in parallel, and therefore
      started to hit this issue again (with btrfs/061 for example).
      
      Fix this by having scrub increment the 'trimming' counter of the block
      group, which pins the block group in such a way that it guarantees neither
      its logical address nor device extents can be reused by future block group
      allocations until we decrement the 'trimming' counter. Also make sure that
      on each iteration of scrub_stripe() we stop scrubbing the block group if
      it was removed already.
      
      A later patch in the series will rename the block group's 'trimming'
      counter and its helpers to a more generic name, since now it is not used
      exclusively for pinning while trimming anymore.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2473d24f
    • David Sterba's avatar
      btrfs: remove more obsolete v0 extent ref declarations · 31344b2f
      David Sterba authored
      The extent references v0 have been superseded long time go, there are
      some unused declarations of access helpers. We can safely remove them
      now. The struct btrfs_extent_ref_v0 is not used anywhere, but struct
      btrfs_extent_item_v0 is still part of a backward compatibility check in
      relocation.c and thus not removed.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      31344b2f
    • YueHaibing's avatar
      btrfs: remove unused function btrfs_dev_extent_chunk_tree_uuid · 943aeb0d
      YueHaibing authored
      There's no callers in-tree anymore since
      commit d24ee97b ("btrfs: use new helpers to set uuids in eb")
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      943aeb0d
    • Qu Wenruo's avatar
      btrfs: qgroup: mark qgroup inconsistent if we're inherting snapshot to a new qgroup · cbab8ade
      Qu Wenruo authored
      [BUG]
      For the following operation, qgroup is guaranteed to be screwed up due
      to snapshot adding to a new qgroup:
      
        # mkfs.btrfs -f $dev
        # mount $dev $mnt
        # btrfs qgroup en $mnt
        # btrfs subv create $mnt/src
        # xfs_io -f -c "pwrite 0 1m" $mnt/src/file
        # sync
        # btrfs qgroup create 1/0 $mnt/src
        # btrfs subv snapshot -i 1/0 $mnt/src $mnt/snapshot
        # btrfs qgroup show -prce $mnt/src
        qgroupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5          16.00KiB     16.00KiB         none         none ---     ---
        0/257         1.02MiB     16.00KiB         none         none ---     ---
        0/258         1.02MiB     16.00KiB         none         none 1/0     ---
        1/0             0.00B        0.00B         none         none ---     0/258
      	        ^^^^^^^^^^^^^^^^^^^^
      
      [CAUSE]
      The problem is in btrfs_qgroup_inherit(), we don't have good enough
      check to determine if the new relation would break the existing
      accounting.
      
      Unlike btrfs_add_qgroup_relation(), which has proper check to determine
      if we can do quick update without a rescan, in btrfs_qgroup_inherit() we
      can even assign a snapshot to multiple qgroups.
      
      [FIX]
      Fix it by manually marking qgroup inconsistent for snapshot inheritance.
      
      For subvolume creation, since all its extents are exclusively owned, we
      don't need to rescan.
      
      In theory, we should call relation check like quick_update_accounting()
      when doing qgroup inheritance and inform user about qgroup accounting
      inconsistency.
      
      But we don't have good mechanism to relay that back to the user in the
      snapshot creation context, thus we can only silently mark the qgroup
      inconsistent.
      
      Anyway, user shouldn't use qgroup inheritance during snapshot creation,
      and should add qgroup relationship after snapshot creation by 'btrfs
      qgroup assign', which has a much better UI to inform user about qgroup
      inconsistent and kick in rescan automatically.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cbab8ade
    • Robbie Ko's avatar
      btrfs: speedup dead root detection during orphan cleanup · a619b3c7
      Robbie Ko authored
      When mounting, we handle deleted subvolume and orphan items.  First,
      find add orphan roots, then add them to fs_root radix tree.  Second, in
      tree-root, process each orphan item, skip if it is dead root.
      
      The original algorithm is based on the list of dead_roots, one by one to
      visit and check whether the objectid is consistent, the time complexity
      is O (n ^ 2).  When processing 50000 deleted subvols, it takes about
      120s.
      
      Because btrfs_find_orphan_roots has already ran before us, and added
      deleted subvol to fs_roots radix tree.
      
      The fs root will only be removed from the fs_roots radix tree after the
      cleaner process is started, and the cleaner will only start execution
      after the mount is complete.
      
      btrfs_orphan_cleanup can be called during the whole filesystem mount
      lifetime, but only "tree root" will be used in this section of code, and
      only mount time will be brought into tree root.
      
      So we can quickly check whether the orphan item is dead root through the
      fs_roots radix tree.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a619b3c7
    • YueHaibing's avatar
      btrfs: remove unused function heads_to_leaves · eec5b6e0
      YueHaibing authored
      There's no callers in-tree anymore since commit 64403612 ("btrfs:
      rework btrfs_check_space_for_delayed_refs")
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eec5b6e0
    • David Sterba's avatar
      btrfs: add more codes to decoder table · fb8521ca
      David Sterba authored
      I've grepped logs for 'errno=.*unknown' and found -95, -117 and -122,
      now added to the table. The wording is adjusted so it makes sense in
      context of filesystem.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb8521ca
    • David Sterba's avatar
      btrfs: sort error decoder entries · d54f8144
      David Sterba authored
      Add the raw errnos and sort them accordingly.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d54f8144
    • Anand Jain's avatar
      btrfs: free alien device after device add · 7f551d96
      Anand Jain authored
      When an old device has new fsid through 'btrfs device add -f <dev>' our
      fs_devices list has an alien device in one of the fs_devices lists.
      
      By having an alien device in fs_devices, we have two issues so far
      
      1. missing device does not not show as missing in the userland
      
      2. degraded mount will fail
      
      Both issues are caused by the fact that there's an alien device in the
      fs_devices list. (Alien means that it does not belong to the filesystem,
      identified by fsid, or does not contain btrfs filesystem at all, eg. due
      to overwrite).
      
      A device can be scanned/added through the control device ioctls
      SCAN_DEV, DEVICES_READY or by ADD_DEV.
      
      And device coming through the control device is checked against the all
      other devices in the lists, but this was not the case for ADD_DEV.
      
      This patch fixes both issues above by removing the alien device.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f551d96
    • Anand Jain's avatar
      btrfs: include non-missing as a qualifier for the latest_bdev · 998a0671
      Anand Jain authored
      btrfs_free_extra_devids() updates fs_devices::latest_bdev to point to
      the bdev with greatest device::generation number.  For a typical-missing
      device the generation number is zero so fs_devices::latest_bdev will
      never point to it.
      
      But if the missing device is due to alienation [1], then
      device::generation is not zero and if it is greater or equal to the rest
      of device  generations in the list, then fs_devices::latest_bdev ends up
      pointing to the missing device and reports the error like [2].
      
      [1] We maintain devices of a fsid (as in fs_device::fsid) in the
      fs_devices::devices list, a device is considered as an alien device
      if its fsid does not match with the fs_device::fsid
      
      Consider a working filesystem with raid1:
      
        $ mkfs.btrfs -f -d raid1 -m raid1 /dev/sda /dev/sdb
        $ mount /dev/sda /mnt-raid1
        $ umount /mnt-raid1
      
      While mnt-raid1 was unmounted the user force-adds one of its devices to
      another btrfs filesystem:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt-single
        $ btrfs dev add -f /dev/sda /mnt-single
      
      Now the original mnt-raid1 fails to mount in degraded mode, because
      fs_devices::latest_bdev is pointing to the alien device.
      
        $ mount -o degraded /dev/sdb /mnt-raid1
      
      [2]
      mount: wrong fs type, bad option, bad superblock on /dev/sdb,
             missing codepage or helper program, or other error
      
             In some cases useful info is found in syslog - try
             dmesg | tail or so.
      
        kernel: BTRFS warning (device sdb): devid 1 uuid 072a0192-675b-4d5a-8640-a5cf2b2c704d is missing
        kernel: BTRFS error (device sdb): failed to read devices
        kernel: BTRFS error (device sdb): open_ctree failed
      
      Fix the root cause by checking if the device is not missing before it
      can be considered for the fs_devices::latest_bdev.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      998a0671
    • Eric Biggers's avatar
      btrfs: use crypto_shash_digest() instead of open coding · fd08001f
      Eric Biggers authored
      Use crypto_shash_digest() instead of crypto_shash_init() +
      crypto_shash_update() + crypto_shash_final().  This is more efficient.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd08001f
    • Anand Jain's avatar
      btrfs: drop useless goto in open_fs_devices · 1ed802c9
      Anand Jain authored
      There is no need of goto out in open_fs_devices() as there is nothing
      special done there.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1ed802c9
    • Filipe Manana's avatar
      btrfs: remove useless check for copy_items() return value · 0bc2d3c0
      Filipe Manana authored
      At btrfs_log_prealloc_extents() we are checking if copy_items() returns a
      value greater than 0. That used to happen in the past to signal the caller
      that the path given to it was released and reused for other searches, but
      as of commit 0e56315c ("Btrfs: fix missing hole after hole punching
      and fsync when using NO_HOLES"), the copy_items() function does not have
      that behaviour anymore and always returns 0 or a negative value. So just
      remove that check at btrfs_log_prealloc_extents(), which the previously
      mentioned commit forgot to remove.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0bc2d3c0
    • Omar Sandoval's avatar
      btrfs: unify buffered and direct I/O read repair · 77d5d689
      Omar Sandoval authored
      Currently, direct I/O has its own versions of bio_readpage_error() and
      btrfs_check_repairable() (dio_read_error() and
      btrfs_check_dio_repairable(), respectively). The main difference is that
      the direct I/O version doesn't do read validation. The rework of direct
      I/O repair makes it possible to do validation, so we can get rid of
      btrfs_check_dio_repairable() and combine bio_readpage_error() and
      dio_read_error() into a new helper, btrfs_submit_read_repair().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77d5d689
    • Omar Sandoval's avatar
      btrfs: get rid of endio_repair_workers · 5c047a69
      Omar Sandoval authored
      This was originally added in commit 8b110e39 ("Btrfs: implement
      repair function when direct read fails") to avoid a deadlock. In that
      commit, the direct I/O read endio executes on the endio_workers
      workqueue, submits a repair bio, and waits for it to complete. The
      repair bio endio must execute on a different workqueue, otherwise it
      could block on the endio_workers workqueue becoming available, which
      won't happen because the original endio is blocked on the repair bio.
      
      As of the previous commit, the original endio doesn't wait for the
      repair bio, so this separate workqueue is unnecessary.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5c047a69
    • Omar Sandoval's avatar
      btrfs: simplify direct I/O read repair · fd9d6670
      Omar Sandoval authored
      Direct I/O read repair was originally implemented in commit 8b110e39
      ("Btrfs: implement repair function when direct read fails"). This
      implementation is unnecessarily complicated. There is major code
      duplication between __btrfs_subio_endio_read() (checks checksums and
      handles I/O errors for files with checksums),
      __btrfs_correct_data_nocsum() (handles I/O errors for files without
      checksums), btrfs_retry_endio() (checks checksums and handles I/O errors
      for retries of files with checksums), and btrfs_retry_endio_nocsum()
      (handles I/O errors for retries of files without checksum). If it sounds
      like these should be one function, that's because they should.
      Additionally, these functions are very hard to follow due to their
      excessive use of goto.
      
      This commit replaces the original implementation. After the previous
      commit getting rid of orig_bio, we can reuse the same endio callback for
      repair I/O and the original I/O, we just need to track the file offset
      and original iterator in the repair bio. We can also unify the handling
      of files with and without checksums and simplify the control flow. We
      also no longer have to wait for each repair I/O to complete one by one.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd9d6670
    • Omar Sandoval's avatar
      btrfs: get rid of one layer of bios in direct I/O · 769b4f24
      Omar Sandoval authored
      In the worst case, there are _4_ layers of bios in the Btrfs direct I/O
      path:
      
      1. The bio created by the generic direct I/O code (dio_bio).
      2. A clone of dio_bio we create in btrfs_submit_direct() to represent
         the entire direct I/O range (orig_bio).
      3. A partial clone of orig_bio limited to the size of a RAID stripe that
         we create in btrfs_submit_direct_hook().
      4. Clones of each of those split bios for each RAID stripe that we
         create in btrfs_map_bio().
      
      As of the previous commit, the second layer (orig_bio) is no longer
      needed for anything: we can split dio_bio instead, and complete dio_bio
      directly when all of the cloned bios complete. This lets us clean up a
      bunch of cruft, including dip->subio_endio and dip->errors (we can use
      dio_bio->bi_status instead). It also enables the next big cleanup of
      direct I/O read repair.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      769b4f24
    • Omar Sandoval's avatar
      btrfs: put direct I/O checksums in btrfs_dio_private instead of bio · 85879573
      Omar Sandoval authored
      The next commit will get rid of btrfs_dio_private->orig_bio. The only
      thing we really need it for is containing all of the checksums, but we
      can easily put the checksum array in btrfs_dio_private and have the
      submitted bios reference the array. We can also look the checksums up
      while we're setting up instead of the current awkward logic that looks
      them up for orig_bio when the first split bio is submitted.
      
      (Interestingly, btrfs_dio_private did contain the
      checksums before commit 23ea8e5a ("Btrfs: load checksum data once
      when submitting a direct read io"), but it didn't look them up up
      front.)
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      85879573
    • Omar Sandoval's avatar
      btrfs: convert btrfs_dio_private->pending_bios to refcount_t · e3b318d1
      Omar Sandoval authored
      This is really a reference count now, so convert it to refcount_t and
      rename it to refs.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3b318d1
    • Omar Sandoval's avatar
      btrfs: remove unused btrfs_dio_private::private · 2390a6da
      Omar Sandoval authored
      We haven't used this since commit 9be3395b ("Btrfs: use a btrfs
      bioset instead of abusing bio internals").
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2390a6da
    • Omar Sandoval's avatar
      btrfs: make btrfs_check_repairable() static · ce06d3ec
      Omar Sandoval authored
      Since its introduction in commit 2fe6303e ("Btrfs: split
      bio_readpage_error into several functions"), btrfs_check_repairable()
      has only been used from extent_io.c where it is defined.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ce06d3ec
    • Omar Sandoval's avatar
      btrfs: rename __readpage_endio_check to check_data_csum · 47df7765
      Omar Sandoval authored
      __readpage_endio_check() is also used from the direct I/O read code, so
      give it a more descriptive name.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47df7765
    • Omar Sandoval's avatar
      btrfs: clarify btrfs_lookup_bio_sums documentation · fb30f470
      Omar Sandoval authored
      Fix a couple of issues in the btrfs_lookup_bio_sums documentation:
      
      * The bio doesn't need to be a btrfs_io_bio if dst was provided. Move
        the declaration in the code to make that clear, too.
      * dst must be large enough to hold nblocks * csum_size, not just
        csum_size.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb30f470
    • Omar Sandoval's avatar
      btrfs: don't do repair validation for checksum errors · f337bd74
      Omar Sandoval authored
      The purpose of the validation step is to distinguish between good and
      bad sectors in a failed multi-sector read. If a multi-sector read
      succeeded but some of those sectors had checksum errors, we don't need
      to validate anything; we know the sectors with bad checksums need to be
      repaired.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f337bd74
    • Omar Sandoval's avatar
      btrfs: look at full bi_io_vec for repair decision · c7333972
      Omar Sandoval authored
      Read repair does two things: it finds a good copy of data to return to
      the reader, and it corrects the bad copy on disk. If a read of multiple
      sectors has an I/O error, repair does an extra "validation" step that
      issues a separate read for each sector. This allows us to find the exact
      failing sectors and only rewrite those.
      
      This heuristic is implemented in
      bio_readpage_error()/btrfs_check_repairable() as:
      
      	failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
      	if (failed_bio_pages > 1)
      		do validation
      
      However, at this point, bi_iter may have already been advanced. This
      means that we'll skip the validation step and rewrite the entire failed
      read.
      
      Fix it by getting the actual size from the biovec (which we can do
      because this is only called for non-cloned bios, although that will
      change in a later commit).
      
      Fixes: 8a2ee44a ("btrfs: look at bi_size for repair decisions")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7333972
    • Omar Sandoval's avatar
      btrfs: fix double __endio_write_update_ordered in direct I/O · c36cac28
      Omar Sandoval authored
      In btrfs_submit_direct(), if we fail to allocate the btrfs_dio_private,
      we complete the ordered extent range. However, we don't mark that the
      range doesn't need to be cleaned up from btrfs_direct_IO() until later.
      Therefore, if we fail to allocate the btrfs_dio_private, we complete the
      ordered extent range twice. We could fix this by updating
      unsubmitted_oe_range earlier, but it's cleaner to reorganize the code so
      that creating the btrfs_dio_private and submitting the bios are
      separate, and once the btrfs_dio_private is created, cleanup always
      happens through the btrfs_dio_private.
      
      The logic around unsubmitted_oe_range_end and unsubmitted_oe_range_start
      is really subtle. We have the following:
      
        1. btrfs_direct_IO sets those two to the same value.
      
        2. When we call __blockdev_direct_IO unless
           btrfs_get_blocks_direct->btrfs_get_blocks_direct_write is called to
           modify unsubmitted_oe_range_start so that start < end. Cleanup
           won't happen.
      
        3. We come into btrfs_submit_direct - if it dip allocation fails we'd
           return with oe_range_end now modified so cleanup will happen.
      
        4. If we manage to allocate the dip we reset the unsubmitted range
           members to be equal so that cleanup happens from
           btrfs_endio_direct_write.
      
      This 4-step logic is not really obvious, especially given it's scattered
      across 3 functions.
      
      Fixes: f28a4928 ("Btrfs: fix leaking of ordered extents after direct IO write error")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      [ add range start/end logic explanation from Nikolay ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c36cac28
    • Omar Sandoval's avatar
      btrfs: fix error handling when submitting direct I/O bio · 6d3113a1
      Omar Sandoval authored
      In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID
      stripe or chunk, we submit orig_bio without cloning it. In this case, we
      don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we
      decrement pending_bios to -1, and we never complete orig_bio. Fix it by
      initializing pending_bios to 1 instead of incrementing later.
      
      Fixing this exposes another bug: we put orig_bio prematurely and then
      put it again from end_io. Fix it by not putting orig_bio.
      
      After this change, pending_bios is really more of a reference count, but
      I'll leave that cleanup separate to keep the fix small.
      
      Fixes: e65e1535 ("btrfs: fix panic caused by direct IO")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d3113a1
    • Omar Sandoval's avatar
      block: add bio_for_each_bvec_all() · 1072c12d
      Omar Sandoval authored
      An upcoming Btrfs fix needs to know the original size of a non-cloned
      bios. Rather than accessing the bvec table directly, let's add a
      bio_for_each_bvec_all() accessor.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1072c12d
    • Filipe Manana's avatar
      btrfs: simplify error handling of clean_pinned_extents() · 534cf531
      Filipe Manana authored
      At clean_pinned_extents(), whether we end up returning success or failure,
      we pretty much have to do the same things:
      
      1) unlock unused_bg_unpin_mutex
      2) decrement reference count on the previous transaction
      
      We also call btrfs_dec_block_group_ro() in case of failure, but that is
      better done in its caller, btrfs_delete_unused_bgs(), since its the
      caller that calls inc_block_group_ro(), so it should be responsible for
      the decrement operation, as it is in case any of the other functions it
      calls fail.
      
      So move the call to btrfs_dec_block_group_ro() from clean_pinned_extents()
      into  btrfs_delete_unused_bgs() and unify the error and success return
      paths for clean_pinned_extents(), reducing duplicated code and making it
      simpler.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      534cf531
    • Qu Wenruo's avatar
      btrfs: remove the redundant parameter level in btrfs_bin_search() · e3b83361
      Qu Wenruo authored
      All callers pass the eb::level so we can get read it directly inside the
      btrfs_bin_search and key_search.
      
      This is inspired by the work of Marek in U-boot.
      
      CC: Marek Behun <marek.behun@nic.cz>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3b83361
    • Nikolay Borisov's avatar
      btrfs: make btrfs_read_disk_super return struct btrfs_disk_super · b335eab8
      Nikolay Borisov authored
      Instead of returning both the page and the super block structure, make
      btrfs_read_disk_super just return a pointer to struct btrfs_disk_super.
      As a result the function signature is simplified. Also,
      read_cache_page_gfp can never return NULL so check its return value only
      for IS_ERR.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b335eab8
    • Nikolay Borisov's avatar
      btrfs: use list_for_each_entry_safe in free_reloc_roots · a7571232
      Nikolay Borisov authored
      The function always works on a local copy of the reloc root list, which
      cannot be modified outside of it so using list_for_each_entry is fine.
      Additionally the macro handles empty lists so drop list_empty checks of
      callers. No semantic changes.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7571232
    • David Sterba's avatar
      btrfs: don't force read-only after error in drop snapshot · 7c09c030
      David Sterba authored
      Deleting a subvolume on a full filesystem leads to ENOSPC followed by a
      forced read-only. This is not a transaction abort and the filesystem is
      otherwise ok, so the error should be just propagated to the callers.
      
      This is caused by unnecessary call to btrfs_handle_fs_error for all
      errors, except EAGAIN. This does not make sense as the standard
      transaction abort mechanism is in btrfs_drop_snapshot so all relevant
      failures are handled.
      
      Originally in commit cb1b69f4 ("Btrfs: forced readonly when
      btrfs_drop_snapshot() fails") there was no return value at all, so the
      btrfs_std_error made some sense but once the error handling and
      propagation has been implemented we don't need it anymore.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7c09c030
    • Filipe Manana's avatar
      btrfs: remove pointless assertion on reclaim_size counter · 2d9faa5a
      Filipe Manana authored
      The reclaim_size counter of a space_info object is unsigned. So its value
      can never be negative, it's pointless to have an assertion that checks
      its value is >= 0, therefore remove it.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d9faa5a
    • Zheng Wei's avatar
      btrfs: tree-checker: remove duplicate definition of 'inode_item_err' · 72f4f078
      Zheng Wei authored
      Remove the duplicate definition of 'inode_item_err' in the file
      tree-checker.c that got there by accident in c23c77b0 ("btrfs:
      tree-checker: Refactor inode key check into seperate function").
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarZheng Wei <wei.zheng@vivo.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      72f4f078
    • Josef Bacik's avatar
      btrfs: force chunk allocation if our global rsv is larger than metadata · 9c343784
      Josef Bacik authored
      Nikolay noticed a bunch of test failures with my global rsv steal
      patches.  At first he thought they were introduced by them, but they've
      been failing for a while with 64k nodes.
      
      The problem is with 64k nodes we have a global reserve that calculates
      out to 13MiB on a freshly made file system, which only has 8MiB of
      metadata space.  Because of changes I previously made we no longer
      account for the global reserve in the overcommit logic, which means we
      correctly allow overcommit to happen even though we are already
      overcommitted.
      
      However in some corner cases, for example btrfs/170, we will allocate
      the entire file system up with data chunks before we have enough space
      pressure to allocate a metadata chunk.  Then once the fs is full we
      ENOSPC out because we cannot overcommit and the global reserve is taking
      up all of the available space.
      
      The most ideal way to deal with this is to change our space reservation
      stuff to take into account the height of the tree's that we're
      modifying, so that our global reserve calculation does not end up so
      obscenely large.
      
      However that is a huge undertaking.  Instead fix this by forcing a chunk
      allocation if the global reserve is larger than the total metadata
      space.  This gives us essentially the same behavior that happened
      before, we get a chunk allocated and these tests can pass.
      
      This is meant to be a stop-gap measure until we can tackle the "tree
      height only" project.
      
      Fixes: 0096420a ("btrfs: do not account global reserve in can_overcommit")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c343784
    • Josef Bacik's avatar
      btrfs: run btrfs_try_granting_tickets if a priority ticket fails · 42a72cb7
      Josef Bacik authored
      With normal tickets we could have a large reservation at the front of
      the list that is unable to be satisfied, but a smaller ticket later on
      that can be satisfied.  The way we handle this is to run
      btrfs_try_granting_tickets() in maybe_fail_all_tickets().
      
      However no such protection exists for priority tickets.  Fix this by
      handling it in handle_reserve_ticket().  If we've returned after
      attempting to flush space in a priority related way, we'll still be on
      the priority list and need to be removed.
      
      We rely on the flushing to free up space and wake the ticket, but if
      there is not enough space to reclaim _but_ there's enough space in the
      space_info to handle subsequent reservations then we would have gotten
      an ENOSPC erroneously.
      
      Address this by catching where we are still on the list, meaning we were
      a priority ticket, and removing ourselves and then running
      btrfs_try_granting_tickets().  This will handle this particular corner
      case.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42a72cb7
    • Josef Bacik's avatar
      btrfs: only check priority tickets for priority flushing · 666daa9f
      Josef Bacik authored
      In debugging a generic/320 failure on ppc64, Nikolay noticed that
      sometimes we'd ENOSPC out with plenty of space to reclaim if we had
      committed the transaction.  He further discovered that this was because
      there was a priority ticket that was small enough to fit in the free
      space currently in the space_info.
      
      Consider the following scenario.  There is no more space to reclaim in
      the fs without committing the transaction.  Assume there's 1MiB of space
      free in the space info, but there are pending normal tickets with 2MiB
      reservations.
      
      Now a priority ticket comes in with a .5MiB reservation.  Because we
      have normal tickets pending we add ourselves to the priority list,
      despite the fact that we could satisfy this reservation.
      
      The flushing machinery now gets to the point where it wants to commit
      the transaction, but because there's a .5MiB ticket on the priority list
      and we have 1MiB of free space we assume the ticket will be granted
      soon, so we bail without committing the transaction.
      
      Meanwhile the priority flushing does not commit the transaction, and
      eventually fails with an ENOSPC.  Then all other tickets are failed with
      ENOSPC because we were never able to actually commit the transaction.
      
      The fix for this is we should have simply granted the priority flusher
      his reservation, because there was space to make the reservation.
      Priority flushers by definition take priority, so they are allowed to
      make their reservations before any previous normal tickets.  By not
      adding this priority ticket to the list the normal flushing mechanisms
      will then commit the transaction and everything will continue normally.
      
      We still need to serialize ourselves with other priority tickets, so if
      there are any tickets on the priority list then we need to add ourselves
      to that list in order to maintain the serialization between priority
      tickets.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      666daa9f
    • Josef Bacik's avatar
      btrfs: account for trans_block_rsv in may_commit_transaction · bb4f58a7
      Josef Bacik authored
      On ppc64le with 64k page size (respectively 64k block size) generic/320
      was failing and debug output showed we were getting a premature ENOSPC
      with a bunch of space in btrfs_fs_info::trans_block_rsv.
      
      This meant there were still open transaction handles holding space, yet
      the flusher didn't commit the transaction because it deemed the freed
      space won't be enough to satisfy the current reserve ticket. Fix this
      by accounting for space in trans_block_rsv when deciding whether the
      current transaction should be committed or not.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb4f58a7