1. 18 Nov, 2019 40 commits
    • Anand Jain's avatar
      btrfs: use bool argument in free_root_pointers() · 4273eaff
      Anand Jain authored
      We don't need int argument bool shall do in free_root_pointers().  And
      rename the argument as it confused two people.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4273eaff
    • Chengguang Xu's avatar
      btrfs: use better definition of number of compression type · ce96b7ff
      Chengguang Xu authored
      The compression type upper limit constant is the same as the last value
      and this is confusing.  In order to keep coding style consistent, use
      BTRFS_NR_COMPRESS_TYPES as the total number that follows the idom of
      'NR' being one more than the last value.
      Signed-off-by: default avatarChengguang Xu <cgxu519@mykernel.net>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ce96b7ff
    • Chengguang Xu's avatar
      btrfs: use enum for extent type defines · b9b1a53e
      Chengguang Xu authored
      Use enum to replace macro definitions of extent types.
      Signed-off-by: default avatarChengguang Xu <cgxu519@mykernel.net>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9b1a53e
    • Chengguang Xu's avatar
      btrfs: props: remove unnecessary hash_init() · b2cd2959
      Chengguang Xu authored
      DEFINE_HASHTABLE itself has already included initialization code,
      we don't have to call hash_init() again, so remove it.
      Signed-off-by: default avatarChengguang Xu <cgxu519@mykernel.net>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b2cd2959
    • Nikolay Borisov's avatar
      btrfs: Rename btrfs_join_transaction_nolock · 8d510121
      Nikolay Borisov authored
      This function is used only during the final phase of freespace cache
      writeout. This is necessary since using the plain btrfs_join_transaction
      api is deadlock prone. The deadlock looks like:
      
      T1:
      btrfs_commit_transaction
        commit_cowonly_roots
          btrfs_write_dirty_block_groups
            btrfs_wait_cache_io
              __btrfs_wait_cache_io
             btrfs_wait_ordered_range <-- Triggers ordered IO for freespace
                                          inode and blocks transaction commit
      				    until freespace cache writeout
      
      T2: <-- after T1 has triggered the writeout
      finish_ordered_fn
        btrfs_finish_ordered_io
          btrfs_join_transaction <--- this would block waiting for current
                                      transaction to commit, but since trans
      				commit is waiting for this writeout to
      				finish
      
      The special purpose functions prevents it by simply skipping the "wait
      for writeout" since it's guaranteed the transaction won't proceed until
      we are done.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d510121
    • Nikolay Borisov's avatar
      btrfs: User assert to document transaction requirement · ce6d3eb6
      Nikolay Borisov authored
      Using an ASSERT in btrfs_pin_extent allows to more stringently observe
      whether the function is called under a transaction or not.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ce6d3eb6
    • David Sterba's avatar
      btrfs: opencode extent_buffer_get · 67439dad
      David Sterba authored
      The helper is trivial and we can understand what the atomic_inc on
      something named refs does.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      67439dad
    • Tejun Heo's avatar
      btrfs: Avoid getting stuck during cyclic writebacks · f7bddf1e
      Tejun Heo authored
      During a cyclic writeback, extent_write_cache_pages() uses done_index
      to update the writeback_index after the current run is over.  However,
      instead of current index + 1, it gets to to the current index itself.
      
      Unfortunately, this, combined with returning on EOF instead of looping
      back, can lead to the following pathlogical behavior.
      
      1. There is a single file which has accumulated enough dirty pages to
         trigger balance_dirty_pages() and the writer appending to the file
         with a series of short writes.
      
      2. balance_dirty_pages kicks in, wakes up background writeback and sleeps.
      
      3. Writeback kicks in and the cursor is on the last page of the dirty
         file.  Writeback is started or skipped if already in progress.  As
         it's EOF, extent_write_cache_pages() returns and the cursor is set
         to done_index which is pointing to the last page.
      
      4. Writeback is done.  Nothing happens till balance_dirty_pages
         finishes, at which point we go back to #1.
      
      This can almost completely stall out writing back of the file and keep
      the system over dirty threshold for a long time which can mess up the
      whole system.  We encountered this issue in production with a package
      handling application which can reliably reproduce the issue when
      running under tight memory limits.
      
      Reading the comment in the error handling section, this seems to be to
      avoid accidentally skipping a page in case the write attempt on the
      page doesn't succeed.  However, this concern seems bogus.
      
      On each page, the code either:
      
      * Skips and moves onto the next page.
      
      * Fails issue and sets done_index to index + 1.
      
      * Successfully issues and continue to the next page if budget allows
        and not EOF.
      
      IOW, as long as it's not EOF and there's budget, the code never
      retries writing back the same page.  Only when a page happens to be
      the last page of a particular run, we end up retrying the page, which
      can't possibly guarantee anything data integrity related.  Besides,
      cyclic writes are only used for non-syncing writebacks meaning that
      there's no data integrity implication to begin with.
      
      Fix it by always setting done_index past the current page being
      processed.
      
      Note that this problem exists in other writepages too.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f7bddf1e
    • Marcos Paulo de Souza's avatar
      btrfs: block-group: Rework documentation of check_system_chunk function · a9143bd3
      Marcos Paulo de Souza authored
      Commit 4617ea3a (" Btrfs: fix necessary chunk tree space calculation
      when allocating a chunk") removed the is_allocation argument from
      check_system_chunk, since the formula for reserving the necessary space
      for allocation or removing a chunk would be the same.
      
      So, rework the comment by removing the mention of is_allocation
      argument.
      Signed-off-by: default avatarMarcos Paulo de Souza <marcos.souza.org@gmail.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a9143bd3
    • Qu Wenruo's avatar
      btrfs: Enhance error output for write time tree checker · c06631b0
      Qu Wenruo authored
      Unlike read time tree checker errors, write time error can't be
      inspected by "btrfs inspect dump-tree", so we need extra information to
      determine what's going wrong.
      
      The patch will add the following output for write time tree checker
      error:
      
      - The content of the offending tree block
        To help determining if it's a false alert.
      
      - Kernel WARN_ON() for debug build
        This is helpful for us to detect unexpected write time tree checker
        error, especially fstests could catch the dmesg.
        Since the WARN_ON() is only triggered for write time tree checker,
        test cases utilizing dm-error won't trigger this WARN_ON(), thus no
        extra noise.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c06631b0
    • Qu Wenruo's avatar
      btrfs: tree-checker: Refactor prev_key check for ino into a function · 80d7fd1e
      Qu Wenruo authored
      Refactor the check for prev_key->objectid of the following key types
      into one function, check_prev_ino():
      
      - EXTENT_DATA
      - INODE_REF
      - DIR_INDEX
      - DIR_ITEM
      - XATTR_ITEM
      
      Also add the check of prev_key for INODE_REF.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      80d7fd1e
    • Chris Mason's avatar
      Btrfs: extent_write_locked_range() should attach inode->i_wb · dbb70bec
      Chris Mason authored
      extent_write_locked_range() is used when we're falling back to buffered
      IO from inside of compression.  It allocates its own wbc and should
      associate it with the inode's i_wb to make sure the IO goes down from
      the correct cgroup.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dbb70bec
    • Chris Mason's avatar
      Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios · ec39f769
      Chris Mason authored
      Async CRCs and compression submit IO through helper threads, which means
      they have IO priority inversions when cgroup IO controllers are in use.
      
      This flags all of the writes submitted by btrfs helper threads as
      REQ_CGROUP_PUNT.  submit_bio() will punt these to dedicated per-blkcg
      work items to avoid the priority inversion.
      
      For the compression code, we take a reference on the wbc's blkg css and
      pass it down to the async workers.
      
      For the async CRCs, the bio already has the correct css, we just need to
      tell the block layer to use REQ_CGROUP_PUNT.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Modified-and-reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ec39f769
    • Chris Mason's avatar
      Btrfs: only associate the locked page with one async_chunk struct · 1d53c9e6
      Chris Mason authored
      The btrfs writepages function collects a large range of pages flagged
      for delayed allocation, and then sends them down through the COW code
      for processing.  When compression is on, we allocate one async_chunk
      structure for every 512K, and then run those pages through the
      compression code for IO submission.
      
      writepages starts all of this off with a single page, locked by the
      original call to extent_write_cache_pages(), and it's important to keep
      track of this page because it has already been through
      clear_page_dirty_for_io().
      
      The btrfs async_chunk struct has a pointer to the locked_page, and when
      we're redirtying the page because compression had to fallback to
      uncompressed IO, we use page->index to decide if a given async_chunk
      struct really owns that page.
      
      But, this is racey.  If a given delalloc range is broken up into two
      async_chunks (chunkA and chunkB), we can end up with something like
      this:
      
       compress_file_range(chunkA)
       submit_compress_extents(chunkA)
       submit compressed bios(chunkA)
       put_page(locked_page)
      
      				 compress_file_range(chunkB)
      				 ...
      
      Or:
      
       async_cow_submit
        submit_compressed_extents <--- falls back to buffered writeout
         cow_file_range
          extent_clear_unlock_delalloc
           __process_pages_contig
             put_page(locked_pages)
      
      					    async_cow_submit
      
      The end result is that chunkA is completed and cleaned up before chunkB
      even starts processing.  This means we can free locked_page() and reuse
      it elsewhere.  If we get really lucky, it'll have the same page->index
      in its new home as it did before.
      
      While we're processing chunkB, we might decide we need to fall back to
      uncompressed IO, and so compress_file_range() will call
      __set_page_dirty_nobufers() on chunkB->locked_page.
      
      Without cgroups in use, this creates as a phantom dirty page, which
      isn't great but isn't the end of the world. What can happen, it can go
      through the fixup worker and the whole COW machinery again:
      
      in submit_compressed_extents():
        while (async extents) {
        ...
          cow_file_range
          if (!page_started ...)
            extent_write_locked_range
          else if (...)
            unlock_page
          continue;
      
      This hasn't been observed in practice but is still possible.
      
      With cgroups in use, we might crash in the accounting code because
      page->mapping->i_wb isn't set.
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
        IP: percpu_counter_add_batch+0x11/0x70
        PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
        CPU: 16 PID: 2172 Comm: rm Not tainted
        RIP: 0010:percpu_counter_add_batch+0x11/0x70
        RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
        RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
        RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
        RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
        R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
        R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
        FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         account_page_cleaned+0x15b/0x1f0
         __cancel_dirty_page+0x146/0x200
         truncate_cleanup_page+0x92/0xb0
         truncate_inode_pages_range+0x202/0x7d0
         btrfs_evict_inode+0x92/0x5a0
         evict+0xc1/0x190
         do_unlinkat+0x176/0x280
         do_syscall_64+0x63/0x1a0
         entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      The fix here is to make asyc_chunk->locked_page NULL everywhere but the
      one async_chunk struct that's allowed to do things to the locked page.
      
      Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
      Fixes: 771ed689 ("Btrfs: Optimize compressed writeback and reads")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      [ update changelog from mail thread discussion ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d53c9e6
    • Chris Mason's avatar
      Btrfs: delete the entire async bio submission framework · ba8a9d07
      Chris Mason authored
      Now that we're not using btrfs_schedule_bio() anymore, delete all the
      code that supported it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ba8a9d07
    • Chris Mason's avatar
      Btrfs: stop using btrfs_schedule_bio() · 08635bae
      Chris Mason authored
      btrfs_schedule_bio() hands IO off to a helper thread to do the actual
      submit_bio() call.  This has been used to make sure async crc and
      compression helpers don't get stuck on IO submission.  To maintain good
      performance, over time the IO submission threads duplicated some IO
      scheduler characteristics such as high and low priority IOs and they
      also made some ugly assumptions about request allocation batch sizes.
      
      All of this cost at least one extra context switch during IO submission,
      and doesn't fit well with the modern blkmq IO stack.  So, this commit stops
      using btrfs_schedule_bio().  We may need to adjust the number of async
      helper threads for crcs and compression, but long term it's a better
      path.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08635bae
    • David Sterba's avatar
      btrfs: add __pure attribute to functions · e1f60a65
      David Sterba authored
      The attribute is more relaxed than const and the functions could
      dereference pointers, as long as the observable state is not changed. We
      do have such functions, based on -Wsuggest-attribute=pure .
      
      The visible effects of this patch are negligible, there are differences
      in the assembly but hard to summarize.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e1f60a65
    • David Sterba's avatar
      btrfs: add const function attribute · 4143cb8b
      David Sterba authored
      For some reason the attribute is called __attribute_const__ and not
      __const, marks functions that have no observable effects on program
      state, IOW not reading pointers, just the arguments and calculating a
      value. Allows the compiler to do some optimizations, based on
      -Wsuggest-attribute=const . The effects are rather small, though, about
      60 bytes decrese of btrfs.ko.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4143cb8b
    • David Sterba's avatar
      btrfs: add __cold attribute to more functions · b105e927
      David Sterba authored
      The attribute can mark functions supposed to be called rarely if at all
      and the text can be moved to sections far from the other code. The
      attribute has been added to several functions already, this patch is
      based on hints given by gcc -Wsuggest-attribute=cold.
      
      The net effect of this patch is decrease of btrfs.ko by 1000-1300,
      depending on the config options.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b105e927
    • David Sterba's avatar
      btrfs: drop unused parameter is_new from btrfs_iget · 4c66e0d4
      David Sterba authored
      The parameter is now always set to NULL and could be dropped. The last
      user was get_default_root but that got reworked in 05dbe683 ("Btrfs:
      unify subvol= and subvolid= mounting") and the parameter became unused.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c66e0d4
    • Josef Bacik's avatar
      btrfs: use refcount_inc_not_zero in kill_all_nodes · baf320b9
      Josef Bacik authored
      We hit the following warning while running down a different problem
      
      [ 6197.175850] ------------[ cut here ]------------
      [ 6197.185082] refcount_t: underflow; use-after-free.
      [ 6197.194704] WARNING: CPU: 47 PID: 966 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60
      [ 6197.521792] Call Trace:
      [ 6197.526687]  __btrfs_release_delayed_node+0x76/0x1c0
      [ 6197.536615]  btrfs_kill_all_delayed_nodes+0xec/0x130
      [ 6197.546532]  ? __btrfs_btree_balance_dirty+0x60/0x60
      [ 6197.556482]  btrfs_clean_one_deleted_snapshot+0x71/0xd0
      [ 6197.566910]  cleaner_kthread+0xfa/0x120
      [ 6197.574573]  kthread+0x111/0x130
      [ 6197.581022]  ? kthread_create_on_node+0x60/0x60
      [ 6197.590086]  ret_from_fork+0x1f/0x30
      [ 6197.597228] ---[ end trace 424bb7ae00509f56 ]---
      
      This is because the free side drops the ref without the lock, and then
      takes the lock if our refcount is 0.  So you can have nodes on the tree
      that have a refcount of 0.  Fix this by zero'ing out that element in our
      temporary array so we don't try to kill it again.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      baf320b9
    • Anand Jain's avatar
      btrfs: print process name and pid that calls device scanning · aa6c0df7
      Anand Jain authored
      Its very helpful if we had logged the device scanner process name to
      debug the race condition between the systemd-udevd scan and the user
      initiated device forget command.
      
      This patch adds process name and pid to the scan message.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add pid to the message ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aa6c0df7
    • Nikolay Borisov's avatar
      btrfs: Open-code name_in_log_ref in replay_one_name · 725af92a
      Nikolay Borisov authored
      That function adds unnecessary indirection between backref_in_log and
      the caller. Furthermore it also "downgrades" backref_in_log's return
      value to a boolean, when in fact it could very well be an error.
      
      Rectify the situation by simply opencoding name_in_log_ref in
      replay_one_name and properly handling possible return codes from
      backref_in_log.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      725af92a
    • Nikolay Borisov's avatar
      btrfs: Properly handle backref_in_log retval · d3316c82
      Nikolay Borisov authored
      This function can return a negative error value if btrfs_search_slot
      errors for whatever reason or if btrfs_alloc_path runs out of memory.
      This is currently problemattic because backref_in_log is treated by its
      callers as if it returns boolean.
      
      Fix this by adding proper error handling in callers. That also enables
      the function to return the direct error code from btrfs_search_slot.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d3316c82
    • Nikolay Borisov's avatar
      btrfs: Don't opencode btrfs_find_name_in_backref in backref_in_log · 89cbf5f6
      Nikolay Borisov authored
      Direct replacement, though note that the inside of the loop in
      btrfs_find_name_in_backref is organized in a slightly different way but
      is equvalent.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      89cbf5f6
    • Qu Wenruo's avatar
      btrfs: transaction: Cleanup unused TRANS_STATE_BLOCKED · 3296bf56
      Qu Wenruo authored
      The state was introduced in commit 4a9d8bde ("Btrfs: make the state
      of the transaction more readable"), then in commit 302167c5
      ("btrfs: don't end the transaction for delayed refs in throttle") the
      state is completely removed.
      
      So we can just clean up the state since it's only compared but never
      set.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3296bf56
    • Qu Wenruo's avatar
      btrfs: transaction: describe transaction states and transitions · 61c047b5
      Qu Wenruo authored
      Add an overview of the basic btrfs transaction transitions, including
      the following states:
      
      - No transaction states
      - Transaction N [[TRANS_STATE_RUNNING]]
      - Transaction N [[TRANS_STATE_COMMIT_START]]
      - Transaction N [[TRANS_STATE_COMMIT_DOING]]
      - Transaction N [[TRANS_STATE_UNBLOCKED]]
      - Transaction N [[TRANS_STATE_COMPLETED]]
      
      For each state, the comment will include:
      
      - Basic explaination about current state
      - How to go next stage
      - What will happen if we call various start_transaction() functions
      - Relationship to transaction N+1
      
      This doesn't provide tech details, but serves as a cheat sheet for
      reader to get into the code a little easier.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      61c047b5
    • David Sterba's avatar
      btrfs: use has_single_bit_set for clarity · c1499166
      David Sterba authored
      Replace is_power_of_2 with the helper that is self-documenting and
      remove the open coded call in alloc_profile_is_valid.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1499166
    • David Sterba's avatar
      btrfs: add 64bit safe helper for power of two checks · 79c8264e
      David Sterba authored
      As is_power_of_two takes unsigned long, it's not safe on 32bit
      architectures, but we could pass any u64 value in seveal places. Add a
      separate helper and also an alias that better expresses the purpose for
      which the helper is used.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79c8264e
    • Anand Jain's avatar
      btrfs: balance: use term redundancy instead of integrity in message · e62869be
      Anand Jain authored
      When balance reduces the number of copies of metadata, it reduces the
      redundancy, use the term redundancy instead of integrity.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e62869be
    • David Sterba's avatar
      btrfs: move btrfs_unlock_up_safe to other locking functions · 1f95ec01
      David Sterba authored
      The function belongs to the family of locking functions, so move it
      there. The 'noinline' keyword is dropped as it's now an exported
      function that does not need it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f95ec01
    • David Sterba's avatar
      btrfs: move btrfs_set_path_blocking to other locking functions · ed2b1d36
      David Sterba authored
      The function belongs to the family of locking functions, so move it
      there. The 'noinline' keyword is dropped as it's now an exported
      function that does not need it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed2b1d36
    • David Sterba's avatar
      btrfs: make btrfs_assert_tree_locked static inline · 31f6e769
      David Sterba authored
      The function btrfs_assert_tree_locked is used outside of the locking
      code so it is exported, however we can make it static inine as it's
      fairly trivial.
      
      This is the only locking assertion used in release builds, inlining
      improves the text size by 174 bytes and reduces stack consumption in the
      callers.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      31f6e769
    • David Sterba's avatar
      btrfs: make locking assertion helpers static inline · d6156218
      David Sterba authored
      I've noticed that none of the btrfs_assert_*lock* debugging helpers is
      inlined, despite they're short and mostly a value update. Making them
      inline shaves 67 from the text size, reduces stack consumption and
      perhaps also slightly improves the performance due to avoiding
      unnecessary calls.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d6156218
    • Omar Sandoval's avatar
      btrfs: get rid of pointless wtag variable in async-thread.c · c9eb55db
      Omar Sandoval authored
      Commit ac0c7cf8 ("btrfs: fix crash when tracepoint arguments are
      freed by wq callbacks") added a void pointer, wtag, which is passed into
      trace_btrfs_all_work_done() instead of the freed work item. This is
      silly for a few reasons:
      
      1. The freed work item still has the same address.
      2. work is still in scope after it's freed, so assigning wtag doesn't
         stop anyone from using it.
      3. The tracepoint has always taken a void * argument, so assigning wtag
         doesn't actually make things any more type-safe. (Note that the
         original bug in commit bc074524 ("btrfs: prefix fsid to all trace
         events") was that the void * was implicitly casted when it was passed
         to btrfs_work_owner() in the trace point itself).
      
      Instead, let's add some clearer warnings as comments.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9eb55db
    • Omar Sandoval's avatar
      btrfs: get rid of unique workqueue helper functions · a0cac0ec
      Omar Sandoval authored
      Commit 9e0af237 ("Btrfs: fix task hang under heavy compressed
      write") worked around the issue that a recycled work item could get a
      false dependency on the original work item due to how the workqueue code
      guarantees non-reentrancy. It did so by giving different work functions
      to different types of work.
      
      However, the fixes in the previous few patches are more complete, as
      they prevent a work item from being recycled at all (except for a tiny
      window that the kernel workqueue code handles for us). This obsoletes
      the previous fix, so we don't need the unique helpers for correctness.
      The only other reason to keep them would be so they show up in stack
      traces, but they always seem to be optimized to a tail call, so they
      don't show up anyways. So, let's just get rid of the extra indirection.
      
      While we're here, rename normal_work_helper() to the more informative
      btrfs_work_helper().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a0cac0ec
    • Omar Sandoval's avatar
      btrfs: don't prematurely free work in scrub_missing_raid56_worker() · 57d4f0b8
      Omar Sandoval authored
      Currently, scrub_missing_raid56_worker() puts and potentially frees
      sblock (which embeds the work item) and then submits a bio through
      scrub_wr_submit(). This is another potential instance of the bug in
      "btrfs: don't prematurely free work in run_ordered_work()". Fix it by
      dropping the reference after we submit the bio.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57d4f0b8
    • Omar Sandoval's avatar
      btrfs: don't prematurely free work in reada_start_machine_worker() · e732fe95
      Omar Sandoval authored
      Currently, reada_start_machine_worker() frees the reada_machine_work and
      then calls __reada_start_machine() to do readahead. This is another
      potential instance of the bug in "btrfs: don't prematurely free work in
      run_ordered_work()".
      
      There _might_ already be a deadlock here: reada_start_machine_worker()
      can depend on itself through stacked filesystems (__read_start_machine()
      -> reada_start_machine_dev() -> reada_tree_block_flagged() ->
      read_extent_buffer_pages() -> submit_one_bio() ->
      btree_submit_bio_hook() -> btrfs_map_bio() -> submit_stripe_bio() ->
      submit_bio() onto a loop device can trigger readahead on the lower
      filesystem).
      
      Either way, let's fix it by freeing the work at the end.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e732fe95
    • Omar Sandoval's avatar
      btrfs: don't prematurely free work in end_workqueue_fn() · 9be490f1
      Omar Sandoval authored
      Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds
      the work item) and then calls bio_endio(). This is another potential
      instance of the bug in "btrfs: don't prematurely free work in
      run_ordered_work()".
      
      In particular, the endio call may depend on other work items. For
      example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() ->
      __btrfs_correct_data_nocsum() -> dio_read_error() ->
      submit_dio_repair_bio(), which submits a bio that is also completed
      through a end_workqueue_fn() work item. However,
      __btrfs_correct_data_nocsum() waits for the newly submitted bio to
      complete, thus it depends on another work item.
      
      This example currently usually works because we use different workqueue
      helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR.
      However, it may deadlock with stacked filesystems and is fragile
      overall. The proper fix is to free the work item at the very end of the
      work function, so let's do that.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9be490f1
    • Omar Sandoval's avatar
      btrfs: don't prematurely free work in run_ordered_work() · c495dcd6
      Omar Sandoval authored
      We hit the following very strange deadlock on a system with Btrfs on a
      loop device backed by another Btrfs filesystem:
      
      1. The top (loop device) filesystem queues an async_cow work item from
         cow_file_range_async(). We'll call this work X.
      2. Worker thread A starts work X (normal_work_helper()).
      3. Worker thread A executes the ordered work for the top filesystem
         (run_ordered_work()).
      4. Worker thread A finishes the ordered work for work X and frees X
         (work->ordered_free()).
      5. Worker thread A executes another ordered work and gets blocked on I/O
         to the bottom filesystem (still in run_ordered_work()).
      6. Meanwhile, the bottom filesystem allocates and queues an async_cow
         work item which happens to be the recently-freed X.
      7. The workqueue code sees that X is already being executed by worker
         thread A, so it schedules X to be executed _after_ worker thread A
         finishes (see the find_worker_executing_work() call in
         process_one_work()).
      
      Now, the top filesystem is waiting for I/O on the bottom filesystem, but
      the bottom filesystem is waiting for the top filesystem to finish, so we
      deadlock.
      
      This happens because we are breaking the workqueue assumption that a
      work item cannot be recycled while it still depends on other work. Fix
      it by waiting to free the work item until we are done with all of the
      related ordered work.
      
      P.S.:
      
      One might ask why the workqueue code doesn't try to detect a recycled
      work item. It actually does try by checking whether the work item has
      the same work function (find_worker_executing_work()), but in our case
      the function is the same. This is the only key that the workqueue code
      has available to compare, short of adding an additional, layer-violating
      "custom key". Considering that we're the only ones that have ever hit
      this, we should just play by the rules.
      
      Unfortunately, we haven't been able to create a minimal reproducer other
      than our full container setup using a compress-force=zstd filesystem on
      top of another compress-force=zstd filesystem.
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c495dcd6