1. 26 Apr, 2018 1 commit
    • Qu Wenruo's avatar
      btrfs: Fix wrong first_key parameter in replace_path · 17515f1b
      Qu Wenruo authored
      Commit 581c1760 ("btrfs: Validate child tree block's level and first
      key") introduced new @first_key parameter for read_tree_block(), however
      caller in replace_path() is parasing wrong key to read_tree_block().
      
      It should use parameter @first_key other than @key.
      
      Normally it won't expose problem as @key is normally initialzied to the
      same value of @first_key we expect.
      However in relocation recovery case, @key can be set to (0, 0, 0), and
      since no valid key in relocation tree can be (0, 0, 0), it will cause
      read_tree_block() to return -EUCLEAN and interrupt relocation recovery.
      
      Fix it by setting @first_key correctly.
      
      Fixes: 581c1760 ("btrfs: Validate child tree block's level and first key")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      17515f1b
  2. 20 Apr, 2018 2 commits
    • Qu Wenruo's avatar
      btrfs: print-tree: debugging output enhancement · c0872323
      Qu Wenruo authored
      This patch enhances the following things:
      
      - tree block header
        * add generation and owner output for node and leaf
      - node pointer generation output
      - allow btrfs_print_tree() to not follow nodes
        * just like btrfs-progs
      
      Please note that, although function btrfs_print_tree() is not called by
      anyone right now, it's still a pretty useful function to debug kernel.
      So that function is still kept for later use.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c0872323
    • Nikolay Borisov's avatar
      btrfs: Fix race condition between delayed refs and blockgroup removal · 5e388e95
      Nikolay Borisov authored
      When the delayed refs for a head are all run, eventually
      cleanup_ref_head is called which (in case of deletion) obtains a
      reference for the relevant btrfs_space_info struct by querying the bg
      for the range. This is problematic because when the last extent of a
      bg is deleted a race window emerges between removal of that bg and the
      subsequent invocation of cleanup_ref_head. This can result in cache being null
      and either a null pointer dereference or assertion failure.
      
      	task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
      	RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
      	RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
      	RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
      	RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
      	RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
      	R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
      	R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
      	FS:  00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
      	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      	Call Trace:
      	 __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
      	 btrfs_run_delayed_refs+0x68/0x250 [btrfs]
      	 btrfs_should_end_transaction+0x42/0x60 [btrfs]
      	 btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
      	 btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
      	 evict+0xc6/0x190
      	 do_unlinkat+0x19c/0x300
      	 do_syscall_64+0x74/0x140
      	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      	RIP: 0033:0x7fbf589c57a7
      
      To fix this, introduce a new flag "is_system" to head_ref structs,
      which is populated at insertion time. This allows to decouple the
      querying for the spaceinfo from querying the possibly deleted bg.
      
      Fixes: d7eae340 ("Btrfs: rework delayed ref total_bytes_pinned accounting")
      CC: stable@vger.kernel.org # 4.14+
      Suggested-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5e388e95
  3. 18 Apr, 2018 5 commits
    • David Sterba's avatar
      btrfs: fix unaligned access in readdir · 92d32170
      David Sterba authored
      The last update to readdir introduced a temporary buffer to store the
      emitted readdir data, but as there are file names of variable length,
      there's a lot of unaligned access.
      
      This was observed on a sparc64 machine:
      
        Kernel unaligned access at TPC[102f3080] btrfs_real_readdir+0x51c/0x718 [btrfs]
      
      Fixes: 23b5ec74 ("btrfs: fix readdir deadlock with pagefault")
      CC: stable@vger.kernel.org # 4.14+
      Reported-and-tested-by: default avatarRené Rebe <rene@exactcode.com>
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92d32170
    • Qu Wenruo's avatar
      btrfs: Fix wrong btrfs_delalloc_release_extents parameter · 336a8bb8
      Qu Wenruo authored
      Commit 43b18595 ("btrfs: qgroup: Use separate meta reservation type
      for delalloc") merged into mainline is not the latest version submitted
      to mail list in Dec 2017.
      
      It has a fatal wrong @qgroup_free parameter, which results increasing
      qgroup metadata pertrans reserved space, and causing a lot of early EDQUOT.
      
      Fix it by applying the correct diff on top of current branch.
      
      Fixes: 43b18595 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      336a8bb8
    • Qu Wenruo's avatar
      btrfs: delayed-inode: Remove wrong qgroup meta reservation calls · f218ea6c
      Qu Wenruo authored
      Commit 4f5427cc ("btrfs: delayed-inode: Use new qgroup meta rsv for
      delayed inode and item") merged into mainline was not latest version
      submitted to the mail list in Dec 2017.
      
      Which lacks the following fixes:
      
      1) Remove btrfs_qgroup_convert_reserved_meta() call in
         btrfs_delayed_item_release_metadata()
      2) Remove btrfs_qgroup_reserve_meta_prealloc() call in
         btrfs_delayed_inode_reserve_metadata()
      
      Those fixes will resolve unexpected EDQUOT problems.
      
      Fixes: 4f5427cc ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f218ea6c
    • Qu Wenruo's avatar
      btrfs: qgroup: Use independent and accurate per inode qgroup rsv · ff6bc37e
      Qu Wenruo authored
      Unlike reservation calculation used in inode rsv for metadata, qgroup
      doesn't really need to care about things like csum size or extent usage
      for the whole tree COW.
      
      Qgroups care more about net change of the extent usage.
      That's to say, if we're going to insert one file extent, it will mostly
      find its place in COWed tree block, leaving no change in extent usage.
      Or causing a leaf split, resulting in one new net extent and increasing
      qgroup number by nodesize.
      Or in an even more rare case, increase the tree level, increasing qgroup
      number by 2 * nodesize.
      
      So here instead of using the complicated calculation for extent
      allocator, which cares more about accuracy and no error, qgroup doesn't
      need that over-estimated reservation.
      
      This patch will maintain 2 new members in btrfs_block_rsv structure for
      qgroup, using much smaller calculation for qgroup rsv, reducing false
      EDQUOT.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      ff6bc37e
    • Qu Wenruo's avatar
      btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT · a514d638
      Qu Wenruo authored
      Unlike previous method that tries to commit transaction inside
      qgroup_reserve(), this time we will try to commit transaction using
      fs_info->transaction_kthread to avoid nested transaction and no need to
      worry about locking context.
      
      Since it's an asynchronous function call and we won't wait for
      transaction commit, unlike previous method, we must call it before we
      hit the qgroup limit.
      
      So this patch will use the ratio and size of qgroup meta_pertrans
      reservation as indicator to check if we should trigger a transaction
      commit.  (meta_prealloc won't be cleaned in transaction committ, it's
      useless anyway)
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a514d638
  4. 13 Apr, 2018 1 commit
    • Qu Wenruo's avatar
      btrfs: Only check first key for committed tree blocks · 5d41be6f
      Qu Wenruo authored
      When looping btrfs/074 with many cpus (>= 8), it's possible to trigger
      kernel warning due to first key verification:
      
      [ 4239.523446] WARNING: CPU: 5 PID: 2381 at fs/btrfs/disk-io.c:460 btree_read_extent_buffer_pages+0x1ad/0x210
      [ 4239.523830] Modules linked in:
      [ 4239.524630] RIP: 0010:btree_read_extent_buffer_pages+0x1ad/0x210
      [ 4239.527101] Call Trace:
      [ 4239.527251]  read_tree_block+0x42/0x70
      [ 4239.527434]  read_node_slot+0xd2/0x110
      [ 4239.527632]  push_leaf_right+0xad/0x1b0
      [ 4239.527809]  split_leaf+0x4ea/0x700
      [ 4239.527988]  ? leaf_space_used+0xbc/0xe0
      [ 4239.528192]  ? btrfs_set_lock_blocking_rw+0x99/0xb0
      [ 4239.528416]  btrfs_search_slot+0x8cc/0xa40
      [ 4239.528605]  btrfs_insert_empty_items+0x71/0xc0
      [ 4239.528798]  __btrfs_run_delayed_refs+0xa98/0x1680
      [ 4239.529013]  btrfs_run_delayed_refs+0x10b/0x1b0
      [ 4239.529205]  btrfs_commit_transaction+0x33/0xaf0
      [ 4239.529445]  ? start_transaction+0xa8/0x4f0
      [ 4239.529630]  btrfs_alloc_data_chunk_ondemand+0x1b0/0x4e0
      [ 4239.529833]  btrfs_check_data_free_space+0x54/0xa0
      [ 4239.530045]  btrfs_delalloc_reserve_space+0x25/0x70
      [ 4239.531907]  btrfs_direct_IO+0x233/0x3d0
      [ 4239.532098]  generic_file_direct_write+0xcb/0x170
      [ 4239.532296]  btrfs_file_write_iter+0x2bb/0x5f4
      [ 4239.532491]  aio_write+0xe2/0x180
      [ 4239.532669]  ? lock_acquire+0xac/0x1e0
      [ 4239.532839]  ? __might_fault+0x3e/0x90
      [ 4239.533032]  do_io_submit+0x594/0x860
      [ 4239.533223]  ? do_io_submit+0x594/0x860
      [ 4239.533398]  SyS_io_submit+0x10/0x20
      [ 4239.533560]  ? SyS_io_submit+0x10/0x20
      [ 4239.533729]  do_syscall_64+0x75/0x1d0
      [ 4239.533979]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
      [ 4239.534182] RIP: 0033:0x7f8519741697
      
      The problem here is, at btree_read_extent_buffer_pages() we don't have
      acquired read/write lock on that extent buffer, only basic info like
      level/bytenr is reliable.
      
      So race condition leads to such false alert.
      
      However in current call site, it's impossible to acquire proper lock
      without race window.
      To fix the problem, we only verify first key for committed tree blocks
      (whose generation is no larger than fs_info->last_trans_committed), so
      the content of such tree blocks will not change and there is no need to
      get read/write lock.
      Reported-by: default avatarNikolay Borisov <nborisov@suse.com>
      Fixes: 581c1760 ("btrfs: Validate child tree block's level and first key")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5d41be6f
  5. 12 Apr, 2018 5 commits
    • David Sterba's avatar
      btrfs: add SPDX header to Kconfig · 852eb3ae
      David Sterba authored
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      852eb3ae
    • David Sterba's avatar
      btrfs: replace GPL boilerplate by SPDX -- sources · c1d7c514
      David Sterba authored
      Remove GPL boilerplate text (long, short, one-line) and keep the rest,
      ie. personal, company or original source copyright statements. Add the
      SPDX header.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1d7c514
    • David Sterba's avatar
      btrfs: replace GPL boilerplate by SPDX -- headers · 9888c340
      David Sterba authored
      Remove GPL boilerplate text (long, short, one-line) and keep the rest,
      ie. personal, company or original source copyright statements. Add the
      SPDX header.
      
      Unify the include protection macros to match the file names.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9888c340
    • Filipe Manana's avatar
      Btrfs: fix loss of prealloc extents past i_size after fsync log replay · 471d557a
      Filipe Manana authored
      Currently if we allocate extents beyond an inode's i_size (through the
      fallocate system call) and then fsync the file, we log the extents but
      after a power failure we replay them and then immediately drop them.
      This behaviour happens since about 2009, commit c71bf099 ("Btrfs:
      Avoid orphan inodes cleanup while replaying log"), because it marks
      the inode as an orphan instead of dropping any extents beyond i_size
      before replaying logged extents, so after the log replay, and while
      the mount operation is still ongoing, we find the inode marked as an
      orphan and then perform a truncation (drop extents beyond the inode's
      i_size). Because the processing of orphan inodes is still done
      right after replaying the log and before the mount operation finishes,
      the intention of that commit does not make any sense (at least as
      of today). However reverting that behaviour is not enough, because
      we can not simply discard all extents beyond i_size and then replay
      logged extents, because we risk dropping extents beyond i_size created
      in past transactions, for example:
      
        add prealloc extent beyond i_size
        fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
        transaction commit
        add another prealloc extent beyond i_size
        fsync - triggers the fast fsync path
        power failure
      
      In that scenario, we would drop the first extent and then replay the
      second one. To fix this just make sure that all prealloc extents
      beyond i_size are logged, and if we find too many (which is far from
      a common case), fallback to a full transaction commit (like we do when
      logging regular extents in the fast fsync path).
      
      Trivial reproducer:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
       $ sync
       $ xfs_io -c "falloc -k 256K 1M" /mnt/foo
       $ xfs_io -c "fsync" /mnt/foo
       <power failure>
      
       # mount to replay log
       $ mount /dev/sdb /mnt
       # at this point the file only has one extent, at offset 0, size 256K
      
      A test case for fstests follows soon, covering multiple scenarios that
      involve adding prealloc extents with previous shrinking truncates and
      without such truncates.
      
      Fixes: c71bf099 ("Btrfs: Avoid orphan inodes cleanup while replaying log")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      471d557a
    • Liu Bo's avatar
      Btrfs: clean up resources during umount after trans is aborted · af722733
      Liu Bo authored
      Currently if some fatal errors occur, like all IO get -EIO, resources
      would be cleaned up when
      a) transaction is being committed or
      b) BTRFS_FS_STATE_ERROR is set
      
      However, in some rare cases, resources may be left alone after transaction
      gets aborted and umount may run into some ASSERT(), e.g.
      ASSERT(list_empty(&block_group->dirty_list));
      
      For case a), in btrfs_commit_transaciton(), there're several places at the
      beginning where we just call btrfs_end_transaction() without cleaning up
      resources.  For case b), it is possible that the trans handle doesn't have
      any dirty stuff, then only trans hanlde is marked as aborted while
      BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.
      
      This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
      all resources won't stay in memory after umount.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af722733
  6. 05 Apr, 2018 3 commits
  7. 31 Mar, 2018 18 commits
    • David Sterba's avatar
      btrfs: lift errors from add_extent_changeset to the callers · 57599c7e
      David Sterba authored
      The missing error handling in add_extent_changeset was hidden, so make
      it at least visible in the callers.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57599c7e
    • Liu Bo's avatar
      Btrfs: print error messages when failing to read trees · f50f4353
      Liu Bo authored
      When mount fails to read trees like fs tree, checksum tree, extent
      tree, etc, there is not enough information about where went wrong.
      
      With this, messages like
      
      "BTRFS warning (device sdf): failed to read root (objectid=7): -5"
      
      would help us a bit.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f50f4353
    • David Sterba's avatar
      btrfs: user proper type for btrfs_mask_flags flags · 38e82de8
      David Sterba authored
      All users pass a local unsigned int and not the __uXX types that are
      supposed to be used for userspace interfaces.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      38e82de8
    • David Sterba's avatar
      btrfs: split dev-replace locking helpers for read and write · 7e79cb86
      David Sterba authored
      The current calls are unclear in what way btrfs_dev_replace_lock takes
      the locks, so drop the argument, split the helpers and use similar
      naming as for read and write locks.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e79cb86
    • David Sterba's avatar
      btrfs: remove stale comments about fs_mutex · e7ab0af6
      David Sterba authored
      The fs_mutex has been killed in 2008, a2135011 ("Btrfs: Replace
      the big fs_mutex with a collection of other locks"), still remembered in
      some comments.
      
      We don't have any extra needs for locking in the ACL handlers.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7ab0af6
    • David Sterba's avatar
      btrfs: use RCU in btrfs_show_devname for device list traversal · 88c14590
      David Sterba authored
      The show_devname callback is used to print device name in
      /proc/self/mounts, we need to traverse the device list consistently and
      read the name that's copied to a seq buffer so we don't need further
      locking.
      
      If the first device is being deleted at the same time, the RCU will
      allow us to read the device name, though it will become stale right
      after the RCU protection ends. This is unavoidable and the user can
      expect that the device will disappear from the filesystem's list at some
      point.
      
      The device_list_mutex was pretty heavy as it is used eg. for writing
      superblock and a few other IO related contexts. This can stall any
      application that reads the proc file for no reason.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88c14590
    • David Sterba's avatar
      btrfs: update barrier in should_cow_block · d1980131
      David Sterba authored
      Once there was a simple int force_cow that was used with the plain
      barriers, and then converted to a bit, so we should use the appropriate
      barrier helper.
      
      Other variables in the complex if condition do not depend on a barrier,
      so we should be fine in case the atomic barrier becomes a no-op.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1980131
    • David Sterba's avatar
      btrfs: use lockdep_assert_held for mutexes · a32bf9a3
      David Sterba authored
      Using lockdep_assert_held is preferred, replace mutex_is_locked.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a32bf9a3
    • David Sterba's avatar
      btrfs: use lockdep_assert_held for spinlocks · a4666e68
      David Sterba authored
      Using lockdep_assert_held is preferred, replace assert_spin_locked.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4666e68
    • Qu Wenruo's avatar
      btrfs: Validate child tree block's level and first key · 581c1760
      Qu Wenruo authored
      We have several reports about node pointer points to incorrect child
      tree blocks, which could have even wrong owner and level but still with
      valid generation and checksum.
      
      Although btrfs check could handle it and print error message like:
      leaf parent key incorrect 60670574592
      
      Kernel doesn't have enough check on this type of corruption correctly.
      At least add such check to read_tree_block() and btrfs_read_buffer(),
      where we need two new parameters @level and @first_key to verify the
      child tree block.
      
      The new @level check is mandatory and all call sites are already
      modified to extract expected level from its call chain.
      
      While @first_key is optional, the following call sites are skipping such
      check:
      1) Root node/leaf
         As ROOT_ITEM doesn't contain the first key, skip @first_key check.
      2) Direct backref
         Only parent bytenr and level is known and we need to resolve the key
         all by ourselves, skip @first_key check.
      
      Another note of this verification is, it needs extra info from nodeptr
      or ROOT_ITEM, so it can't fit into current tree-checker framework, which
      is limited to node/leaf boundary.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      581c1760
    • Qu Wenruo's avatar
      btrfs: tests/qgroup: Fix wrong tree backref level · 3c0efdf0
      Qu Wenruo authored
      The extent tree of the test fs is like the following:
      
       BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919
        item 0 key (4096 168 4096) itemoff 3944 itemsize 51
                extent refs 1 gen 1 flags 2
                tree block key (68719476736 0 0) level 1
                                                 ^^^^^^^
                ref#0: tree block backref root 5
      
      And it's using an empty tree for fs tree, so there is no way that its
      level can be 1.
      
      For REAL (created by mkfs) fs tree backref with no skinny metadata, the
      result should look like:
      
       item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51
               refs 1 gen 4 flags TREE_BLOCK
               tree block key (256 INODE_ITEM 0) level 0
                                                 ^^^^^^^
               tree block backref root 5
      
      Fix the level to 0, so it won't break later tree level checker.
      
      Fixes: faa2dbf0 ("Btrfs: add sanity tests for new qgroup accounting code")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c0efdf0
    • Filipe Manana's avatar
      Btrfs: fix copy_items() return value when logging an inode · 8434ec46
      Filipe Manana authored
      When logging an inode, at tree-log.c:copy_items(), if we call
      btrfs_next_leaf() at the loop which checks for the need to log holes, we
      need to make sure copy_items() returns the value 1 to its caller and
      not 0 (on success). This is because the path the caller passed was
      released and is now different from what is was before, and the caller
      expects a return value of 0 to mean both success and that the path
      has not changed, while a return value of 1 means both success and
      signals the caller that it can not reuse the path, it has to perform
      another tree search.
      
      Even though this is a case that should not be triggered on normal
      circumstances or very rare at least, its consequences can be very
      unpredictable (especially when replaying a log tree).
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8434ec46
    • Filipe Manana's avatar
      Btrfs: fix fsync after hole punching when using no-holes feature · 4ee3fad3
      Filipe Manana authored
      When we have the no-holes mode enabled and fsync a file after punching a
      hole in it, we can end up not logging the whole hole range in the log tree.
      This happens if the file has extent items that span more than one leaf and
      we punch a hole that covers a range that starts in a leaf but does not go
      beyond the offset of the first extent in the next leaf.
      
      Example:
      
        $ mkfs.btrfs -f -O no-holes -n 65536 /dev/sdb
        $ mount /dev/sdb /mnt
        $ for ((i = 0; i <= 831; i++)); do
      	offset=$((i * 2 * 256 * 1024))
      	xfs_io -f -c "pwrite -S 0xab -b 256K $offset 256K" \
      		/mnt/foobar >/dev/null
          done
        $ sync
      
        # We now have 2 leafs in our filesystem fs tree, the first leaf has an
        # item corresponding the extent at file offset 216530944 and the second
        # leaf has a first item corresponding to the extent at offset 217055232.
        # Now we punch a hole that partially covers the range of the extent at
        # offset 216530944 but does go beyond the offset 217055232.
      
        $ xfs_io -c "fpunch $((216530944 + 128 * 1024 - 4000)) 256K" /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
      
        <power fail>
      
        # mount to replay the log
        $ mount /dev/sdb /mnt
      
        # Before this patch, only the subrange [216658016, 216662016[ (length of
        # 4000 bytes) was logged, leaving an incorrect file layout after log
        # replay.
      
      Fix this by checking if there is a hole between the last extent item that
      we processed and the first extent item in the next leaf, and if there is
      one, log an explicit hole extent item.
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ee3fad3
    • David Sterba's avatar
      btrfs: use helper to set ulist aux from a qgroup · a1840b50
      David Sterba authored
      We have a nice helper to do proper casting of a qgroup to a ulist aux
      value. And several places that could make use of it.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a1840b50
    • Qu Wenruo's avatar
      Revert "btrfs: qgroups: Retry after commit on getting EDQUOT" · 0b78877a
      Qu Wenruo authored
      This reverts commit 48a89bc4.
      
      The idea to commit transaction and free some space after hitting qgroup
      limit is good, although the problem is it can easily cause deadlocks.
      
      One deadlock example is caused by trying to flush data while still
      holding it:
      
      Call Trace:
       __schedule+0x49d/0x10f0
       schedule+0xc6/0x290
       schedule_timeout+0x187/0x1c0
       wait_for_completion+0x204/0x3a0
       btrfs_wait_ordered_extents+0xa40/0xaf0 [btrfs]
       qgroup_reserve+0x913/0xa10 [btrfs]
       btrfs_qgroup_reserve_data+0x3ef/0x580 [btrfs]
       btrfs_check_data_free_space+0x96/0xd0 [btrfs]
       __btrfs_buffered_write+0x3ac/0xd40 [btrfs]
       btrfs_file_write_iter+0x62a/0xba0 [btrfs]
       __vfs_write+0x320/0x430
       vfs_write+0x107/0x270
       SyS_write+0xbf/0x150
       do_syscall_64+0x1b0/0x3d0
       entry_SYSCALL64_slow_path+0x25/0x25
      
      Another can be caused by trying to commit one transaction while nesting
      with trans handle held by ourselves:
      
      btrfs_start_transaction()
      |- btrfs_qgroup_reserve_meta_pertrans()
         |- qgroup_reserve()
            |- btrfs_join_transaction()
            |- btrfs_commit_transaction()
      
      The retry is causing more problems than exppected when limit is enabled.
      At least a graceful EDQUOT is way better than deadlock.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0b78877a
    • Qu Wenruo's avatar
      btrfs: qgroup: Update trace events for metadata reservation · 4ee0d883
      Qu Wenruo authored
      Now trace_qgroup_meta_reserve() will have extra type parameter.
      
      And introduce two new trace events:
      
      1) trace_qgroup_meta_free_all_pertrans()
         For btrfs_qgroup_free_meta_all_pertrans()
      
      2) trace_qgroup_meta_convert()
         For btrfs_qgroup_convert_reserved_meta()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ee0d883
    • Qu Wenruo's avatar
      btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space · 8287475a
      Qu Wenruo authored
      For quota disabled->enable case, it's possible that at reservation time
      quota was not enabled so no bytes were really reserved, while at release
      time, quota was enabled so we will try to release some bytes we didn't
      really own.
      
      Such situation can cause metadata reserveation underflow, for both types,
      also less possible for per-trans type since quota enable will commit
      transaction.
      
      To address this, record qgroup meta reserved bytes into
      root::qgroup_meta_rsv_pertrans and ::prealloc.
      So at releasing time we won't free any bytes we didn't reserve.
      
      For DATA, it's already handled by io_tree, so nothing needs to be done
      there.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8287475a
    • Qu Wenruo's avatar
      btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item · 4f5427cc
      Qu Wenruo authored
      Quite similar for delalloc, some modification to delayed-inode and
      delayed-item reservation.  Also needs extra parameter for release case
      to distinguish normal release and error release.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4f5427cc
  8. 30 Mar, 2018 5 commits
    • Qu Wenruo's avatar
      btrfs: qgroup: Use separate meta reservation type for delalloc · 43b18595
      Qu Wenruo authored
      Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
      preallocated meta rsv, making it quite easy to underflow qgroup meta
      reservation.
      
      Since we have the new qgroup meta rsv types, apply it to delalloc
      reservation.
      
      Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
      rsv type.
      
      And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
      they will convert corresponding META_PREALLOC reservation to
      META_PERTRANS.
      
      This is mainly due to the fact that current qgroup numbers will only be
      updated in btrfs_commit_transaction(), that's to say if we don't keep
      such placeholder reservation, we can exceed qgroup limitation.
      
      And for callers freeing outstanding extent in error handler, we will
      just free META_PREALLOC bytes.
      
      This behavior makes callers of btrfs_qgroup_release_meta() or
      btrfs_qgroup_convert_meta() to be aware of which type they are.
      So in this patch, btrfs_delalloc_release_metadata() and its callers get
      an extra parameter to info qgroup to do correct meta convert/release.
      
      The good news is, even we use the wrong type (convert or free), it won't
      cause obvious bug, as prealloc type is always in good shape, and the
      type only affects how per-trans meta is increased or not.
      
      So the worst case will be at most metadata limitation can be sometimes
      exceeded (no convert at all) or metadata limitation is reached too soon
      (no free at all).
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      43b18595
    • Qu Wenruo's avatar
      btrfs: qgroup: Introduce function to convert META_PREALLOC into META_PERTRANS · 64cfaef6
      Qu Wenruo authored
      For meta_prealloc reservation users, after btrfs_join_transaction()
      caller will modify tree so part (or even all) meta_prealloc reservation
      should be converted to meta_pertrans until transaction commit time.
      
      This patch introduces a new function,
      btrfs_qgroup_convert_reserved_meta() to do this for META_PREALLOC
      reservation user.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      64cfaef6
    • Qu Wenruo's avatar
      btrfs: qgroup: Don't use root->qgroup_meta_rsv for qgroup · e1211d0e
      Qu Wenruo authored
      Since qgroup has seperate metadata reservation types now, we can
      completely get rid of the old root->qgroup_meta_rsv, which mostly acts
      as current META_PERTRANS reservation type.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e1211d0e
    • Qu Wenruo's avatar
      btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans · 733e03a0
      Qu Wenruo authored
      Btrfs uses 2 different methods to reseve metadata qgroup space.
      
      1) Reserve at btrfs_start_transaction() time
         This is quite straightforward, caller will use the trans handler
         allocated to modify b-trees.
      
         In this case, reserved metadata should be kept until qgroup numbers
         are updated.
      
      2) Reserve by using block_rsv first, and later btrfs_join_transaction()
         This is more complicated, caller will reserve space using block_rsv
         first, and then later call btrfs_join_transaction() to get a trans
         handle.
      
         In this case, before we modify trees, the reserved space can be
         modified on demand, and after btrfs_join_transaction(), such reserved
         space should also be kept until qgroup numbers are updated.
      
      Since these two types behave differently, split the original "META"
      reservation type into 2 sub-types:
      
        META_PERTRANS:
          For above case 1)
      
        META_PREALLOC:
          For reservations that happened before btrfs_join_transaction() of
          case 2)
      
      NOTE: This patch will only convert existing qgroup meta reservation
      callers according to its situation, not ensuring all callers are at
      correct timing.
      Such fix will be added in later patches.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [ update comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      733e03a0
    • Qu Wenruo's avatar
      btrfs: qgroup: Cleanup the remaining old reservation counters · 5c40507f
      Qu Wenruo authored
      So qgroup is switched to new separate types reservation system.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5c40507f