1. 06 Sep, 2016 1 commit
    • Wang Xiaoguang's avatar
      btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress · ce129655
      Wang Xiaoguang authored
      In btrfs_async_reclaim_metadata_space(), we use ticket's address to
      determine whether asynchronous metadata reclaim work is making progress.
      
      	ticket = list_first_entry(&space_info->tickets,
      				  struct reserve_ticket, list);
      	if (last_ticket == ticket) {
      		flush_state++;
      	} else {
      		last_ticket = ticket;
      		flush_state = FLUSH_DELAYED_ITEMS_NR;
      		if (commit_cycles)
      			commit_cycles--;
      	}
      
      But indeed it's wrong, we should not rely on local variable's address to
      do this check, because addresses may be same. In my test environment, I
      dd one 168MB file in a 256MB fs, found that for this file, every time
      wait_reserve_ticket() called, local variable ticket's address is same,
      
      For above codes, assume a previous ticket's address is addrA, last_ticket
      is addrA. Btrfs_async_reclaim_metadata_space() finished this ticket and
      wake up it, then another ticket is added, but with the same address addrA,
      now last_ticket will be same to current ticket, then current ticket's flush
      work will start from current flush_state, not initial FLUSH_DELAYED_ITEMS_NR,
      which may result in some enospc issues(I have seen this in my test machine).
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ce129655
  2. 05 Sep, 2016 1 commit
  3. 02 Sep, 2016 1 commit
  4. 01 Sep, 2016 3 commits
  5. 25 Aug, 2016 20 commits
    • Filipe Manana's avatar
      Btrfs: fix lockdep warning on deadlock against an inode's log mutex · 28a23593
      Filipe Manana authored
      Commit 44f714da ("Btrfs: improve performance on fsync against new
      inode after rename/unlink"), which landed in 4.8-rc2, introduced a
      possibility for a deadlock due to double locking of an inode's log mutex
      by the same task, which lockdep reports with:
      
      [23045.433975] =============================================
      [23045.434748] [ INFO: possible recursive locking detected ]
      [23045.435426] 4.7.0-rc6-btrfs-next-34+ #1 Not tainted
      [23045.436044] ---------------------------------------------
      [23045.436044] xfs_io/3688 is trying to acquire lock:
      [23045.436044]  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]
                     but task is already holding lock:
      [23045.436044]  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]
                     other info that might help us debug this:
      [23045.436044]  Possible unsafe locking scenario:
      
      [23045.436044]        CPU0
      [23045.436044]        ----
      [23045.436044]   lock(&ei->log_mutex);
      [23045.436044]   lock(&ei->log_mutex);
      [23045.436044]
                      *** DEADLOCK ***
      
      [23045.436044]  May be due to missing lock nesting notation
      
      [23045.436044] 3 locks held by xfs_io/3688:
      [23045.436044]  #0:  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffffa035f2ae>] btrfs_sync_file+0x14e/0x425 [btrfs]
      [23045.436044]  #1:  (sb_internal#2){.+.+.+}, at: [<ffffffff8118446b>] __sb_start_write+0x5f/0xb0
      [23045.436044]  #2:  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]
                     stack backtrace:
      [23045.436044] CPU: 4 PID: 3688 Comm: xfs_io Not tainted 4.7.0-rc6-btrfs-next-34+ #1
      [23045.436044] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [23045.436044]  0000000000000000 ffff88022f5f7860 ffffffff8127074d ffffffff82a54b70
      [23045.436044]  ffffffff82a54b70 ffff88022f5f7920 ffffffff81092897 ffff880228015d68
      [23045.436044]  0000000000000000 ffffffff82a54b70 ffffffff829c3f00 ffff880228015d68
      [23045.436044] Call Trace:
      [23045.436044]  [<ffffffff8127074d>] dump_stack+0x67/0x90
      [23045.436044]  [<ffffffff81092897>] __lock_acquire+0xcbb/0xe4e
      [23045.436044]  [<ffffffff8109155f>] ? mark_lock+0x24/0x201
      [23045.436044]  [<ffffffff8109179a>] ? mark_held_locks+0x5e/0x74
      [23045.436044]  [<ffffffff81092de0>] lock_acquire+0x12f/0x1c3
      [23045.436044]  [<ffffffff81092de0>] ? lock_acquire+0x12f/0x1c3
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffff814a51a4>] mutex_lock_nested+0x77/0x3a7
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa039705e>] ? btrfs_release_delayed_node+0xb/0xd [btrfs]
      [23045.436044]  [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffff810a0ed1>] ? vprintk_emit+0x453/0x465
      [23045.436044]  [<ffffffffa0385a61>] btrfs_log_inode+0x66e/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa03c084d>] log_new_dir_dentries+0x26c/0x359 [btrfs]
      [23045.436044]  [<ffffffffa03865aa>] btrfs_log_inode_parent+0x4a6/0x628 [btrfs]
      [23045.436044]  [<ffffffffa0387552>] btrfs_log_dentry_safe+0x5a/0x75 [btrfs]
      [23045.436044]  [<ffffffffa035f464>] btrfs_sync_file+0x304/0x425 [btrfs]
      [23045.436044]  [<ffffffff811acaf4>] vfs_fsync_range+0x8c/0x9e
      [23045.436044]  [<ffffffff811acb22>] vfs_fsync+0x1c/0x1e
      [23045.436044]  [<ffffffff811acc79>] do_fsync+0x31/0x4a
      [23045.436044]  [<ffffffff811ace99>] SyS_fsync+0x10/0x14
      [23045.436044]  [<ffffffff814a88e5>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [23045.436044]  [<ffffffff8108f039>] ? trace_hardirqs_off_caller+0x3f/0xaa
      
      An example reproducer for this is:
      
         $ mkfs.btrfs -f /dev/sdb
         $ mount /dev/sdb /mnt
         $ mkdir /mnt/dir
         $ touch /mnt/dir/foo
         $ sync
         $ mv /mnt/dir/foo /mnt/dir/bar
         $ touch /mnt/dir/foo
         $ xfs_io -c "fsync" /mnt/dir/bar
      
      This is because while logging the inode of file bar we end up logging its
      parent directory (since its inode has an unlink_trans field matching the
      current transaction id due to the rename operation), which in turn logs
      the inodes for all its new dentries, so that the new inode for the new
      file named foo gets logged which in turn triggered another logging attempt
      for the inode we are fsync'ing, since that inode had an old name that
      corresponds to the name of the new inode.
      
      So fix this by ensuring that when logging the inode for a new dentry that
      has a name matching an old name of some other inode, we don't log again
      the original inode that we are fsync'ing.
      
      Fixes: 44f714da ("Btrfs: improve performance on fsync against new inode after rename/unlink")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      28a23593
    • Liu Bo's avatar
      Btrfs: detect corruption when non-root leaf has zero item · 1ba98d08
      Liu Bo authored
      Right now we treat leaf which has zero item as a valid one
      because we could have an empty tree, that is, a root that is
      also a leaf without any item, however, in the same case but
      when the leaf is not a root, we can end up with hitting the
      BUG_ON(1) in btrfs_extend_item() called by
      setup_inline_extent_backref().
      
      This makes us check the situation as a corruption if leaf is
      not its own root.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1ba98d08
    • Liu Bo's avatar
      Btrfs: check btree node's nritems · 053ab70f
      Liu Bo authored
      When btree node (level = 1) has nritems which equals to zero,
      we can end up with panic due to insert_ptr()'s
      
      BUG_ON(slot > nritems);
      
      where slot is 1 and nritems is 0, as copy_for_split() calls
      insert_ptr(.., path->slots[1] + 1, ...);
      
      A invalid value results in the whole mess, this adds the check
      for btree's node nritems so that we stop reading block when
      when something is wrong.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      053ab70f
    • Jeff Mahoney's avatar
      btrfs: don't create or leak aliased root while cleaning up orphans · 35bbb97f
      Jeff Mahoney authored
      commit 909c3a22 (Btrfs: fix loading of orphan roots leading to BUG_ON)
      avoids the BUG_ON but can add an aliased root to the dead_roots list or
      leak the root.
      
      Since we've already been loading roots into the radix tree, we should
      use it before looking the root up on disk.
      
      Cc: <stable@vger.kernel.org> # 4.5
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      35bbb97f
    • Josef Bacik's avatar
      Btrfs: fix em leak in find_first_block_group · 187ee58c
      Josef Bacik authored
      We need to call free_extent_map() on the em we look up.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      187ee58c
    • Anand Jain's avatar
      btrfs: do not background blkdev_put() · 14238819
      Anand Jain authored
      At the end of unmount/dev-delete, if the device exclusive open is not
      actually closed, then there might be a race with another program in
      the userland who is trying to open the device in exclusive mode and
      it may fail for eg:
            unmount /btrfs; fsck /dev/x
            btrfs dev del /dev/x /btrfs; fsck /dev/x
      so here background blkdev_put() is not a choice
      Signed-off-by: default avatarAnand Jain <Anand.Jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      14238819
    • Liu Bo's avatar
      Btrfs: clarify do_chunk_alloc()'s return value · 28b737f6
      Liu Bo authored
      Function start_transaction() can return ERR_PTR(1) when flush is
      BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is
      
      start_transaction (return ERR_PTR(1))
        -> btrfs_block_rsv_add (return 1)
           -> reserve_metadata_bytes (return 1)
              -> flush_space (return 1)
                 -> do_chunk_alloc  (return 1)
      
      With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the
      flush_state of ALLOC_CHUNK and it successfully allocates a new
      chunk, then instead of trying to reserve space again,
      reserve_metadata_bytes returns 1 immediately.
      
      Eventually the callers who call start_transaction() usually just
      do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get
      a panic when dereferencing a pointer which is ERR_PTR(1).
      
      The following patch fixes the above problem.
      "btrfs: flush_space: treat return value of do_chunk_alloc properly"
      https://patchwork.kernel.org/patch/7778651/
      
      This add comments to clarify do_chunk_alloc()'s return value.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      28b737f6
    • Wang Xiaoguang's avatar
      btrfs: fix fsfreeze hang caused by delayed iputs deal · 9e7cc91a
      Wang Xiaoguang authored
      When running fstests generic/068, sometimes we got below deadlock:
        xfs_io          D ffff8800331dbb20     0  6697   6693 0x00000080
        ffff8800331dbb20 ffff88007acfc140 ffff880034d895c0 ffff8800331dc000
        ffff880032d243e8 fffffffeffffffff ffff880032d24400 0000000000000001
        ffff8800331dbb38 ffffffff816a9045 ffff880034d895c0 ffff8800331dbba8
        Call Trace:
        [<ffffffff816a9045>] schedule+0x35/0x80
        [<ffffffff816abab2>] rwsem_down_read_failed+0xf2/0x140
        [<ffffffff8118f5e1>] ? __filemap_fdatawrite_range+0xd1/0x100
        [<ffffffff8134f978>] call_rwsem_down_read_failed+0x18/0x30
        [<ffffffffa06631fc>] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs]
        [<ffffffff810d32b5>] percpu_down_read+0x35/0x50
        [<ffffffff81217dfc>] __sb_start_write+0x2c/0x40
        [<ffffffffa067f5d5>] start_transaction+0x2a5/0x4d0 [btrfs]
        [<ffffffffa067f857>] btrfs_join_transaction+0x17/0x20 [btrfs]
        [<ffffffffa068ba34>] btrfs_evict_inode+0x3c4/0x5d0 [btrfs]
        [<ffffffff81230a1a>] evict+0xba/0x1a0
        [<ffffffff812316b6>] iput+0x196/0x200
        [<ffffffffa06851d0>] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs]
        [<ffffffffa067f1d8>] btrfs_commit_transaction+0x928/0xa80 [btrfs]
        [<ffffffffa0646df0>] btrfs_freeze+0x30/0x40 [btrfs]
        [<ffffffff81218040>] freeze_super+0xf0/0x190
        [<ffffffff81229275>] do_vfs_ioctl+0x4a5/0x5c0
        [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70
        [<ffffffff810038cf>] ? syscall_trace_enter_phase1+0x11f/0x140
        [<ffffffff81229409>] SyS_ioctl+0x79/0x90
        [<ffffffff81003c12>] do_syscall_64+0x62/0x110
        [<ffffffff816acbe1>] entry_SYSCALL64_slow_path+0x25/0x25
      
      >From this warning, freeze_super() already holds SB_FREEZE_FS, but
      btrfs_freeze() will call btrfs_commit_transaction() again, if
      btrfs_commit_transaction() finds that it has delayed iputs to handle,
      it'll start_transaction(), which will try to get SB_FREEZE_FS lock
      again, then deadlock occurs.
      
      The root cause is that in btrfs, sync_filesystem(sb) does not make
      sure all metadata is updated. There still maybe some codes adding
      delayed iputs, see below sample race window:
      
               CPU1                                  |         CPU2
      |-> freeze_super()                             |
          |-> sync_filesystem(sb);                   |
          |                                          |-> cleaner_kthread()
          |                                          |   |-> btrfs_delete_unused_bgs()
          |                                          |       |-> btrfs_remove_chunk()
          |                                          |           |-> btrfs_remove_block_group()
          |                                          |               |-> btrfs_add_delayed_iput()
          |                                          |
          |-> sb->s_writers.frozen = SB_FREEZE_FS;   |
          |-> sb_wait_write(sb, SB_FREEZE_FS);       |
          |   acquire SB_FREEZE_FS lock.             |
          |                                          |
          |-> btrfs_freeze()                         |
              |-> btrfs_commit_transaction()         |
                  |-> btrfs_run_delayed_iputs()      |
                  |   will handle delayed iputs,     |
                  |   that means start_transaction() |
                  |   will be called, which will try |
                  |   to get SB_FREEZE_FS lock.      |
      
      To fix this issue, introduce a "int fs_frozen" to record internally whether
      fs has been frozen. If fs has been frozen, we can not handle delayed iputs.
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comment to btrfs_freeze ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9e7cc91a
    • Wang Xiaoguang's avatar
      btrfs: update btrfs_space_info's bytes_may_use timely · 18513091
      Wang Xiaoguang authored
      This patch can fix some false ENOSPC errors, below test script can
      reproduce one false ENOSPC error:
      	#!/bin/bash
      	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
      	dev=$(losetup --show -f fs.img)
      	mkfs.btrfs -f -M $dev
      	mkdir /tmp/mntpoint
      	mount $dev /tmp/mntpoint
      	cd /tmp/mntpoint
      	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
      
      Above script will fail for ENOSPC reason, but indeed fs still has free
      space to satisfy this request. Please see call graph:
      btrfs_fallocate()
      |-> btrfs_alloc_data_chunk_ondemand()
      |   bytes_may_use += 64M
      |-> btrfs_prealloc_file_range()
          |-> btrfs_reserve_extent()
              |-> btrfs_add_reserved_bytes()
              |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
              |   change bytes_may_use, and bytes_reserved += 64M. Now
              |   bytes_may_use + bytes_reserved == 128M, which is greater
              |   than btrfs_space_info's total_bytes, false enospc occurs.
              |   Note, the bytes_may_use decrease operation will be done in
              |   end of btrfs_fallocate(), which is too late.
      
      Here is another simple case for buffered write:
                          CPU 1              |              CPU 2
                                             |
      |-> cow_file_range()                   |-> __btrfs_buffered_write()
          |-> btrfs_reserve_extent()         |   |
          |                                  |   |
          |                                  |   |
          |    .....                         |   |-> btrfs_check_data_free_space()
          |                                  |
          |                                  |
          |-> extent_clear_unlock_delalloc() |
      
      In CPU 1, btrfs_reserve_extent()->find_free_extent()->
      btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
      operation will be delayed to be done in extent_clear_unlock_delalloc().
      Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
      btrfs_check_data_free_space() tries to reserve 100MB data space.
      If
      	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
      		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
      		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
      btrfs_check_data_free_space() will try to allcate new data chunk or call
      btrfs_start_delalloc_roots(), or commit current transaction in order to
      reserve some free space, obviously a lot of work. But indeed it's not
      necessary as long as decreasing bytes_may_use timely, we still have
      free space, decreasing 128M from bytes_may_use.
      
      To fix this issue, this patch chooses to update bytes_may_use for both
      data and metadata in btrfs_add_reserved_bytes(). For compress path, real
      extent length may not be equal to file content length, so introduce a
      ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
      btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
      file content length. Then compress path can update bytes_may_use
      correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
      and RESERVE_FREE.
      
      As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
      run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
      PREALLOC, we also need to update bytes_may_use, but can not pass
      EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
      here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
      to update btrfs_space_info's bytes_may_use.
      
      Meanwhile __btrfs_prealloc_file_range() will call
      btrfs_free_reserved_data_space() internally for both sucessful and failed
      path, btrfs_prealloc_file_range()'s callers does not need to call
      btrfs_free_reserved_data_space() any more.
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      18513091
    • Wang Xiaoguang's avatar
      btrfs: divide btrfs_update_reserved_bytes() into two functions · 4824f1f4
      Wang Xiaoguang authored
      This patch divides btrfs_update_reserved_bytes() into
      btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and
      next patch will extend btrfs_add_reserved_bytes()to fix some
      false ENOSPC error, please see later patch for detailed info.
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4824f1f4
    • Wang Xiaoguang's avatar
      btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster() · dcb40c19
      Wang Xiaoguang authored
      In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses
      wrong file offset for reloc_inode, it uses cluster->start and cluster->end,
      which indeed are extent's bytenr. The correct value should be
      cluster->[start|end] minus block group's start bytenr.
      
      start bytenr   cluster->start
      |              |     extent      |   extent   | ...| extent |
      |----------------------------------------------------------------|
      |                block group reloc_inode                         |
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      dcb40c19
    • Qu Wenruo's avatar
      btrfs: qgroup: Fix qgroup incorrectness caused by log replay · df2c95f3
      Qu Wenruo authored
      When doing log replay at mount time(after power loss), qgroup will leak
      numbers of replayed data extents.
      
      The cause is almost the same of balance.
      So fix it by manually informing qgroup for owner changed extents.
      
      The bug can be detected by btrfs/119 test case.
      
      Cc: Mark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-and-Tested-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      df2c95f3
    • Qu Wenruo's avatar
      btrfs: relocation: Fix leaking qgroups numbers on data extents · 62b99540
      Qu Wenruo authored
      This patch fixes a REGRESSION introduced in 4.2, caused by the big quota
      rework.
      
      When balancing data extents, qgroup will leak all its numbers for
      relocated data extents.
      
      The relocation is done in the following steps for data extents:
      1) Create data reloc tree and inode
      2) Copy all data extents to data reloc tree
         And commit transaction
      3) Create tree reloc tree(special snapshot) for any related subvolumes
      4) Replace file extent in tree reloc tree with new extents in data reloc
         tree
         And commit transaction
      5) Merge tree reloc tree with original fs, by swapping tree blocks
      
      For 1)~4), since tree reloc tree and data reloc tree doesn't count to
      qgroup, everything is OK.
      
      But for 5), the swapping of tree blocks will only info qgroup to track
      metadata extents.
      
      If metadata extents contain file extents, qgroup number for file extents
      will get lost, leading to corrupted qgroup accounting.
      
      The fix is, before commit transaction of step 5), manually info qgroup to
      track all file extents in data reloc tree.
      Since at commit transaction time, the tree swapping is done, and qgroup
      will account these data extents correctly.
      
      Cc: Mark Fasheh <mfasheh@suse.de>
      Reported-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reported-by: default avatarFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      62b99540
    • Qu Wenruo's avatar
      btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent() · cb93b52c
      Qu Wenruo authored
      Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
      1. btrfs_qgroup_insert_dirty_extent_nolock()
         Almost the same with original code.
         For delayed_ref usage, which has delayed refs locked.
      
         Change the return value type to int, since caller never needs the
         pointer, but only needs to know if they need to free the allocated
         memory.
      
      2. btrfs_qgroup_insert_dirty_extent()
         The more encapsulated version.
      
         Will do the delayed_refs lock, memory allocation, quota enabled check
         and other things.
      
      The original design is to keep exported functions to minimal, but since
      more btrfs hacks exposed, like replacing path in balance, we need to
      record dirty extents manually, so we have to add such functions.
      
      Also, add comment for both functions, to info developers how to keep
      qgroup correct when doing hacks.
      
      Cc: Mark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-and-Tested-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cb93b52c
    • Jeff Mahoney's avatar
      btrfs: waiting on qgroup rescan should not always be interruptible · d06f23d6
      Jeff Mahoney authored
      We wait on qgroup rescan completion in three places: file system
      shutdown, the quota disable ioctl, and the rescan wait ioctl.  If the
      user sends a signal while we're waiting, we continue happily along.  This
      is expected behavior for the rescan wait ioctl.  It's racy in the shutdown
      path but mostly works due to other unrelated synchronization points.
      In the quota disable path, it Oopses the kernel pretty much immediately.
      
      Cc: <stable@vger.kernel.org> # v4.4+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d06f23d6
    • Jeff Mahoney's avatar
      btrfs: properly track when rescan worker is running · d2c609b8
      Jeff Mahoney authored
      The qgroup_flags field is overloaded such that it reflects the on-disk
      status of qgroups and the runtime state.  The BTRFS_QGROUP_STATUS_FLAG_RESCAN
      flag is used to indicate that a rescan operation is in progress, but if
      the file system is unmounted while a rescan is running, the rescan
      operation is paused.  If the file system is then mounted read-only,
      the flag will still be present but the rescan operation will not have
      been resumed.  When we go to umount, btrfs_qgroup_wait_for_completion
      will see the flag and interpret it to mean that the rescan worker is
      still running and will wait for a completion that will never come.
      
      This patch uses a separate flag to indicate when the worker is
      running.  The locking and state surrounding the qgroup rescan worker
      needs a lot of attention beyond this patch but this is enough to
      avoid a hung umount.
      
      Cc: <stable@vger.kernel.org> # v4.4+
      Signed-off-by; Jeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d2c609b8
    • Alex Lyakas's avatar
      btrfs: flush_space: treat return value of do_chunk_alloc properly · eecba891
      Alex Lyakas authored
      do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
      But flush_space will not convert this to 0, and will also return 1.
      As a result, reserve_metadata_bytes will think that flush_space failed,
      and may potentially return this value "1" to the caller (depends how
      reserve_metadata_bytes was called). The caller will also treat this as an error.
      For example, btrfs_block_rsv_refill does:
      
      int ret = -ENOSPC;
      ...
      ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
      if (!ret) {
              block_rsv_add_bytes(block_rsv, num_bytes, 0);
              return 0;
      }
      
      return ret;
      
      So it will return -ENOSPC.
      Signed-off-by: default avatarAlex Lyakas <alex@zadarastorage.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      eecba891
    • Liu Bo's avatar
      Btrfs: add ASSERT for block group's memory leak · f3bca802
      Liu Bo authored
      This adds several ASSERT()' s to report memory leak of block group cache.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f3bca802
    • Qu Wenruo's avatar
      btrfs: backref: Fix soft lockup in __merge_refs function · d8422ba3
      Qu Wenruo authored
      When over 1000 file extents refers to one extent, find_parent_nodes()
      will be obviously slow, due to the O(n^2)~O(n^3) loops inside
      __merge_refs().
      
      The following ftrace shows the cubic growth of execution time:
      
      256 refs
       5) + 91.768 us   |  __add_keyed_refs.isra.12 [btrfs]();
       5)   1.447 us    |  __add_missing_keys.isra.13 [btrfs]();
       5) ! 114.544 us  |  __merge_refs [btrfs]();
       5) ! 136.399 us  |  __merge_refs [btrfs]();
      
      512 refs
       6) ! 279.859 us  |  __add_keyed_refs.isra.12 [btrfs]();
       6)   3.164 us    |  __add_missing_keys.isra.13 [btrfs]();
       6) ! 442.498 us  |  __merge_refs [btrfs]();
       6) # 2091.073 us |  __merge_refs [btrfs]();
      
      and 1024 refs
       7) ! 368.683 us  |  __add_keyed_refs.isra.12 [btrfs]();
       7)   4.810 us    |  __add_missing_keys.isra.13 [btrfs]();
       7) # 2043.428 us |  __merge_refs [btrfs]();
       7) * 18964.23 us |  __merge_refs [btrfs]();
      
      And sort them into the following char:
      (Unit: us)
      ------------------------------------------------------------------------
       Trace function        | 256 ref        | 512 refs      | 1024 refs    |
      ------------------------------------------------------------------------
       __add_keyed_refs      | 91             | 249           | 368          |
       __add_missing_keys    | 1              | 3             | 4            |
       __merge_refs 1st call | 114            | 442           | 2043         |
       __merge_refs 2nd call | 136            | 2091          | 18964        |
      ------------------------------------------------------------------------
      
      We can see the that __add_keyed_refs() grows almost in linear behavior.
      And __add_missing_keys() in this case doesn't change much or takes much
      time.
      
      While for the 1st __merge_refs() it's square growth
      for the 2nd __merge_refs() call it's cubic growth.
      
      It's no doubt that merge_refs() will take a long long time to execute if
      the number of refs continues its grows.
      
      So add a cond_resced() into the loop of __merge_refs().
      
      Although this will solve the problem of soft lockup, we need to use the
      new rb_tree based structure introduced by Lu Fengqi to really solve the
      long execution time.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d8422ba3
    • Liu Bo's avatar
      Btrfs: fix memory leak of reloc_root · 1c1ea4f7
      Liu Bo authored
      When some critical errors occur and FS would be flipped into RO,
      if we have an on-going balance, we can end up with a memory leak
      of root->reloc_root since btrfs_drop_snapshots() bails out
      without freeing reloc_root at the very early start.
      
      However, we're not able to free reloc_root in btrfs_drop_snapshots()
      because its caller, merge_reloc_roots(), still needs to access it to
      cleanup reloc_root's rbtree.
      
      This makes us free reloc_root when we're going to free fs/file roots.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1c1ea4f7
  6. 05 Aug, 2016 1 commit
  7. 03 Aug, 2016 2 commits
  8. 01 Aug, 2016 11 commits
    • Filipe Manana's avatar
      Btrfs: improve performance on fsync against new inode after rename/unlink · 44f714da
      Filipe Manana authored
      With commit 56f23fdb ("Btrfs: fix file/data loss caused by fsync after
      rename and new inode") we got simple fix for a functional issue when the
      following sequence of actions is done:
      
        at transaction N
        create file A at directory D
        at transaction N + M (where M >= 1)
        move/rename existing file A from directory D to directory E
        create a new file named A at directory D
        fsync the new file
        power fail
      
      The solution was to simply detect such scenario and fallback to a full
      transaction commit when we detect it. However this turned out to had a
      significant impact on throughput (and a bit on latency too) for benchmarks
      using the dbench tool, which simulates real workloads from smbd (Samba)
      servers. For example on a test vm (with a debug kernel):
      
      Unpatched:
      Throughput 19.1572 MB/sec  32 clients  32 procs  max_latency=1005.229 ms
      
      Patched:
      Throughput 23.7015 MB/sec  32 clients  32 procs  max_latency=809.206 ms
      
      The patched results (this patch is applied) are similar to the results of
      a kernel with the commit 56f23fdb ("Btrfs: fix file/data loss caused
      by fsync after rename and new inode") reverted.
      
      This change avoids the fallback to a transaction commit and instead makes
      sure all the names of the conflicting inode (the one that had a name in a
      past transaction that matches the name of the new file in the same parent
      directory) are logged so that at log replay time we don't lose neither the
      new file nor the old file, and the old file gets the name it was renamed
      to.
      
      This also ends up avoiding a full transaction commit for a similar case
      that involves an unlink instead of a rename of the old file:
      
        at transaction N
        create file A at directory D
        at transaction N + M (where M >= 1)
        remove file A
        create a new file named A at directory D
        fsync the new file
        power fail
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      44f714da
    • Filipe Manana's avatar
      Btrfs: be more precise on errors when getting an inode from disk · 67710892
      Filipe Manana authored
      When we attempt to read an inode from disk, we end up always returning an
      -ESTALE error to the caller regardless of the actual failure reason, which
      can be an out of memory problem (when allocating a path), some error found
      when reading from the fs/subvolume btree (like a genuine IO error) or the
      inode does not exists. So lets start returning the real error code to the
      callers so that they don't treat all -ESTALE errors as meaning that the
      inode does not exists (such as during orphan cleanup). This will also be
      needed for a subsequent patch in the same series dealing with a special
      fsync case.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      67710892
    • Filipe Manana's avatar
      Btrfs: send, don't bug on inconsistent snapshots · 95155585
      Filipe Manana authored
      When doing an incremental send, if we find a new/modified/deleted extent,
      reference or xattr without having previously processed the corresponding
      inode item we end up exexuting a BUG_ON(). This is because whenever an
      extent, xattr or reference is added, modified or deleted, we always expect
      to have the corresponding inode item updated. However there are situations
      where this will not happen due to transient -ENOMEM or -ENOSPC errors when
      doing delayed inode updates.
      
      For example, when punching holes we can succeed in deleting and modifying
      (shrinking) extents but later fail to do the delayed inode update. So after
      such failure we close our transaction handle and right after a snapshot of
      the fs/subvol tree can be made and used later for a send operation. The
      same thing can happen during truncate, link, unlink, and xattr related
      operations.
      
      So instead of executing a BUG_ON, make send return an -EIO error and print
      an informative error message do dmesg/syslog.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      95155585
    • Filipe Manana's avatar
      Btrfs: send, avoid incorrect leaf accesses when sending utimes operations · 15b253ea
      Filipe Manana authored
      The caller of send_utimes() is supposed to be sure that the inode number
      it passes to this function does actually exists in the send snapshot.
      However due to logic/algorithm bugs (such as the one fixed by the patch
      titled "Btrfs: send, fix invalid leaf accesses due to incorrect utimes
      operations"), this might not be the case and when that happens it makes
      send_utimes() access use an unrelated leaf item as the target inode item
      or access beyond a leaf's boundaries (when the leaf is full and
      path->slots[0] matches the number of items in the leaf).
      
      So if the call to btrfs_search_slot() done by send_utimes() does not find
      the inode item, just make sure send_utimes() returns -ENOENT and does not
      silently accesses unrelated leaf items or does invalid leaf accesses, also
      allowing us to easialy and deterministically catch such algorithmic/logic
      bugs.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      15b253ea
    • Robbie Ko's avatar
      Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations · 764433a1
      Robbie Ko authored
      During an incremental send, if we have delayed rename operations for inodes
      that were children of directories which were removed in the send snapshot,
      we can end up accessing incorrect items in a leaf or accessing beyond the
      last item of the leaf due to issuing utimes operations for the removed
      inodes. Consider the following example:
      
        Parent snapshot:
        .                                                             (ino 256)
        |--- a/                                                       (ino 257)
        |    |--- c/                                                  (ino 262)
        |
        |--- b/                                                       (ino 258)
        |    |--- d/                                                  (ino 263)
        |
        |--- del/                                                     (ino 261)
              |--- x/                                                 (ino 259)
              |--- y/                                                 (ino 260)
      
        Send snapshot:
      
        .                                                             (ino 256)
        |--- a/                                                       (ino 257)
        |
        |--- b/                                                       (ino 258)
        |
        |--- c/                                                       (ino 262)
        |    |--- y/                                                  (ino 260)
        |
        |--- d/                                                       (ino 263)
             |--- x/                                                  (ino 259)
      
      1) When processing inodes 259 and 260, we end up delaying their rename
         operations because their parents, inodes 263 and 262 respectively, were
         not yet processed and therefore not yet renamed;
      
      2) When processing inode 262, its rename operation is issued and right
         after the rename operation for inode 260 is issued. However right after
         issuing the rename operation for inode 260, at send.c:apply_dir_move(),
         we issue utimes operations for all current and past parents of inode
         260. This means we try to send a utimes operation for its old parent,
         inode 261 (deleted in the send snapshot), which does not cause any
         immediate and deterministic failure, because when the target inode is
         not found in the send snapshot, the send.c:send_utimes() function
         ignores it and uses the leaf region pointed to by path->slots[0],
         which can be any unrelated item (belonging to other inode) or it can
         be a region outside the leaf boundaries, if the leaf is full and
         path->slots[0] matches the number of items in the leaf. So we end
         up either successfully sending a utimes operation, which is fine
         and irrelevant because the old parent (inode 261) will end up being
         deleted later, or we end up doing an invalid memory access tha
         crashes the kernel.
      
      So fix this by making apply_dir_move() issue utimes operations only for
      parents that still exist in the send snapshot. In a separate patch we
      will make send_utimes() return an error (-ENOENT) if the given inode
      does not exists in the send snapshot.
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      [Rewrote change log to be more detailed and better organized]
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      764433a1
    • Robbie Ko's avatar
      Btrfs: send, fix warning due to late freeing of orphan_dir_info structures · 443f9d26
      Robbie Ko authored
      Under certain situations, when doing an incremental send, we can end up
      not freeing orphan_dir_info structures as soon as they are no longer
      needed. Instead we end up freeing them only after finishing the send
      stream, which causes a warning to be emitted:
      
      [282735.229200] ------------[ cut here ]------------
      [282735.229968] WARNING: CPU: 9 PID: 10588 at fs/btrfs/send.c:6298 btrfs_ioctl_send+0xe2f/0xe51 [btrfs]
      [282735.231282] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
      [282735.237130] CPU: 9 PID: 10588 Comm: btrfs Tainted: G        W       4.6.0-rc7-btrfs-next-31+ #1
      [282735.239309] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [282735.240160]  0000000000000000 ffff880224273ca8 ffffffff8126b42c 0000000000000000
      [282735.240160]  0000000000000000 ffff880224273ce8 ffffffff81052b14 0000189a24273ac8
      [282735.240160]  ffff8802210c9800 0000000000000000 0000000000000001 0000000000000000
      [282735.240160] Call Trace:
      [282735.240160]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
      [282735.240160]  [<ffffffff81052b14>] __warn+0xc2/0xdd
      [282735.240160]  [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
      [282735.240160]  [<ffffffffa03c99d5>] btrfs_ioctl_send+0xe2f/0xe51 [btrfs]
      [282735.240160]  [<ffffffffa0398358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
      [282735.240160]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
      [282735.240160]  [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
      [282735.240160]  [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
      [282735.240160]  [<ffffffff81196f0c>] ? __fget+0x6b/0x77
      [282735.240160]  [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
      [282735.240160]  [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
      [282735.240160]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [282735.240160]  [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14
      [282735.240160]  [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa
      [282735.256343] ---[ end trace a4539270c8056f93 ]---
      
      Consider the following example:
      
        Parent snapshot:
      
        .                                                             (ino 256)
        |--- a/                                                       (ino 257)
        |    |--- c/                                                  (ino 260)
        |
        |--- del/                                                     (ino 259)
              |--- tmp/                                               (ino 258)
              |--- x/                                                 (ino 261)
              |--- y/                                                 (ino 262)
      
        Send snapshot:
      
        .                                                             (ino 256)
        |--- a/                                                       (ino 257)
        |    |--- x/                                                  (ino 261)
        |    |--- y/                                                  (ino 262)
        |
        |--- c/                                                       (ino 260)
             |--- tmp/                                                (ino 258)
      
      1) When processing inode 258, we end up delaying its rename operation
         because it has an ancestor (in the send snapshot) that has a higher
         inode number (inode 260) which was also renamed in the send snapshot,
         therefore we delay the rename of inode 258 so that it happens after
         inode 260 is renamed;
      
      2) When processing inode 259, we end up delaying its deletion (rmdir
         operation) because it has a child inode (258) that has its rename
         operation delayed. At this point we allocate an orphan_dir_info
         structure and tag inode 258 so that we later attempt to see if we
         can delete (rmdir) inode 259 once inode 258 is renamed;
      
      3) When we process inode 260, after renaming it we finally do the rename
         operation for inode 258. Once we issue the rename operation for inode
         258 we notice that this inode was tagged so that we attempt to see
         if at this point we can delete (rmdir) inode 259. But at this point
         we can not still delete inode 259 because it has 2 children, inodes
         261 and 262, that were not yet processed and therefore not yet
         moved (renamed) away from inode 259. We end up not freeing the
         orphan_dir_info structure allocated in step 2;
      
      4) We process inodes 261 and 262, and once we move/rename inode 262
         we issue the rmdir operation for inode 260;
      
      5) We finish the send stream and notice that red black tree that
         contains orphan_dir_info structures is not empty, so we emit
         a warning and then free any orphan_dir_structures left.
      
      So fix this by freeing an orphan_dir_info structure once we try to
      apply a pending rename operation if we can not delete yet the tagged
      directory.
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      [Modified changelog to be more detailed and easier to understand]
      443f9d26
    • Robbie Ko's avatar
      Btrfs: incremental send, fix premature rmdir operations · 99ea42dd
      Robbie Ko authored
      Under certain situations, an incremental send operation can contain
      a rmdir operation that will make the receiving end fail when attempting
      to execute it, because the target directory is not yet empty.
      
      Consider the following example:
      
        Parent snapshot:
      
        .                                                             (ino 256)
        |--- a/                                                       (ino 257)
        |    |--- c/                                                  (ino 260)
        |
        |--- del/                                                     (ino 259)
              |--- tmp/                                               (ino 258)
              |--- x/                                                 (ino 261)
      
        Send snapshot:
      
        .                                                             (ino 256)
        |--- a/                                                       (ino 257)
        |    |--- x/                                                  (ino 261)
        |
        |--- c/                                                       (ino 260)
             |--- tmp/                                                (ino 258)
      
      1) When processing inode 258, we delay its rename operation because inode
         260 is its new parent in the send snapshot and it was not yet renamed
         (since 260 > 258, that is, beyond the current progress);
      
      2) When processing inode 259, we realize we can not yet send an rmdir
         operation (against inode 259) because inode 258 was still not yet
         renamed/moved away from inode 259. Therefore we update data structures
         so that after inode 258 is renamed, we try again to see if we can
         finally send an rmdir operation for inode 259;
      
      3) When we process inode 260, we send a rename operation for it followed
         by a rename operation for inode 258. Once we send the rename operation
         for inode 258 we then check if we can finally issue an rmdir for its
         previous parent, inode 259, by calling the can_rmdir() function with
         a value of sctx->cur_ino + 1 (260 + 1 = 261) for its "progress"
         argument. This makes can_rmdir() return true (value 1) because even
         though there's still a child inode of inode 259 that was not yet
         renamed/moved, which is inode 261, the given value of progress (261)
         is not lower then 261 (that is, not lower than the inode number of
         some child of inode 259). So we end up sending a rmdir operation for
         inode 259 before its child inode 261 is processed and renamed.
      
      So fix this by passing the correct progress value to the call to
      can_rmdir() from within apply_dir_move() (where we issue delayed rename
      operations), which should match stcx->cur_ino (the number of the inode
      currently being processed) and not sctx->cur_ino + 1.
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      [Rewrote change log to be more detailed, clear and well formatted]
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      99ea42dd
    • Filipe Manana's avatar
      Btrfs: incremental send, fix invalid paths for rename operations · 4122ea64
      Filipe Manana authored
      Example scenario:
      
        Parent snapshot:
      
        .                                                       (ino 277)
        |---- tmp/                                              (ino 278)
        |---- pre/                                              (ino 280)
        |      |---- wait_dir/                                  (ino 281)
        |
        |---- desc/                                             (ino 282)
        |---- ance/                                             (ino 283)
        |       |---- below_ance/                               (ino 279)
        |
        |---- other_dir/                                        (ino 284)
      
        Send snapshot:
      
        .                                                       (ino 277)
        |---- tmp/                                              (ino 278)
               |---- other_dir/                                 (ino 284)
                         |---- below_ance/                      (ino 279)
                         |            |---- pre/                (ino 280)
                         |
                         |---- wait_dir/                        (ino 281)
                                    |---- desc/                 (ino 282)
                                            |---- ance/         (ino 283)
      
      While computing the send stream the following steps happen:
      
      1) While processing inode 279 we end up delaying its rename operation
         because its new parent in the send snapshot, inode 284, was not
         yet processed and therefore not yet renamed;
      
      2) Later when processing inode 280 we end up renaming it immediately to
         "ance/below_once/pre" and not delay its rename operation because its
         new parent (inode 279 in the send snapshot) has its rename operation
         delayed and inode 280 is not an encestor of inode 279 (its parent in
         the send snapshot) in the parent snapshot;
      
      3) When processing inode 281 we end up delaying its rename operation
         because its new parent in the send snapshot, inode 284, was not yet
         processed and therefore not yet renamed;
      
      4) When processing inode 282 we do not delay its rename operation because
         its parent in the send snapshot, inode 281, already has its own rename
         operation delayed and our current inode (282) is not an ancestor of
         inode 281 in the parent snapshot. Therefore inode 282 is renamed to
         "ance/below_ance/pre/wait_dir";
      
      5) When processing inode 283 we realize that we can rename it because one
         of its ancestors in the send snapshot, inode 281, has its rename
         operation delayed and inode 283 is not an ancestor of inode 281 in the
         parent snapshot. So a rename operation to rename inode 283 to
         "ance/below_ance/pre/wait_dir/desc/ance" is issued. This path is
         invalid due to a missing path building loop that was undetected by
         the incremental send implementation, as inode 283 ends up getting
         included twice in the path (once with its path in the parent snapshot).
         Therefore its rename operation must wait before the ancestor inode 284
         is renamed.
      
      Fix this by not terminating the rename dependency checks when we find an
      ancestor, in the send snapshot, that has its rename operation delayed. So
      that we continue doing the same checks if the current inode is not an
      ancestor, in the parent snapshot, of an ancestor in the send snapshot we
      are processing in the loop.
      
      The problem and reproducer were reported by Robbie Ko, as part of a patch
      titled "Btrfs: incremental send, avoid ancestor rename to descendant".
      However the fix was unnecessarily complicated and can be addressed with
      much less code and effort.
      Reported-by: default avatarRobbie Ko <robbieko@synology.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      4122ea64
    • Filipe Manana's avatar
      Btrfs: send, add missing error check for calls to path_loop() · 7969e77a
      Filipe Manana authored
      The function path_loop() can return a negative integer, signaling an
      error, 0 if there's no path loop and 1 if there's a path loop. We were
      treating any non zero values as meaning that a path loop exists. Fix
      this by explicitly checking for errors and gracefully return them to
      user space.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      7969e77a
    • Robbie Ko's avatar
      Btrfs: send, fix failure to move directories with the same name around · 801bec36
      Robbie Ko authored
      When doing an incremental send we can end up not moving directories that
      have the same name. This happens when the same parent directory has
      different child directories with the same name in the parent and send
      snapshots.
      
      For example, consider the following scenario:
      
        Parent snapshot:
      
        .                   (ino 256)
        |---- d/            (ino 257)
        |     |--- p1/      (ino 258)
        |
        |---- p1/           (ino 259)
      
        Send snapshot:
      
        .                    (ino 256)
        |--- d/              (ino 257)
             |--- p1/        (ino 259)
                   |--- p1/  (ino 258)
      
      The directory named "d" (inode 257) has in both snapshots an entry with
      the name "p1" but it refers to different inodes in both snapshots (inode
      258 in the parent snapshot and inode 259 in the send snapshot). When
      attempting to move inode 258, the operation is delayed because its new
      parent, inode 259, was not yet moved/renamed (as the stream is currently
      processing inode 258). Then when processing inode 259, we also end up
      delaying its move/rename operation so that it happens after inode 258 is
      moved/renamed. This decision to delay the move/rename rename operation
      of inode 259 is due to the fact that the new parent inode (257) still
      has inode 258 as its child, which has the same name has inode 259. So
      we end up with inode 258 move/rename operation waiting for inode's 259
      move/rename operation, which in turn it waiting for inode's 258
      move/rename. This results in ending the send stream without issuing
      move/rename operations for inodes 258 and 259 and generating the
      following warnings in syslog/dmesg:
      
      [148402.979747] ------------[ cut here ]------------
      [148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
      [148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
      [148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G        W       4.6.0-rc7-btrfs-next-31+ #1
      [148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [148402.988136]  0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
      [148402.988136]  0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
      [148402.988136]  ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
      [148402.988136] Call Trace:
      [148402.988136]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
      [148402.988136]  [<ffffffff81052b14>] __warn+0xc2/0xdd
      [148402.988136]  [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
      [148402.988136]  [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
      [148402.988136]  [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
      [148402.988136]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
      [148402.988136]  [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
      [148402.988136]  [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
      [148402.988136]  [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
      [148402.988136]  [<ffffffff81196f0c>] ? __fget+0x6b/0x77
      [148402.988136]  [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
      [148402.988136]  [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
      [148402.988136]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [148402.988136]  [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
      [148403.011373] ---[ end trace a4539270c8056f8b ]---
      [148403.012296] ------------[ cut here ]------------
      [148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
      [148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
      [148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G        W       4.6.0-rc7-btrfs-next-31+ #1
      [148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [148403.020104]  0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
      [148403.020104]  0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
      [148403.020104]  ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
      [148403.020104] Call Trace:
      [148403.020104]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
      [148403.020104]  [<ffffffff81052b14>] __warn+0xc2/0xdd
      [148403.020104]  [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
      [148403.020104]  [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
      [148403.020104]  [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
      [148403.020104]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
      [148403.020104]  [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
      [148403.020104]  [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
      [148403.020104]  [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
      [148403.020104]  [<ffffffff81196f0c>] ? __fget+0x6b/0x77
      [148403.020104]  [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
      [148403.020104]  [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
      [148403.020104]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [148403.020104]  [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
      [148403.038981] ---[ end trace a4539270c8056f8c ]---
      
      There's another issue caused by similar (but more complex) changes in the
      directory hierarchy that makes move/rename operations fail, described with
      the following example:
      
        Parent snapshot:
      
        .
        |---- a/                                                   (ino 262)
        |     |---- c/                                             (ino 268)
        |
        |---- d/                                                   (ino 263)
              |---- ance/                                          (ino 267)
                      |---- e/                                     (ino 264)
                      |---- f/                                     (ino 265)
                      |---- ance/                                  (ino 266)
      
        Send snapshot:
      
        .
        |---- a/                                                   (ino 262)
        |---- c/                                                   (ino 268)
        |     |---- ance/                                          (ino 267)
        |
        |---- d/                                                   (ino 263)
        |     |---- ance/                                          (ino 266)
        |
        |---- f/                                                   (ino 265)
              |---- e/                                             (ino 264)
      
      When the inode 265 is processed, the path for inode 267 is computed, which
      at that time corresponds to "d/ance", and it's stored in the names cache.
      Later on when processing inode 266, we end up orphanizing (renaming to a
      name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
      the same name as inode 266 and it's currently a child of the new parent
      directory (inode 263) for inode 266. After the orphanization and while we
      are still processing inode 266, a rename operation for inode 266 is
      generated. However the source path for that rename operation is incorrect
      because it ends up using the old, pre-orphanization, name of inode 267.
      The no longer valid name for inode 267 was previously cached when
      processing inode 265 and it remains usable and considered valid until
      the inode currently being processed has a number greater than 267.
      This resulted in the receiving side failing with the following error:
      
        ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
      
      So fix these issues by detecting such circular dependencies for rename
      operations and by clearing the cached name of an inode once the inode
      is orphanized.
      
      A test case for fstests will follow soon.
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      [Rewrote change log to be more detailed and organized, and improved
       comments]
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      801bec36
    • Filipe Manana's avatar
      Btrfs: add missing check for writeback errors on fsync · 0596a904
      Filipe Manana authored
      When we start an fsync we start ordered extents for all delalloc ranges.
      However before attempting to log the inode, we only wait for those ordered
      extents if we are not doing a full sync (bit BTRFS_INODE_NEEDS_FULL_SYNC
      is set in the inode's flags). This means that if an ordered extent
      completes with an IO error before we check if we can skip logging the
      inode, we will not catch and report the IO error to user space. This is
      because on an IO error, when the ordered extent completes we do not
      update the inode, so if the inode was not previously updated by the
      current transaction we end up not logging it through calls to fsync and
      therefore not check its mapping flags for the presence of IO errors.
      
      Fix this by checking for errors in the flags of the inode's mapping when
      we notice we can skip logging the inode.
      
      This caused sporadic failures in the test generic/331 (which explicitly
      tests for IO errors during an fsync call).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      0596a904