1. 19 Jun, 2014 5 commits
    • Qu Wenruo's avatar
      btrfs: Skip scrubbing removed chunks to avoid -ENOENT. · ced96edc
      Qu Wenruo authored
      When run scrub with balance, sometimes -ENOENT will be returned, since
      in scrub_enumerate_chunks() will search dev_extent in *COMMIT_ROOT*, but
      btrfs_lookup_block_group() will search block group in *MEMORY*, so if a
      chunk is removed but not committed, -ENOENT will be returned.
      
      However, there is no need to stop scrubbing since other chunks may be
      scrubbed without problem.
      
      So this patch changes the behavior to skip removed chunks and continue
      to scrub the rest.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ced96edc
    • Miao Xie's avatar
      Btrfs: fix broken free space cache after the system crashed · e570fd27
      Miao Xie authored
      When we mounted the filesystem after the crash, we got the following
      message:
        BTRFS error (device xxx): block group xxxx has wrong amount of free space
        BTRFS error (device xxx): failed to load free space cache for block group xxx
      
      It is because we didn't update the metadata of the allocated space (in extent
      tree) until the file data was written into the disk. During this time, there was
      no information about the allocated spaces in either the extent tree nor the
      free space cache. when we wrote out the free space cache at this time (commit
      transaction), those spaces were lost. In fact, only the free space that is
      used to store the file data had this problem, the others didn't because
      the metadata of them is updated in the same transaction context.
      
      There are many methods which can fix the above problem
      - track the allocated space, and write it out when we write out the free
        space cache
      - account the size of the allocated space that is used to store the file
        data, if the size is not zero, don't write out the free space cache.
      
      The first one is complex and may make the performance drop down.
      This patch chose the second method, we use a per-block-group variant to
      account the size of that allocated space. Besides that, we also introduce
      a per-block-group read-write semaphore to avoid the race between
      the allocation and the free space cache write out.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e570fd27
    • Miao Xie's avatar
      Btrfs: make free space cache write out functions more readable · 5349d6c3
      Miao Xie authored
      This patch makes the free space cache write out functions more readable,
      and beisdes that, it also reduces the stack space that the function --
      __btrfs_write_out_cache uses from 194bytes to 144bytes.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5349d6c3
    • Filipe Manana's avatar
      Btrfs: remove unused wait queue in struct extent_buffer · 46fefe41
      Filipe Manana authored
      The lock_wq wait queue is not used anywhere, therefore just remove it.
      On a x86_64 system, this reduced sizeof(struct extent_buffer) from 320
      bytes down to 296 bytes, which means a 4Kb page can now be used for
      13 extent buffers instead of 12.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      46fefe41
    • Chris Mason's avatar
      Btrfs: fix deadlocks with trylock on tree nodes · ea4ebde0
      Chris Mason authored
      The Btrfs tree trylock function is poorly named.  It always takes
      the spinlock and backs off if the blocking lock is held.  This
      can lead to surprising lockups because people expect it to really be a
      trylock.
      
      This commit makes it a pure trylock, both for the spinlock and the
      blocking lock.  It also reworks the nested lock handling slightly to
      avoid taking the read lock while a spinning write lock might be held.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ea4ebde0
  2. 13 Jun, 2014 13 commits
    • Eric Sandeen's avatar
      btrfs: fix error handling in create_pending_snapshot · 47a306a7
      Eric Sandeen authored
      fcebe456 cut and pasted some code to a later point
      in create_pending_snapshot(), but didn't switch
      to the appropriate error handling for this stage
      of the function.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      47a306a7
    • Eric Sandeen's avatar
      btrfs: fix use of uninit "ret" in end_extent_writepage() · 3e2426bd
      Eric Sandeen authored
      If this condition in end_extent_writepage() is false:
      
      	if (tree->ops && tree->ops->writepage_end_io_hook)
      
      we will then test an uninitialized "ret" at:
      
      	ret = ret < 0 ? ret : -EIO;
      
      The test for ret is for the case where ->writepage_end_io_hook
      failed, and we'd choose that ret as the error; but if
      there is no ->writepage_end_io_hook, nothing sets ret.
      
      Initializing ret to 0 should be sufficient; if
      writepage_end_io_hook wasn't set, (!uptodate) means
      non-zero err was passed in, so we choose -EIO in that case.
      Signed-of-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3e2426bd
    • Eric Sandeen's avatar
      btrfs: free ulist in qgroup_shared_accounting() error path · d7372780
      Eric Sandeen authored
      If tmp = ulist_alloc(GFP_NOFS) fails, we return without
      freeing the previously allocated qgroups = ulist_alloc(GFP_NOFS)
      and cause a memory leak.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d7372780
    • Filipe Manana's avatar
      Btrfs: fix qgroups sanity test crash or hang · b050f9f6
      Filipe Manana authored
      Often when running the qgroups sanity test, a crash or a hang happened.
      This is because the extent buffer the test uses for the root node doesn't
      have an header level explicitly set, making it have a random level value.
      This is a problem when it's not zero for the btrfs_search_slot() calls
      the test ends up doing, resulting in crashes or hangs such as the following:
      
      [ 6454.127192] Btrfs loaded, debug=on, assert=on, integrity-checker=on
      (...)
      [ 6454.127760] BTRFS: selftest: Running qgroup tests
      [ 6454.127964] BTRFS: selftest: Running test_test_no_shared_qgroup
      [ 6454.127966] BTRFS: selftest: Qgroup basic add
      [ 6480.152005] BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:5383]
      [ 6480.152005] Modules linked in: btrfs(+) xor raid6_pq binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc i2c_piix4 i2c_core pcspkr evbug psmouse serio_raw e1000 [last unloaded: btrfs]
      [ 6480.152005] irq event stamp: 188448
      [ 6480.152005] hardirqs last  enabled at (188447): [<ffffffff8168ef5c>] restore_args+0x0/0x30
      [ 6480.152005] hardirqs last disabled at (188448): [<ffffffff81698e6a>] apic_timer_interrupt+0x6a/0x80
      [ 6480.152005] softirqs last  enabled at (188446): [<ffffffff810516cf>] __do_softirq+0x1cf/0x450
      [ 6480.152005] softirqs last disabled at (188441): [<ffffffff81051c25>] irq_exit+0xb5/0xc0
      [ 6480.152005] CPU: 0 PID: 5383 Comm: modprobe Not tainted 3.15.0-rc8-fdm-btrfs-next-33+ #4
      [ 6480.152005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [ 6480.152005] task: ffff8802146125a0 ti: ffff8800d0d00000 task.ti: ffff8800d0d00000
      [ 6480.152005] RIP: 0010:[<ffffffff81349a63>]  [<ffffffff81349a63>] __write_lock_failed+0x13/0x20
      [ 6480.152005] RSP: 0018:ffff8800d0d038e8  EFLAGS: 00000287
      [ 6480.152005] RAX: 0000000000000000 RBX: ffffffff8168ef5c RCX: 000005deb8525852
      [ 6480.152005] RDX: 0000000000000000 RSI: 0000000000001d45 RDI: ffff8802105000b8
      [ 6480.152005] RBP: ffff8800d0d038e8 R08: fffffe12710f63db R09: ffffffffa03196fb
      [ 6480.152005] R10: ffff8802146125a0 R11: ffff880214612e28 R12: ffff8800d0d03858
      [ 6480.152005] R13: 0000000000000000 R14: ffff8800d0d00000 R15: ffff8802146125a0
      [ 6480.152005] FS:  00007f14ff804700(0000) GS:ffff880215e00000(0000) knlGS:0000000000000000
      [ 6480.152005] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 6480.152005] CR2: 00007fff4df0dac8 CR3: 00000000d1796000 CR4: 00000000000006f0
      [ 6480.152005] Stack:
      [ 6480.152005]  ffff8800d0d03908 ffffffff810ae967 0000000000000001 ffff8802105000b8
      [ 6480.152005]  ffff8800d0d03938 ffffffff8168e57e ffffffffa0319c16 0000000000000007
      [ 6480.152005]  ffff880210500000 ffff880210500100 ffff8800d0d039b8 ffffffffa0319c16
      [ 6480.152005] Call Trace:
      [ 6480.152005]  [<ffffffff810ae967>] do_raw_write_lock+0x47/0xa0
      [ 6480.152005]  [<ffffffff8168e57e>] _raw_write_lock+0x5e/0x80
      [ 6480.152005]  [<ffffffffa0319c16>] ? btrfs_tree_lock+0x116/0x270 [btrfs]
      [ 6480.152005]  [<ffffffffa0319c16>] btrfs_tree_lock+0x116/0x270 [btrfs]
      [ 6480.152005]  [<ffffffffa02b2acb>] btrfs_lock_root_node+0x3b/0x50 [btrfs]
      [ 6480.152005]  [<ffffffffa02b81a6>] btrfs_search_slot+0x916/0xa20 [btrfs]
      [ 6480.152005]  [<ffffffff811a727f>] ? create_object+0x23f/0x300
      [ 6480.152005]  [<ffffffffa02b9958>] btrfs_insert_empty_items+0x78/0xd0 [btrfs]
      [ 6480.152005]  [<ffffffffa036041a>] insert_normal_tree_ref.constprop.4+0xa2/0x19a [btrfs]
      [ 6480.152005]  [<ffffffffa03605c3>] test_no_shared_qgroup+0xb1/0x1ca [btrfs]
      [ 6480.152005]  [<ffffffff8108cad6>] ? local_clock+0x16/0x30
      [ 6480.152005]  [<ffffffffa035ef8e>] btrfs_test_qgroups+0x1ae/0x1d7 [btrfs]
      [ 6480.152005]  [<ffffffffa03a69d2>] ? ftrace_define_fields_btrfs_space_reservation+0xfd/0xfd [btrfs]
      [ 6480.152005]  [<ffffffffa03a6a86>] init_btrfs_fs+0xb4/0x153 [btrfs]
      [ 6480.152005]  [<ffffffff81000352>] do_one_initcall+0x102/0x150
      [ 6480.152005]  [<ffffffff8103d223>] ? set_memory_nx+0x43/0x50
      [ 6480.152005]  [<ffffffff81682668>] ? set_section_ro_nx+0x6d/0x74
      [ 6480.152005]  [<ffffffff810d91cc>] load_module+0x1cdc/0x2630
      (...)
      
      Therefore initialize the extent buffer as an empty leaf (level 0).
      
      Issue easy to reproduce when btrfs is built as a module via:
      
          $ for ((i = 1; i <= 1000000; i++)); do rmmod btrfs; modprobe btrfs; done
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      b050f9f6
    • Sasha Levin's avatar
      btrfs: prevent RCU warning when dereferencing radix tree slot · f1e3c289
      Sasha Levin authored
      Mark the dereference as protected by lock. Not doing so triggers
      an RCU warning since the radix tree assumed that RCU is in use.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f1e3c289
    • Wang Shilong's avatar
      Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting · 5fbc7c59
      Wang Shilong authored
      Steps to reproduce:
      
       # mkfs.btrfs -f /dev/sd[b-f] -m raid5 -d raid5
       # mkfs.ext4 /dev/sdc --->corrupt one of btrfs device
       # mount /dev/sdb /mnt -o degraded
       # btrfs scrub start -BRd /mnt
      
      This is because readahead would skip missing device, this is not true
      for RAID5/6, because REQ_GET_READ_MIRRORS return 1 for RAID5/6 block
      mapping. If expected data locates in missing device, readahead thread
      would not call __readahead_hook() which makes event @rc->elems=0
      wait forever.
      
      Fix this problem by checking return value of btrfs_map_block(),we
      can only skip missing device safely if there are several mirrors.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5fbc7c59
    • Gerhard Heift's avatar
      btrfs: new ioctl TREE_SEARCH_V2 · cc68a8a5
      Gerhard Heift authored
      This new ioctl call allows the user to supply a buffer of varying size in which
      a tree search can store its results. This is much more flexible if you want to
      receive items which are larger than the current fixed buffer of 3992 bytes or
      if you want to fetch more items at once. Items larger than this buffer are for
      example some of the type EXTENT_CSUM.
      Signed-off-by: default avatarGerhard Heift <Gerhard@Heift.Name>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Acked-by: default avatarDavid Sterba <dsterba@suse.cz>
      cc68a8a5
    • Gerhard Heift's avatar
      btrfs: tree_search, search_ioctl: direct copy to userspace · ba346b35
      Gerhard Heift authored
      By copying each found item seperatly to userspace, we do not need extra
      buffer in the kernel.
      Signed-off-by: default avatarGerhard Heift <Gerhard@Heift.Name>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Acked-by: default avatarDavid Sterba <dsterba@suse.cz>
      ba346b35
    • Gerhard Heift's avatar
      btrfs: new function read_extent_buffer_to_user · 550ac1d8
      Gerhard Heift authored
      This new function reads the content of an extent directly to user memory.
      Signed-off-by: default avatarGerhard Heift <Gerhard@Heift.Name>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Acked-by: default avatarDavid Sterba <dsterba@suse.cz>
      550ac1d8
    • Gerhard Heift's avatar
      btrfs: tree_search, copy_to_sk: return needed size on EOVERFLOW · 9b6e817d
      Gerhard Heift authored
      If an item in tree_search is too large to be stored in the given buffer, return
      the needed size (including the header).
      Signed-off-by: default avatarGerhard Heift <Gerhard@Heift.Name>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Acked-by: default avatarDavid Sterba <dsterba@suse.cz>
      9b6e817d
    • Gerhard Heift's avatar
      btrfs: tree_search, copy_to_sk: return EOVERFLOW for too small buffer · 8f5f6178
      Gerhard Heift authored
      In copy_to_sk, if an item is too large for the given buffer, it now returns
      -EOVERFLOW instead of copying a search_header with len = 0. For backward
      compatibility for the first item it still copies such a header to the buffer,
      but not any other following items, which could have fitted.
      
      tree_search changes -EOVERFLOW back to 0 to behave similiar to the way it
      behaved before this patch.
      Signed-off-by: default avatarGerhard Heift <Gerhard@Heift.Name>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Acked-by: default avatarDavid Sterba <dsterba@suse.cz>
      8f5f6178
    • Gerhard Heift's avatar
      btrfs: tree_search, search_ioctl: accept varying buffer · 12544442
      Gerhard Heift authored
      rewrite search_ioctl to accept a buffer with varying size
      Signed-off-by: default avatarGerhard Heift <Gerhard@Heift.Name>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Acked-by: default avatarDavid Sterba <dsterba@suse.cz>
      12544442
    • Gerhard Heift's avatar
      btrfs: tree_search: eliminate redundant nr_items check · 25c9bc2e
      Gerhard Heift authored
      If the amount of items reached the given limit of nr_items, we can leave
      copy_to_sk without updating the key. Also by returning 1 we leave the loop in
      search_ioctl without rechecking if we reached the given limit.
      Signed-off-by: default avatarGerhard Heift <Gerhard@Heift.Name>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Acked-by: default avatarDavid Sterba <dsterba@suse.cz>
      25c9bc2e
  3. 10 Jun, 2014 22 commits
    • Chris Mason's avatar
      Btrfs: convert smp_mb__{before,after}_clear_bit · c7548af6
      Chris Mason authored
      The new call is smp_mb__{before,after}_atomic.  The __ gives us extra
      protection from the atomic rays.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c7548af6
    • Liu Bo's avatar
      Btrfs: fix scrub_print_warning to handle skinny metadata extents · 6eda71d0
      Liu Bo authored
      The skinny extents are intepreted incorrectly in scrub_print_warning(),
      and end up hitting the BUG() in btrfs_extent_inline_ref_size.
      Reported-by: default avatarKonstantinos Skarlatos <k.skarlatos@gmail.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      6eda71d0
    • Filipe Manana's avatar
      Btrfs: make fsync work after cloning into a file · 7ffbb598
      Filipe Manana authored
      When cloning into a file, we were correctly replacing the extent
      items in the target range and removing the extent maps. However
      we weren't replacing the extent maps with new ones that point to
      the new extents - as a consequence, an incremental fsync (when the
      inode doesn't have the full sync flag) was a NOOP, since it relies
      on the existence of extent maps in the modified list of the inode's
      extent map tree, which was empty. Therefore add new extent maps to
      reflect the target clone range.
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      7ffbb598
    • Liu Bo's avatar
      Btrfs: use right type to get real comparison · cd857dd6
      Liu Bo authored
      We want to make sure the point is still within the extent item, not to verify
      the memory it's pointing to.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cd857dd6
    • Josef Bacik's avatar
      Btrfs: don't check nodes for extent items · 8a56457f
      Josef Bacik authored
      The backref code was looking at nodes as well as leaves when we tried to
      populate extent item entries.  This is not good, and although we go away with it
      for the most part because we'd skip where disk_bytenr != random_memory,
      sometimes random_memory would match and suddenly boom.  This fixes that problem.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8a56457f
    • Filipe Manana's avatar
      Btrfs: don't release invalid page in btrfs_page_exists_in_range() · 6fdef6d4
      Filipe Manana authored
      In inode.c:btrfs_page_exists_in_range(), if the page we got from
      the radix tree is an exception entry, which can't be retried, we
      exit the loop with a non-NULL page and then call page_cache_release
      against it, which is not ok since it's not a valid page. This could
      also make us return true when we shouldn't.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      6fdef6d4
    • Filipe Manana's avatar
      Btrfs: make sure we retry if page is a retriable exception · 809f9016
      Filipe Manana authored
      In inode.c:btrfs_page_exists_in_range(), if the page we get from the
      radix tree is an exception which should make us retry, set page to
      NULL in order to really retry, because otherwise we don't get another
      loop iteration executed (page != NULL makes the while loop exit).
      This also was making us call page_cache_release after exiting the loop,
      which isn't correct because page doesn't point to a valid page, and
      possibly return true from the function when we shouldn't.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      809f9016
    • Filipe Manana's avatar
      Btrfs: make sure we retry if we couldn't get the page · 91405151
      Filipe Manana authored
      In inode.c:btrfs_page_exists_in_range(), if we can't get the page
      we need to retry. However we weren't retrying because we weren't
      setting page to NULL, which makes the while loop exit immediately
      and will make us call page_cache_release after exiting the loop
      which is incorrect because our page get didn't succeed. This could
      also make us return true when we shouldn't.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      91405151
    • Gui Hecheng's avatar
      btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56 · c81d5767
      Gui Hecheng authored
      To return EOPNOTSUPP is more user friendly than to return EINVAL,
      and then user-space tool will show that the dev_replace operation
      for raid56 is not currently supported rather than showing that
      there is an invalid argument.
      Signed-off-by: default avatarGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c81d5767
    • Antonio Ospite's avatar
      trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/ · 93915584
      Antonio Ospite authored
      Signed-off-by: default avatarAntonio Ospite <ao2@ao2.it>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: linux-btrfs@vger.kernel.org
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      93915584
    • Liu Bo's avatar
      Btrfs: fix leaf corruption after __btrfs_drop_extents · 0b43e04f
      Liu Bo authored
      Several reports about leaf corruption has been floating on the list, one of them
      points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted
      after __btrfs_drop_extents(), it's really a rare case but it does exist.
      
      The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents().
      
      So in btrfs_next_leaf(), we release the current path to re-search the last key of
      the leaf for locating next leaf, and we've taken it into account that there might
      be balance operations between leafs during this 'unlock and re-lock' dance, so
      we check the path again and advance it if there are now more items available.
      But things are a bit different if that last key happens to be removed and balance
      gets a bigger key as the last one, and btrfs_search_slot will return it with
      ret > 0, IOW, nothing change in this leaf except the new last key, then we think
      we're okay because there is no more item balanced in, fine, we thinks we can
      go to the next leaf.
      
      However, we should return that bigger key, otherwise we deserve leaf corruption,
      for example, in endio, skipping that key means that __btrfs_drop_extents() thinks
      it has dropped all extent matched the required range and finish_ordered_io can
      safely insert a new extent, but it actually doesn't and ends up a leaf
      corruption.
      
      One may be asking that why our locking on extent io tree doesn't work as
      expected, ie. it should avoid this kind of race situation.  But in
      __btrfs_drop_extents(), we don't always find extents which are included within
      our locking range, IOW, extents can start before our searching start, in this
      case locking on extent io tree doesn't protect us from the race.
      
      This takes the special case into account.
      Reviewed-by: default avatarFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0b43e04f
    • Filipe Manana's avatar
      Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item · 337c6f68
      Filipe Manana authored
      We might have had an item with the previous key in the tree right
      before we released our path. And after we released our path, that
      item might have been pushed to the first slot (0) of the leaf we
      were holding due to a tree balance. Alternatively, an item with the
      previous key can exist as the only element of a leaf (big fat item).
      Therefore account for these 2 cases, so that our callers (like
      btrfs_previous_item) don't miss an existing item with a key matching
      the previous key we computed above.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      337c6f68
    • Filipe Manana's avatar
      Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled · f82a9901
      Filipe Manana authored
      If the NO_HOLES feature is enabled holes don't have file extent items in
      the btree that represent them anymore. This made the clone operation
      ignore the gaps that exist between consecutive file extent items and
      therefore not create the holes at the destination. When not using the
      NO_HOLES feature, the holes were created at the destination.
      
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f82a9901
    • Jeff Mahoney's avatar
      btrfs: free delayed node outside of root->inode_lock · 96493031
      Jeff Mahoney authored
      On heavy workloads, we're seeing soft lockup warnings on
      root->inode_lock in __btrfs_release_delayed_node. The low hanging fruit
      is to reduce the size of the critical section.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      96493031
    • Gui Hecheng's avatar
      btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX · 902c68a4
      Gui Hecheng authored
      To be accurate about the error case,
      if the new size is beyond ULLONG_MAX, return ERANGE instead of EINVAL.
      Signed-off-by: default avatarGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      902c68a4
    • Filipe Manana's avatar
      Btrfs: fix transaction leak during fsync call · b05fd874
      Filipe Manana authored
      If btrfs_log_dentry_safe() returns an error, we set ret to 1 and
      fall through with the goal of committing the transaction. However,
      in the case where the inode doesn't need a full sync, we would call
      btrfs_wait_ordered_range() against the target range for our inode,
      and if it returned an error, we would return without commiting or
      ending the transaction.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      b05fd874
    • Qu Wenruo's avatar
      btrfs: Avoid trucating page or punching hole in a already existed hole. · d7781546
      Qu Wenruo authored
      btrfs_punch_hole() will truncate unaligned pages or punch hole on a
      already existed hole.
      This will cause unneeded zero page or holes splitting the original huge
      hole.
      
      This patch will skip already existed holes before any page truncating or
      hole punching.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d7781546
    • Filipe Manana's avatar
      Btrfs: update commit root on snapshot creation after orphan cleanup · 3821f348
      Filipe Manana authored
      On snapshot creation (either writable or read-only), we do orphan cleanup
      against the root of the snapshot. If the cleanup did remove any orphans,
      then the current root node will be different from the commit root node
      until the next transaction commit happens.
      
      A send operation always uses the commit root of a snapshot - this means
      it will see the orphans if it starts computing the send stream before the
      next transaction commit happens (triggered by a timer or sync() for .e.g),
      which is when the commit root gets assigned a reference to current root,
      where the orphans are not visible anymore. The consequence of send seeing
      the orphans is explained below.
      
      For example:
      
          mkfs.btrfs -f /dev/sdd
          mount -o commit=999 /dev/sdd /mnt
      
          # open a file with O_TMPFILE and leave it open
          # write some data to the file
          btrfs subvolume snapshot -r /mnt /mnt/snap1
      
          btrfs send /mnt/snap1 -f /tmp/send.data
      
      The send operation will fail with the following error:
      
          ERROR: send ioctl failed with -116: Stale file handle
      
      What happens here is that our snapshot has an orphan inode still visible
      through the commit root, that corresponds to the tmpfile. However send
      will attempt to call inode.c:btrfs_iget(), with the goal of reading the
      file's data, which will return -ESTALE because it will use the current
      root (and not the commit root) of the snapshot.
      
      Of course, there are other cases where we can get orphans, but this
      example using a tmpfile makes it much easier to reproduce the issue.
      
      Therefore on snapshot creation, after calling btrfs_orphan_cleanup, if
      the commit root is different from the current root, just commit the
      transaction associated with the snapshot's root (if it exists), so that
      a send will not see any orphans that don't exist anymore. This also
      guarantees a send will always see the same content regardless of whether
      a transaction commit happened already before the send was requested and
      after the orphan cleanup (meaning the commit root and current roots are
      the same) or it hasn't happened yet (commit and current roots are
      different).
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3821f348
    • Filipe Manana's avatar
      Btrfs: ioctl, don't re-lock extent range when not necessary · ff5df9b8
      Filipe Manana authored
      In ioctl.c:lock_extent_range(), after locking our target range, the
      ordered extent that btrfs_lookup_first_ordered_extent() returns us
      may not overlap our target range at all. In this case we would just
      unlock our target range, wait for any new ordered extents that overlap
      the range to complete, lock again the range and repeat all these steps
      until we don't get any ordered extent and the delalloc flag isn't set
      in the io tree for our target range.
      
      Therefore just stop if we get an ordered extent that doesn't overlap
      our target range and the dealalloc flag isn't set for the range in
      the inode's io tree.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ff5df9b8
    • Filipe Manana's avatar
      Btrfs: avoid visiting all extent items when cloning a range · 2c463823
      Filipe Manana authored
      When cloning a range of a file, we were visiting all the extent items in
      the btree that belong to our source inode. We don't need to visit those
      extent items that don't overlap the range we are cloning, as doing so only
      makes us waste time and do unnecessary btree navigations (btrfs_next_leaf)
      for inodes that have a large number of file extent items in the btree.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2c463823
    • Filipe Manana's avatar
      Btrfs: set dead flag on the right root when destroying snapshot · c55bfa67
      Filipe Manana authored
      We were setting the BTRFS_ROOT_SUBVOL_DEAD flag on the root of the
      parent of our target snapshot, instead of setting it in the target
      snapshot's root.
      
      This is easy to observe by running the following scenario:
      
          mkfs.btrfs -f /dev/sdd
          mount /dev/sdd /mnt
      
          btrfs subvolume create /mnt/first_subvol
          btrfs subvolume snapshot -r /mnt /mnt/mysnap1
      
          btrfs subvolume delete /mnt/first_subvol
          btrfs subvolume snapshot -r /mnt /mnt/mysnap2
      
          btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data
      
      The send command failed because the send ioctl returned -EPERM.
      A test case for xfstests follows.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c55bfa67
    • Filipe Manana's avatar
      Btrfs: ensure readers see new data after a clone operation · c125b8bf
      Filipe Manana authored
      We were cleaning the clone target file range from the page cache before
      we did replace the file extent items in the fs tree. This was racy,
      as right after cleaning the relevant range from the page cache and before
      replacing the file extent items, a read against that range could be
      performed by another task and populate again the page cache with stale
      data (stale after the cloning finishes). This would result in reads after
      the clone operation successfully finishes to get old data (and potentially
      for a very long time). Therefore evict the pages after replacing the file
      extent items, so that subsequent reads will always get the new data.
      
      Similarly, we were prone to races while cloning the file extent items
      because we weren't locking the target range and wait for any existing
      ordered extents against that range to complete. It was possible that
      after cloning the extent items, a write operation that was performed
      before the clone operation and overlaps the same range, would end up
      undoing all or part of the work the clone operation did (a worker task
      running inode.c:btrfs_finish_ordered_io). Therefore lock the target
      range in the io tree, wait for all pending ordered extents against that
      range to finish and then safely perform the cloning.
      
      The issue of reading stale data after the clone operation is easy to
      reproduce by running the following C program in a loop until it exits
      with return value 1.
      
       #include <unistd.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <errno.h>
       #include <pthread.h>
       #include <fcntl.h>
       #include <assert.h>
       #include <asm/types.h>
       #include <linux/ioctl.h>
       #include <sys/stat.h>
       #include <sys/types.h>
       #include <sys/ioctl.h>
      
       #define SRC_FILE "/mnt/sdd/foo"
       #define DST_FILE "/mnt/sdd/bar"
       #define FILE_SIZE (16 * 1024)
       #define PATTERN_SRC 'X'
       #define PATTERN_DST 'Y'
      
      struct btrfs_ioctl_clone_range_args {
      	__s64 src_fd;
      	__u64 src_offset, src_length;
      	__u64 dest_offset;
      };
      
       #define BTRFS_IOCTL_MAGIC 0x94
       #define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \
      				   struct btrfs_ioctl_clone_range_args)
      
      static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
      static int clone_done = 0;
      static int reader_ready = 0;
      static int stale_data = 0;
      
      static void *reader_loop(void *arg)
      {
      	char buf[4096], want_buf[4096];
      
      	memset(want_buf, PATTERN_SRC, 4096);
      	pthread_mutex_lock(&mutex);
      	reader_ready = 1;
      	pthread_mutex_unlock(&mutex);
      
      	while (1) {
      		int done, fd, ret;
      
      		fd = open(DST_FILE, O_RDONLY);
      		assert(fd != -1);
      
      		pthread_mutex_lock(&mutex);
      		done = clone_done;
      		pthread_mutex_unlock(&mutex);
      
      		ret = read(fd, buf, 4096);
      		assert(ret == 4096);
      		close(fd);
      
      		if (done) {
      			ret = memcmp(buf, want_buf, 4096);
      			if (ret == 0) {
      				printf("Found new content\n");
      			} else {
      				printf("Found old content\n");
      				pthread_mutex_lock(&mutex);
      				stale_data = 1;
      				pthread_mutex_unlock(&mutex);
      			}
      			break;
      		}
      	}
      	return NULL;
      }
      
      int main(int argc, char *argv[])
      {
      	pthread_t reader;
      	int ret, i, fd;
      	struct btrfs_ioctl_clone_range_args clone_args;
      	int fd1, fd2;
      
      	ret = remove(SRC_FILE);
      	if (ret == -1 && errno != ENOENT) {
      		fprintf(stderr, "Error deleting src file: %s\n", strerror(errno));
      		return 1;
      	}
      	ret = remove(DST_FILE);
      	if (ret == -1 && errno != ENOENT) {
      		fprintf(stderr, "Error deleting dst file: %s\n", strerror(errno));
      		return 1;
      	}
      
      	fd = open(SRC_FILE, O_CREAT | O_WRONLY | O_TRUNC, S_IRWXU);
      	assert(fd != -1);
      	for (i = 0; i < FILE_SIZE; i++) {
      		char c = PATTERN_SRC;
      		ret = write(fd, &c, 1);
      		assert(ret == 1);
      	}
      	close(fd);
      	fd = open(DST_FILE, O_CREAT | O_WRONLY | O_TRUNC, S_IRWXU);
      	assert(fd != -1);
      	for (i = 0; i < FILE_SIZE; i++) {
      		char c = PATTERN_DST;
      		ret = write(fd, &c, 1);
      		assert(ret == 1);
      	}
      	close(fd);
              sync();
      
      	ret = pthread_create(&reader, NULL, reader_loop, NULL);
      	assert(ret == 0);
      	while (1) {
      		int r;
      		pthread_mutex_lock(&mutex);
      		r = reader_ready;
      		pthread_mutex_unlock(&mutex);
      		if (r) break;
      	}
      
      	fd1 = open(SRC_FILE, O_RDONLY);
      	if (fd1 < 0) {
      		fprintf(stderr, "Error open src file: %s\n", strerror(errno));
      		return 1;
      	}
      	fd2 = open(DST_FILE, O_RDWR);
      	if (fd2 < 0) {
      		fprintf(stderr, "Error open dst file: %s\n", strerror(errno));
      		return 1;
      	}
      	clone_args.src_fd = fd1;
      	clone_args.src_offset = 0;
      	clone_args.src_length = 4096;
      	clone_args.dest_offset = 0;
      	ret = ioctl(fd2, BTRFS_IOC_CLONE_RANGE, &clone_args);
      	assert(ret == 0);
      	close(fd1);
      	close(fd2);
      
      	pthread_mutex_lock(&mutex);
      	clone_done = 1;
      	pthread_mutex_unlock(&mutex);
      	ret = pthread_join(reader, NULL);
      	assert(ret == 0);
      
      	pthread_mutex_lock(&mutex);
      	ret = stale_data ? 1 : 0;
      	pthread_mutex_unlock(&mutex);
      	return ret;
      }
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c125b8bf