1. 11 Nov, 2015 5 commits
  2. 09 Nov, 2015 2 commits
    • Filipe Manana's avatar
      Btrfs: fix race when listing an inode's xattrs · f1cd1f0b
      Filipe Manana authored
      When listing a inode's xattrs we have a time window where we race against
      a concurrent operation for adding a new hard link for our inode that makes
      us not return any xattr to user space. In order for this to happen, the
      first xattr of our inode needs to be at slot 0 of a leaf and the previous
      leaf must still have room for an inode ref (or extref) item, and this can
      happen because an inode's listxattrs callback does not lock the inode's
      i_mutex (nor does the VFS does it for us), but adding a hard link to an
      inode makes the VFS lock the inode's i_mutex before calling the inode's
      link callback.
      
      If we have the following leafs:
      
                     Leaf X (has N items)                    Leaf Y
      
       [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 XATTR_ITEM 12345), ... ]
                 slot N - 2         slot N - 1              slot 0
      
      The race illustrated by the following sequence diagram is possible:
      
             CPU 1                                               CPU 2
      
        btrfs_listxattr()
      
          searches for key (257 XATTR_ITEM 0)
      
          gets path with path->nodes[0] == leaf X
          and path->slots[0] == N
      
          because path->slots[0] is >=
          btrfs_header_nritems(leaf X), it calls
          btrfs_next_leaf()
      
          btrfs_next_leaf()
            releases the path
      
                                                         adds key (257 INODE_REF 666)
                                                         to the end of leaf X (slot N),
                                                         and leaf X now has N + 1 items
      
            searches for the key (257 INODE_REF 256),
            with path->keep_locks == 1, because that
            is the last key it saw in leaf X before
            releasing the path
      
            ends up at leaf X again and it verifies
            that the key (257 INODE_REF 256) is no
            longer the last key in leaf X, so it
            returns with path->nodes[0] == leaf X
            and path->slots[0] == N, pointing to
            the new item with key (257 INODE_REF 666)
      
          btrfs_listxattr's loop iteration sees that
          the type of the key pointed by the path is
          different from the type BTRFS_XATTR_ITEM_KEY
          and so it breaks the loop and stops looking
          for more xattr items
            --> the application doesn't get any xattr
                listed for our inode
      
      So fix this by breaking the loop only if the key's type is greater than
      BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      f1cd1f0b
    • Filipe Manana's avatar
      Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow · 1d512cb7
      Filipe Manana authored
      If we are using the NO_HOLES feature, we have a tiny time window when
      running delalloc for a nodatacow inode where we can race with a concurrent
      link or xattr add operation leading to a BUG_ON.
      
      This happens because at run_delalloc_nocow() we end up casting a leaf item
      of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
      file extent item (struct btrfs_file_extent_item) and then analyse its
      extent type field, which won't match any of the expected extent types
      (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
      explicit BUG_ON(1).
      
      The following sequence diagram shows how the race happens when running a
      no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
      neighbour leafs:
      
                   Leaf X (has N items)                    Leaf Y
      
       [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
                    slot N - 2         slot N - 1              slot 0
      
       (Note the implicit hole for inode 257 regarding the [0, 8K[ range)
      
             CPU 1                                         CPU 2
      
       run_dealloc_nocow()
         btrfs_lookup_file_extent()
           --> searches for a key with value
               (257 EXTENT_DATA 4096) in the
               fs/subvol tree
           --> returns us a path with
               path->nodes[0] == leaf X and
               path->slots[0] == N
      
         because path->slots[0] is >=
         btrfs_header_nritems(leaf X), it
         calls btrfs_next_leaf()
      
         btrfs_next_leaf()
           --> releases the path
      
                                                    hard link added to our inode,
                                                    with key (257 INODE_REF 500)
                                                    added to the end of leaf X,
                                                    so leaf X now has N + 1 keys
      
           --> searches for the key
               (257 INODE_REF 256), because
               it was the last key in leaf X
               before it released the path,
               with path->keep_locks set to 1
      
           --> ends up at leaf X again and
               it verifies that the key
               (257 INODE_REF 256) is no longer
               the last key in the leaf, so it
               returns with path->nodes[0] ==
               leaf X and path->slots[0] == N,
               pointing to the new item with
               key (257 INODE_REF 500)
      
         the loop iteration of run_dealloc_nocow()
         does not break out the loop and continues
         because the key referenced in the path
         at path->nodes[0] and path->slots[0] is
         for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
         and its offset (500) is less then our delalloc
         range's end (8192)
      
         the item pointed by the path, an inode reference item,
         is (incorrectly) interpreted as a file extent item and
         we get an invalid extent type, leading to the BUG_ON(1):
      
         if (extent_type == BTRFS_FILE_EXTENT_REG ||
            extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
             (...)
         } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
             (...)
         } else {
             BUG_ON(1)
         }
      
      The same can happen if a xattr is added concurrently and ends up having
      a key with an offset smaller then the delalloc's range end.
      
      So fix this by skipping keys with a type smaller than
      BTRFS_EXTENT_DATA_KEY.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      1d512cb7
  3. 08 Nov, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: fix race leading to incorrect item deletion when dropping extents · aeafbf84
      Filipe Manana authored
      While running a stress test I got the following warning triggered:
      
        [191627.672810] ------------[ cut here ]------------
        [191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]()
        (...)
        [191627.701485] Call Trace:
        [191627.702037]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
        [191627.702992]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
        [191627.704091]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
        [191627.705380]  [<ffffffffa0664499>] ? __btrfs_drop_extents+0x391/0xa50 [btrfs]
        [191627.706637]  [<ffffffff8104b46d>] warn_slowpath_null+0x1a/0x1c
        [191627.707789]  [<ffffffffa0664499>] __btrfs_drop_extents+0x391/0xa50 [btrfs]
        [191627.709155]  [<ffffffff8115663c>] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0
        [191627.712444]  [<ffffffff81155007>] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18
        [191627.714162]  [<ffffffffa06570c9>] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs]
        [191627.715887]  [<ffffffffa065422b>] ? start_transaction+0x3bb/0x610 [btrfs]
        [191627.717287]  [<ffffffffa065b604>] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs]
        [191627.728865]  [<ffffffffa065b888>] finish_ordered_fn+0x15/0x17 [btrfs]
        [191627.730045]  [<ffffffffa067d688>] normal_work_helper+0x14c/0x32c [btrfs]
        [191627.731256]  [<ffffffffa067d96a>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
        [191627.732661]  [<ffffffff81061119>] process_one_work+0x24c/0x4ae
        [191627.733822]  [<ffffffff810615b0>] worker_thread+0x206/0x2c2
        [191627.734857]  [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
        [191627.736052]  [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
        [191627.737349]  [<ffffffff810669a6>] kthread+0xef/0xf7
        [191627.738267]  [<ffffffff810f3b3a>] ? time_hardirqs_on+0x15/0x28
        [191627.739330]  [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
        [191627.741976]  [<ffffffff81465592>] ret_from_fork+0x42/0x70
        [191627.743080]  [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
        [191627.744206] ---[ end trace bbfddacb7aaada8d ]---
      
        $ cat -n fs/btrfs/file.c
        691  int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
        (...)
        758                  btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
        759                  if (key.objectid > ino ||
        760                      key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
        761                          break;
        762
        763                  fi = btrfs_item_ptr(leaf, path->slots[0],
        764                                      struct btrfs_file_extent_item);
        765                  extent_type = btrfs_file_extent_type(leaf, fi);
        766
        767                  if (extent_type == BTRFS_FILE_EXTENT_REG ||
        768                      extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
        (...)
        774                  } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
        (...)
        778                  } else {
        779                          WARN_ON(1);
        780                          extent_end = search_start;
        781                  }
        (...)
      
      This happened because the item we were processing did not match a file
      extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
      case we cast the item to a struct btrfs_file_extent_item pointer and
      then find a type field value that does not match any of the expected
      values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
      due to a tiny time window where a race can happen as exemplified below.
      For example, consider the following scenario where we're using the
      NO_HOLES feature and we have the following two neighbour leafs:
      
                     Leaf X (has N items)                    Leaf Y
      
      [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
                slot N - 2         slot N - 1              slot 0
      
      Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
      than explicit because NO_HOLES is enabled). Now if our inode has an
      ordered extent for the range [4K, 8K[ that is finishing, the following
      can happen:
      
                CPU 1                                       CPU 2
      
        btrfs_finish_ordered_io()
          insert_reserved_file_extent()
            __btrfs_drop_extents()
               Searches for the key
                (257 EXTENT_DATA 4096) through
                btrfs_lookup_file_extent()
      
               Key not found and we get a path where
               path->nodes[0] == leaf X and
               path->slots[0] == N
      
               Because path->slots[0] is >=
               btrfs_header_nritems(leaf X), we call
               btrfs_next_leaf()
      
               btrfs_next_leaf() releases the path
      
                                                        inserts key
                                                        (257 INODE_REF 4096)
                                                        at the end of leaf X,
                                                        leaf X now has N + 1 keys,
                                                        and the new key is at
                                                        slot N
      
               btrfs_next_leaf() searches for
               key (257 INODE_REF 256), with
               path->keep_locks set to 1,
               because it was the last key it
               saw in leaf X
      
                 finds it in leaf X again and
                 notices it's no longer the last
                 key of the leaf, so it returns 0
                 with path->nodes[0] == leaf X and
                 path->slots[0] == N (which is now
                 < btrfs_header_nritems(leaf X)),
                 pointing to the new key
                 (257 INODE_REF 4096)
      
               __btrfs_drop_extents() casts the
               item at path->nodes[0], slot
               path->slots[0], to a struct
               btrfs_file_extent_item - it does
               not skip keys for the target
               inode with a type less than
               BTRFS_EXTENT_DATA_KEY
               (BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY)
      
               sees a bogus value for the type
               field triggering the WARN_ON in
               the trace shown above, and sets
               extent_end = search_start (4096)
      
               does the if-then-else logic to
               fixup 0 length extent items created
               by a past bug from hole punching:
      
                 if (extent_end == key.offset &&
                     extent_end >= search_start)
                     goto delete_extent_item;
      
               that evaluates to true and it ends
               up deleting the key pointed to by
               path->slots[0], (257 INODE_REF 4096),
               from leaf X
      
      The same could happen for example for a xattr that ends up having a key
      with an offset value that matches search_start (very unlikely but not
      impossible).
      
      So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
      skipped, never casted to struct btrfs_file_extent_item and never deleted
      by accident. Also protect against the unexpected case of getting a key
      for a lower inode number by skipping that key and issuing a warning.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      aeafbf84
  4. 05 Nov, 2015 4 commits
    • Filipe Manana's avatar
      Btrfs: fix sleeping inside atomic context in qgroup rescan worker · 3b2ba7b3
      Filipe Manana authored
      We are holding a btree path with spinning locks and then we attempt to
      clone an extent buffer, which calls kmem_cache_alloc() and this function
      can sleep, causing the following trace to be reported on a debug kernel:
      
      [107118.218536] BUG: sleeping function called from invalid context at mm/slab.c:2871
      [107118.224110] in_atomic(): 1, irqs_disabled(): 0, pid: 19148, name: kworker/u32:3
      [107118.226120] INFO: lockdep is turned off.
      [107118.226843] Preemption disabled at:[<ffffffffa05ffa22>] btrfs_clear_lock_blocking_rw+0x96/0xea [btrfs]
      
      [107118.229175] CPU: 3 PID: 19148 Comm: kworker/u32:3 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [107118.231326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [107118.233687] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper [btrfs]
      [107118.236835]  0000000000000000 ffff880424bf3b78 ffffffff812566f4 0000000000000000
      [107118.238369]  ffff880424bf3ba0 ffffffff81070664 ffffffff817f1cd5 0000000000000b37
      [107118.239769]  0000000000000000 ffff880424bf3bc8 ffffffff8107070a 0000000000008850
      [107118.241244] Call Trace:
      [107118.241729]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
      [107118.242602]  [<ffffffff81070664>] ___might_sleep+0x23a/0x241
      [107118.243586]  [<ffffffff8107070a>] __might_sleep+0x9f/0xa6
      [107118.244532]  [<ffffffff8115af70>] cache_alloc_debugcheck_before+0x25/0x36
      [107118.245939]  [<ffffffff8115d52b>] kmem_cache_alloc+0x50/0x215
      [107118.246930]  [<ffffffffa05e627e>] __alloc_extent_buffer+0x2a/0x11f [btrfs]
      [107118.248121]  [<ffffffffa05ecb1a>] btrfs_clone_extent_buffer+0x3d/0xdd [btrfs]
      [107118.249451]  [<ffffffffa06239ea>] btrfs_qgroup_rescan_worker+0x16d/0x434 [btrfs]
      [107118.250755]  [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
      [107118.251754]  [<ffffffffa05f7952>] normal_work_helper+0x14c/0x32a [btrfs]
      [107118.252899]  [<ffffffffa05f7952>] ? normal_work_helper+0x14c/0x32a [btrfs]
      [107118.254195]  [<ffffffffa05f7c82>] btrfs_qgroup_rescan_helper+0x12/0x14 [btrfs]
      [107118.255436]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
      [107118.263690]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
      [107118.264888]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [107118.267413]  [<ffffffff8106904d>] kthread+0xef/0xf7
      [107118.268417]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [107118.269505]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
      [107118.270491]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      
      So just use blocking locks for our path to solve this.
      This fixes the patch titled:
        "btrfs: qgroup: Don't copy extent buffer to do qgroup rescan"
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      3b2ba7b3
    • Filipe Manana's avatar
      Btrfs: fix race waiting for qgroup rescan worker · 190631f1
      Filipe Manana authored
      We were initializing the completion (fs_info->qgroup_rescan_completion)
      object after releasing the qgroup rescan lock, which gives a small time
      window for a rescan waiter to not actually wait for the rescan worker
      to finish. Example:
      
               CPU 1                                                     CPU 2
      
       fs_info->qgroup_rescan_completion->done is 0
      
       btrfs_qgroup_rescan_worker()
         complete_all(&fs_info->qgroup_rescan_completion)
           sets fs_info->qgroup_rescan_completion->done
           to UINT_MAX / 2
      
       ... do some other stuff ....
      
       qgroup_rescan_init()
         mutex_lock(&fs_info->qgroup_rescan_lock)
         set flag BTRFS_QGROUP_STATUS_FLAG_RESCAN
           in fs_info->qgroup_flags
         mutex_unlock(&fs_info->qgroup_rescan_lock)
      
                                                             btrfs_qgroup_wait_for_completion()
                                                               mutex_lock(&fs_info->qgroup_rescan_lock)
                                                               sees flag BTRFS_QGROUP_STATUS_FLAG_RESCAN
                                                                 in fs_info->qgroup_flags
                                                               mutex_unlock(&fs_info->qgroup_rescan_lock)
      
                                                               wait_for_completion_interruptible(
                                                                 &fs_info->qgroup_rescan_completion)
      
                                                                 fs_info->qgroup_rescan_completion->done
                                                                 is > 0 so it returns immediately
      
        init_completion(&fs_info->qgroup_rescan_completion)
          sets fs_info->qgroup_rescan_completion->done to 0
      
      So fix this by initializing the completion object while holding the mutex
      fs_info->qgroup_rescan_lock.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      190631f1
    • Justin Maggard's avatar
      btrfs: qgroup: exit the rescan worker during umount · 7343dd61
      Justin Maggard authored
      I was hitting a consistent NULL pointer dereference during shutdown that
      showed the trace running through end_workqueue_bio().  I traced it back to
      the endio_meta_workers workqueue being poked after it had already been
      destroyed.
      
      Eventually I found that the root cause was a qgroup rescan that was still
      in progress while we were stopping all the btrfs workers.
      
      Currently we explicitly pause balance and scrub operations in
      close_ctree(), but we do nothing to stop the qgroup rescan.  We should
      probably be doing the same for qgroup rescan, but that's a much larger
      change.  This small change is good enough to allow me to unmount without
      crashing.
      Signed-off-by: default avatarJustin Maggard <jmaggard@netgear.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      7343dd61
    • Filipe Manana's avatar
      Btrfs: fix extent accounting for partial direct IO writes · 9c9464cc
      Filipe Manana authored
      When doing a write using direct IO we can end up not doing the whole write
      operation using the direct IO path, in that case we fallback to a buffered
      write to do the remaining IO. This happens for example if the range we are
      writing to contains a compressed extent.
      When we do a partial write and fallback to buffered IO, due to the
      existence of a compressed extent for example, we end up not adjusting the
      outstanding extents counter of our inode which ends up getting decremented
      twice, once by the DIO ordered extent for the partial write and once again
      by btrfs_direct_IO(), resulting in an arithmetic underflow at
      extent-tree.c:drop_outstanding_extent(). For example if we have:
      
        extents        [ prealloc extent ] [ compressed extent ]
        offsets        A        B          C       D           E
      
      and at the moment our inode's outstanding extents counter is 0, if we do a
      direct IO write against the range [B, D[ (which has a length smaller than
      128Mb), we end up bumping our inode's outstanding extents counter to 1, we
      create a DIO ordered extent for the range [B, C[ and then fallback to a
      buffered write for the range [C, D[. The direct IO handler
      (inode.c:btrfs_direct_IO()) decrements the outstanding extents counter by
      1, leaving it with a value of 0, through a call to
      btrfs_delalloc_release_space() and then shortly after the DIO ordered
      extent finishes and calls btrfs_delalloc_release_metadata() which ends
      up to attempt to decrement the inode's outstanding extents counter by 1,
      resulting in an assertion failure at drop_outstanding_extent() because
      the operation would result in an arithmetic underflow (0 - 1). This
      produces the following trace:
      
        [125471.336838] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5526
        [125471.338844] ------------[ cut here ]------------
        [125471.340745] kernel BUG at fs/btrfs/ctree.h:4173!
        [125471.340745] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        [125471.340745] Modules linked in: btrfs f2fs xfs libcrc32c dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpufreq psmouse i2c_piix4 parport pcspkr serio_raw microcode processor evdev i2c_core button ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci virtio_ring floppy libata virtio e1000 scsi_mod [last unloaded: btrfs]
        [125471.340745] CPU: 10 PID: 23649 Comm: kworker/u32:1 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
        [125471.340745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
        [125471.340745] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        [125471.340745] task: ffff8804244fcf80 ti: ffff88040a118000 task.ti: ffff88040a118000
        [125471.340745] RIP: 0010:[<ffffffffa0550da1>]  [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
        [125471.340745] RSP: 0018:ffff88040a11bc78  EFLAGS: 00010296
        [125471.340745] RAX: 0000000000000075 RBX: 0000000000005000 RCX: 0000000000000000
        [125471.340745] RDX: ffffffff81098f93 RSI: ffffffff8147c619 RDI: 00000000ffffffff
        [125471.340745] RBP: ffff88040a11bc78 R08: 0000000000000001 R09: 0000000000000000
        [125471.340745] R10: ffff88040a11bc08 R11: ffffffff81651000 R12: ffff8803efb4a000
        [125471.340745] R13: ffff8803efb4a000 R14: 0000000000000000 R15: ffff8802f8e33c88
        [125471.340745] FS:  0000000000000000(0000) GS:ffff88043dd40000(0000) knlGS:0000000000000000
        [125471.340745] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        [125471.340745] CR2: 00007fae7ca86095 CR3: 0000000001a0b000 CR4: 00000000000006e0
        [125471.340745] Stack:
        [125471.340745]  ffff88040a11bc88 ffffffffa04ca0cd ffff88040a11bcc8 ffffffffa04ceeb1
        [125471.340745]  ffff8802f8e33940 ffff8802c93eadb0 ffff8802f8e0bf50 ffff8803efb4a000
        [125471.340745]  0000000000000000 ffff8802f8e33c88 ffff88040a11bd38 ffffffffa04eccfa
        [125471.340745] Call Trace:
        [125471.340745]  [<ffffffffa04ca0cd>] drop_outstanding_extent+0x3d/0x6d [btrfs]
        [125471.340745]  [<ffffffffa04ceeb1>] btrfs_delalloc_release_metadata+0x51/0xdd [btrfs]
        [125471.340745]  [<ffffffffa04eccfa>] btrfs_finish_ordered_io+0x420/0x4eb [btrfs]
        [125471.340745]  [<ffffffffa04ecdda>] finish_ordered_fn+0x15/0x17 [btrfs]
        [125471.340745]  [<ffffffffa050e6e8>] normal_work_helper+0x14c/0x32a [btrfs]
        [125471.340745]  [<ffffffffa050e9c8>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
        [125471.340745]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
        [125471.340745]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
        [125471.340745]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
        [125471.340745]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
        [125471.340745]  [<ffffffff8106904d>] kthread+0xef/0xf7
        [125471.340745]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
        [125471.340745]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
        [125471.340745]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
        [125471.340745] Code: a5 55 a0 48 89 e5 e8 42 50 bc e0 0f 0b 55 89 f1 48 c7 c2 f0 a8 55 a0 48 89 fe 31 c0 48 c7 c7 14 aa 55 a0 48 89 e5 e8 22 50 bc e0 <0f> 0b 0f 1f 44 00 00 55 31 c9 ba 18 00 00 00 48 89 e5 41 56 41
        [125471.340745] RIP  [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
        [125471.340745]  RSP <ffff88040a11bc78>
        [125471.539620] ---[ end trace 144259f7838b4aa4 ]---
      
      So fix this by ensuring we adjust the outstanding extents counter when we
      do the fallback just like we do for the case where the whole write can be
      done through the direct IO path.
      
      We were also adjusting the outstanding extents counter by a constant value
      of 1, which is incorrect because we were ignorning that we account extents
      in BTRFS_MAX_EXTENT_SIZE units, o fix that as well.
      
      The following test case for fstests reproduces this issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_xfs_io_command "falloc"
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount "-o compress"
      
        # Create a compressed extent covering the range [700K, 800K[.
        $XFS_IO_PROG -f -s -c "pwrite -S 0xaa -b 100K 700K 100K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Create prealloc extent covering the range [600K, 700K[.
        $XFS_IO_PROG -c "falloc 600K 100K" $SCRATCH_MNT/foo
      
        # Write 80K of data to the range [640K, 720K[ using direct IO. This
        # range covers both the prealloc extent and the compressed extent.
        # Because there's a compressed extent in the range we are writing to,
        # the DIO write code path ends up only writing the first 60k of data,
        # which goes to the prealloc extent, and then falls back to buffered IO
        # for writing the remaining 20K of data - because that remaining data
        # maps to a file range containing a compressed extent.
        # When falling back to buffered IO, we used to trigger an assertion when
        # releasing reserved space due to bad accounting of the inode's
        # outstanding extents counter, which was set to 1 but we ended up
        # decrementing it by 1 twice, once through the ordered extent for the
        # 60K of data we wrote using direct IO, and once through the main direct
        # IO handler (inode.cbtrfs_direct_IO()) because the direct IO write
        # wrote less than 80K of data (60K).
        $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 80K 640K 80K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now similar test as above but for very large write operations. This
        # triggers special cases for an inode's outstanding extents accounting,
        # as internally btrfs logically splits extents into 128Mb units.
        $XFS_IO_PROG -f -s \
            -c "pwrite -S 0xaa -b 128M 258M 128M" \
            -c "falloc 0 258M" \
            $SCRATCH_MNT/bar | _filter_xfs_io
        $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 256M 3M 256M" $SCRATCH_MNT/bar \
            | _filter_xfs_io
      
        # Now verify the file contents are correct and that they are the same
        # even after unmounting and mounting the fs again (or evicting the page
        # cache).
        #
        # For file foo, all bytes in the range [0, 640K[ must have a value of
        # 0x00, all bytes in the range [640K, 720K[ must have a value of 0xbb
        # and all bytes in the range [720K, 800K[ must have a value of 0xaa.
        #
        # For file bar, all bytes in the range [0, 3M[ must havea value of 0x00,
        # all bytes in the range [3M, 259M[ must have a value of 0xbb and all
        # bytes in the range [259M, 386M[ must have a value of 0xaa.
        #
        echo "File digests before remounting the file system:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
        md5sum $SCRATCH_MNT/bar | _filter_scratch
        _scratch_remount
        echo "File digests after remounting the file system:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
        md5sum $SCRATCH_MNT/bar | _filter_scratch
      
        status=0
        exit
      
      Fixes: e1cbbfa5 ("Btrfs: fix outstanding_extents accounting in DIO")
      Fixes: 3e05bde8 ("Btrfs: only adjust outstanding_extents when we do a short write")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      9c9464cc
  5. 03 Nov, 2015 3 commits
    • Filipe Manana's avatar
      Btrfs: fix hole punching when using the no-holes feature · 2959a32a
      Filipe Manana authored
      When we are using the no-holes feature, if we punch a hole into a file
      range that already contains a hole which overlaps the range we are passing
      to fallocate(), we end up removing the extent map that represents the
      existing hole without adding a new one. This happens because with the
      no-holes feature we do not have explicit extent items to represent holes
      and therefore the call to __btrfs_drop_extents(), made from
      btrfs_punch_hole(), returns an end offset to the variable drop_end that
      is smaller than the end of the range passed to fallocate(), while it
      drops all existing extent maps in that range.
      Normally having a missing extent map is not a problem, for example for
      a readpages() operation we just end up building the extent map by
      looking at the fs/subvol tree for a matching extent item (or a lack of
      one for implicit holes). However for an fsync that uses the fast path,
      which needs to look at the list of modified extent maps, this means
      the fsync will not record information about the complete hole we had
      before the fallocate() call into the log tree, resulting in a file with
      content/layout that does not match what we had neither before nor after
      the hole punch operation.
      
      The following test case for fstests reproduces the issue. It fails without
      this change because we get a file with a different digest after the fsync
      log replay and also with a different extent/hole layout.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
           _cleanup_flakey
           rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/punch
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_xfs_io_command "fpunch"
        _require_xfs_io_command "fiemap"
        _require_dm_target flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        # This test was motivated by an issue found in btrfs when the btrfs
        # no-holes feature is enabled (introduced in kernel 3.14). So enable
        # the feature if the fs being tested is btrfs.
        if [ $FSTYP == "btrfs" ]; then
            _require_btrfs_fs_feature "no_holes"
            _require_btrfs_mkfs_feature "no-holes"
            MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
        fi
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create out test file with some data and then fsync it.
        # We do the fsync only to make sure the last fsync we do in this test
        # triggers the fast code path of btrfs' fsync implementation, a
        # condition necessary to trigger the bug btrfs had.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 128K" \
                        -c "fsync"                  \
                        $SCRATCH_MNT/foobar | _filter_xfs_io
      
        # Now punch a hole against the range [96K, 128K[.
        $XFS_IO_PROG -c "fpunch 96K 32K" $SCRATCH_MNT/foobar
      
        # Punch another hole against a range that overlaps the previous range
        # and ends beyond eof.
        $XFS_IO_PROG -c "fpunch 64K 128K" $SCRATCH_MNT/foobar
      
        # Punch another hole against a range that overlaps the first range
        # ([96K, 128K[) and ends at eof.
        $XFS_IO_PROG -c "fpunch 32K 96K" $SCRATCH_MNT/foobar
      
        # Fsync our file. We want to verify that, after a power failure and
        # mounting the filesystem again, the file content reflects all the hole
        # punch operations.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
      
        echo "File digest before power failure:"
        md5sum $SCRATCH_MNT/foobar | _filter_scratch
      
        echo "Fiemap before power failure:"
        $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap
      
        # Silently drop all writes and umount to simulate a crash/power failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file
        # contents.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        echo "File digest after log replay:"
        # Must match the same digest we got before the power failure.
        md5sum $SCRATCH_MNT/foobar | _filter_scratch
      
        echo "Fiemap after log replay:"
        # Must match the same extent listing we got before the power failure.
        $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap
      
        _unmount_flakey
      
        status=0
        exit
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2959a32a
    • chandan's avatar
      Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state · 13a0db5a
      chandan authored
      When executing generic/001 in a loop on a ppc64 machine (with both sectorsize
      and nodesize set to 64k), the following call trace is observed,
      
      WARNING: at /root/repos/linux/fs/btrfs/locking.c:253
      Modules linked in:
      CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d9 #54
      task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000
      NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000
      REGS: c0000000f600a820 TRAP: 0700   Not tainted  (4.3.0-rc5-13676-ga5e681d9)
      MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI>  CR: 24444884  XER: 00000000
      CFAR: c0000000004a3b30 SOFTE: 1
      GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0
      GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005
      GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030
      GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800
      GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049
      GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000
      GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000
      GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0
      NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0
      LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
      Call Trace:
      [c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable)
      [c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
      [c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00
      [c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120
      [c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620
      [c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0
      [c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420
      [c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030
      [c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250
      [c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590
      [c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780
      [c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250
      [c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00
      [c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0
      [c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570
      [c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0
      [c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30
      [c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0
      [c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30
      [c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0
      [c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0
      [c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0
      [c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100
      [c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80
      [c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68
      Instruction dump:
      fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040
      812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c
      
      The above call trace is seen even on x86_64; albeit very rarely and that too
      with nodesize set to 64k and with nospace_cache mount option being used.
      
      The reason for the above call trace is,
      btrfs_remove_chunk
        check_system_chunk
          Allocate chunk if required
        For each physical stripe on underlying device,
          btrfs_free_dev_extent
            ...
            Take lock on Device tree's root node
            btrfs_cow_block("dev tree's root node");
              btrfs_reserve_extent
                find_free_extent
      	    index = BTRFS_RAID_DUP;
      	    have_caching_bg = false;
      
                  When in LOOP_CACHING_NOWAIT state, Assume we find a block group
      	    which is being cached; Hence have_caching_bg is set to true
      
                  When repeating the search for the next RAID index, we set
      	    have_caching_bg to false.
      
      Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly
      skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we
      allocate a chunk and try to add entries corresponding to the chunk's physical
      stripe into the device tree. When doing so the task deadlocks itself waiting
      for the blocking lock on the root node of the device tree.
      
      This commit fixes the issue by introducing a new local variable to help
      indicate as to whether a block group of any RAID type is being cached.
      Signed-off-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      13a0db5a
    • Qu Wenruo's avatar
      btrfs: Fix a data space underflow warning · 485290a7
      Qu Wenruo authored
      Even with quota disabled, generic/127 will trigger a kernel warning by
      underflow data space info.
      
      The bug is caused by buffered write, which in case of short copy, the
      start parameter for btrfs_delalloc_release_space() is wrong, and
      round_up/down() in btrfs_delalloc_release() extents the range to page
      aligned, decreasing one more page than expected.
      
      This patch will fix it by passing correct start.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      485290a7
  6. 27 Oct, 2015 10 commits
  7. 25 Oct, 2015 2 commits
    • Filipe Manana's avatar
      Btrfs: fix regression running delayed references when using qgroups · b06c4bf5
      Filipe Manana authored
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: default avatarElias Probst <mail@eliasprobst.eu>
      Reported-by: default avatarPeter Becker <floyd.net@gmail.com>
      Reported-by: default avatarMalte Schröder <malte@tnxip.de>
      Reported-by: default avatarDerek Dongray <derek@valedon.co.uk>
      Reported-by: default avatarErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      b06c4bf5
    • Filipe Manana's avatar
      Btrfs: fix regression when running delayed references · 2c3cf7d5
      Filipe Manana authored
      In the kernel 4.2 merge window we had a refactoring/rework of the delayed
      references implementation in order to fix certain problems with qgroups.
      However that rework introduced one more regression that leads to the
      following trace when running delayed references for metadata:
      
      [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
      [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
      [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
      [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
      [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
      [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
      [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
      [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
      [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
      [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
      [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
      [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
      [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
      [35908.065201] Stack:
      [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
      [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
      [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
      [35908.065201] Call Trace:
      [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
      [35908.065201]  [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
      [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
      [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
      [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
      [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
      [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
      [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
      [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
      [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
      [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
      [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
      [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
      [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
      [35908.065201] RIP  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
      [35908.065201]  RSP <ffff88010c4cbb08>
      [35908.310885] ---[ end trace fe4299baf0666457 ]---
      
      This happens because the new delayed references code no longer merges
      delayed references that have different sequence values. The following
      steps are an example sequence leading to this issue:
      
      1) Transaction N starts, fs_info->tree_mod_seq has value 0;
      
      2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
         bytenr A is created, with a value of 1 and a seq value of 0;
      
      3) fs_info->tree_mod_seq is incremented to 1;
      
      4) Extent buffer A is deleted through btrfs_del_items(), which calls
         btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
         later returns the metadata extent associated to extent buffer A to
         the free space cache (the range is not pinned), because the extent
         buffer was created in the current transaction (N) and writeback never
         happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
         in the extent buffer).
         This creates the delayed reference Ref2 for bytenr A, with a value
         of -1 and a seq value of 1;
      
      5) Delayed reference Ref2 is not merged with Ref1 when we create it,
         because they have different sequence numbers (decided at
         add_delayed_ref_tail_merge());
      
      6) fs_info->tree_mod_seq is incremented to 2;
      
      7) Some task attempts to allocate a new extent buffer (done at
         extent-tree.c:find_free_extent()), but due to heavy fragmentation
         and running low on metadata space the clustered allocation fails
         and we fall back to unclustered allocation, which finds the
         extent at offset A, so a new extent buffer at offset A is allocated.
         This creates delayed reference Ref3 for bytenr A, with a value of 1
         and a seq value of 2;
      
      8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
         all have different seq values;
      
      9) We start running the delayed references (__btrfs_run_delayed_refs());
      
      10) The delayed Ref1 is the first one being applied, which ends up
          creating an inline extent backref in the extent tree;
      
      10) Next the delayed reference Ref3 is selected for execution, and not
          Ref2, because select_delayed_ref() always gives a preference for
          positive references (that have an action of BTRFS_ADD_DELAYED_REF);
      
      11) When running Ref3 we encounter alreay the inline extent backref
          in the extent tree at insert_inline_extent_backref(), which makes
          us hit the following BUG_ON:
      
              BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
      
          This is always true because owner corresponds to the level of the
          extent buffer/btree node in the btree.
      
      For the scenario described above we hit the BUG_ON because we never merge
      references that have different seq values.
      
      We used to do the merging before the 4.2 kernel, more specifically, before
      the commmits:
      
        c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
        c43d160f ("btrfs: delayed-ref: Cleanup the unneeded functions.")
      
      This issue became more exposed after the following change that was added
      to 4.2 as well:
      
        cffc3374 ("Btrfs: fix order by which delayed references are run")
      
      Which in turn fixed another regression by the two commits previously
      mentioned.
      
      So fix this by bringing back the delayed reference merge code, with the
      proper adaptations so that it operates against the new data structure
      (linked list vs old red black tree implementation).
      
      This issue was hit running fstest btrfs/063 in a loop. Several people have
      reported this issue in the mailing list when running on kernels 4.2+.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Fixes: c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
      Reported-by: default avatarPeter Becker <floyd.net@gmail.com>
      Reported-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: default avatarMalte Schröder <malte@tnxip.de>
      Reported-by: default avatarDerek Dongray <derek@valedon.co.uk>
      Reported-by: default avatarErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      2c3cf7d5
  8. 22 Oct, 2015 13 commits
    • Chris Mason's avatar
      Merge branch 'allocator-fixes' into for-linus-4.4 · a9e6d153
      Chris Mason authored
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a9e6d153
    • Josef Bacik's avatar
      Btrfs: don't do extra bitmap search in one bit case · 0584f718
      Josef Bacik authored
      When we make ctl->unit allocations from a bitmap there is no point in searching
      for the next 0 in the bitmap.  If we've found a bit we're done and can just exit
      the loop.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0584f718
    • Josef Bacik's avatar
      Btrfs: keep track of largest extent in bitmaps · cef40483
      Josef Bacik authored
      We can waste a lot of time searching through bitmaps when we are heavily
      fragmented trying to find large contiguous areas that don't exist in the bitmap.
      So keep track of the max extent size when we do a full search of a bitmap so
      that next time around we can just skip the expensive searching if our max size
      is less than what we are looking for.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cef40483
    • Josef Bacik's avatar
      Btrfs: don't keep trying to build clusters if we are fragmented · c759c4e1
      Josef Bacik authored
      If we are extremely fragmented then we won't be able to create a free_cluster.
      So if this happens set last_ptr->fragmented so that all future allcations will
      give up trying to create a cluster.  When we unpin extents we will unset
      ->fragmented if we free up a sufficient amount of space in a block group.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c759c4e1
    • Josef Bacik's avatar
      Btrfs: cut down on loops through the allocator · a5e681d9
      Josef Bacik authored
      We try really really hard to make allocations, but sometimes it is just not
      going to happen, especially when free space is extremely fragmented.  So add a
      few short cuts through the looping states.  For example if we couldn't allocate
      a chunk, just go straight to the NO_EMPTY_SIZE loop.  If there are no uncached
      block groups and we've done a full search, go straight to the ALLOC_CHUNK stage.
      And finally if we already have empty_size and empty_cluster set to 0 go ahead
      and return -ENOSPC.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a5e681d9
    • Josef Bacik's avatar
      Btrfs: don't continue setting up space cache when enospc · 2968b1f4
      Josef Bacik authored
      If we hit ENOSPC when setting up a space cache don't bother setting up any of
      the other space cache's in this transaction, it'll just induce unnecessary
      latency.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2968b1f4
    • Josef Bacik's avatar
      Btrfs: keep track of max_extent_size per space_info · 4f4db217
      Josef Bacik authored
      When we are heavily fragmented we can induce a lot of latency trying to make an
      allocation happen that is simply not going to happen.  Thankfully we keep track
      of our max_extent_size when going through the allocator, so if we get to the
      point where we are exiting find_free_extent with ENOSPC then set our
      space_info->max_extent_size so we can keep future allocations from having to pay
      this cost.  We reset the max_extent_size whenever we release pinned bytes back
      into this space info so we can redo all the work.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4f4db217
    • Josef Bacik's avatar
      Btrfs: don't loop in allocator for space cache · 36af4e07
      Josef Bacik authored
      The space cache needs to have contiguous allocations, and the allocator tries to
      make allocations by reducing the amount of bytes requested and re-searching.
      But this just makes us waste time when we are very fragmented, so if we can't
      find our space just exit, don't bother trying to search again.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      36af4e07
    • Josef Bacik's avatar
      Btrfs: add a flags field to btrfs_transaction · 3204d33c
      Josef Bacik authored
      I want to set some per transaction flags, so instead of adding yet another int
      lets just convert the current two int indicators to flags and add a flags field
      for future use.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3204d33c
    • Josef Bacik's avatar
      Btrfs: fix prealloc under heavy fragmentation conditions · 0b670dc4
      Josef Bacik authored
      If we are heavily fragmented we will continually try to prealloc the largest
      extent size we can every time we call btrfs_reserve_extent.  This can be very
      expensive when we are heavily fragmented, burning lots of CPU cycles and loops
      through the allocator.  So instead notice when we get a smaller chunk from the
      allocator than what we specified and use this as the new maximum size we try to
      allocate.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0b670dc4
    • Josef Bacik's avatar
      Btrfs: add fragment=* debug mount option · d0bd4560
      Josef Bacik authored
      In tracking down these weird bitmap problems it was helpful to artificially
      create an extremely fragmented file system.  These mount options let us either
      fragment data or metadata or both.  With these options I could reproduce all
      sorts of weird latencies and hangs that occur under extreme fragmentation and
      get them fixed.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d0bd4560
    • Josef Bacik's avatar
      Btrfs: fix qgroup sanity tests · d9ee522b
      Josef Bacik authored
      With my changes to allow us to find old roots when resolving indirect refs I
      introduced a regression to the sanity tests.  Since we don't really care to go
      down into the fs roots we just need to have the old behavior of returning ENOENT
      for dummy roots for the sanity tests.  In the future if we want to get fancy we
      can populate the test fs trees with the references as well.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d9ee522b
    • Josef Bacik's avatar
      Btrfs: change how we wait for pending ordered extents · 161c3549
      Josef Bacik authored
      We have a mechanism to make sure we don't lose updates for ordered extents that
      were logged in the transaction that is currently running.  We add the ordered
      extent to a transaction list and then the transaction waits on all the ordered
      extents in that list.  However are substantially large file systems this list
      can be extremely large, and can give us soft lockups, since the ordered extents
      don't remove themselves from the list when they do complete.
      
      To fix this we simply add a counter to the transaction that is incremented any
      time we have a logged extent that needs to be completed in the current
      transaction.  Then when the ordered extent finally completes it decrements the
      per transaction counter and wakes up the transaction if we are the last ones.
      This will eliminate the softlockup.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      161c3549