1. 17 Mar, 2015 5 commits
    • Josef Bacik's avatar
      Btrfs: fix outstanding_extents accounting in DIO · e1cbbfa5
      Josef Bacik authored
      We are keeping track of how many extents we need to reserve properly based on
      the amount we want to write, but we were still incrementing outstanding_extents
      if we wrote less than what we requested.  This isn't quite right since we will
      be limited to our max extent size.  So instead lets do something horrible!  Keep
      track of how many outstanding_extents we reserved, and decrement each time we
      allocate an extent.  If we use our entire reserve make sure to jack up
      outstanding_extents on the inode so the accounting works out properly.  Thanks,
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      e1cbbfa5
    • Josef Bacik's avatar
      Btrfs: add sanity test for outstanding_extents accounting · 6a3891c5
      Josef Bacik authored
      I introduced a regression wrt outstanding_extents accounting.  These are tricky
      areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
      at any time.  So add sanity tests to cover the various conditions that are
      tricky in order to make sure we don't introduce regressions in the future.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      6a3891c5
    • Josef Bacik's avatar
      Btrfs: just free dummy extent buffers · bcb7e449
      Josef Bacik authored
      If we fail during our sanity tests we could get NULL deref's because we unload
      the module before the dummy extent buffers are free'd via RCU.  So check for
      this case and just free the things directly.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      bcb7e449
    • Josef Bacik's avatar
      Btrfs: account merges/splits properly · ba117213
      Josef Bacik authored
      My fix
      
      Btrfs: fix merge delalloc logic
      
      only fixed half of the problems, it didn't fix the case where we have two large
      extents on either side and then join them together with a new small extent.  We
      need to instead keep track of how many extents we have accounted for with each
      side of the new extent, and then see how many extents we need for the new large
      extent.  If they match then we know we need to keep our reservation, otherwise
      we need to drop our reservation.  This shows up with a case like this
      
      [BTRFS_MAX_EXTENT_SIZE+4K][4K HOLE][BTRFS_MAX_EXTENT_SIZE+4K]
      
      Previously the logic would have said that the number extents required for the
      new size (3) is larger than the number of extents required for the largest side
      (2) therefore we need to keep our reservation.  But this isn't the case, since
      both sides require a reservation of 2 which leads to 4 for the whole range
      currently reserved, but we only need 3, so we need to drop one of the
      reservations.  The same problem existed for splits, we'd think we only need 3
      extents when creating the hole but in reality we need 4.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      ba117213
    • Josef Bacik's avatar
      Btrfs: prepare block group cache before writing · dcdf7f6d
      Josef Bacik authored
      Writing the block group cache will modify the extent tree quite a bit because it
      truncates the old space cache and pre-allocates new stuff.  To try and cut down
      on the churn lets do the setup dance first, then later on hopefully we can avoid
      looping with newly dirtied roots.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      dcdf7f6d
  2. 13 Mar, 2015 6 commits
  3. 06 Mar, 2015 3 commits
    • Quentin Casasnovas's avatar
      Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref. · dd9ef135
      Quentin Casasnovas authored
      Improper arithmetics when calculting the address of the extended ref could
      lead to an out of bounds memory read and kernel panic.
      Signed-off-by: default avatarQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      cc: stable@vger.kernel.org # v3.7+
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      dd9ef135
    • Filipe Manana's avatar
      Btrfs: fix data loss in the fast fsync path · 3a8b36f3
      Filipe Manana authored
      When using the fast file fsync code path we can miss the fact that new
      writes happened since the last file fsync and therefore return without
      waiting for the IO to finish and write the new extents to the fsync log.
      
      Here's an example scenario where the fsync will miss the fact that new
      file data exists that wasn't yet durably persisted:
      
      1. fs_info->last_trans_committed == N - 1 and current transaction is
         transaction N (fs_info->generation == N);
      
      2. do a buffered write;
      
      3. fsync our inode, this clears our inode's full sync flag, starts
         an ordered extent and waits for it to complete - when it completes
         at btrfs_finish_ordered_io(), the inode's last_trans is set to the
         value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
         btrfs_set_inode_last_trans);
      
      4. transaction N is committed, so fs_info->last_trans_committed is now
         set to the value N and fs_info->generation remains with the value N;
      
      5. do another buffered write, when this happens btrfs_file_write_iter
         sets our inode's last_trans to the value N + 1 (that is
         fs_info->generation + 1 == N + 1);
      
      6. transaction N + 1 is started and fs_info->generation now has the
         value N + 1;
      
      7. transaction N + 1 is committed, so fs_info->last_trans_committed
         is set to the value N + 1;
      
      8. fsync our inode - because it doesn't have the full sync flag set,
         we only start the ordered extent, we don't wait for it to complete
         (only in a later phase) therefore its last_trans field has the
         value N + 1 set previously by btrfs_file_write_iter(), and so we
         have:
      
             inode->last_trans <= fs_info->last_trans_committed
                 (N + 1)              (N + 1)
      
         Which made us not log the last buffered write and exit the fsync
         handler immediately, returning success (0) to user space and resulting
         in data loss after a crash.
      
      This can actually be triggered deterministically and the following excerpt
      from a testcase I made for xfstests triggers the issue. It moves a dummy
      file across directories and then fsyncs the old parent directory - this
      is just to trigger a transaction commit, so moving files around isn't
      directly related to the issue but it was chosen because running 'sync' for
      example does more than just committing the current transaction, as it
      flushes/waits for all file data to be persisted. The issue can also happen
      at random periods, since the transaction kthread periodicaly commits the
      current transaction (about every 30 seconds by default).
      The body of the test is:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our main test file 'foo', the one we check for data loss.
        # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
        # bit from its flags (btrfs inode specific flags).
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                        -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now create one other file and 2 directories. We will move this second file
        # from one directory to the other later because it forces btrfs to commit its
        # currently open transaction if we fsync the old parent directory. This is
        # necessary to trigger the data loss bug that affected btrfs.
        mkdir $SCRATCH_MNT/testdir_1
        touch $SCRATCH_MNT/testdir_1/bar
        mkdir $SCRATCH_MNT/testdir_2
      
        # Make sure everything is durably persisted.
        sync
      
        # Write more 8Kb of data to our file.
        $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Move our 'bar' file into a new directory.
        mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
      
        # Fsync our first directory. Because it had a file moved into some other
        # directory, this made btrfs commit the currently open transaction. This is
        # a condition necessary to trigger the data loss bug.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
      
        # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
        # data we wrote previously to be persisted and available if a crash happens.
        # This did not happen with btrfs, because of the transaction commit that
        # happened when we fsynced the parent directory.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Now check that all data we wrote before are available.
        echo "File content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        status=0
        exit
      
      The expected golden output for the test, which is what we get with this
      fix applied (or when running against ext3/4 and xfs), is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 8192/8192 bytes at offset 8192
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
        *
        0040000
      
      Without this fix applied, the output shows the test file does not have
      the second 8Kb extent that we successfully fsynced:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 8192/8192 bytes at offset 8192
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
      
      So fix this by skipping the fsync only if we're doing a full sync and
      if the inode's last_trans is <= fs_info->last_trans_committed, or if
      the inode is already in the log. Also remove setting the inode's
      last_trans in btrfs_file_write_iter since it's useless/unreliable.
      
      Also because btrfs_file_write_iter no longer sets inode->last_trans to
      fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
      bail out if last_trans is 0, otherwise something as simple as the following
      example wouldn't log the second write on the last fsync:
      
        1. write to file
      
        2. fsync file
      
        3. fsync file
             |--> btrfs_inode_in_log() returns true and it set last_trans to 0
      
        4. write to file
             |--> btrfs_file_write_iter() no longers sets last_trans, so it
                  remained with a value of 0
        5. fsync
             |--> inode->last_trans == 0, so it bails out without logging the
                  second write
      
      A test case for xfstests will be sent soon.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3a8b36f3
    • Josef Bacik's avatar
      Btrfs: remove extra run_delayed_refs in update_cowonly_root · f5c0a122
      Josef Bacik authored
      This got added with my dirty_bgs patch, it's not needed.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f5c0a122
  4. 02 Mar, 2015 7 commits
    • Filipe Manana's avatar
      Btrfs: incremental send, don't rename a directory too soon · 84471e24
      Filipe Manana authored
      There's one more case where we can't issue a rename operation for a
      directory as soon as we process it. We used to delay directory renames
      only if they have some ancestor directory with a higher inode number
      that got renamed too, but there's another case where we need to delay
      the rename too - when a directory A is renamed to the old name of a
      directory B but that directory B has its rename delayed because it
      has now (in the send root) an ancestor with a higher inode number that
      was renamed. If we don't delay the directory rename in this case, the
      receiving end of the send stream will attempt to rename A to the old
      name of B before B got renamed to its new name, which results in a
      "directory not empty" error. So fix this by delaying directory renames
      for this case too.
      
      Steps to reproduce:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/a
        $ mkdir /mnt/b
        $ mkdir /mnt/c
        $ touch /mnt/a/file
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      
        $ mv /mnt/c /mnt/x
        $ mv /mnt/a /mnt/x/y
        $ mv /mnt/b /mnt/a
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      
        $ btrfs send /mnt/snap1 -f /tmp/1.send
        $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt2
        $ btrfs receive /mnt2 -f /tmp/1.send
        $ btrfs receive /mnt2 -f /tmp/2.send
        ERROR: rename b -> a failed. Directory not empty
      
      A test case for xfstests follows soon.
      Reported-by: default avatarAmes Cornish <ames@cornishes.net>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      84471e24
    • David Sterba's avatar
      btrfs: fix lost return value due to variable shadowing · 1932b7be
      David Sterba authored
      A block-local variable stores error code but btrfs_get_blocks_direct may
      not return it in the end as there's a ret defined in the function scope.
      
      CC: <stable@vger.kernel.org>	# 3.6+
      Fixes: d187663e ("Btrfs: lock extents as we map them in DIO")
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1932b7be
    • Filipe Manana's avatar
      Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr · 5cdf83ed
      Filipe Manana authored
      The return value from btrfs_lookup_xattr() can be a pointer encoding an
      error, therefore deal with it. This fixes commit 5f5bc6b1
      ("Btrfs: make xattr replace operations atomic").
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5cdf83ed
    • Filipe Manana's avatar
      Btrfs: fix off-by-one logic error in btrfs_realloc_node · 5dfe2be7
      Filipe Manana authored
      The end_slot variable actually matches the number of pointers in the
      node and not the last slot (which is 'nritems - 1'). Therefore in order
      to check that the current slot in the for loop doesn't match the last
      one, the correct logic is to check if 'i' is less than 'end_slot - 1'
      and not 'end_slot - 2'.
      
      Fix this and set end_slot to be 'nritems - 1', as it's less confusing
      since the variable name implies it's inclusive rather then exclusive.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5dfe2be7
    • Filipe Manana's avatar
      Btrfs: add missing inode update when punching hole · e8c1c76e
      Filipe Manana authored
      When punching a file hole if we endup only zeroing parts of a page,
      because the start offset isn't a multiple of the sector size or the
      start offset and length fall within the same page, we were not updating
      the inode item. This prevented an fsync from doing anything, if no other
      file changes happened in the current transaction, because the fields
      in btrfs_inode used to check if the inode needs to be fsync'ed weren't
      updated.
      
      This issue is easy to reproduce and the following excerpt from the
      xfstest case I made shows how to trigger it:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file.
        $XFS_IO_PROG -f -c "pwrite -S 0x22 -b 16K 0 16K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Fsync the file, this makes btrfs update some btrfs inode specific fields
        # that are used to track if the inode needs to be written/updated to the fsync
        # log or not. After this fsync, the new values for those fields indicate that
        # a subsequent fsync does not need to touch the fsync log.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Force a commit of the current transaction. After this point, any operation
        # that modifies the data or metadata of our file, should update those fields in
        # the btrfs inode with values that make the next fsync operation write to the
        # fsync log.
        sync
      
        # Punch a hole in our file. This small range affects only 1 page.
        # This made the btrfs hole punching implementation write only some zeroes in
        # one page, but it did not update the btrfs inode fields used to determine if
        # the next fsync needs to write to the fsync log.
        $XFS_IO_PROG -c "fpunch 8000 4K" $SCRATCH_MNT/foo
      
        # Another variation of the previously mentioned case.
        $XFS_IO_PROG -c "fpunch 15000 100" $SCRATCH_MNT/foo
      
        # Now fsync the file. This was a no-operation because the previous hole punch
        # operation didn't update the inode's fields mentioned before, so they remained
        # with the values they had after the first fsync - that is, they indicate that
        # it is not needed to write to fsync log.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        echo "File content before:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Enable writes and mount the fs. This makes the fsync log replay code run.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Because the last fsync didn't do anything, here the file content matched what
        # it was after the first fsync, before the holes were punched, and not what it
        # was after the holes were punched.
        echo "File content after:"
        od -t x1 $SCRATCH_MNT/foo
      
      This issue has been around since 2012, when the punch hole implementation
      was added, commit 2aaa6655 ("Btrfs: add hole punching").
      
      A test case for xfstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e8c1c76e
    • Josef Bacik's avatar
      Btrfs: abort the transaction if we fail to update the free space cache inode · 0c0ef4bc
      Josef Bacik authored
      Our gluster boxes were hitting a problem where they'd run out of space when
      updating the block group cache and therefore wouldn't be able to update the free
      space inode.  This is a problem because this is how we invalidate the cache and
      protect ourselves from errors further down the stack, so if this fails we have
      to abort the transaction so we make sure we don't end up with stale free space
      cache.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0c0ef4bc
    • Filipe Manana's avatar
      Btrfs: fix fsync race leading to ordered extent memory leaks · 4d884fce
      Filipe Manana authored
      We can have multiple fsync operations against the same file during the
      same transaction and they can collect the same ordered extents while they
      don't complete (still accessible from the inode's ordered tree). If this
      happens, those ordered extents will never get their reference counts
      decremented to 0, leading to memory leaks and inode leaks (an iput for an
      ordered extent's inode is scheduled only when the ordered extent's refcount
      drops to 0). The following sequence diagram explains this race:
      
               CPU 1                                         CPU 2
      
      btrfs_sync_file()
      
                                                       btrfs_sync_file()
      
        mutex_lock(inode->i_mutex)
        btrfs_log_inode()
          btrfs_get_logged_extents()
            --> collects ordered extent X
            --> increments ordered
                extent X's refcount
          btrfs_submit_logged_extents()
        mutex_unlock(inode->i_mutex)
      
                                                         mutex_lock(inode->i_mutex)
        btrfs_sync_log()
           btrfs_wait_logged_extents()
             --> list_del_init(&ordered->log_list)
                                                           btrfs_log_inode()
                                                             btrfs_get_logged_extents()
                                                               --> Adds ordered extent X
                                                                   to logged_list because
                                                                   at this point:
                                                                   list_empty(&ordered->log_list)
                                                                   && test_bit(BTRFS_ORDERED_LOGGED,
                                                                               &ordered->flags) == 0
                                                               --> Increments ordered extent
                                                                   X's refcount
             --> check if ordered extent's io is
                 finished or not, start it if
                 necessary and wait for it to finish
             --> sets bit BTRFS_ORDERED_LOGGED
                 on ordered extent X's flags
                 and adds it to trans->ordered
        btrfs_sync_log() finishes
      
                                                             btrfs_submit_logged_extents()
                                                           btrfs_log_inode() finishes
                                                         mutex_unlock(inode->i_mutex)
      
      btrfs_sync_file() finishes
      
                                                         btrfs_sync_log()
                                                            btrfs_wait_logged_extents()
                                                              --> Sees ordered extent X has the
                                                                  bit BTRFS_ORDERED_LOGGED set in
                                                                  its flags
                                                              --> X's refcount is untouched
                                                         btrfs_sync_log() finishes
      
                                                       btrfs_sync_file() finishes
      
      btrfs_commit_transaction()
        --> called by transaction kthread for e.g.
        btrfs_wait_pending_ordered()
          --> waits for ordered extent X to
              complete
          --> decrements ordered extent X's
              refcount by 1 only, corresponding
              to the increment done by the fsync
              task ran by CPU 1
      
      In the scenario of the above diagram, after the transaction commit,
      the ordered extent will remain with a refcount of 1 forever, leaking
      the ordered extent structure and preventing the i_count of its inode
      from ever decreasing to 0, since the delayed iput is scheduled only
      when the ordered extent's refcount drops to 0, preventing the inode
      from ever being evicted by the VFS.
      
      Fix this by using the flag BTRFS_ORDERED_LOGGED differently. Use it to
      mean that an ordered extent is already being processed by an fsync call,
      which will attach it to the current transaction, preventing it from being
      collected by subsequent fsync operations against the same inode.
      
      This race was introduced with the following change (added in 3.19 and
      backported to stable 3.18 and 3.17):
      
        Btrfs: make sure logged extents complete in the current transaction V3
        commit 50d9aa99
      
      I ran into this issue while running xfstests/generic/113 in a loop, which
      failed about 1 out of 10 runs with the following warning in dmesg:
      
      [ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 free_fs_root+0x36/0x133 [btrfs]()
      [ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor parport_pc parport psmouse therma
      l_sys i2c_piix4 serio_raw pcspkr evdev microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio flo
      ppy e1000 scsi_mod [last unloaded: btrfs]
      [ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
      [ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [ 2612.457709]  0000000000000009 ffff8801342c3c78 ffffffff8142425e ffff88023ec8f2d8
      [ 2612.459829]  0000000000000000 ffff8801342c3cb8 ffffffff81045308 ffff880046460000
      [ 2612.461564]  ffffffffa036da56 ffff88003d07b000 ffff880046460000 ffff880046460068
      [ 2612.463163] Call Trace:
      [ 2612.463719]  [<ffffffff8142425e>] dump_stack+0x4c/0x65
      [ 2612.464789]  [<ffffffff81045308>] warn_slowpath_common+0xa1/0xbb
      [ 2612.466026]  [<ffffffffa036da56>] ? free_fs_root+0x36/0x133 [btrfs]
      [ 2612.467247]  [<ffffffff810453c5>] warn_slowpath_null+0x1a/0x1c
      [ 2612.468416]  [<ffffffffa036da56>] free_fs_root+0x36/0x133 [btrfs]
      [ 2612.469625]  [<ffffffffa036f2a7>] btrfs_drop_and_free_fs_root+0x93/0x9b [btrfs]
      [ 2612.471251]  [<ffffffffa036f353>] btrfs_free_fs_roots+0xa4/0xd6 [btrfs]
      [ 2612.472536]  [<ffffffff8142612e>] ? wait_for_completion+0x24/0x26
      [ 2612.473742]  [<ffffffffa0370bbc>] close_ctree+0x1f3/0x33c [btrfs]
      [ 2612.475477]  [<ffffffff81059d1d>] ? destroy_workqueue+0x148/0x1ba
      [ 2612.476695]  [<ffffffffa034e3da>] btrfs_put_super+0x19/0x1b [btrfs]
      [ 2612.477911]  [<ffffffff81153e53>] generic_shutdown_super+0x73/0xef
      [ 2612.479106]  [<ffffffff811540e2>] kill_anon_super+0x13/0x1e
      [ 2612.480226]  [<ffffffffa034e1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
      [ 2612.481471]  [<ffffffff81154307>] deactivate_locked_super+0x3b/0x50
      [ 2612.482686]  [<ffffffff811547a7>] deactivate_super+0x3f/0x43
      [ 2612.483791]  [<ffffffff8116b3ed>] cleanup_mnt+0x59/0x78
      [ 2612.484842]  [<ffffffff8116b44c>] __cleanup_mnt+0x12/0x14
      [ 2612.485900]  [<ffffffff8105d019>] task_work_run+0x8f/0xbc
      [ 2612.486960]  [<ffffffff810028d8>] do_notify_resume+0x5a/0x6b
      [ 2612.488083]  [<ffffffff81236e5b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [ 2612.489333]  [<ffffffff8142a17f>] int_signal+0x12/0x17
      [ 2612.490353] ---[ end trace 54a960a6bdcb8d93 ]---
      [ 2612.557253] VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds.  Have a nice day...
      
      Kmemleak confirmed the ordered extent leak (and btrfs inode specific
      structures such as delayed nodes):
      
      $ cat /sys/kernel/debug/kmemleak
      unreferenced object 0xffff880154290db0 (size 576):
        comm "btrfsck", pid 21980, jiffies 4295542503 (age 1273.412s)
        hex dump (first 32 bytes):
          01 40 00 00 01 00 00 00 b0 1d f1 4e 01 88 ff ff  .@.........N....
          00 00 00 00 00 00 00 00 c8 0d 29 54 01 88 ff ff  ..........)T....
        backtrace:
          [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a
          [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83
          [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190
          [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac
          [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
          [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
          [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs]
          [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
          [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs]
          [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed
          [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa
          [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b
          [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
          [<ffffffffffffffff>] 0xffffffffffffffff
      unreferenced object 0xffff88014ef11db0 (size 576):
        comm "rm", pid 22009, jiffies 4295542593 (age 1273.052s)
        hex dump (first 32 bytes):
          02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 c8 1d f1 4e 01 88 ff ff  ...........N....
        backtrace:
          [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a
          [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83
          [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190
          [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac
          [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
          [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
          [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs]
          [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
          [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs]
          [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed
          [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa
          [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b
          [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
          [<ffffffffffffffff>] 0xffffffffffffffff
      unreferenced object 0xffff8800336feda8 (size 584):
        comm "aio-stress", pid 22031, jiffies 4295543006 (age 1271.400s)
        hex dump (first 32 bytes):
          00 40 3e 00 00 00 00 00 00 00 8f 42 00 00 00 00  .@>........B....
          00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00  ................
        backtrace:
          [<ffffffff8114eb34>] create_object+0x172/0x29a
          [<ffffffff8141d790>] kmemleak_alloc+0x25/0x41
          [<ffffffff81141ae6>] kmemleak_alloc_recursive.constprop.52+0x16/0x18
          [<ffffffff81145288>] kmem_cache_alloc+0xf7/0x198
          [<ffffffffa0389243>] __btrfs_add_ordered_extent+0x43/0x309 [btrfs]
          [<ffffffffa038968b>] btrfs_add_ordered_extent_dio+0x12/0x14 [btrfs]
          [<ffffffffa03810e2>] btrfs_get_blocks_direct+0x3ef/0x571 [btrfs]
          [<ffffffff81181349>] do_blockdev_direct_IO+0x62a/0xb47
          [<ffffffff8118189a>] __blockdev_direct_IO+0x34/0x36
          [<ffffffffa03776e5>] btrfs_direct_IO+0x16a/0x1e8 [btrfs]
          [<ffffffff81100373>] generic_file_direct_write+0xb8/0x12d
          [<ffffffffa038615c>] btrfs_file_write_iter+0x24b/0x42f [btrfs]
          [<ffffffff8118bb0d>] aio_run_iocb+0x2b7/0x32e
          [<ffffffff8118c99a>] do_io_submit+0x26e/0x2ff
          [<ffffffff8118ca3b>] SyS_io_submit+0x10/0x12
          [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
      
      CC: <stable@vger.kernel.org> # 3.19, 3.18 and 3.17
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4d884fce
  5. 20 Feb, 2015 1 commit
  6. 14 Feb, 2015 9 commits
    • Filipe Manana's avatar
      Btrfs: don't remove extents and xattrs when logging new names · a742994a
      Filipe Manana authored
      If we are recording in the tree log that an inode has new names (new hard
      links were added), we would drop items, belonging to the inode, that we
      shouldn't:
      
      1) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime
         flags, we ended up dropping all the extent and xattr items that were
         previously logged. This was done only in memory, since logging a new
         name doesn't imply syncing the log;
      
      2) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime
         flags, we ended up dropping all the xattr items that were previously
         logged. Like the case before, this was done only in memory because
         logging a new name doesn't imply syncing the log.
      
      This led to some surprises in scenarios such as the following:
      
      1) write some extents to an inode;
      2) fsync the inode;
      3) truncate the inode or delete/modify some of its xattrs
      4) add a new hard link for that inode
      5) fsync some other file, to force the log tree to be durably persisted
      6) power failure happens
      
      The next time the fs is mounted, the fsync log replay code is executed,
      and the resulting file doesn't have the content it had when the last fsync
      against it was performed, instead if has a content matching what it had
      when the last transaction commit happened.
      
      So change the behaviour such that when a new name is logged, only the inode
      item and reference items are processed.
      
      This is easy to reproduce with the test I just made for xfstests, whose
      main body is:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file with some data.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Make sure the file is durably persisted.
        sync
      
        # Append some data to our file, to increase its size.
        $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Fsync the file, so from this point on if a crash/power failure happens, our
        # new data is guaranteed to be there next time the fs is mounted.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Now shrink our file to 5000 bytes.
        $XFS_IO_PROG -c "truncate 5000" $SCRATCH_MNT/foo
      
        # Now do an expanding truncate to a size larger than what we had when we last
        # fsync'ed our file. This is just to verify that after power failure and
        # replaying the fsync log, our file matches what it was when we last fsync'ed
        # it - 12Kb size, first 8Kb of data had a value of 0xaa and the last 4Kb of
        # data had a value of 0xcc.
        $XFS_IO_PROG -c "truncate 32K" $SCRATCH_MNT/foo
      
        # Add one hard link to our file. This made btrfs drop all of our file's
        # metadata from the fsync log, including the metadata relative to the
        # extent we just wrote and fsync'ed. This change was made only to the fsync
        # log in memory, so adding the hard link alone doesn't change the persisted
        # fsync log. This happened because the previous truncates set the runtime
        # flag BTRFS_INODE_NEEDS_FULL_SYNC in the btrfs inode structure.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
      
        # Now make sure the in memory fsync log is durably persisted.
        # Creating and fsync'ing another file will do it.
        # After this our persisted fsync log will no longer have metadata for our file
        # foo that points to the extent we wrote and fsync'ed before.
        touch $SCRATCH_MNT/bar
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
      
        # As expected, before the crash/power failure, we should be able to see a file
        # with a size of 32Kb, with its first 5000 bytes having the value 0xaa and all
        # the remaining bytes with value 0x00.
        echo "File content before:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # After mounting the fs again, the fsync log was replayed.
        # The expected result is to see a file with a size of 12Kb, with its first 8Kb
        # of data having the value 0xaa and its last 4Kb of data having a value of 0xcc.
        # The btrfs bug used to leave the file as it used te be as of the last
        # transaction commit - that is, with a size of 8Kb with all bytes having a
        # value of 0xaa.
        echo "File content after:"
        od -t x1 $SCRATCH_MNT/foo
      
      The test case for xfstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a742994a
    • Filipe Manana's avatar
      Btrfs: fix fsync data loss after adding hard link to inode · 1a4bcf47
      Filipe Manana authored
      We have a scenario where after the fsync log replay we can lose file data
      that had been previously fsync'ed if we added an hard link for our inode
      and after that we sync'ed the fsync log (for example by fsync'ing some
      other file or directory).
      
      This is because when adding an hard link we updated the inode item in the
      log tree with an i_size value of 0. At that point the new inode item was
      in memory only and a subsequent fsync log replay would not make us lose
      the file data. However if after adding the hard link we sync the log tree
      to disk, by fsync'ing some other file or directory for example, we ended
      up losing the file data after log replay, because the inode item in the
      persisted log tree had an an i_size of zero.
      
      This is easy to reproduce, and the following excerpt from my test for
      xfstests shows this:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create one file with data and fsync it.
        # This made the btrfs fsync log persist the data and the inode metadata with
        # a correct inode->i_size (4096 bytes).
        $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 4K 0 4K" -c "fsync" \
             $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now add one hard link to our file. This made the btrfs code update the fsync
        # log, in memory only, with an inode metadata having a size of 0.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
      
        # Now force persistence of the fsync log to disk, for example, by fsyncing some
        # other file.
        touch $SCRATCH_MNT/bar
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
      
        # Before a power loss or crash, we could read the 4Kb of data from our file as
        # expected.
        echo "File content before:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # After the fsync log replay, because the fsync log had a value of 0 for our
        # inode's i_size, we couldn't read anymore the 4Kb of data that we previously
        # wrote and fsync'ed. The size of the file became 0 after the fsync log replay.
        echo "File content after:"
        od -t x1 $SCRATCH_MNT/foo
      
      Another alternative test, that doesn't need to fsync an inode in the same
      transaction it was created, is:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file with some data.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
             $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Make sure the file is durably persisted.
        sync
      
        # Append some data to our file, to increase its size.
        $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
             $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Fsync the file, so from this point on if a crash/power failure happens, our
        # new data is guaranteed to be there next time the fs is mounted.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Add one hard link to our file. This made btrfs write into the in memory fsync
        # log a special inode with generation 0 and an i_size of 0 too. Note that this
        # didn't update the inode in the fsync log on disk.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
      
        # Now make sure the in memory fsync log is durably persisted.
        # Creating and fsync'ing another file will do it.
        touch $SCRATCH_MNT/bar
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
      
        # As expected, before the crash/power failure, we should be able to read the
        # 12Kb of file data.
        echo "File content before:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # After mounting the fs again, the fsync log was replayed.
        # The btrfs fsync log replay code didn't update the i_size of the persisted
        # inode because the inode item in the log had a special generation with a
        # value of 0 (and it couldn't know the correct i_size, since that inode item
        # had a 0 i_size too). This made the last 4Kb of file data inaccessible and
        # effectively lost.
        echo "File content after:"
        od -t x1 $SCRATCH_MNT/foo
      
      This isn't a new issue/regression. This problem has been around since the
      log tree code was added in 2008:
      
        Btrfs: Add a write ahead tree log to optimize synchronous operations
        (commit e02119d5)
      
      Test cases for xfstests follow soon.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1a4bcf47
    • Forrest Liu's avatar
      Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group · 3d84be79
      Forrest Liu authored
      Removing large amount of block group in a transaction may encounters
      BUG_ON() in btrfs_orphan_add(). That is because btrfs_orphan_reserve_metadata()
      will grab metadata reservation from transaction handle, and
      btrfs_delete_unused_bgs() didn't reserve metadata for trnasaction handle when
      delete unused block group.
      
      The problem can be reproduce by following script
      
          mntpath=/btrfs
          loopdev=/dev/loop0
          filepath=/home/forrest/image
      
          umount $mntpath
          losetup -d $loopdev
          truncate --size 1000g $filepath
          losetup $loopdev $filepath
          mkfs.btrfs -f $loopdev
          mount $loopdev $mntpath
      
          for j in `seq 1 1 1000`; do
              fallocate -l 1g $mntpath/$j
          done
          # wait cleaner thread remove unused block group
          sleep 300
      
      The call trace that results from the BUG_ON() is:
      
      [  613.093084] ------------[ cut here ]------------
      [  613.097928] kernel BUG at fs/btrfs/inode.c:3142!
      [  613.105855] invalid opcode: 0000 [#1] SMP
      [  613.112702] Modules linked in: coretemp(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) snd_ens1371(E) snd_ac97_codec(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ppdev(E) ac97_bus(E) ablk_helper(E) gameport(E) cryptd(E) snd_rawmidi(E) snd_seq_device(E) snd_pcm(E) vmw_balloon(E) snd_timer(E) snd(E) soundcore(E) serio_raw(E) vmwgfx(E) ttm(E) drm_kms_helper(E) drm(E) vmw_vmci(E) parport_pc(E) shpchp(E) i2c_piix4(E) mac_hid(E) lp(E) parport(E) btrfs(E) xor(E) raid6_pq(E) hid_generic(E) usbhid(E) hid(E) psmouse(E) ahci(E) libahci(E) e1000(E) mptspi(E) mptscsih(E) mptbase(E) floppy(E) vmw_pvscsi(E) vmxnet3(E)
      [  613.144196] CPU: 0 PID: 1480 Comm: btrfs-cleaner Tainted: G            E  3.19.0-rc7-custom #2
      [  613.148501] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
      [  613.152694] task: ffff880035cdb1a0 ti: ffff880039cf4000 task.ti: ffff880039cf4000
      [  613.154969] RIP: 0010:[<ffffffffa01441c2>]  [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
      [  613.157780] RSP: 0018:ffff880039cf7c48  EFLAGS: 00010286
      [  613.159560] RAX: 00000000ffffffe4 RBX: ffff88003bd981a0 RCX: ffff88003c9e4000
      [  613.161904] RDX: 0000000000002244 RSI: 0000000000040000 RDI: ffff88003c9e4138
      [  613.164264] RBP: ffff880039cf7c88 R08: 000060ffc0000850 R09: 0000000000000000
      [  613.166507] R10: ffff88003bc4b7a0 R11: ffffea0000eb6740 R12: ffff88003c9c0000
      [  613.168681] R13: ffff88003c102160 R14: ffff88003c9c0458 R15: 0000000000000001
      [  613.170932] FS:  0000000000000000(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000
      [  613.173316] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  613.175227] CR2: 00007f6343537000 CR3: 0000000036329000 CR4: 00000000000407f0
      [  613.177554] Stack:
      [  613.178712]  ffff880039cf7c88 ffffffffa0182a54 ffff88003c9e4b04 ffff88003c9c7800
      [  613.181297]  ffff88003bc4b7a0 ffff88003bd981a0 ffff88003c8db200 ffff88003c2fcc60
      [  613.183782]  ffff880039cf7d18 ffffffffa012da97 ffff88003bc4b7a4 ffff88003bc4b7a0
      [  613.186171] Call Trace:
      [  613.187493]  [<ffffffffa0182a54>] ? lookup_free_space_inode+0x44/0x100 [btrfs]
      [  613.189801]  [<ffffffffa012da97>] btrfs_remove_block_group+0x137/0x740 [btrfs]
      [  613.192126]  [<ffffffffa0166912>] btrfs_remove_chunk+0x672/0x780 [btrfs]
      [  613.194267]  [<ffffffffa012e2ff>] btrfs_delete_unused_bgs+0x25f/0x280 [btrfs]
      [  613.196567]  [<ffffffffa0135e4c>] cleaner_kthread+0x12c/0x190 [btrfs]
      [  613.198687]  [<ffffffffa0135d20>] ? check_leaf+0x350/0x350 [btrfs]
      [  613.200758]  [<ffffffff8108f232>] kthread+0xd2/0xf0
      [  613.202616]  [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180
      [  613.204738]  [<ffffffff8175dabc>] ret_from_fork+0x7c/0xb0
      [  613.206652]  [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180
      [  613.208741] Code: ff ff 0f 1f 80 00 00 00 00 89 45 c8 3e 80 63 80 fd 48 89 df e8 d0 23 fe ff 8b 45 c8 e9 14 ff ff ff b8 f4 ff ff ff e9 12 ff ff ff <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
      [  613.216562] RIP  [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
      [  613.218828]  RSP <ffff880039cf7c48>
      [  613.220382] ---[ end trace 71073106deb8a457 ]---
      
      This patch replace btrfs_join_transaction() with btrfs_start_transaction() in
      btrfs_delete_unused_bgs() to revent BUG_ON() in btrfs_orphan_add()
      Signed-off-by: default avatarForrest Liu <forrestl@synology.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3d84be79
    • Josef Bacik's avatar
      Btrfs: account for large extents with enospc · dcab6a3b
      Josef Bacik authored
      On our gluster boxes we stream large tar balls of backups onto our fses.  With
      160gb of ram this means we get really large contiguous ranges of dirty data, but
      the way our ENOSPC stuff works is that as long as it's contiguous we only hold
      metadata reservation for one extent.  The problem is we limit our extents to
      128mb, so we'll end up with at least 800 extents so our enospc accounting is
      quite a bit lower than what we need.  To keep track of this make sure we
      increase outstanding_extents for every multiple of the max extent size so we can
      be sure to have enough reserved metadata space.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      dcab6a3b
    • Josef Bacik's avatar
      Btrfs: don't set and clear delalloc for O_DIRECT writes · 3266789f
      Josef Bacik authored
      We do this to get the space accounting, but this is just needless churn on the
      io_tree, so just drop setting/clearing delalloc and just drop the reserved data
      space when we have a successfull allocation.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3266789f
    • Josef Bacik's avatar
      Btrfs: only adjust outstanding_extents when we do a short write · 3e05bde8
      Josef Bacik authored
      We have this weird dance where we always inc outstanding_extents when we do a
      O_DIRECT write, even if we allocate the entire range.  To get around this we
      also drop the metadata space if we successfully write.  This is an unnecessary
      dance, we only need to jack up outstanding_extents if we don't satisfy the
      entire range request in get_blocks_direct, otherwise we are good using our
      original reservation.  So drop the unconditional inc and the drop of the
      metadata space that we have for the unconditional inc.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3e05bde8
    • Zhao Lei's avatar
      btrfs: Fix out-of-space bug · 13212b54
      Zhao Lei authored
      Btrfs will report NO_SPACE when we create and remove files for several times,
      and we can't write to filesystem until mount it again.
      
      Steps to reproduce:
       1: Create a single-dev btrfs fs with default option
       2: Write a file into it to take up most fs space
       3: Delete above file
       4: Wait about 100s to let chunk removed
       5: goto 2
      
      Script is like following:
       #!/bin/bash
      
       # Recommend 1.2G space, too large disk will make test slow
       DEV="/dev/sda16"
       MNT="/mnt/tmp"
      
       dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2
       file_size_m=$((dev_size * 75 / 100 / 1024 / 1024))
      
       echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev"
      
       for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done
       echo "mkfs $DEV"
       mkfs.btrfs -f "$DEV" >/dev/null || exit 2
       echo "mount $DEV $MNT"
       mount "$DEV" "$MNT" || exit 2
      
       for ((loop_i = 0; loop_i < 20; loop_i++)); do
           echo
           echo "loop $loop_i"
      
           echo "dd file..."
           cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m")
           "${cmd[@]}" 2>/dev/null || {
               # NO_SPACE error triggered
               echo "dd failed: ${cmd[*]}"
               exit 1
           }
      
           echo "rm file..."
           rm -f "$MNT"/file0 || exit 2
      
           for ((i = 0; i < 10; i++)); do
               df "$MNT" | tail -1
               sleep 10
           done
       done
      
      Reason:
       It is triggered by commit: 47ab2a6c
       which is used to remove empty block groups automatically, but the
       reason is not in that patch. Code before works well because btrfs
       don't need to create and delete chunks so many times with high
       complexity.
       Above bug is caused by many reason, any of them can trigger it.
      
      Reason1:
       When we remove some continuous chunks but leave other chunks after,
       these disk space should be used by chunk-recreating, but in current
       code, only first create will successed.
       Fixed by Forrest Liu <forrestl@synology.com> in:
       Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
      
      Reason2:
       contains_pending_extent() return wrong value in calculation.
       Fixed by Forrest Liu <forrestl@synology.com> in:
       Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
      
      Reason3:
       btrfs_check_data_free_space() try to commit transaction and retry
       allocating chunk when the first allocating failed, but space_info->full
       is set in first allocating, and prevent second allocating in retry.
       Fixed in this patch by clear space_info->full in commit transaction.
      
       Tested for severial times by above script.
      
      Changelog v3->v4:
       use light weight int instead of atomic_t to record have_remove_bgs in
       transaction, suggested by:
       Josef Bacik <jbacik@fb.com>
      
      Changelog v2->v3:
       v2 fixed the bug by adding more commit-transaction, but we
       only need to reclaim space when we are really have no space for
       new chunk, noticed by:
       Filipe David Manana <fdmanana@gmail.com>
      
       Actually, our code already have this type of commit-and-retry,
       we only need to make it working with removed-bgs.
       v3 fixed the bug with above way.
      
      Changelog v1->v2:
       v1 will introduce a new bug when delete and create chunk in same disk
       space in same transaction, noticed by:
       Filipe David Manana <fdmanana@gmail.com>
       V2 fix this bug by commit transaction after remove block grops.
      Reported-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Suggested-by: default avatarFilipe David Manana <fdmanana@gmail.com>
      Suggested-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      13212b54
    • Filipe Manana's avatar
      Btrfs: scrub, fix sleep in atomic context · f55985f4
      Filipe Manana authored
      My previous patch "Btrfs: fix scrub race leading to use-after-free"
      introduced the possibility to sleep in an atomic context, which happens
      when the scrub_lock mutex is held at the time scrub_pending_bio_dec()
      is called - this function can be called under an atomic context.
      Chris ran into this in a debug kernel which gave the following trace:
      
      [ 1928.950319] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:621
      [ 1928.967334] in_atomic(): 1, irqs_disabled(): 0, pid: 149670, name: fsstress
      [ 1928.981324] INFO: lockdep is turned off.
      [ 1928.989244] CPU: 24 PID: 149670 Comm: fsstress Tainted: G        W     3.19.0-rc7-mason+ #41
      [ 1929.006418] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, BIOS 1.07 05/10/2012
      [ 1929.022207]  ffffffff81a22cf8 ffff881076e03b78 ffffffff816b8dd9 ffff881076e03b78
      [ 1929.037267]  ffff880d8e828710 ffff881076e03ba8 ffffffff810856c4 ffff881076e03bc8
      [ 1929.052315]  0000000000000000 000000000000026d ffffffff81a22cf8 ffff881076e03bd8
      [ 1929.067381] Call Trace:
      [ 1929.072344]  <IRQ>  [<ffffffff816b8dd9>] dump_stack+0x4f/0x6e
      [ 1929.083968]  [<ffffffff810856c4>] ___might_sleep+0x174/0x230
      [ 1929.095352]  [<ffffffff810857d2>] __might_sleep+0x52/0x90
      [ 1929.106223]  [<ffffffff816bb68f>] mutex_lock_nested+0x2f/0x3b0
      [ 1929.117951]  [<ffffffff810ab37d>] ? trace_hardirqs_on+0xd/0x10
      [ 1929.129708]  [<ffffffffa05dc838>] scrub_pending_bio_dec+0x38/0x70 [btrfs]
      [ 1929.143370]  [<ffffffffa05dd0e0>] scrub_parity_bio_endio+0x50/0x70 [btrfs]
      [ 1929.157191]  [<ffffffff812fa603>] bio_endio+0x53/0xa0
      [ 1929.167382]  [<ffffffffa05f96bc>] rbio_orig_end_io+0x7c/0xa0 [btrfs]
      [ 1929.180161]  [<ffffffffa05f97ba>] raid_write_parity_end_io+0x5a/0x80 [btrfs]
      [ 1929.194318]  [<ffffffff812fa603>] bio_endio+0x53/0xa0
      [ 1929.204496]  [<ffffffff8130401b>] blk_update_request+0x1eb/0x450
      [ 1929.216569]  [<ffffffff81096e58>] ? trigger_load_balance+0x78/0x500
      [ 1929.229176]  [<ffffffff8144c74d>] scsi_end_request+0x3d/0x1f0
      [ 1929.240740]  [<ffffffff8144ccac>] scsi_io_completion+0xac/0x5b0
      [ 1929.252654]  [<ffffffff81441c50>] scsi_finish_command+0xf0/0x150
      [ 1929.264725]  [<ffffffff8144d317>] scsi_softirq_done+0x147/0x170
      [ 1929.276635]  [<ffffffff8130ace6>] blk_done_softirq+0x86/0xa0
      [ 1929.288014]  [<ffffffff8105d92e>] __do_softirq+0xde/0x600
      [ 1929.298885]  [<ffffffff8105df6d>] irq_exit+0xbd/0xd0
      (...)
      
      Fix this by using a reference count on the scrub context structure
      instead of locking the scrub_lock mutex.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f55985f4
    • Filipe Manana's avatar
      Btrfs: fix scheduler warning when syncing log · 575849ec
      Filipe Manana authored
      We try to lock a mutex while the current task state is not TASK_RUNNING,
      which results in the following warning when CONFIG_DEBUG_LOCK_ALLOC=y:
      
      [30736.772501] ------------[ cut here ]------------
      [30736.774545] WARNING: CPU: 9 PID: 19972 at kernel/sched/core.c:7300 __might_sleep+0x8b/0xa8()
      [30736.783453] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff8107499b>] prepare_to_wait+0x43/0x89
      [30736.786261] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc psmouse parport pcspkr microcode serio_raw evdev processor thermal_sys i2c_piix4 i2c_core button ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring e1000 virtio scsi_mod
      [30736.794323] CPU: 9 PID: 19972 Comm: fsstress Not tainted 3.19.0-rc7-btrfs-next-5+ #1
      [30736.795821] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [30736.798788]  0000000000000009 ffff88042743fbd8 ffffffff814248ed ffff88043d32f2d8
      [30736.800504]  ffff88042743fc28 ffff88042743fc18 ffffffff81045338 0000000000000001
      [30736.802131]  ffffffff81064514 ffffffff817c52d1 000000000000026d 0000000000000000
      [30736.803676] Call Trace:
      [30736.804256]  [<ffffffff814248ed>] dump_stack+0x4c/0x65
      [30736.805245]  [<ffffffff81045338>] warn_slowpath_common+0xa1/0xbb
      [30736.806360]  [<ffffffff81064514>] ? __might_sleep+0x8b/0xa8
      [30736.807391]  [<ffffffff81045398>] warn_slowpath_fmt+0x46/0x48
      [30736.808511]  [<ffffffff8107499b>] ? prepare_to_wait+0x43/0x89
      [30736.809620]  [<ffffffff8107499b>] ? prepare_to_wait+0x43/0x89
      [30736.810691]  [<ffffffff81064514>] __might_sleep+0x8b/0xa8
      [30736.811703]  [<ffffffff81426eaf>] mutex_lock_nested+0x2f/0x3a0
      [30736.812889]  [<ffffffff8107bfa1>] ? trace_hardirqs_on_caller+0x18f/0x1ab
      [30736.814138]  [<ffffffff8107bfca>] ? trace_hardirqs_on+0xd/0xf
      [30736.819878]  [<ffffffffa038cfff>] wait_for_writer.isra.12+0x91/0xaa [btrfs]
      [30736.821260]  [<ffffffff810748bd>] ? signal_pending_state+0x31/0x31
      [30736.822410]  [<ffffffffa0391f0a>] btrfs_sync_log+0x160/0x947 [btrfs]
      [30736.823574]  [<ffffffff8107bfa1>] ? trace_hardirqs_on_caller+0x18f/0x1ab
      [30736.824847]  [<ffffffff8107bfca>] ? trace_hardirqs_on+0xd/0xf
      [30736.825972]  [<ffffffffa036e555>] btrfs_sync_file+0x2b0/0x319 [btrfs]
      [30736.827684]  [<ffffffff8117901a>] vfs_fsync_range+0x21/0x23
      [30736.828932]  [<ffffffff81179038>] vfs_fsync+0x1c/0x1e
      [30736.829917]  [<ffffffff8117928b>] do_fsync+0x34/0x4e
      [30736.830862]  [<ffffffff811794b3>] SyS_fsync+0x10/0x14
      [30736.831819]  [<ffffffff8142a512>] system_call_fastpath+0x12/0x17
      [30736.832982] ---[ end trace c0b57df60d32ae5c ]---
      
      Fix this my acquiring the mutex after calling finish_wait(), which sets the
      task's state to TASK_RUNNING.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      575849ec
  7. 03 Feb, 2015 9 commits
    • Satoru Takeuchi's avatar
      Btrfs: Remove unnecessary placeholder in btrfs_err_code · eb710b15
      Satoru Takeuchi authored
      "notused" is not necessary. Set 1 to the first entry is enough.
      
      Signed-off-by: Takeuchi Satoru <takeuchi_satoru@jp.fujitsu.com
      Cc: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      eb710b15
    • Gui Hecheng's avatar
      btrfs: cleanup init for list in free-space-cache · b76808fc
      Gui Hecheng authored
      o removed an unecessary INIT_LIST_HEAD after LIST_HEAD
      
      o merge a declare & INIT_LIST_HEAD pair into one LIST_HEAD
      Signed-off-by: default avatarGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      b76808fc
    • Shaohua Li's avatar
      btrfs: delete chunk allocation attemp when setting block group ro · 2f081088
      Shaohua Li authored
      Below test will fail currently:
            mkfs.ext4 -F /dev/sda
            btrfs-convert /dev/sda
            mount /dev/sda /mnt
            btrfs device add -f /dev/sdb /mnt
            btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt
      
      The reason is there are some block groups with usage 0, but the whole
      disk hasn't free space to allocate new chunk, so we even can't set such
      block group readonly. This patch deletes the chunk allocation when
      setting block group ro. For META, we already have reserve. But for
      SYSTEM, we don't have, so the check_system_chunk is still required.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2f081088
    • Naohiro Aota's avatar
      btrfs: clear bio reference after submit_one_bio() · 289454ad
      Naohiro Aota authored
      After submit_one_bio(), `bio' can go away. However submit_extent_page()
      leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM).
      It will cause invalid paging request when submit_extent_page() is called
      next time.
      
      I reproduced ENOMEM case with the following script (need
      CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS).
      
        #!/bin/bash
      
        dmesgout=dmesg.txt
        start=100000
        end=300000
        step=1000
      
        # btrfs options
        device=/dev/vdb1
        directory=/mnt/btrfs
      
        # fault-injection options
        percent=100
        times=3
      
        mkdir -p $directory || exit 1
        mount -o compress $device $directory || exit 1
      
        rm -f $directory/file || exit 1
        dd if=/dev/zero of=$directory/file bs=1M count=512 || exit 1
      
        for interval in `seq $start $step $end`; do
                dmesg -C
                echo 1 > /proc/sys/vm/drop_caches
                sync
                export FAILCMD_TYPE=fail_page_alloc
                ./failcmd.sh -p $percent -t $times -i $interval \
                        --ignore-gfp-highmem=N --ignore-gfp-wait=N --min-order=0 \
                        -- \
                        cat $directory/file > /dev/null
                dmesg > ${dmesgout}
                if grep -q BUG: ${dmesgout}; then
                        cat ${dmesgout}
                        exit 1
                fi
        done
      
        umount $directory
        exit 0
      Signed-off-by: default avatarNaohiro Aota <naota@elisp.net>
      Tested-by: default avatarSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      289454ad
    • Filipe Manana's avatar
      Btrfs: fix scrub race leading to use-after-free · de554a4f
      Filipe Manana authored
      While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got
      the following trace:
      
      [68127.807663] BUG: unable to handle kernel paging request at ffff8803f8947a50
      [68127.807663] IP: [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
      [68127.807663] PGD 3003067 PUD 43e1f5067 PMD 43e030067 PTE 80000003f8947060
      [68127.807663] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [68127.807663] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc processor parpo
      [68127.807663] CPU: 2 PID: 3081 Comm: kworker/u8:5 Not tainted 3.18.0-rc6-btrfs-next-3+ #4
      [68127.807663] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [68127.807663] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
      [68127.807663] task: ffff880101fc5250 ti: ffff8803f097c000 task.ti: ffff8803f097c000
      [68127.807663] RIP: 0010:[<ffffffff8107da31>]  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
      [68127.807663] RSP: 0018:ffff8803f097fbb8  EFLAGS: 00010093
      [68127.807663] RAX: 0000000028dd386c RBX: ffff8803f8947a50 RCX: 0000000028dd3854
      [68127.807663] RDX: 0000000000000018 RSI: 0000000000000002 RDI: 0000000000000001
      [68127.807663] RBP: ffff8803f097fbd8 R08: 0000000000000004 R09: 0000000000000001
      [68127.807663] R10: ffff880102620980 R11: ffff8801f3e8c900 R12: 000000000001d390
      [68127.807663] R13: 00000000cabd13c8 R14: ffff8803f8947800 R15: ffff88037c574f00
      [68127.807663] FS:  0000000000000000(0000) GS:ffff88043dd00000(0000) knlGS:0000000000000000
      [68127.807663] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [68127.807663] CR2: ffff8803f8947a50 CR3: 00000000b6481000 CR4: 00000000000006e0
      [68127.807663] Stack:
      [68127.807663]  ffffffff823942a8 ffff8803f8947a50 ffff8802a3416f80 0000000000000000
      [68127.807663]  ffff8803f097fc18 ffffffff8141e7c0 ffffffff81072948 000000000034f314
      [68127.807663]  ffff8803f097fc08 0000000000000292 ffff8803f097fc48 ffff8803f8947a50
      [68127.807663] Call Trace:
      [68127.807663]  [<ffffffff8141e7c0>] _raw_spin_lock_irqsave+0x4b/0x55
      [68127.807663]  [<ffffffff81072948>] ? __wake_up+0x22/0x4b
      [68127.807663]  [<ffffffff81072948>] __wake_up+0x22/0x4b
      [68127.807663]  [<ffffffffa0392327>] scrub_pending_bio_dec+0x32/0x36 [btrfs]
      [68127.807663]  [<ffffffffa0395e70>] scrub_bio_end_io_worker+0x5a3/0x5c9 [btrfs]
      [68127.807663]  [<ffffffff810e0c7c>] ? time_hardirqs_off+0x15/0x28
      [68127.807663]  [<ffffffff81078106>] ? trace_hardirqs_off_caller+0x4c/0xb9
      [68127.807663]  [<ffffffffa0372a7c>] normal_work_helper+0xf1/0x238 [btrfs]
      [68127.807663]  [<ffffffffa0372d3d>] btrfs_scrub_helper+0x12/0x14 [btrfs]
      [68127.807663]  [<ffffffff810582d2>] process_one_work+0x1e4/0x3b6
      [68127.807663]  [<ffffffff81078180>] ? trace_hardirqs_off+0xd/0xf
      [68127.807663]  [<ffffffff81058dc9>] worker_thread+0x1fb/0x2a8
      [68127.807663]  [<ffffffff81058bce>] ? rescuer_thread+0x219/0x219
      [68127.807663]  [<ffffffff8105cd75>] kthread+0xdb/0xe3
      [68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
      [68127.807663]  [<ffffffff8141f1ec>] ret_from_fork+0x7c/0xb0
      [68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
      [68127.807663] Code: 39 c2 75 14 8d 8a 00 00 01 00 89 d0 f0 0f b1 0b 39 d0 0f 84 81 00 00 00 4c 69 2d 27 86 99 00 fa 00 00 00 45 31 e4 4d 39 ec 74 2b <8b> 13 89 d0 c1 e8 10 66 39 c2 75
      [68127.807663] RIP  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
      [68127.807663]  RSP <ffff8803f097fbb8>
      [68127.807663] CR2: ffff8803f8947a50
      [68127.807663] ---[ end trace d7045aac00a66cd8 ]---
      
      This is due to a race that can happen in a very tiny time window and is
      illustrated by the following sequence diagram:
      
               CPU 1                                                     CPU 2
      
                                                                      btrfs_scrub_dev()
      scrub_bio_end_io_worker()
         scrub_pending_bio_dec()
             atomic_dec(&sctx->bios_in_flight)
                                                                         wait sctx->bios_in_flight == 0
                                                                         wait sctx->workers_pending == 0
                                                                         mutex_lock(&fs_info->scrub_lock)
                                                                         (...)
                                                                         mutex_lock(&fs_info->scrub_lock)
                                                                         scrub_free_ctx(sctx)
                                                                            kfree(sctx)
             wake_up(&sctx->list_wait)
                __wake_up()
                    spin_lock_irqsave(&sctx->list_wait->lock, flags)
      
      Another variation of this scenario that results in the same use-after-free
      issue is:
      
               CPU 1                                                     CPU 2
      
                                                                      btrfs_scrub_dev()
                                                                         wait sctx->bios_in_flight == 0
      scrub_bio_end_io_worker()
         scrub_pending_bio_dec()
             __wake_up(&sctx->list_wait)
                spin_lock_irqsave(&sctx->list_wait->lock, flags)
                default_wake_function()
                    wake up task at CPU 2
                                                                         wait sctx->workers_pending == 0
                                                                         mutex_lock(&fs_info->scrub_lock)
                                                                         (...)
                                                                         mutex_lock(&fs_info->scrub_lock)
                                                                         scrub_free_ctx(sctx)
                                                                            kfree(sctx)
                spin_unlock_irqrestore(&sctx->list_wait->lock, flags)
      
      Fix this by holding the scrub lock while doing the wakeup.
      
      This isn't a recent regression, the issue as been around since the scrub
      feature was added (2011, commit a2de733c).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      de554a4f
    • Filipe Manana's avatar
      Btrfs: add missing cleanup on sysfs init failure · 001a648d
      Filipe Manana authored
      If we failed during initialization of sysfs, we weren't unregistering the
      top level btrfs sysfs entry nor the debugfs stuff.
      Not unregistering the top level sysfs entry makes future attempts to reload
      the btrfs module impossible and the following is reported in dmesg:
      
      [ 2246.451296] WARNING: CPU: 3 PID: 10999 at fs/sysfs/dir.c:486 sysfs_warn_dup+0x91/0xb0()
      [ 2246.451298] sysfs: cannot create duplicate filename '/fs/btrfs'
      [ 2246.451298] Modules linked in: btrfs(+) raid6_pq xor bnep rfcomm bluetooth binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc parport_pc parport psmouse serio_raw pcspkr evbug i2c_piix4 e1000 floppy [last unloaded: btrfs]
      [ 2246.451310] CPU: 3 PID: 10999 Comm: modprobe Tainted: G        W    3.13.0-fdm-btrfs-next-24+ #7
      [ 2246.451311] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [ 2246.451312]  0000000000000009 ffff8800d353fa08 ffffffff816f1da6 0000000000000410
      [ 2246.451314]  ffff8800d353fa58 ffff8800d353fa48 ffffffff8104a32c ffff88020821a290
      [ 2246.451316]  ffff88020821a290 ffff88020821a290 ffff8802148f0000 ffff8800d353fb80
      [ 2246.451318] Call Trace:
      [ 2246.451322]  [<ffffffff816f1da6>] dump_stack+0x4e/0x68
      [ 2246.451324]  [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0
      [ 2246.451325]  [<ffffffff8104a416>] warn_slowpath_fmt+0x46/0x50
      [ 2246.451328]  [<ffffffff81367dc5>] ? strlcat+0x65/0x90
      (....)
      
      This fixes the following change:
      
          btrfs: add simple debugfs interface
          commit 1bae3098Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      001a648d
    • Filipe Manana's avatar
      Btrfs: fix race between transaction commit and empty block group removal · d4b450cd
      Filipe Manana authored
      Committing a transaction can race with automatic removal of empty block
      groups (cleaner kthread), leading to a BUG_ON() in the transaction
      commit code while running btrfs_finish_extent_commit(). The following
      sequence diagram shows how it can happen:
      
                 CPU 1                                       CPU 2
      
      btrfs_commit_transaction()
        fs_info->running_transaction = NULL
        btrfs_finish_extent_commit()
          find_first_extent_bit()
            -> found range for block group X
               in fs_info->freed_extents[]
      
                                                     btrfs_delete_unused_bgs()
                                                       -> found block group X
      
                                                       Removed block group X's range
                                                       from fs_info->freed_extents[]
      
                                                       btrfs_remove_chunk()
                                                          btrfs_remove_block_group(bg X)
      
          unpin_extent_range(bg X range)
             btrfs_lookup_block_group(bg X)
                -> returns NULL
                  -> BUG_ON()
      
      The trace that results from the BUG_ON() is:
      
      [48665.187808] ------------[ cut here ]------------
      [48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675!
      [48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      [48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode
      [48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
      [48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
      [48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000
      [48665.197388] RIP: 0010:[<ffffffffa0350d05>]  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
      [48665.197388] RSP: 0018:ffff8801b56a7b88  EFLAGS: 00010246
      [48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8
      [48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0
      [48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000
      [48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000
      [48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000
      [48665.197388] FS:  0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
      [48665.197388] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0
      [48665.197388] Stack:
      [48665.197388]  ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff
      [48665.197388]  ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0
      [48665.197388]  ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227
      [48665.197388] Call Trace:
      [48665.197388]  [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs]
      [48665.197388]  [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs]
      [48665.197388]  [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs]
      [48665.197388]  [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33
      [48665.197388]  [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs]
      [48665.197388]  [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab
      [48665.197388]  [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab
      [48665.197388]  [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf
      [48665.197388]  [<ffffffff8105a55b>] worker_thread+0x210/0x2d0
      [48665.197388]  [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3
      [48665.197388]  [<ffffffff8105e5c0>] kthread+0xef/0xf7
      [48665.197388]  [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39
      [48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
      [48665.197388]  [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0
      [48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
      [48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b
      [48665.197388] RIP  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
      [48665.197388]  RSP <ffff8801b56a7b88>
      [48665.272246] ---[ end trace b9c6ab9957521376 ]---
      
      Fix this by ensuring that unpining the block group's range in
      btrfs_finish_extent_commit() is done in a synchronized fashion
      with removing the block group's range from freed_extents[]
      in btrfs_delete_unused_bgs()
      
      This race got introduced with the change:
      
          Btrfs: remove empty block groups automatically
          commit 47ab2a6cSigned-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d4b450cd
    • David Sterba's avatar
      btrfs: add more checks to btrfs_read_sys_array · e3540eab
      David Sterba authored
      Verify that the sys_array has enough bytes to read the next item.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e3540eab
    • David Sterba's avatar
      btrfs: cleanup, rename a few variables in btrfs_read_sys_array · 1ffb22cf
      David Sterba authored
      There's a pointer to buffer, integer offset and offset passed as
      pointer, try to find matching names for them.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1ffb22cf