1. 03 Jun, 2016 1 commit
    • Chris Mason's avatar
      Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent · 8dff9c85
      Chris Mason authored
      When dealing with inline extents, btrfs_get_extent will incorrectly try
      to insert a duplicate extent_map.  The dup hits -EEXIST from
      add_extent_map, but then we try to merge with the existing one and end
      up trying to insert a zero length extent_map.
      
      This actually works most of the time, except when there are extent maps
      past the end of the inline extent.  rocksdb will trigger this sometimes
      because it preallocates an extent and then truncates down.
      
      Josef made a script to trigger with xfs_io:
      
      	#!/bin/bash
      
      	xfs_io -f -c "pwrite 0 1000" inline
      	xfs_io -c "falloc -k 4k 1M" inline
      	xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline
      	xfs_io -c "fadvise -d 0 1000" inline
      	cat inline
      
      You'll get EIOs trying to read inline after this because add_extent_map
      is returning EEXIST
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8dff9c85
  2. 02 Jun, 2016 1 commit
  3. 31 May, 2016 1 commit
    • Filipe Manana's avatar
      Btrfs: fix race between device replace and read repair · b5de8d0d
      Filipe Manana authored
      While we are finishing a device replace operation we can have a concurrent
      task trying to do a read repair operation, in which case it will call
      btrfs_map_block() to get a struct btrfs_bio which can have a stripe that
      points to the source device of the device replace operation. This allows
      for the read repair task to dereference the stripe's device pointer after
      the device replace operation has freed the source device, resulting in
      an invalid memory access. This is similar to the problem solved by my
      previous patch in the same series and named "Btrfs: fix race between
      device replace and discard".
      
      So fix this by surrounding the call to btrfs_map_block() and the code
      that uses the returned struct btrfs_bio with calls to
      btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the
      proper serialization with the finishing phase of the device replace
      operation.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      b5de8d0d
  4. 30 May, 2016 7 commits
    • Filipe Manana's avatar
      Btrfs: fix race between device replace and discard · 2999241d
      Filipe Manana authored
      While we are finishing a device replace operation, we can make a discard
      operation (fs mounted with -o discard) do an invalid memory access like
      the one reported by the following trace:
      
      [ 3206.384654] general protection fault: 0000 [#1] PREEMPT SMP
      [ 3206.387520] Modules linked in: dm_mod btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis psmouse tpm ppdev sg parport_pc evdev i2c_piix4 parport
      processor serio_raw i2c_core pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci
      virtio_ring scsi_mod e1000 virtio floppy [last unloaded: btrfs]
      [ 3206.388595] CPU: 14 PID: 29194 Comm: fsstress Not tainted 4.6.0-rc7-btrfs-next-29+ #1
      [ 3206.388595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 3206.388595] task: ffff88017ace0100 ti: ffff880171b98000 task.ti: ffff880171b98000
      [ 3206.388595] RIP: 0010:[<ffffffff8124d233>]  [<ffffffff8124d233>] blkdev_issue_discard+0x5c/0x2a7
      [ 3206.388595] RSP: 0018:ffff880171b9bb80  EFLAGS: 00010246
      [ 3206.388595] RAX: ffff880171b9bc28 RBX: 000000000090d000 RCX: 0000000000000000
      [ 3206.388595] RDX: ffffffff82fa1b48 RSI: ffffffff8179f46c RDI: ffffffff82fa1b48
      [ 3206.388595] RBP: ffff880171b9bcc0 R08: 0000000000000000 R09: 0000000000000001
      [ 3206.388595] R10: ffff880171b9bce0 R11: 000000000090f000 R12: ffff880171b9bbe8
      [ 3206.388595] R13: 0000000000000010 R14: 0000000000004868 R15: 6b6b6b6b6b6b6b6b
      [ 3206.388595] FS:  00007f6182e4e700(0000) GS:ffff88023fdc0000(0000) knlGS:0000000000000000
      [ 3206.388595] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 3206.388595] CR2: 00007f617c2bbb18 CR3: 000000017ad9c000 CR4: 00000000000006e0
      [ 3206.388595] Stack:
      [ 3206.388595]  0000000000004878 0000000000000000 0000000002400040 0000000000000000
      [ 3206.388595]  0000000000000000 ffff880171b9bbe8 ffff880171b9bbb0 ffff880171b9bbb0
      [ 3206.388595]  ffff880171b9bbc0 ffff880171b9bbc0 ffff880171b9bbd0 ffff880171b9bbd0
      [ 3206.388595] Call Trace:
      [ 3206.388595]  [<ffffffffa042899e>] btrfs_issue_discard+0x12f/0x143 [btrfs]
      [ 3206.388595]  [<ffffffffa042899e>] ? btrfs_issue_discard+0x12f/0x143 [btrfs]
      [ 3206.388595]  [<ffffffffa042e862>] btrfs_discard_extent+0x87/0xde [btrfs]
      [ 3206.388595]  [<ffffffffa04303b5>] btrfs_finish_extent_commit+0xb2/0x1df [btrfs]
      [ 3206.388595]  [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
      [ 3206.388595]  [<ffffffffa04464c4>] btrfs_commit_transaction+0x7fc/0x980 [btrfs]
      [ 3206.388595]  [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
      [ 3206.388595]  [<ffffffffa0459af6>] btrfs_sync_file+0x38f/0x428 [btrfs]
      [ 3206.388595]  [<ffffffff811a8292>] vfs_fsync_range+0x8c/0x9e
      [ 3206.388595]  [<ffffffff811a82c0>] vfs_fsync+0x1c/0x1e
      [ 3206.388595]  [<ffffffff811a8417>] do_fsync+0x31/0x4a
      [ 3206.388595]  [<ffffffff811a8637>] SyS_fsync+0x10/0x14
      [ 3206.388595]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [ 3206.388595]  [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14
      [ 3206.388595]  [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa
      
      This happens because when we call btrfs_map_block() from
      btrfs_discard_extent() to get a btrfs_bio structure, the device replace
      operation has not finished yet, but before we use the device of one of the
      stripes from the returned btrfs_bio structure, the device object is freed.
      
      This is illustrated by the following diagram.
      
                  CPU 1                                                  CPU 2
      
       btrfs_dev_replace_start()
      
       (...)
      
       btrfs_dev_replace_finishing()
      
         btrfs_start_transaction()
         btrfs_commit_transaction()
      
         (...)
      
                                                                  btrfs_sync_file()
                                                                    btrfs_start_transaction()
      
                                                                    (...)
      
                                                                    btrfs_commit_transaction()
                                                                      btrfs_finish_extent_commit()
                                                                        btrfs_discard_extent()
                                                                          btrfs_map_block()
                                                                            --> returns a struct btrfs_bio
                                                                                with a stripe that has a
                                                                                device field pointing to
                                                                                source device of the replace
                                                                                operation (the device that
                                                                                is being replaced)
      
         mutex_lock(&uuid_mutex)
         mutex_lock(&fs_info->fs_devices->device_list_mutex)
         mutex_lock(&fs_info->chunk_mutex)
      
         btrfs_dev_replace_update_device_in_mapping_tree()
           --> iterates the mapping tree and for each
               extent map that has a stripe pointing to
               the source device, it updates the stripe
               to point to the target device instead
      
         btrfs_rm_dev_replace_blocked()
           --> waits for fs_info->bio_counter to go down to 0
      
         btrfs_rm_dev_replace_remove_srcdev()
           --> removes source device from the list of devices
      
         mutex_unlock(&fs_info->chunk_mutex)
         mutex_unlock(&fs_info->fs_devices->device_list_mutex)
         mutex_unlock(&uuid_mutex)
      
         btrfs_rm_dev_replace_free_srcdev()
           --> frees the source device
      
                                                                          --> iterates over all stripes
                                                                              of the returned struct
                                                                              btrfs_bio
                                                                          --> for each stripe it
                                                                              dereferences its device
                                                                              pointer
                                                                              --> it ends up finding a
                                                                                  pointer to the device
                                                                                  used as the source
                                                                                  device for the replace
                                                                                  operation and that was
                                                                                  already freed
      
      So fix this by surrounding the call to btrfs_map_block(), and the code
      that uses the returned struct btrfs_bio, with calls to
      btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that
      the finishing phase of the device replace operation blocks until the
      the bio counter decreases to zero before it frees the source device.
      This is the same approach we do at btrfs_map_bio() for example.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      2999241d
    • Filipe Manana's avatar
      Btrfs: fix race between device replace and chunk allocation · 22ab04e8
      Filipe Manana authored
      While iterating and copying extents from the source device, the device
      replace code keeps adjusting a left cursor that is used to make sure that
      once we finish processing a device extent, any future writes to extents
      from the corresponding block group will get into both the source and
      target devices. This left cursor is also used for resuming the device
      replace operation at mount time.
      
      However using this left cursor to decide whether writes go into both
      devices or only the source device is not enough to guarantee we don't
      miss copying extents into the target device. There are two cases where
      the current approach fails. The first one is related to when there are
      holes in the device and they get allocated for new block groups while
      the device replace operation is iterating the device extents (more on
      this explained below). The second one is that when that loop over the
      device extents finishes, we start dellaloc, wait for all ordered extents
      and then commit the current transaction, we might have got new block
      groups allocated that are now using a device extent that has an offset
      greater then or equals to the value of the left cursor, in which case
      writes to extents belonging to these new block groups will get issued
      only to the source device.
      
      For the first case where the current approach of using a left cursor
      fails, consider the source device currently has the following layout:
      
        [ extent bg A ] [ hole, unallocated space ] [extent bg B ]
        3Gb             4Gb                         5Gb
      
      While we are iterating the device extents from the source device using
      the commit root of the device tree, the following happens:
      
              CPU 1                                            CPU 2
      
                            <we are at transaction N>
      
        scrub_enumerate_chunks()
          --> searches the device tree for
              extents belonging to the source
              device using the device tree's
              commit root
          --> 1st iteration finds extent belonging to
              block group A
      
              --> sets block group A to RO mode
                  (btrfs_inc_block_group_ro)
      
              --> sets cursor left to found_key.offset
                  which is 3Gb
      
              --> scrub_chunk() starts
                  copies all allocated extents from
                  block group's A stripe at source
                  device into target device
      
                                                                 btrfs_alloc_chunk()
                                                                   --> allocates device extent
                                                                       in the range [4Gb, 5Gb[
                                                                       from the source device for
                                                                       a new block group C
      
                                                                 extent allocated from block
                                                                 group C for a direct IO,
                                                                 buffered write or btree node/leaf
      
                                                                 extent is written to, perhaps
                                                                 in response to a writepages()
                                                                 call from the VM or directly
                                                                 through direct IO
      
                                                                 the write is made only against
                                                                 the source device and not against
                                                                 the target device because the
                                                                 extent's offset is in the interval
                                                                 [4Gb, 5Gb[ which is larger then
                                                                 the value of cursor_left (3Gb)
      
              --> scrub_chunks() finishes
      
              --> updates left cursor from 3Gb to
                  4Gb
      
              --> btrfs_dec_block_group_ro() sets
                  block group A back to RW mode
      
                                   <we are still at transaction N>
      
          --> 2nd iteration finds extent belonging to
              block group B - it did not find the new
              extent in the range [4Gb, 5Gb[ for block
              group C because we are using the device
              tree's commit root or even because the
              block group's items are not all yet
              inserted in the respective btrees, that is,
              the block group is still attached to some
              transaction handle's new_bgs list and
              btrfs_create_pending_block_groups() was
              not called yet against that transaction
              handle, so the device extent items were
              not yet inserted into the devices tree
      
                                   <we are still at transaction N>
      
              --> so we end not copying anything from the newly
                  allocated device extent from the source device
                  to the target device
      
      So fix this by making __btrfs_map_block() always redirect writes to the
      target device as well, independently of the left cursor's value. With
      this change the left cursor is now used only for the purpose of tracking
      progress and allow a mount operation to resume a device replace.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      22ab04e8
    • Filipe Manana's avatar
      Btrfs: fix race setting block group back to RW mode during device replace · 1a1a8b73
      Filipe Manana authored
      After it finishes processing a device extent, the device replace code sets
      back the block group to RW mode and then after that it sets the left cursor
      to match the logical end address of the block group, so that future writes
      into extents belonging to the block group go both the source (old) and
      target (new) devices. However from the moment we turn the block group
      back to RW mode we have a short time window, that lasts until we update
      the left cursor's value, where extents can be allocated from the block
      group and written to, in which case they will not be copied/written to
      the target (new) device. Fix this by updating the left cursor's value
      before turning the block group back to RW mode.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      1a1a8b73
    • Filipe Manana's avatar
      Btrfs: fix unprotected assignment of the left cursor for device replace · 81e87a73
      Filipe Manana authored
      We were assigning new values to fields of the device replace object
      without holding the respective lock after processing each device extent.
      This is important for the left cursor field which can be accessed by a
      concurrent task running __btrfs_map_block (which, correctly, takes the
      device replace lock).
      So change these fields while holding the device replace lock.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      81e87a73
    • Filipe Manana's avatar
      Btrfs: fix race setting block group readonly during device replace · f0e9b7d6
      Filipe Manana authored
      When we do a device replace, for each device extent we find from the
      source device, we set the corresponding block group to readonly mode to
      prevent writes into it from happening while we are copying the device
      extent from the source to the target device. However just before we set
      the block group to readonly mode some concurrent task might have already
      allocated an extent from it or decided it could perform a nocow write
      into one of its extents, which can make the device replace process to
      miss copying an extent since it uses the extent tree's commit root to
      search for extents and only once it finishes searching for all extents
      belonging to the block group it does set the left cursor to the logical
      end address of the block group - this is a problem if the respective
      ordered extents finish while we are searching for extents using the
      extent tree's commit root and no transaction commit happens while we
      are iterating the tree, since it's the delayed references created by the
      ordered extents (when they complete) that insert the extent items into
      the extent tree (using the non-commit root of course).
      Example:
      
                CPU 1                                            CPU 2
      
       btrfs_dev_replace_start()
         btrfs_scrub_dev()
           scrub_enumerate_chunks()
             --> finds device extent belonging
                 to block group X
      
                                     <transaction N starts>
      
                                                            starts buffered write
                                                            against some inode
      
                                                            writepages is run against
                                                            that inode forcing dellaloc
                                                            to run
      
                                                            btrfs_writepages()
                                                              extent_writepages()
                                                                extent_write_cache_pages()
                                                                  __extent_writepage()
                                                                    writepage_delalloc()
                                                                      run_delalloc_range()
                                                                        cow_file_range()
                                                                          btrfs_reserve_extent()
                                                                            --> allocates an extent
                                                                                from block group X
                                                                                (which is not yet
                                                                                 in RO mode)
                                                                          btrfs_add_ordered_extent()
                                                                            --> creates ordered extent Y
                                                              flush_epd_write_bio()
                                                                --> bio against the extent from
                                                                    block group X is submitted
      
             btrfs_inc_block_group_ro(bg X)
               --> sets block group X to readonly
      
             scrub_chunk(bg X)
               scrub_stripe(device extent from srcdev)
                 --> keeps searching for extent items
                     belonging to the block group using
                     the extent tree's commit root
                 --> it never blocks due to
                     fs_info->scrub_pause_req as no
                     one tries to commit transaction N
                 --> copies all extents found from the
                     source device into the target device
                 --> finishes search loop
      
                                                              bio completes
      
                                                              ordered extent Y completes
                                                              and creates delayed data
                                                              reference which will add an
                                                              extent item to the extent
                                                              tree when run (typically
                                                              at transaction commit time)
      
                                                                --> so the task doing the
                                                                    scrub/device replace
                                                                    at CPU 1 misses this
                                                                    and does not copy this
                                                                    extent into the new/target
                                                                    device
      
             btrfs_dec_block_group_ro(bg X)
               --> turns block group X back to RW mode
      
             dev_replace->cursor_left is set to the
             logical end offset of block group X
      
      So fix this by waiting for all cow and nocow writes after setting a block
      group to readonly mode.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      f0e9b7d6
    • Filipe Manana's avatar
      Btrfs: fix race between device replace and block group removal · 57ba4cb8
      Filipe Manana authored
      When it's finishing, the device replace code iterates all extent maps
      representing block group and for each one that has a stripe that refers
      to the source device, it replaces its device with the target device.
      However when it replaces the source device with the target device it,
      the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID),
      only after its ID is changed to match the one from the source device.
      This leads to races with the chunk removal code that can temporarly see
      a device with an ID of 0ULL and then attempt to use that ID to remove
      items from the device tree and fail, causing a transaction abort:
      
      [ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to /dev/sde finished
      [ 9238.594377] ------------[ cut here ]------------
      [ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 btrfs_remove_chunk+0x2e5/0x793 [btrfs]
      [ 9238.594403] BTRFS: Transaction aborted (error 1)
      [ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg i2c_core se
      rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod fl
      oppy [last unloaded: btrfs]
      [ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 4.6.0-rc7-btrfs-next-29+ #1
      [ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 9238.594421]  0000000000000000 ffff88017f1dbc60 ffffffff8126b42c ffff88017f1dbcb0
      [ 9238.594422]  0000000000000000 ffff88017f1dbca0 ffffffff81052b14 00000ad37f1dbd18
      [ 9238.594423]  0000000000000001 ffff88018068a558 ffff88005c4b9c00 ffff880233f60db0
      [ 9238.594424] Call Trace:
      [ 9238.594428]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
      [ 9238.594430]  [<ffffffff81052b14>] __warn+0xc2/0xdd
      [ 9238.594432]  [<ffffffff81052b7a>] warn_slowpath_fmt+0x4b/0x53
      [ 9238.594434]  [<ffffffff8116c311>] ? kmem_cache_free+0x128/0x188
      [ 9238.594450]  [<ffffffffa04d43f5>] btrfs_remove_chunk+0x2e5/0x793 [btrfs]
      [ 9238.594452]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
      [ 9238.594464]  [<ffffffffa04a26fa>] btrfs_delete_unused_bgs+0x317/0x382 [btrfs]
      [ 9238.594476]  [<ffffffffa04a961d>] cleaner_kthread+0x1ad/0x1c7 [btrfs]
      [ 9238.594489]  [<ffffffffa04a9470>] ? btree_invalidatepage+0x8e/0x8e [btrfs]
      [ 9238.594490]  [<ffffffff8106f403>] kthread+0xd4/0xdc
      [ 9238.594494]  [<ffffffff8149e242>] ret_from_fork+0x22/0x40
      [ 9238.594495]  [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286
      [ 9238.594496] ---[ end trace 183efbe50275f059 ]---
      
      The sequence of steps leading to this is like the following:
      
                    CPU 1                                           CPU 2
      
       btrfs_dev_replace_finishing()
      
         at this point
         dev_replace->tgtdev->devid ==
         BTRFS_DEV_REPLACE_DEVID (0ULL)
      
         ...
      
         btrfs_start_transaction()
         btrfs_commit_transaction()
      
                                                           btrfs_delete_unused_bgs()
                                                             btrfs_remove_chunk()
      
                                                               looks up for the extent map
                                                               corresponding to the chunk
      
                                                               lock_chunks() (chunk_mutex)
                                                               check_system_chunk()
                                                               unlock_chunks() (chunk_mutex)
      
         locks fs_info->chunk_mutex
      
         btrfs_dev_replace_update_device_in_mapping_tree()
           --> iterates fs_info->mapping_tree and
               replaces the device in every extent
               map's map->stripes[] with
               dev_replace->tgtdev, which still has
               an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
      
                                                               iterates over all stripes from
                                                               the extent map
      
                                                                 --> calls btrfs_free_dev_extent()
                                                                     passing it the target device
                                                                     that still has an ID of 0ULL
      
                                                                 --> btrfs_free_dev_extent() fails
                                                                   --> aborts current transaction
      
         finishes setting up the target device,
         namely it sets tgtdev->devid to the value
         of srcdev->devid (which is necessarily > 0)
      
         frees the srcdev
      
         unlocks fs_info->chunk_mutex
      
      So fix this by taking the device list mutex while processing the stripes
      for the chunk's extent map. This is similar to the race between device
      replace and block group creation that was fixed by commit 50460e37
      ("Btrfs: fix race when finishing dev replace leading to transaction abort").
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      57ba4cb8
    • Filipe Manana's avatar
      Btrfs: fix race between readahead and device replace/removal · ce7791ff
      Filipe Manana authored
      The list of devices is protected by the device_list_mutex and the device
      replace code, in its finishing phase correctly takes that mutex before
      removing the source device from that list. However the readahead code was
      iterating that list without acquiring the respective mutex leading to
      crashes later on due to invalid memory accesses:
      
      [125671.831036] general protection fault: 0000 [#1] PREEMPT SMP
      [125671.832129] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis tpm ppdev evdev parport_pc psmouse sg parport
      processor ser
      [125671.834973] CPU: 10 PID: 19603 Comm: kworker/u32:19 Tainted: G        W       4.6.0-rc7-btrfs-next-29+ #1
      [125671.834973] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [125671.834973] Workqueue: btrfs-readahead btrfs_readahead_helper [btrfs]
      [125671.834973] task: ffff8801ac520540 ti: ffff8801ac918000 task.ti: ffff8801ac918000
      [125671.834973] RIP: 0010:[<ffffffff81270479>]  [<ffffffff81270479>] __radix_tree_lookup+0x6a/0x105
      [125671.834973] RSP: 0018:ffff8801ac91bc28  EFLAGS: 00010206
      [125671.834973] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6a RCX: 0000000000000000
      [125671.834973] RDX: 0000000000000000 RSI: 00000000000c1bff RDI: ffff88002ebd62a8
      [125671.834973] RBP: ffff8801ac91bc70 R08: 0000000000000001 R09: 0000000000000000
      [125671.834973] R10: ffff8801ac91bc70 R11: 0000000000000000 R12: ffff88002ebd62a8
      [125671.834973] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000c1bff
      [125671.834973] FS:  0000000000000000(0000) GS:ffff88023fd40000(0000) knlGS:0000000000000000
      [125671.834973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [125671.834973] CR2: 000000000073cae4 CR3: 00000000b7723000 CR4: 00000000000006e0
      [125671.834973] Stack:
      [125671.834973]  0000000000000000 ffff8801422d5600 ffff8802286bbc00 0000000000000000
      [125671.834973]  0000000000000001 ffff8802286bbc00 00000000000c1bff 0000000000000000
      [125671.834973]  ffff88002e639eb8 ffff8801ac91bc80 ffffffff81270541 ffff8801ac91bcb0
      [125671.834973] Call Trace:
      [125671.834973]  [<ffffffff81270541>] radix_tree_lookup+0xd/0xf
      [125671.834973]  [<ffffffffa04ae6a6>] reada_peer_zones_set_lock+0x3e/0x60 [btrfs]
      [125671.834973]  [<ffffffffa04ae8b9>] reada_pick_zone+0x29/0x103 [btrfs]
      [125671.834973]  [<ffffffffa04af42f>] reada_start_machine_worker+0x129/0x2d3 [btrfs]
      [125671.834973]  [<ffffffffa04880be>] btrfs_scrubparity_helper+0x185/0x3aa [btrfs]
      [125671.834973]  [<ffffffffa0488341>] btrfs_readahead_helper+0xe/0x10 [btrfs]
      [125671.834973]  [<ffffffff81069691>] process_one_work+0x271/0x4e9
      [125671.834973]  [<ffffffff81069dda>] worker_thread+0x1eb/0x2c9
      [125671.834973]  [<ffffffff81069bef>] ? rescuer_thread+0x2b3/0x2b3
      [125671.834973]  [<ffffffff8106f403>] kthread+0xd4/0xdc
      [125671.834973]  [<ffffffff8149e242>] ret_from_fork+0x22/0x40
      [125671.834973]  [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286
      
      So fix this by taking the device_list_mutex in the readahead code. We
      can't use here the lighter approach of using a rcu_read_lock() and
      rcu_read_unlock() pair together with a list_for_each_entry_rcu() call
      because we end up doing calls to sleeping functions (kzalloc()) in the
      respective code path.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      ce7791ff
  5. 26 May, 2016 2 commits
    • Chris Mason's avatar
      Btrfs: fix handling of faults from btrfs_copy_from_user · 56244ef1
      Chris Mason authored
      When btrfs_copy_from_user isn't able to copy all of the pages, we need
      to adjust our accounting to reflect the work that was actually done.
      
      Commit 2e78c927 changed around the decisions a little and we ended up
      skipping the accounting adjustments some of the time.  This commit makes
      sure that when we don't copy anything at all, we still hop into
      the adjustments, and switches to release_bytes instead of write_bytes,
      since write_bytes isn't aligned.
      
      The accounting errors led to warnings during btrfs_destroy_inode:
      
      [   70.847532] WARNING: CPU: 10 PID: 514 at fs/btrfs/inode.c:9350 btrfs_destroy_inode+0x2b3/0x2c0
      [   70.847536] Modules linked in: i2c_piix4 virtio_net i2c_core input_leds button led_class serio_raw acpi_cpufreq sch_fq_codel autofs4 virtio_blk
      [   70.847538] CPU: 10 PID: 514 Comm: umount Tainted: G        W 4.6.0-rc6_00062_g2997da1-dirty #23
      [   70.847539] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
      [   70.847542]  0000000000000000 ffff880ff5cafab8 ffffffff8149d5e9 0000000000000202
      [   70.847543]  0000000000000000 0000000000000000 0000000000000000 ffff880ff5cafb08
      [   70.847547]  ffffffff8107bdfd ffff880ff5cafaf8 000024868120013d ffff880ff5cafb28
      [   70.847547] Call Trace:
      [   70.847550]  [<ffffffff8149d5e9>] dump_stack+0x51/0x78
      [   70.847551]  [<ffffffff8107bdfd>] __warn+0xfd/0x120
      [   70.847553]  [<ffffffff8107be3d>] warn_slowpath_null+0x1d/0x20
      [   70.847555]  [<ffffffff8139c9e3>] btrfs_destroy_inode+0x2b3/0x2c0
      [   70.847556]  [<ffffffff812003a1>] ? __destroy_inode+0x71/0x140
      [   70.847558]  [<ffffffff812004b3>] destroy_inode+0x43/0x70
      [   70.847559]  [<ffffffff810b7b5f>] ? wake_up_bit+0x2f/0x40
      [   70.847560]  [<ffffffff81200c68>] evict+0x148/0x1d0
      [   70.847562]  [<ffffffff81398ade>] ? start_transaction+0x3de/0x460
      [   70.847564]  [<ffffffff81200d49>] dispose_list+0x59/0x80
      [   70.847565]  [<ffffffff81201ba0>] evict_inodes+0x180/0x190
      [   70.847566]  [<ffffffff812191ff>] ? __sync_filesystem+0x3f/0x50
      [   70.847568]  [<ffffffff811e95f8>] generic_shutdown_super+0x48/0x100
      [   70.847569]  [<ffffffff810b75c0>] ? woken_wake_function+0x20/0x20
      [   70.847571]  [<ffffffff811e9796>] kill_anon_super+0x16/0x30
      [   70.847573]  [<ffffffff81365cde>] btrfs_kill_super+0x1e/0x130
      [   70.847574]  [<ffffffff811e99be>] deactivate_locked_super+0x4e/0x90
      [   70.847576]  [<ffffffff811e9e61>] deactivate_super+0x51/0x70
      [   70.847577]  [<ffffffff8120536f>] cleanup_mnt+0x3f/0x80
      [   70.847579]  [<ffffffff81205402>] __cleanup_mnt+0x12/0x20
      [   70.847581]  [<ffffffff81098358>] task_work_run+0x68/0xa0
      [   70.847582]  [<ffffffff810022b6>] exit_to_usermode_loop+0xd6/0xe0
      [   70.847583]  [<ffffffff81002e1d>] do_syscall_64+0xbd/0x170
      [   70.847586]  [<ffffffff817d4dbc>] entry_SYSCALL64_slow_path+0x25/0x25
      
      This is the test program I used to force short returns from
      btrfs_copy_from_user
      
      void *dontneed(void *arg)
      {
      	char *p = arg;
      	int ret;
      
      	while(1) {
      		ret = madvise(p, BUFSIZE/4, MADV_DONTNEED);
      		if (ret) {
      			perror("madvise");
      			exit(1);
      		}
      	}
      }
      
      int main(int ac, char **av) {
      	int ret;
      	int fd;
      	char *filename;
      	unsigned long offset;
      	char *buf;
      	int i;
      	pthread_t tid;
      
      	if (ac != 2) {
      		fprintf(stderr, "usage: dammitdave filename\n");
      		exit(1);
      	}
      
      	buf = mmap(NULL, BUFSIZE, PROT_READ|PROT_WRITE,
      		   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
      	if (buf == MAP_FAILED) {
      		perror("mmap");
      		exit(1);
      	}
      	memset(buf, 'a', BUFSIZE);
      	filename = av[1];
      
      	ret = pthread_create(&tid, NULL, dontneed, buf);
      	if (ret) {
      		fprintf(stderr, "error %d from pthread_create\n", ret);
      		exit(1);
      	}
      
      	ret = pthread_detach(tid);
      	if (ret) {
      		fprintf(stderr, "pthread detach failed %d\n", ret);
      		exit(1);
      	}
      
      	while (1) {
      		fd = open(filename, O_RDWR | O_CREAT, 0600);
      		if (fd < 0) {
      			perror("open");
      			exit(1);
      		}
      
      		for (i = 0; i < ROUNDS; i++) {
      			int this_write = BUFSIZE;
      
      			offset = rand() % MAXSIZE;
      			ret = pwrite(fd, buf, this_write, offset);
      			if (ret < 0) {
      				perror("pwrite");
      				exit(1);
      			} else if (ret != this_write) {
      				fprintf(stderr, "short write to %s offset %lu ret %d\n",
      					filename, offset, ret);
      				exit(1);
      			}
      			if (i == ROUNDS - 1) {
      				ret = sync_file_range(fd, offset, 4096,
      				    SYNC_FILE_RANGE_WRITE);
      				if (ret < 0) {
      					perror("sync_file_range");
      					exit(1);
      				}
      			}
      		}
      		ret = ftruncate(fd, 0);
      		if (ret < 0) {
      			perror("ftruncate");
      			exit(1);
      		}
      		ret = close(fd);
      		if (ret) {
      			perror("close");
      			exit(1);
      		}
      		ret = unlink(filename);
      		if (ret) {
      			perror("unlink");
      			exit(1);
      		}
      
      	}
      	return 0;
      }
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reported-by: default avatarDave Jones <dsj@fb.com>
      Fixes: 2e78c927
      cc: stable@vger.kernel.org # v4.6
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      56244ef1
    • Chris Mason's avatar
      Merge branch 'for-chris-4.7' of... · 9257b4ca
      Chris Mason authored
      Merge branch 'for-chris-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.7
      9257b4ca
  6. 25 May, 2016 7 commits
  7. 17 May, 2016 2 commits
  8. 16 May, 2016 5 commits
  9. 13 May, 2016 14 commits
    • Filipe Manana's avatar
      Btrfs: add semaphore to synchronize direct IO writes with fsync · 5f9a8a51
      Filipe Manana authored
      Due to the optimization of lockless direct IO writes (the inode's i_mutex
      is not held) introduced in commit 38851cc1 ("Btrfs: implement unlocked
      dio write"), we started having races between such writes with concurrent
      fsync operations that use the fast fsync path. These races were addressed
      in the patches titled "Btrfs: fix race between fsync and lockless direct
      IO writes" and "Btrfs: fix race between fsync and direct IO writes for
      prealloc extents". The races happened because the direct IO path, like
      every other write path, does create extent maps followed by the
      corresponding ordered extents while the fast fsync path collected first
      ordered extents and then it collected extent maps. This made it possible
      to log file extent items (based on the collected extent maps) without
      waiting for the corresponding ordered extents to complete (get their IO
      done). The two fixes mentioned before added a solution that consists of
      making the direct IO path create first the ordered extents and then the
      extent maps, while the fsync path attempts to collect any new ordered
      extents once it collects the extent maps. This was simple and did not
      require adding any synchonization primitive to any data structure (struct
      btrfs_inode for example) but it makes things more fragile for future
      development endeavours and adds an exceptional approach compared to the
      other write paths.
      
      This change adds a read-write semaphore to the btrfs inode structure and
      makes the direct IO path create the extent maps and the ordered extents
      while holding read access on that semaphore, while the fast fsync path
      collects extent maps and ordered extents while holding write access on
      that semaphore. The logic for direct IO write path is encapsulated in a
      new helper function that is used both for cow and nocow direct IO writes.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      5f9a8a51
    • Filipe Manana's avatar
      Btrfs: fix race between block group relocation and nocow writes · f78c436c
      Filipe Manana authored
      Relocation of a block group waits for all existing tasks flushing
      dellaloc, starting direct IO writes and any ordered extents before
      starting the relocation process. However for direct IO writes that end
      up doing nocow (inode either has the flag nodatacow set or the write is
      against a prealloc extent) we have a short time window that allows for a
      race that makes relocation proceed without waiting for the direct IO
      write to complete first, resulting in data loss after the relocation
      finishes. This is illustrated by the following diagram:
      
                 CPU 1                                     CPU 2
      
       btrfs_relocate_block_group(bg X)
      
                                                     direct IO write starts against
                                                     an extent in block group X
                                                     using nocow mode (inode has the
                                                     nodatacow flag or the write is
                                                     for a prealloc extent)
      
                                                     btrfs_direct_IO()
                                                       btrfs_get_blocks_direct()
                                                         --> can_nocow_extent() returns 1
      
         btrfs_inc_block_group_ro(bg X)
           --> turns block group into RO mode
      
         btrfs_wait_ordered_roots()
           --> returns and does not know about
               the DIO write happening at CPU 2
               (the task there has not created
                yet an ordered extent)
      
         relocate_block_group(bg X)
           --> rc->stage == MOVE_DATA_EXTENTS
      
           find_next_extent()
             --> returns extent that the DIO
                 write is going to write to
      
           relocate_data_extent()
      
             relocate_file_extent_cluster()
      
               --> reads the extent from disk into
                   pages belonging to the relocation
                   inode and dirties them
      
                                                         --> creates DIO ordered extent
      
                                                       btrfs_submit_direct()
                                                         --> submits bio against a location
                                                             on disk obtained from an extent
                                                             map before the relocation started
      
         btrfs_wait_ordered_range()
           --> writes all the pages read before
               to disk (belonging to the
               relocation inode)
      
         relocation finishes
      
                                                       bio completes and wrote new data
                                                       to the old location of the block
                                                       group
      
      So fix this by tracking the number of nocow writers for a block group and
      make sure relocation waits for that number to go down to 0 before starting
      to move the extents.
      
      The same race can also happen with buffered writes in nocow mode since the
      patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes
      when relocating", because we are no longer flushing all delalloc which
      served as a synchonization mechanism (due to page locking) and ensured
      the ordered extents for nocow buffered writes were created before we
      called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow
      mode existed before that patch (no pages are locked or used during direct
      IO) and that fixed only races with direct IO writes that do cow.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      f78c436c
    • Filipe Manana's avatar
      Btrfs: fix race between fsync and direct IO writes for prealloc extents · 0b901916
      Filipe Manana authored
      When we do a direct IO write against a preallocated extent (fallocate)
      that does not go beyond the i_size of the inode, we do the write operation
      without holding the inode's i_mutex (an optimization that landed in
      commit 38851cc1 ("Btrfs: implement unlocked dio write")). This allows
      for a very tiny time window where a race can happen with a concurrent
      fsync using the fast code path, as the direct IO write path creates first
      a new extent map (no longer flagged as a prealloc extent) and then it
      creates the ordered extent, while the fast fsync path first collects
      ordered extents and then it collects extent maps. This allows for the
      possibility of the fast fsync path to collect the new extent map without
      collecting the new ordered extent, and therefore logging an extent item
      based on the extent map without waiting for the ordered extent to be
      created and complete. This can result in a situation where after a log
      replay we end up with an extent not marked anymore as prealloc but it was
      only partially written (or not written at all), exposing random, stale or
      garbage data corresponding to the unwritten pages and without any
      checksums in the csum tree covering the extent's range.
      
      This is an extension of what was done in commit de0ee0ed ("Btrfs: fix
      race between fsync and lockless direct IO writes").
      
      So fix this by creating first the ordered extent and then the extent
      map, so that this way if the fast fsync patch collects the new extent
      map it also collects the corresponding ordered extent.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      0b901916
    • Filipe Manana's avatar
      Btrfs: fix number of transaction units for renames with whiteout · 5062af35
      Filipe Manana authored
      When we do a rename with the whiteout flag, we need to create the whiteout
      inode, which in the worst case requires 5 transaction units (1 inode item,
      1 inode ref, 2 dir items and 1 xattr if selinux is enabled). So bump the
      number of transaction units from 11 to 16 if the whiteout flag is set.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      5062af35
    • Filipe Manana's avatar
      Btrfs: pin logs earlier when doing a rename exchange operation · 376e5a57
      Filipe Manana authored
      The btrfs_rename_exchange() started as a copy-paste from btrfs_rename(),
      which had a race fixed by my previous patch titled "Btrfs: pin log earlier
      when renaming", and so it suffers from the same problem.
      
      We pin the logs of the affected roots after we insert the new inode
      references, leaving a time window where concurrent tasks logging the
      inodes can end up logging both the new and old references, resulting
      in log trees that when replayed can turn the metadata into inconsistent
      states. This behaviour was added to btrfs_rename() in 2009 without any
      explanation about why not pinning the logs earlier, just leaving a
      comment about the posibility for the race. As of today it's perfectly
      safe and sane to pin the logs before we start doing any of the steps
      involved in the rename operation.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      376e5a57
    • Filipe Manana's avatar
      Btrfs: unpin logs if rename exchange operation fails · 86e8aa0e
      Filipe Manana authored
      If rename exchange operations fail at some point after we pinned any of
      the logs, we end up aborting the current transaction but never unpin the
      logs, which leaves concurrent tasks that are trying to sync the logs (as
      part of an fsync request from user space) blocked forever and preventing
      the filesystem from being unmountable.
      
      Fix this by safely unpinning the log.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      86e8aa0e
    • Filipe Manana's avatar
      Btrfs: fix inode leak on failure to setup whiteout inode in rename · c9901618
      Filipe Manana authored
      If we failed to fully setup the whiteout inode during a rename operation
      with the whiteout flag, we ended up leaking the inode, not decrementing
      its link count nor removing all its items from the fs/subvol tree.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      c9901618
    • Dan Fuhry's avatar
      btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT · cdd1fedf
      Dan Fuhry authored
      Two new flags, RENAME_EXCHANGE and RENAME_WHITEOUT, provide for new
      behavior in the renameat2() syscall. This behavior is primarily used by
      overlayfs. This patch adds support for these flags to btrfs, enabling it to
      be used as a fully functional upper layer for overlayfs.
      
      RENAME_EXCHANGE support was written by Davide Italiano originally
      submitted on 2 April 2015.
      Signed-off-by: default avatarDavide Italiano <dccitaliano@gmail.com>
      Signed-off-by: default avatarDan Fuhry <dfuhry@datto.com>
      [ remove unlikely ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      cdd1fedf
    • Filipe Manana's avatar
      Btrfs: pin log earlier when renaming · c4aba954
      Filipe Manana authored
      We were pinning the log right after the first step in the rename operation
      (inserting inode ref for the new name in the destination directory)
      instead of doing it before. This behaviour was introduced in 2009 for some
      reason that was not mentioned neither on the changelog nor any comment,
      with the drawback of a small time window where concurrent log writers can
      end up logging the new inode reference for the inode we are renaming while
      the rename operation is in progress (so that we can end up with a log
      containing both the new and old references). As of today there's no reason
      to not pin the log before that first step anymore, so just fix this.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      c4aba954
    • Filipe Manana's avatar
      Btrfs: unpin log if rename operation fails · 3dc9e8f7
      Filipe Manana authored
      If rename operations fail at some point after we pinned the log, we end
      up aborting the current transaction but never unpin the log, which leaves
      concurrent tasks that are trying to sync the log (as part of an fsync
      request from user space) blocked forever and preventing the filesystem
      from being unmountable.
      
      Fix this by safely unpinning the log.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      3dc9e8f7
    • Filipe Manana's avatar
      Btrfs: don't do unnecessary delalloc flushes when relocating · 9cfa3e34
      Filipe Manana authored
      Before we start the actual relocation process of a block group, we do
      calls to flush delalloc of all inodes and then wait for ordered extents
      to complete. However we do these flush calls just to make sure we don't
      race with concurrent tasks that have actually already started to run
      delalloc and have allocated an extent from the block group we want to
      relocate, right before we set it to readonly mode, but have not yet
      created the respective ordered extents. The flush calls make us wait
      for such concurrent tasks because they end up calling
      filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
      __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
      btrfs_run_delalloc_work()) which ends up serializing us with those tasks
      due to attempts to lock the same pages (and the delalloc flush procedure
      calls the allocator and creates the ordered extents before unlocking the
      pages).
      
      These flushing calls not only make us waste time (cpu, IO) but also reduce
      the chances of writing larger extents (applications might be writing to
      contiguous ranges and we flush before they finish dirtying the whole
      ranges).
      
      So make sure we don't flush delalloc and just wait for concurrent tasks
      that have already started flushing delalloc and have allocated an extent
      from the block group we are about to relocate.
      
      This change also ends up fixing a race with direct IO writes that makes
      relocation not wait for direct IO ordered extents. This race is
      illustrated by the following diagram:
      
              CPU 1                                       CPU 2
      
       btrfs_relocate_block_group(bg X)
      
                                                 starts direct IO write,
                                                 target inode currently has no
                                                 ordered extents ongoing nor
                                                 dirty pages (delalloc regions),
                                                 therefore the root for our inode
                                                 is not in the list
                                                 fs_info->ordered_roots
      
                                                 btrfs_direct_IO()
                                                   __blockdev_direct_IO()
                                                     btrfs_get_blocks_direct()
                                                       btrfs_lock_extent_direct()
                                                         locks range in the io tree
                                                       btrfs_new_extent_direct()
                                                         btrfs_reserve_extent()
                                                           --> extent allocated
                                                               from bg X
      
         btrfs_inc_block_group_ro(bg X)
      
         btrfs_start_delalloc_roots()
           __start_delalloc_inodes()
             --> does nothing, no dealloc ranges
                 in the inode's io tree so the
                 inode's root is not in the list
                 fs_info->delalloc_roots
      
         btrfs_wait_ordered_roots()
           --> does not find the inode's root in the
               list fs_info->ordered_roots
      
           --> ends up not waiting for the direct IO
               write started by the task at CPU 2
      
         relocate_block_group(rc->stage ==
           MOVE_DATA_EXTENTS)
      
           prepare_to_relocate()
             btrfs_commit_transaction()
      
           iterates the extent tree, using its
           commit root and moves extents into new
           locations
      
                                                         btrfs_add_ordered_extent_dio()
                                                           --> now a ordered extent is
                                                               created and added to the
                                                               list root->ordered_extents
                                                               and the root added to the
                                                               list fs_info->ordered_roots
                                                           --> this is too late and the
                                                               task at CPU 1 already
                                                               started the relocation
      
           btrfs_commit_transaction()
      
                                                         btrfs_finish_ordered_io()
                                                           btrfs_alloc_reserved_file_extent()
                                                             --> adds delayed data reference
                                                                 for the extent allocated
                                                                 from bg X
      
         relocate_block_group(rc->stage ==
           UPDATE_DATA_PTRS)
      
           prepare_to_relocate()
             btrfs_commit_transaction()
               --> delayed refs are run, so an extent
                   item for the allocated extent from
                   bg X is added to extent tree
               --> commit roots are switched, so the
                   next scan in the extent tree will
                   see the extent item
      
           sees the extent in the extent tree
      
      When this happens the relocation produces the following warning when it
      finishes:
      
      [ 7260.832836] ------------[ cut here ]------------
      [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
      [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
      [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
      [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7260.852998]  0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
      [ 7260.852998]  0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
      [ 7260.852998]  ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
      [ 7260.852998] Call Trace:
      [ 7260.852998]  [<ffffffff812648b3>] dump_stack+0x67/0x90
      [ 7260.852998]  [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
      [ 7260.852998]  [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
      [ 7260.852998]  [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
      [ 7260.852998]  [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
      [ 7260.852998]  [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
      [ 7260.852998]  [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
      [ 7260.852998]  [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
      [ 7260.852998]  [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
      [ 7260.852998]  [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
      [ 7260.852998]  [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
      [ 7260.852998]  [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
      [ 7260.852998]  [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
      [ 7260.852998]  [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
      [ 7260.852998]  [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
      [ 7260.852998]  [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
      [ 7260.852998]  [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
      [ 7260.852998]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---
      
      This is because at the end of the first stage, in relocate_block_group(),
      we commit the current transaction, which makes delayed refs run, the
      commit roots are switched and so the second stage will find the extent
      item that the ordered extent added to the delayed refs. But this extent
      was not moved (ordered extent completed after first stage finished), so
      at the end of the relocation our block group item still has a positive
      used bytes counter, triggering a warning at the end of
      btrfs_relocate_block_group(). Later on when trying to read the extent
      contents from disk we hit a BUG_ON() due to the inability to map a block
      with a logical address that belongs to the block group we relocated and
      is no longer valid, resulting in the following trace:
      
      [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
      [ 7344.887518] ------------[ cut here ]------------
      [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
      [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
      [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G        W       4.5.0-rc6-btrfs-next-28+ #1
      [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
      [ 7344.888431] RIP: 0010:[<ffffffffa037c88c>]  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
      [ 7344.888431] RSP: 0018:ffff8802046878f0  EFLAGS: 00010282
      [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
      [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
      [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
      [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
      [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
      [ 7344.888431] FS:  00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
      [ 7344.888431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
      [ 7344.888431] Stack:
      [ 7344.888431]  0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
      [ 7344.888431]  ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
      [ 7344.888431]  ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
      [ 7344.888431] Call Trace:
      [ 7344.888431]  [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
      [ 7344.888431]  [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
      [ 7344.888431]  [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
      [ 7344.888431]  [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
      [ 7344.888431]  [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
      [ 7344.888431]  [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
      [ 7344.888431]  [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
      [ 7344.888431]  [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
      [ 7344.888431]  [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
      [ 7344.888431]  [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
      [ 7344.888431]  [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
      [ 7344.888431]  [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
      [ 7344.888431]  [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
      [ 7344.888431]  [<ffffffff8117773a>] __vfs_read+0x79/0x9d
      [ 7344.888431]  [<ffffffff81178050>] vfs_read+0x8f/0xd2
      [ 7344.888431]  [<ffffffff81178a38>] SyS_read+0x50/0x7e
      [ 7344.888431]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
      [ 7344.888431] RIP  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
      [ 7344.888431]  RSP <ffff8802046878f0>
      [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      9cfa3e34
    • Filipe Manana's avatar
      Btrfs: don't wait for unrelated IO to finish before relocation · 578def7c
      Filipe Manana authored
      Before the relocation process of a block group starts, it sets the block
      group to readonly mode, then flushes all delalloc writes and then finally
      it waits for all ordered extents to complete. This last step includes
      waiting for ordered extents destinated at extents allocated in other block
      groups, making us waste unecessary time.
      
      So improve this by waiting only for ordered extents that fall into the
      block group's range.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      578def7c
    • Filipe Manana's avatar
      Btrfs: fix empty symlink after creating symlink and fsync parent dir · 3f9749f6
      Filipe Manana authored
      If we create a symlink, fsync its parent directory, crash/power fail and
      mount the filesystem, we end up with an empty symlink, which not only is
      useless it's also not allowed in linux (the man page symlink(2) is well
      explicit about that).  So we just need to make sure to fully log an inode
      if it's a symlink, to ensure its inline extent gets logged, ensuring the
      same behaviour as ext3, ext4, xfs, reiserfs, f2fs, nilfs2, etc.
      
      Example reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ mkdir /mnt/testdir
        $ sync
        $ ln -s /mnt/foo /mnt/testdir/bar
        $ xfs_io -c fsync /mnt/testdir
        <power fail>
        $ mount /dev/sdb /mnt
        $ readlink /mnt/testdir/bar
        <empty string>
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      3f9749f6
    • Filipe Manana's avatar
      Btrfs: fix for incorrect directory entries after fsync log replay · 657ed1aa
      Filipe Manana authored
      If we move a directory to a new parent and later log that parent and don't
      explicitly log the old parent, when we replay the log we can end up with
      entries for the moved directory in both the old and new parent directories.
      Besides being ilegal to have directories with multiple hard links in linux,
      it also resulted in the leaving the inode item with a link count of 1.
      A similar issue also happens if we move a regular file - after the log tree
      is replayed the file has a link in both the old and new parent directories,
      when it should be only at the new directory.
      
      Sample reproducer:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
        $ mkdir /mnt/x
        $ mkdir /mnt/y
        $ touch /mnt/x/foo
        $ mkdir /mnt/y/z
        $ sync
        $ ln /mnt/x/foo /mnt/x/bar
        $ mv /mnt/y/z /mnt/x/z
        < power fail >
        $ mount /dev/sdc /mnt
        $ ls -1Ri /mnt
        /mnt:
        257 x
        258 y
      
        /mnt/x:
        259 bar
        259 foo
        260 z
      
        /mnt/x/z:
      
        /mnt/y:
        260 z
      
        /mnt/y/z:
      
        $ umount /dev/sdc
        $ btrfs check /dev/sdc
        Checking filesystem on /dev/sdc
        UUID: a67e2c4a-a4b4-4fdc-b015-9d9af1e344be
        checking extents
        checking free space cache
        checking fs roots
        root 5 inode 260 errors 2000, link count wrong
              unresolved ref dir 257 index 4 namelen 1 name z filetype 2 errors 0
              unresolved ref dir 258 index 2 namelen 1 name z filetype 2 errors 0
        (...)
      
      Attempting to remove the directory becomes impossible:
      
        $ mount /dev/sdc /mnt
        $ rmdir /mnt/y/z
        $ ls -lh /mnt/y
        ls: cannot access /mnt/y/z: No such file or directory
        total 0
        d????????? ? ? ? ?            ? z
        $ rmdir /mnt/x/z
        rmdir: failed to remove ‘/mnt/x/z’: Stale file handle
        $ ls -lh /mnt/x
        ls: cannot access /mnt/x/z: Stale file handle
        total 0
        -rw-r--r-- 2 root root 0 Apr  6 18:06 bar
        -rw-r--r-- 2 root root 0 Apr  6 18:06 foo
        d????????? ? ?    ?    ?            ? z
      
      So make sure that on rename we set the last_unlink_trans value for our
      inode, even if it's a directory, to the value of the current transaction's
      ID and that if the new parent directory is logged that we fallback to a
      transaction commit.
      
      A test case for fstests is being submitted as well.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      657ed1aa