1. 21 Jun, 2017 2 commits
    • Filipe Manana's avatar
      Btrfs: fix invalid extent maps due to hole punching · 609805d8
      Filipe Manana authored
      While punching a hole in a range that is not aligned with the sector size
      (currently the same as the page size) we can end up leaving an extent map
      in memory with a length that is smaller then the sector size or with a
      start offset that is not aligned to the sector size. Both cases are not
      expected and can lead to problems. This issue is easily detected
      after the patch from commit a7e3b975 ("Btrfs: fix reported number of
      inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the
      following for example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ xfs_io -c "pwrite -S 0xaa -b 100K 0 100K" /mnt/foo
        $ xfs_io -c "fpunch 60K 90K" /mnt/foo
        $ xfs_io -c "pwrite -S 0xbb -b 100K 50K 100K" /mnt/foo
        $ xfs_io -c "pwrite -S 0xcc -b 50K 100K 50K" /mnt/foo
        $ umount /mnt
      
      After the unmount operation we can see several warnings emmitted due to
      underflows related to space reservation counters:
      
      [ 2837.443299] ------------[ cut here ]------------
      [ 2837.447395] WARNING: CPU: 8 PID: 2474 at fs/btrfs/inode.c:9444 btrfs_destroy_inode+0xe8/0x27e [btrfs]
      [ 2837.452108] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button se
      rio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_gene
      ric raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.458389] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.459754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.462379] Call Trace:
      [ 2837.462379]  dump_stack+0x68/0x92
      [ 2837.462379]  __warn+0xc2/0xdd
      [ 2837.462379]  warn_slowpath_null+0x1d/0x1f
      [ 2837.462379]  btrfs_destroy_inode+0xe8/0x27e [btrfs]
      [ 2837.462379]  destroy_inode+0x3d/0x55
      [ 2837.462379]  evict+0x177/0x17e
      [ 2837.462379]  dispose_list+0x50/0x71
      [ 2837.462379]  evict_inodes+0x132/0x141
      [ 2837.462379]  generic_shutdown_super+0x3f/0xeb
      [ 2837.462379]  kill_anon_super+0x12/0x1c
      [ 2837.462379]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.462379]  deactivate_locked_super+0x30/0x68
      [ 2837.462379]  deactivate_super+0x36/0x39
      [ 2837.462379]  cleanup_mnt+0x58/0x76
      [ 2837.462379]  __cleanup_mnt+0x12/0x14
      [ 2837.462379]  task_work_run+0x77/0x9b
      [ 2837.462379]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.462379]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.462379]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.462379] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.462379] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.462379] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.462379] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.462379] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.462379] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.462379] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.519355] ---[ end trace e79345fe24b30b8d ]---
      [ 2837.596256] ------------[ cut here ]------------
      [ 2837.597625] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5699 btrfs_free_block_groups+0x246/0x3eb [btrfs]
      [ 2837.603547] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.659372] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.663359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.663359] Call Trace:
      [ 2837.663359]  dump_stack+0x68/0x92
      [ 2837.663359]  __warn+0xc2/0xdd
      [ 2837.663359]  warn_slowpath_null+0x1d/0x1f
      [ 2837.663359]  btrfs_free_block_groups+0x246/0x3eb [btrfs]
      [ 2837.663359]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.663359]  ? evict_inodes+0x132/0x141
      [ 2837.663359]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.663359]  generic_shutdown_super+0x6a/0xeb
      [ 2837.663359]  kill_anon_super+0x12/0x1c
      [ 2837.663359]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.663359]  deactivate_locked_super+0x30/0x68
      [ 2837.663359]  deactivate_super+0x36/0x39
      [ 2837.663359]  cleanup_mnt+0x58/0x76
      [ 2837.663359]  __cleanup_mnt+0x12/0x14
      [ 2837.663359]  task_work_run+0x77/0x9b
      [ 2837.663359]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.663359]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.663359]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.663359] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.663359] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.663359] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.663359] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.663359] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.663359] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.663359] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.739445] ---[ end trace e79345fe24b30b8e ]---
      [ 2837.745595] ------------[ cut here ]------------
      [ 2837.746412] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5700 btrfs_free_block_groups+0x261/0x3eb [btrfs]
      [ 2837.747955] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.755395] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.756769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.758526] Call Trace:
      [ 2837.758925]  dump_stack+0x68/0x92
      [ 2837.759383]  __warn+0xc2/0xdd
      [ 2837.759383]  warn_slowpath_null+0x1d/0x1f
      [ 2837.759383]  btrfs_free_block_groups+0x261/0x3eb [btrfs]
      [ 2837.759383]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.759383]  ? evict_inodes+0x132/0x141
      [ 2837.759383]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.759383]  generic_shutdown_super+0x6a/0xeb
      [ 2837.759383]  kill_anon_super+0x12/0x1c
      [ 2837.759383]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.759383]  deactivate_locked_super+0x30/0x68
      [ 2837.759383]  deactivate_super+0x36/0x39
      [ 2837.759383]  cleanup_mnt+0x58/0x76
      [ 2837.759383]  __cleanup_mnt+0x12/0x14
      [ 2837.759383]  task_work_run+0x77/0x9b
      [ 2837.759383]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.759383]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.759383]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.759383] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.759383] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.759383] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.759383] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.759383] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.759383] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.759383] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.777063] ---[ end trace e79345fe24b30b8f ]---
      [ 2837.778235] ------------[ cut here ]------------
      [ 2837.778856] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:9825 btrfs_free_block_groups+0x348/0x3eb [btrfs]
      [ 2837.791385] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.797711] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.798594] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.800118] Call Trace:
      [ 2837.800515]  dump_stack+0x68/0x92
      [ 2837.801015]  __warn+0xc2/0xdd
      [ 2837.801471]  warn_slowpath_null+0x1d/0x1f
      [ 2837.801698]  btrfs_free_block_groups+0x348/0x3eb [btrfs]
      [ 2837.801698]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.801698]  ? evict_inodes+0x132/0x141
      [ 2837.801698]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.801698]  generic_shutdown_super+0x6a/0xeb
      [ 2837.801698]  kill_anon_super+0x12/0x1c
      [ 2837.801698]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.801698]  deactivate_locked_super+0x30/0x68
      [ 2837.801698]  deactivate_super+0x36/0x39
      [ 2837.801698]  cleanup_mnt+0x58/0x76
      [ 2837.801698]  __cleanup_mnt+0x12/0x14
      [ 2837.801698]  task_work_run+0x77/0x9b
      [ 2837.801698]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.801698]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.801698]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.801698] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.801698] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.801698] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.801698] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.801698] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.801698] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.801698] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.818441] ---[ end trace e79345fe24b30b90 ]---
      [ 2837.818991] BTRFS info (device sdc): space_info 1 has 7974912 free, is not full
      [ 2837.819830] BTRFS info (device sdc): space_info total=8388608, used=417792, pinned=0, reserved=0, may_use=18446744073709547520, readonly=0
      
      What happens in the above example is the following:
      
      1) When punching the hole, at btrfs_punch_hole(), the variable tail_len
         is set to 2048 (as tail_start is 148Kb + 1 and offset + len is 150Kb).
         This results in the creation of an extent map with a length of 2Kb
         starting at file offset 148Kb, through find_first_non_hole() ->
         btrfs_get_extent().
      
      2) The second write (first write after the hole punch operation), sets
         the range [50Kb, 152Kb[ to delalloc.
      
      3) The third write, at btrfs_find_new_delalloc_bytes(), sees the extent
         map covering the range [148Kb, 150Kb[ and ends up calling
         set_extent_bit() for the same range, which results in splitting an
         existing extent state record, covering the range [148Kb, 152Kb[ into
         two 2Kb extent state records, covering the ranges [148Kb, 150Kb[ and
         [150Kb, 152Kb[.
      
      4) Finally at lock_and_cleanup_extent_if_need(), immediately after calling
         btrfs_find_new_delalloc_bytes() we clear the delalloc bit from the
         range [100Kb, 152Kb[ which results in the btrfs_clear_bit_hook()
         callback being invoked against the two 2Kb extent state records that
         cover the ranges [148Kb, 150Kb[ and [150Kb, 152Kb[. When called against
         the first 2Kb extent state, it calls btrfs_delalloc_release_metadata()
         with a length argument of 2048 bytes. That function rounds up the length
         to a sector size aligned length, so it ends up considering a length of
         4096 bytes, and then calls calc_csum_metadata_size() which results in
         decrementing the inode's csum_bytes counter by 4096 bytes, so after
         it stays a value of 0 bytes. Then the same happens when
         btrfs_clear_bit_hook() is called against the second extent state that
         has a length of 2Kb, covering the range [150Kb, 152Kb[, the length is
         rounded up to 4096 and calc_csum_metadata_size() ends up being called
         to decrement 4096 bytes from the inode's csum_bytes counter, which
         at that time has a value of 0, leading to an underflow, which is
         exactly what triggers the first warning, at btrfs_destroy_inode().
         All the other warnings relate to several space accounting counters
         that underflow as well due to similar reasons.
      
      A similar case but where the hole punching operation creates an extent map
      with a start offset not aligned to the sector size is the following:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ xfs_io -f -c "fpunch 695K 820K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xaa 1008K 307K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xbb -b 630K 1073K 630K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xcc -b 459K 1068K 459K" $SCRATCH_MNT/bar
        $ umount /mnt
      
      During the unmount operation we get similar traces for the same reasons as
      in the first example.
      
      So fix the hole punching operation to make sure it never creates extent
      maps with a length that is not aligned to the sector size nor with a start
      offset that is not aligned to the sector size, as this breaks all
      assumptions and it's a land mine.
      
      Fixes: d7781546 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      609805d8
    • Jeff Mahoney's avatar
      btrfs: add cond_resched to btrfs_qgroup_trace_leaf_items · cddf3b2c
      Jeff Mahoney authored
      On an uncontended system, we can end up hitting soft lockups while
      doing replace_path.  At the core, and frequently called is
      btrfs_qgroup_trace_leaf_items, so it makes sense to add a cond_resched
      there.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cddf3b2c
  2. 20 Jun, 2017 5 commits
    • Nikolay Borisov's avatar
      btrfs: Round down values which are written for total_bytes_size · 7dfb8be1
      Nikolay Borisov authored
      We got an internal report about a file system not wanting to mount
      following 99e3ecfc ("Btrfs: add more validation checks for
      superblock").
      
      BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with
      fs_devices total_rw_bytes 1000203820544
      
      Subtracting the numbers we get a difference of less than a 4kb. Upon
      closer inspection it became apparent that mkfs actually rounds down the
      size of the device to a multiple of sector size. However, the same
      cannot be said for various functions which modify the total size and are
      called from btrfs_balance as well as when adding a new device. So this
      patch ensures that values being saved into on-disk data structures are
      always rounded down to a multiple of sectorsize.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7dfb8be1
    • Nikolay Borisov's avatar
      btrfs: Manually implement device_total_bytes getter/setter · eca152ed
      Nikolay Borisov authored
      The device->total_bytes member needs to always be rounded down to sectorsize
      so that it corresponds to the value of super->total_bytes. However, there are
      multiple places where the setter is fed a value which is not rounded which
      can cause a fs to be unmountable due to the check introduced in
      99e3ecfc ("Btrfs: add more validation checks for superblock"). This patch
      implements the getter/setter manually so that in a later patch I can add
      necessary code to catch offenders.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eca152ed
    • David Sterba's avatar
      btrfs: obsolete and remove mount option alloc_start · 0d0c71b3
      David Sterba authored
      The mount option alloc_start was used in the past for debugging and
      stressing the chunk allocator. Not meant to be used by users, so we're
      not breaking anybody's setup.
      
      There was some added complexity handling changes of the value and when
      it was not same as default. Such code has likely been untested and I
      think it's better to remove it.
      
      This patch kills all use of alloc_start, and by doing that also fixes
      a bug when alloc_size is set, potentially called from statfs:
      
      in btrfs_calc_avail_data_space, traversing the list in RCU, the RCU
      protection is temporarily dropped so btrfs_account_dev_extents_size can
      be called and then RCU is locked again! Doing that inside
      list_for_each_entry_rcu is just asking for trouble, but unlikely to be
      observed in practice.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d0c71b3
    • David Sterba's avatar
      btrfs: move fs_info::fs_frozen to the flags · fac03c8d
      David Sterba authored
      We can keep the state among the other fs_info flags, there's no reason
      why fs_frozen would need to be separate.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fac03c8d
    • David Sterba's avatar
      btrfs: cleanup duplicate return value in insert_inline_extent · 79b4f4c6
      David Sterba authored
      The pattern when err is used for function exit and ret is used for
      return values of callees is not used here.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79b4f4c6
  3. 19 Jun, 2017 33 commits