1. 21 Aug, 2017 19 commits
    • Liu Bo's avatar
      Btrfs: remove BUG_ON in __add_tree_block · cdccee99
      Liu Bo authored
      The BUG_ON() can be triggered when the caller is processing an invalid
      extent inline ref, e.g.
      
      a shared data ref is offered instead of an extent data ref, such that
      it tries to find a non-existent tree block and then btrfs_search_slot
      returns 1 for no such item.
      
      This replaces the BUG_ON() with a WARN() followed by calling
      btrfs_print_leaf() to show more details about what's going on and
      returning -EINVAL to upper callers.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cdccee99
    • Liu Bo's avatar
      Btrfs: remove BUG() in add_data_reference · b14c55a1
      Liu Bo authored
      Now that we have a helper to report invalid value of extent inline ref
      type, we need to quit gracefully instead of throwing out a kernel panic.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b14c55a1
    • Liu Bo's avatar
      Btrfs: remove BUG() in print_extent_item · 07638ea5
      Liu Bo authored
      btrfs_print_leaf() is used in btrfs_get_extent_inline_ref_type, so
      here we really want to print the invalid value of ref type instead of
      causing a kernel panic.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07638ea5
    • Liu Bo's avatar
      Btrfs: remove BUG() in btrfs_extent_inline_ref_size · 4335958d
      Liu Bo authored
      Now that btrfs_get_extent_inline_ref_type() can report if type is a
      valid one and all callers can gracefully deal with that, we don't need
      to crash here.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4335958d
    • Liu Bo's avatar
      Btrfs: convert to use btrfs_get_extent_inline_ref_type · 3de28d57
      Liu Bo authored
      Since we have a helper which can do sanity check, this converts all
      btrfs_extent_inline_ref_type to it.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3de28d57
    • Liu Bo's avatar
      Btrfs: add a helper to retrive extent inline ref type · 167ce953
      Liu Bo authored
      An invalid value of extent inline ref type may be read from a
      malicious image which may force btrfs to crash.
      
      This adds a helper which does sanity check for the ref type, so we can
      know if it's sane, return he type, otherwise return an error.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ minimal tweak const types, causing warnings due to other cleanup patches ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      167ce953
    • David Sterba's avatar
      btrfs: scrub: simplify scrub worker initialization · af1cbe0a
      David Sterba authored
      Minor simplification, merge calls to one.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af1cbe0a
    • David Sterba's avatar
      btrfs: scrub: clean up division in scrub_find_csum · 1d1bf92d
      David Sterba authored
      Use proper helpers for 64bit division.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d1bf92d
    • David Sterba's avatar
      btrfs: scrub: clean up division in __scrub_mark_bitmap · 7736b0a4
      David Sterba authored
      Use proper helpers for 64bit division and then cast to narrower type.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7736b0a4
    • David Sterba's avatar
      btrfs: scrub: use bool for flush_all_writes · 2073c4c2
      David Sterba authored
      flush_all_writes is an atomic but does not use the semantics at all,
      it's just on/off indicator, we can use bool.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2073c4c2
    • Ernesto A. Fernández's avatar
      btrfs: preserve i_mode if __btrfs_set_acl() fails · d7d82496
      Ernesto A. Fernández authored
      When changing a file's acl mask, btrfs_set_acl() will first set the
      group bits of i_mode to the value of the mask, and only then set the
      actual extended attribute representing the new acl.
      
      If the second part fails (due to lack of space, for example) and the
      file had no acl attribute to begin with, the system will from now on
      assume that the mask permission bits are actual group permission bits,
      potentially granting access to the wrong users.
      
      Prevent this by restoring the original mode bits if __btrfs_set_acl
      fails.
      Signed-off-by: default avatarErnesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7d82496
    • Nikolay Borisov's avatar
      btrfs: Remove extraneous chunk_objectid variable · 408fbf19
      Nikolay Borisov authored
      BTRFS_FIRST_CHUNK_TREE_OBJECTIS id the only objectid being used in the
      chunk_tree. So remove a variable which is always set to that value and collapse
      its usage in callees which are passed this variable. No functional changes
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      408fbf19
    • Nikolay Borisov's avatar
      btrfs: Remove chunk_objectid argument from btrfs_make_block_group · 0174484d
      Nikolay Borisov authored
      btrfs_make_block_group is always called with chunk_objectid set to
      BTRFS_FIRST_CHUNK_TREE_OBJECTID. There's no reason why this behavior will
      change anytime soon, so let's remove the argument and decrease the cognitive
      load when reading the code path. No functional change
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0174484d
    • Matthias Kaehlcke's avatar
      btrfs: Remove extra parentheses from condition in copy_items() · 0dde10be
      Matthias Kaehlcke authored
      There is no need for the extra pair of parentheses, remove it. This
      fixes the following warning when building with clang:
      
      fs/btrfs/tree-log.c:3694:10: warning: equality comparison with extraneous
        parentheses [-Wparentheses-equality]
                      if ((i == (nr - 1)))
                           ~~^~~~~~~~~~~
      
      Also remove the unnecessary parentheses around the substraction.
      Signed-off-by: default avatarMatthias Kaehlcke <mka@chromium.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0dde10be
    • Nikolay Borisov's avatar
      btrfs: Remove redundant setting of uuid in btrfs_block_header · 0ce1dd2a
      Nikolay Borisov authored
      btrfs_alloc_dev_extent currently unconditionally sets the uuid in the
      leaf block header the function is working with. This is unnecessary
      since this operation is peformed by the core btree handling code
      (splitting a node, allocating a new btree block etc). So let's remove
      it.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ce1dd2a
    • Hans van Kranenburg's avatar
      btrfs: Do not use data_alloc_cluster in ssd mode · 583b7231
      Hans van Kranenburg authored
          This patch provides a band aid to improve the 'out of the box'
      behaviour of btrfs for disks that are detected as being an ssd.  In a
      general purpose mixed workload scenario, the current ssd mode causes
      overallocation of available raw disk space for data, while leaving
      behind increasing amounts of unused fragmented free space. This
      situation leads to early ENOSPC problems which are harming user
      experience and adoption of btrfs as a general purpose filesystem.
      
      This patch modifies the data extent allocation behaviour of the ssd mode
      to make it behave identical to nossd mode.  The metadata behaviour and
      additional ssd_spread option stay untouched so far.
      
      Recommendations for future development are to reconsider the current
      oversimplified nossd / ssd distinction and the broken detection
      mechanism based on the rotational attribute in sysfs and provide
      experienced users with a more flexible way to choose allocator behaviour
      for data and metadata, optimized for certain use cases, while keeping
      sane 'out of the box' default settings.  The internals of the current
      btrfs code have more potential than what currently gets exposed to the
      user to choose from.
      
          The SSD story...
      
          In the first year of btrfs development, around early 2008, btrfs
      gained a mount option which enables specific functionality for
      filesystems on solid state devices. The first occurance of this
      functionality is in commit e18e4809, labeled "Add mount -o ssd, which
      includes optimizations for seek free storage".
      
      The effect on allocating free space for doing (data) writes is to
      'cluster' writes together, writing them out in contiguous space, as
      opposed to a 'tetris' way of putting all separate writes into any free
      space fragment that fits (which is what the -o nossd behaviour does).
      
      A somewhat simplified explanation of what happens is that, when for
      example, the 'cluster' size is set to 2MiB, when we do some writes, the
      data allocator will search for a free space block that is 2MiB big, and
      put the writes in there. The ssd mode itself might allow a 2MiB cluster
      to be composed of multiple free space extents with some existing data in
      between, while the additional ssd_spread mount option kills off this
      option and requires fully free space.
      
      The idea behind this is (commit 536ac8ae): "The [...] clusters make it
      more likely a given IO will completely overwrite the ssd block, so it
      doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
      block. So, effectively this means applying a "locality based algorithm"
      and trying to outsmart the actual ssd.
      
      Since then, various changes have been made to the involved code, but the
      basic idea is still present, and gets activated whenever the ssd mount
      option is active. This also happens by default, when the rotational flag
      as seen at /sys/block/<device>/queue/rotational is set to 0.
      
          However, there's a number of problems with this approach.
      
          First, what the optimization is trying to do is outsmart the ssd by
      assuming there is a relation between the physical address space of the
      block device as seen by btrfs and the actual physical storage of the
      ssd, and then adjusting data placement. However, since the introduction
      of the Flash Translation Layer (FTL) which is a part of the internal
      controller of an ssd, these attempts are futile. The use of good quality
      FTL in consumer ssd products might have been limited in 2008, but this
      situation has changed drastically soon after that time. Today, even the
      flash memory in your automatic cat feeding machine or your grandma's
      wheelchair has a full featured one.
      
      Second, the behaviour as described above results in the filesystem being
      filled up with badly fragmented free space extents because of relatively
      small pieces of space that are freed up by deletes, but not selected
      again as part of a 'cluster'. Since the algorithm prefers allocating a
      new chunk over going back to tetris mode, the end result is a filesystem
      in which all raw space is allocated, but which is composed of
      underutilized chunks with a 'shotgun blast' pattern of fragmented free
      space. Usually, the next problematic thing that happens is the
      filesystem wanting to allocate new space for metadata, which causes the
      filesystem to fail in spectacular ways.
      
      Third, the default mount options you get for an ssd ('ssd' mode enabled,
      'discard' not enabled), in combination with spreading out writes over
      the full address space and ignoring freed up space leads to worst case
      behaviour in providing information to the ssd itself, since it will
      never learn that all the free space left behind is actually free.  There
      are two ways to let an ssd know previously written data does not have to
      be preserved, which are sending explicit signals using discard or
      fstrim, or by simply overwriting the space with new data.  The worst
      case behaviour is the btrfs ssd_spread mount option in combination with
      not having discard enabled. It has a side effect of minimizing the reuse
      of free space previously written in.
      
      Fourth, the rotational flag in /sys/ does not reliably indicate if the
      device is a locally attached ssd. For example, iSCSI or NBD displays as
      non-rotational, while a loop device on an ssd shows up as rotational.
      
      The combination of the second and third problem effectively means that
      despite all the good intentions, the btrfs ssd mode reliably causes the
      ssd hardware and the filesystem structures and performance to be choked
      to death. The clickbait version of the title of this story would have
      been "Btrfs ssd optimizations considered harmful for ssds".
      
      The current nossd 'tetris' mode (even still without discard) allows a
      pattern of overwriting much more previously used space, causing many
      more implicit discards to happen because of the overwrite information
      the ssd gets. The actual location in the physical address space, as seen
      from the point of view of btrfs is irrelevant, because the actual writes
      to the low level flash are reordered anyway thanks to the FTL.
      
          Changes made in the code
      
      1. Make ssd mode data allocation identical to tetris mode, like nossd.
      2. Adjust and clean up filesystem mount messages so that we can easily
      identify if a kernel has this patch applied or not, when providing
      support to end users. Also, make better use of the *_and_info helpers to
      only trigger messages on actual state changes.
      
          Backporting notes
      
      Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
      * First apply commit 951e7966 "btrfs: drop the nossd flag when
        remounting with -o ssd", or fixup the differences manually.
      * The rest of the conflicts are because of the fs_info refactoring. So,
        for example, instead of using fs_info, it's root->fs_info in
        extent-tree.c
      Signed-off-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      583b7231
    • Lu Fengqi's avatar
      btrfs: use btrfsic_submit_bio instead of submit_bio in write_dev_flush · 43a01111
      Lu Fengqi authored
      Although this bio has no data attached, it will reach this condition
      (bio->bi_opf & REQ_PREFLUSH) and then update the flush_gen of dev_state
      in __btrfsic_submit_bio. So we should still submit it through integrity
      checker. Otherwise, the integrity checker will throw the following warning
      when I mount a newly created btrfs filesystem.
      
      [10264.755497] btrfs: attempt to write superblock which references block M @29523968 (sdb1/1111654400/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!
      [10264.755498] btrfs: attempt to write superblock which references block M @29523968 (sdb1/37912576/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!
      Signed-off-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      43a01111
    • Filipe Manana's avatar
      Btrfs: incremental send, fix emission of invalid clone operations · 72610b1b
      Filipe Manana authored
      When doing an incremental send it's possible that the computed send stream
      contains clone operations that will fail on the receiver if the receiver
      has compression enabled and the clone operations target a sector sized
      extent that starts at a zero file offset, is not compressed on the source
      filesystem but ends up being compressed and inlined at the destination
      filesystem.
      
      Example scenario:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount -o compress /dev/sdb /mnt
      
        # By doing a direct IO write, the data is not compressed.
        $ xfs_io -f -d -c "pwrite -S 0xab 0 4K" /mnt/foobar
        $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1
      
        $ xfs_io -c "reflink /mnt/foobar 0 8K 4K" /mnt/foobar
        $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2
      
        $ btrfs send -f /tmp/1.snap /mnt/mysnap1
        $ btrfs send -f /tmp/2.snap -p /mnt/mysnap1 /mnt/mysnap2
        $ umount /mnt
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount -o compress /dev/sdc /mnt
        $ btrfs receive -f /tmp/1.snap /mnt
        $ btrfs receive -f /tmp/2.snap /mnt
        ERROR: failed to clone extents to foobar
        Operation not supported
      
      The same could be achieved by mounting the source filesystem without
      compression and doing a buffered IO write instead of a direct IO one,
      and mounting the destination filesystem with compression enabled.
      
      So fix this by issuing regular write operations in the send stream
      instead of clone operations when the source offset is zero and the
      range has a length matching the sector size.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      72610b1b
    • Liu Bo's avatar
      Btrfs: fix out of bounds array access while reading extent buffer · f716abd5
      Liu Bo authored
      There is a corner case that slips through the checkers in functions
      reading extent buffer, ie.
      
      if (start < eb->len) and (start + len > eb->len),
      then
      
      a) map_private_extent_buffer() returns immediately because
      it's thinking the range spans across two pages,
      
      b) and the checkers in read_extent_buffer(), WARN_ON(start > eb->len)
      and WARN_ON(start + len > eb->start + eb->len), both are OK in this
      corner case, but it'd actually try to access the eb->pages out of
      bounds because of (start + len > eb->len).
      
      The case is found by switching extent inline ref type from shared data
      ref to non-shared data ref, which is a kind of metadata corruption.
      
      It'd use the wrong helper to access the eb,
      eg. btrfs_extent_data_ref_root(eb, ref) is used but the %ref passing
      here is "struct btrfs_shared_data_ref".  And if the extent item
      happens to be the first item in the eb, then offset/length will get
      over eb->len which ends up an invalid memory access.
      
      This is adding proper checks in order to avoid invalid memory access,
      ie. 'general protection fault', before it's too late.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f716abd5
  2. 18 Aug, 2017 12 commits
    • Nikolay Borisov's avatar
      btrfs: Fix -EOVERFLOW handling in btrfs_ioctl_tree_search_v2 · c59efa7e
      Nikolay Borisov authored
      The buffer passed to btrfs_ioctl_tree_search* functions have to be at least
      sizeof(struct btrfs_ioctl_search_header). If this is not the case then the
      ioctl should return -EOVERFLOW and set the uarg->buf_size to the minimum
      required size. Currently btrfs_ioctl_tree_search_v2 would return an -EOVERFLOW
      error with ->buf_size being set to the value passed by user space. Fix this by
      removing the size check and relying on search_ioctl, which already includes it
      and correctly sets buf_size.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c59efa7e
    • Nikolay Borisov's avatar
      btrfs: Move skip checksum check from btrfs_submit_direct to __btrfs_submit_dio_bio · e6961cac
      Nikolay Borisov authored
      Currently the code checks whether we should do data checksumming in
      btrfs_submit_direct and the boolean result of this check is passed to
      btrfs_submit_direct_hook, in turn passing it to __btrfs_submit_dio_bio which
      actually consumes it. The last function actually has all the necessary context
      to figure out whether to skip the check or not, so let's move the check closer
      to where it's being consumed. No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e6961cac
    • Filipe Manana's avatar
      Btrfs: fix assertion failure during fsync in no-holes mode · 6399fb5a
      Filipe Manana authored
      When logging an inode in full mode that has an inline compressed extent
      that represents a range with a size matching the sector size (currently
      the same as the page size), has a trailing hole and the no-holes feature
      is enabled, we end up failing an assertion leading to a trace like the
      following:
      
      [141812.031528] assertion failed: len == i_size, file: fs/btrfs/tree-log.c, line: 4453
      [141812.033069] ------------[ cut here ]------------
      [141812.034330] kernel BUG at fs/btrfs/ctree.h:3452!
      [141812.035137] invalid opcode: 0000 [#1] PREEMPT SMP
      [141812.035932] Modules linked in: btrfs dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey dm_mod dax ppdev evdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 tpm_tis psmouse crypto_simd parport_pc sg pcspkr tpm_tis_core cryptd parport serio_raw glue_helper tpm i2c_piix4 i2c_core button sunrpc loop autofs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod ata_generic virtio_scsi ata_piix floppy crc32c_intel libata scsi_mod virtio_pci virtio_ring e1000 virtio [last unloaded: btrfs]
      [141812.036790] CPU: 3 PID: 845 Comm: fdm-stress Tainted: G    B   W       4.12.3-btrfs-next-52+ #1
      [141812.036790] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
      [141812.036790] task: ffff8801e6694180 task.stack: ffffc90009004000
      [141812.036790] RIP: 0010:assfail.constprop.18+0x1c/0x1e [btrfs]
      [141812.036790] RSP: 0018:ffffc90009007bc0 EFLAGS: 00010282
      [141812.036790] RAX: 0000000000000046 RBX: ffff88017512c008 RCX: 0000000000000001
      [141812.036790] RDX: ffff88023fd95201 RSI: ffffffff8182264c RDI: 00000000ffffffff
      [141812.036790] RBP: ffffc90009007bc0 R08: 0000000000000001 R09: 0000000000000001
      [141812.036790] R10: 0000000000001000 R11: ffffffff82f5a0c9 R12: ffff88014e5947e8
      [141812.036790] R13: 00000000000b4000 R14: ffff8801b234d008 R15: 0000000000000000
      [141812.036790] FS:  00007fdba6ffd700(0000) GS:ffff88023fd80000(0000) knlGS:0000000000000000
      [141812.036790] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [141812.036790] CR2: 00007fdb9c000010 CR3: 000000016efa2000 CR4: 00000000001406e0
      [141812.036790] Call Trace:
      [141812.036790]  btrfs_log_inode+0x9f0/0xd3d [btrfs]
      [141812.036790]  ? __mutex_lock+0x120/0x3ce
      [141812.036790]  btrfs_log_inode_parent+0x224/0x685 [btrfs]
      [141812.036790]  ? lock_acquire+0x16b/0x1af
      [141812.036790]  btrfs_log_dentry_safe+0x60/0x7b [btrfs]
      [141812.036790]  btrfs_sync_file+0x32e/0x3f8 [btrfs]
      [141812.036790]  vfs_fsync_range+0x8a/0x9d
      [141812.036790]  vfs_fsync+0x1c/0x1e
      [141812.036790]  do_fsync+0x31/0x4a
      [141812.036790]  SyS_fdatasync+0x13/0x17
      [141812.036790]  entry_SYSCALL_64_fastpath+0x18/0xad
      [141812.036790] RIP: 0033:0x7fdbac41a47d
      [141812.036790] RSP: 002b:00007fdba6ffce30 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
      [141812.036790] RAX: ffffffffffffffda RBX: ffffffff81092c9f RCX: 00007fdbac41a47d
      [141812.036790] RDX: 0000004cf0160a40 RSI: 0000000000000000 RDI: 0000000000000006
      [141812.036790] RBP: ffffc90009007f98 R08: 0000000000000000 R09: 0000000000000010
      [141812.036790] R10: 00000000000002e8 R11: 0000000000000293 R12: ffffffff8110cd90
      [141812.036790] R13: ffffc90009007f78 R14: 0000000000000000 R15: 0000000000000000
      [141812.036790]  ? time_hardirqs_off+0x9/0x14
      [141812.036790]  ? trace_hardirqs_off_caller+0x1f/0xa3
      [141812.036790] Code: c7 d6 61 6b a0 48 89 e5 e8 ba ef a8 e0 0f 0b 55 89 f1 48 c7 c2 6d 65 6b a0 48 89 fe 48 c7 c7 81 65 6b a0 48 89 e5 e8 9c ef a8 e0 <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89
      [141812.036790] RIP: assfail.constprop.18+0x1c/0x1e [btrfs] RSP: ffffc90009007bc0
      [141812.084448] ---[ end trace 44e472684c7a32cc ]---
      
      Which happens because the code that logs a trailing hole when the no-holes
      feature is enabled, did not consider that a compressed inline extent can
      represent a range with a size matching the sector size, in which case
      expanding the inode's i_size, through a truncate operation, won't lead
      to padding with zeroes the page that represents the inline extent, and
      therefore the inline extent remains after the truncation.
      
      Fix this by adapting the assertion to accept inline extents representing
      data with a sector size length if, and only if, the inline extents are
      compressed.
      
      A sample and trivial reproducer (for systems with a 4K page size) for this
      issue:
      
        mkfs.btrfs -O no-holes -f /dev/sdc
        mount -o compress /dev/sdc /mnt
        xfs_io -f -c "pwrite -S 0xab 0 4K" /mnt/foobar
        sync
        xfs_io -c "truncate 32K" /mnt/foobar
        xfs_io -c "fsync" /mnt/foobar
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6399fb5a
    • Filipe Manana's avatar
      Btrfs: avoid unnecessarily locking inode when clearing a range · 4a4b964f
      Filipe Manana authored
      If the range being cleared was not marked for defrag and we are not
      about to clear the range from the defrag status, we don't need to
      lock and unlock the inode.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarWang Shilong <wangshilong1991@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4a4b964f
    • Colin Ian King's avatar
      btrfs: remove redundant check on ret being non-zero · 938e1c77
      Colin Ian King authored
      The error return variable ret is initialized to zero and then is
      checked to see if it is non-zero in the if-block that follows it.
      It is therefore impossible for ret to be non-zero after the if-block
      hence the check is redundant and can be removed.
      
      Detected by CoverityScan, CID#1021040 ("Logically dead code")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      938e1c77
    • Nikolay Borisov's avatar
      btrfs: expose internal free space tree routine only if sanity tests are enabled · 2d77ab3c
      Nikolay Borisov authored
      The internal free space tree management routines are always exposed for
      testing purposes. Make them dependent on SANITY_TESTS being on so that
      they are exposed only when they really have to.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d77ab3c
    • Nikolay Borisov's avatar
      btrfs: Remove unused sectorsize variable from struct map_lookup · db7c942c
      Nikolay Borisov authored
      This variable was added in 1abe9b8a ("Btrfs: add initial tracepointi
      support for btrfs"), yet it never really got used, only assigned to. So
      let's remove it.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      db7c942c
    • Nikolay Borisov's avatar
      btrfs: Remove never-reached WARN_ON · 92ac58ec
      Nikolay Borisov authored
      We have a WARN_ON(!var) inside an if branch which is executed (among
      others) only when var is true.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92ac58ec
    • Anand Jain's avatar
      btrfs: remove unused BTRFS_COMPRESS_LAST · dc2f2921
      Anand Jain authored
      We aren't using this define, so removing it.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc2f2921
    • Anand Jain's avatar
      btrfs: use BTRFS_FSID_SIZE for fsid · b94417ea
      Anand Jain authored
      We have define for FSID size so use it.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b94417ea
    • Anand Jain's avatar
      btrfs: use appropriate define for the fsid · 44880fdc
      Anand Jain authored
      Though BTRFS_FSID_SIZE and BTRFS_UUID_SIZE are of the same size, we
      should use the matching constant for the fsid buffer.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44880fdc
    • Josef Bacik's avatar
      btrfs: increase ctx->pos for delayed dir index · 42e9cc46
      Josef Bacik authored
      Our dir_context->pos is supposed to hold the next position we're
      supposed to look.  If we successfully insert a delayed dir index we
      could end up with a duplicate entry because we don't increase ctx->pos
      after doing the dir_emit.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42e9cc46
  3. 16 Aug, 2017 9 commits
    • Josef Bacik's avatar
      btrfs: fix readdir deadlock with pagefault · 23b5ec74
      Josef Bacik authored
      Readdir does dir_emit while under the btree lock.  dir_emit can trigger
      the page fault which means we can deadlock.  Fix this by allocating a
      buffer on opening a directory and copying the readdir into this buffer
      and doing dir_emit from outside of the tree lock.
      
      Thread A
      readdir  <holding tree lock>
        dir_emit
          <page fault>
            down_read(mmap_sem)
      
      Thread B
      mmap write
        down_write(mmap_sem)
          page_mkwrite
            wait_ordered_extents
      
      Process C
      finish_ordered_extent
        insert_reserved_file_extent
         try to lock leaf <hang>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ copy the deadlock scenario to changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      23b5ec74
    • Nikolay Borisov's avatar
      btrfs: Simplify math in should_alloc chunk · 8d8aafee
      Nikolay Borisov authored
      Currently should_alloc_chunk uses ->total_bytes - ->bytes_readonly to
      signify the total amount of bytes in this space info. However, given
      Jeff's patch which adds bytes_pinned and bytes_may_use to the calculation
      of num_allocated it becomes a lot more clear to just eliminate num_bytes
      altogether and add the bytes_readonly to the amount of used space. That
      way we don't change the results of the following statements. In the
      process also start using btrfs_space_info_used.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d8aafee
    • Jeff Mahoney's avatar
      btrfs: account for pinned bytes in should_alloc_chunk · f44d2287
      Jeff Mahoney authored
      In a heavy write scenario, we can end up with a large number of pinned bytes.
      This can translate into (very) premature ENOSPC because pinned bytes
      must be accounted for when allowing a reservation but aren't accounted for
      when deciding whether to create a new chunk.
      
      This patch adds the accounting to should_alloc_chunk so that we can
      create the chunk.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f44d2287
    • David Sterba's avatar
      btrfs: prepare for extensions in compression options · a7164fa4
      David Sterba authored
      This is a minimal patch intended to be backported to older kernels.
      We're going to extend the string specifying the compression method and
      this would fail on kernels before that change (the string is compared
      exactly).
      
      Relax the string matching only to the prefix, ie. ignoring anything that
      goes after "zlib" or "lzo", regardless of th format extension we decide
      to use. This applies to the mount options and properties.
      
      That way, patched old kernels could be booted on systems already
      utilizing the new compression spec.
      
      Applicable since commit 63541927, v3.14.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7164fa4
    • David Sterba's avatar
      btrfs: allow defrag compress to override NOCOMPRESS attribute · 1e20d1c4
      David Sterba authored
      Currently, the BTRFS_INODE_NOCOMPRESS will prevent any compression on a
      given file, except when the mount is force-compress. As users have
      reported on IRC, this will also prevent compression when requested by
      defrag (btrfs fi defrag -c file).
      
      The nocompress flag is set automatically by filesystem when the ratios
      are bad and the user would have to manually drop the bit in order to
      make defrag -c work. This is not good from the usability perspective.
      
      This patch will raise priority for the defrag -c over nocompress, ie.
      any file with NOCOMPRESS bit set will get defragmented. The bit will
      remain untouched.
      
      Alternate option was to also drop the nocompress bit and keep the
      decision logic as is, but I think this is not the right solution.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e20d1c4
    • David Sterba's avatar
      1e2ef46d
    • David Sterba's avatar
      btrfs: separate defrag and property compression · eec63c65
      David Sterba authored
      Add new value for compression to distinguish between defrag and
      property. Previously, a single variable was used and this caused clashes
      when the per-file 'compression' was set and a defrag -c was called.
      
      The property-compression is loaded when the file is open, defrag will
      overwrite the same variable and reset to 0 (ie. NONE) at when the file
      defragmentaion is finished. That's considered a usability bug.
      
      Now we won't touch the property value, use the defrag-compression. The
      precedence of defrag is higher than for property (and whole-filesystem).
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eec63c65
    • David Sterba's avatar
      btrfs: rename variable holding per-inode compression type · b52aa8c9
      David Sterba authored
      This is preparatory for separating inode compression requested by defrag
      and set via properties. This will fix a usability bug when defrag will
      reset compression type to NONE. If the file has compression set via
      property, it will not apply anymore (until next mount or reset through
      command line).
      
      We're going to fix that by adding another variable just for the defrag
      call and won't touch the property. The defrag will have higher priority
      when deciding whether to compress the data.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b52aa8c9
    • Timofey Titovets's avatar
      Btrfs: add skeleton code for compression heuristic · c2fcdcdf
      Timofey Titovets authored
      Add skeleton code for compresison heuristics. Now it iterates over all
      the pages, but in the end always says "yes, compress please", ie it does
      not change the current behaviour.
      
      In the future we're going to add various heuristics to analyze the data.
      This patch can be used as a baseline for measuring if the effectivness
      and performance.
      Signed-off-by: default avatarTimofey Titovets <nefelim4ag@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ enhanced changelog, modified comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2fcdcdf