1. 03 Jun, 2015 25 commits
    • Filipe Manana's avatar
      Btrfs: fix -ENOSPC when finishing block group creation · 4fbcdf66
      Filipe Manana authored
      While creating a block group, we often end up getting ENOSPC while updating
      the chunk tree, which leads to a transaction abortion that produces a trace
      like the following:
      
      [30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
      [30670.117777] BTRFS: Transaction aborted (error -28)
      (...)
      [30670.163567] Call Trace:
      [30670.163906]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [30670.164522]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [30670.165171]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [30670.166323]  [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [30670.167213]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [30670.167862]  [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [30670.169116]  [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
      [30670.170593]  [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs]
      [30670.171960]  [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [30670.174649]  [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
      [30670.176092]  [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs]
      [30670.177218]  [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15
      [30670.178622]  [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de
      [30670.179642]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
      [30670.180692]  [<ffffffff81152863>] SyS_fallocate+0x47/0x62
      [30670.186737]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [30670.187792] ---[ end trace 0373e6b491c4a8cc ]---
      
      This is because we don't do proper space reservation for the chunk block
      reserve when we have multiple tasks allocating chunks in parallel.
      
      So block group creation has 2 phases, and the first phase essentially
      checks if there is enough space in the system space_info, allocating a
      new system chunk if there isn't, while the second phase updates the
      device, extent and chunk trees. However, because the updates to the
      chunk tree happen in the second phase, if we have N tasks, each with
      its own transaction handle, allocating new chunks in parallel and if
      there is only enough space in the system space_info to allocate M chunks,
      where M < N, none of the tasks ends up allocating a new system chunk in
      the first phase and N - M tasks will get -ENOSPC when attempting to
      update the chunk tree in phase 2 if they need to COW any nodes/leafs
      from the chunk tree.
      
      Fix this by doing proper reservation in the chunk block reserve.
      
      The issue could be reproduced by running fstests generic/038 in a loop,
      which eventually triggered the problem.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4fbcdf66
    • Josef Bacik's avatar
      Btrfs: set UNWRITTEN for prealloc'ed extents in fiemap · 0d2b2372
      Josef Bacik authored
      We should be doing this, it's weird we hadn't been doing this.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0d2b2372
    • Omar Sandoval's avatar
      Btrfs: show subvol= and subvolid= in /proc/mounts · c8d3fe02
      Omar Sandoval authored
      Now that we're guaranteed to have a meaningful root dentry, we can just
      export seq_dentry() and use it in btrfs_show_options(). The subvolume ID
      is easy to get and can also be useful, so put that in there, too.
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c8d3fe02
    • Omar Sandoval's avatar
      Btrfs: unify subvol= and subvolid= mounting · 05dbe683
      Omar Sandoval authored
      Currently, mounting a subvolume with subvolid= takes a different code
      path than mounting with subvol=. This isn't really a big deal except for
      the fact that mounts done with subvolid= or the default subvolume don't
      have a dentry that's connected to the dentry tree like in the subvol=
      case. To unify the code paths, when given subvolid= or using the default
      subvolume ID, translate it into a subvolume name by walking
      ROOT_BACKREFs in the root tree and INODE_REFs in the filesystem trees.
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      05dbe683
    • Omar Sandoval's avatar
      Btrfs: fail on mismatched subvol and subvolid mount options · bb289b7b
      Omar Sandoval authored
      There's nothing to stop a user from passing both subvol= and subvolid=
      to mount, but if they don't refer to the same subvolume, someone is
      going to be surprised at some point. Error out on this case, but allow
      users to pass in both if they do match (which they could, for example,
      get out of /proc/mounts).
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      bb289b7b
    • Omar Sandoval's avatar
      Btrfs: clean up error handling in mount_subvol() · fa330659
      Omar Sandoval authored
      In preparation for new functionality in mount_subvol(), give it
      ownership of subvol_name and tidy up the error paths.
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      fa330659
    • Omar Sandoval's avatar
      Btrfs: remove all subvol options before mounting top-level · e6e4dbe8
      Omar Sandoval authored
      Currently, setup_root_args() substitutes 's/subvol=[^,]*/subvolid=0/'.
      But, this means that if the user passes both a subvol and subvolid for
      some reason, we won't actually mount the top-level when we recursively
      mount. For example, consider:
      
      mkfs.btrfs -f /dev/sdb
      mount /dev/sdb /mnt
      btrfs subvol create /mnt/subvol1 # subvolid=257
      btrfs subvol create /mnt/subvol2 # subvolid=258
      umount /mnt
      mount -osubvol=/subvol1,subvolid=258 /dev/sdb /mnt
      
      In the final mount, subvol=/subvol1,subvolid=258 becomes
      subvolid=0,subvolid=258, and the last option takes precedence, so we
      mount subvol2 and try to look up subvol1 inside of it, which fails.
      
      So, instead, do a thorough scan through the argument list and remove any
      subvol= and subvolid= options, then append subvolid=0 to the end. This
      implicitly makes subvol= take precedence over subvolid=, but we're about
      to add a stricter check for that. This also makes setup_root_args() more
      generic, which we'll need soon.
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e6e4dbe8
    • Omar Sandoval's avatar
      Btrfs: lock superblock before remounting for rw subvol · 773cd04e
      Omar Sandoval authored
      Since commit 0723a047 ("btrfs: allow mounting btrfs subvolumes with
      different ro/rw options"), when mounting a subvolume read/write when
      another subvolume has previously been mounted read-only, we first do a
      remount. However, this should be done with the superblock locked, as per
      sync_filesystem():
      
      	/*
      	 * We need to be protected against the filesystem going from
      	 * r/o to r/w or vice versa.
      	 */
      	WARN_ON(!rwsem_is_locked(&sb->s_umount));
      
      This WARN_ON can easily be hit with:
      
      mkfs.btrfs -f /dev/vdb
      mount /dev/vdb /mnt
      btrfs subvol create /mnt/vol1
      btrfs subvol create /mnt/vol2
      umount /mnt
      mount -oro,subvol=/vol1 /dev/vdb /mnt
      mount -orw,subvol=/vol2 /dev/vdb /mnt2
      
      Fixes: 0723a047 ("btrfs: allow mounting btrfs subvolumes with different ro/rw options")
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      773cd04e
    • Filipe Manana's avatar
      Btrfs: wake up extent state waiters on unlock through clear_extent_bits · 0f31871f
      Filipe Manana authored
      When we clear an extent state's EXTENT_LOCKED bit with clear_extent_bits()
      through free_io_failure(), we weren't waking up any tasks waiting for the
      extent's state EXTENT_LOCKED bit, leading to an hang.
      
      So make sure clear_extent_bits() ends up waking up any waiters if the
      bit EXTENT_LOCKED is supplied by its callers.
      
      Zygo Blaxell was experiencing such hangs at inode eviction time after
      file unlinks. Thanks to him for a set of scripts to reproduce the issue.
      Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0f31871f
    • Filipe Manana's avatar
      Btrfs: fix chunk allocation regression leading to transaction abort · c152b63e
      Filipe Manana authored
      With commit 1b984508 ("Btrfs: fix find_free_dev_extent() malfunction
      in case device tree has hole") introduced in the kernel 4.1 merge window,
      we end up using part of a device hole for which there are already pending
      chunks or pinned chunks. Before that commit we didn't use the hole and
      would just move on to the next hole in the device.
      
      However when we adjust the start offset for the chunk allocation and we
      have pinned chunks, we set it blindly to the end offset of the pinned
      chunk we are currently processing, which is dangerous because we can
      have a pending chunk that has a start offset that matches the end offset
      of our pinned chunk - leading us to a case where we end up getting two
      pending chunks that start at the same physical device offset, which makes
      us later abort the current transaction with -EEXIST when finishing the
      chunk allocation at btrfs_create_pending_block_groups():
      
      [194737.659017] ------------[ cut here ]------------
      [194737.660192] WARNING: CPU: 15 PID: 31111 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
      [194737.662209] BTRFS: Transaction aborted (error -17)
      [194737.663175] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse
      [194737.674015] CPU: 15 PID: 31111 Comm: xfs_io Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [194737.675986] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [194737.682999]  0000000000000009 ffff8800564c7a98 ffffffff8142fa46 ffffffff8108b6a2
      [194737.684540]  ffff8800564c7ae8 ffff8800564c7ad8 ffffffff81045ea5 ffff8800564c7b78
      [194737.686017]  ffffffffa0383aa7 00000000ffffffef ffff88000c7ba000 ffff8801a1f66f40
      [194737.687509] Call Trace:
      [194737.688068]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [194737.689027]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [194737.690095]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [194737.691198]  [<ffffffffa0383aa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [194737.693789]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [194737.695065]  [<ffffffffa0383aa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [194737.696806]  [<ffffffffa039a3bd>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
      [194737.698683]  [<ffffffffa03aa433>] __btrfs_end_transaction+0x84/0x366 [btrfs]
      [194737.700329]  [<ffffffffa03aa725>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [194737.701924]  [<ffffffffa0394b51>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
      [194737.703675]  [<ffffffffa03b8ba4>] __btrfs_buffered_write+0x16a/0x4c8 [btrfs]
      [194737.705417]  [<ffffffffa03bb502>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
      [194737.707058]  [<ffffffffa03bb511>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs]
      [194737.708560]  [<ffffffffa03bb68d>] btrfs_file_write_iter+0x325/0x431 [btrfs]
      [194737.710673]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
      [194737.712076]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
      [194737.713293]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
      [194737.714443]  [<ffffffff81154424>] SyS_pwrite64+0x64/0x82
      [194737.715646]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [194737.717175] ---[ end trace f2d5dc04e56d7e48 ]---
      [194737.718170] BTRFS: error (device sdc) in btrfs_create_pending_block_groups:9524: errno=-17 Object already exists
      
      The -EEXIST failure comes from btrfs_finish_chunk_alloc(), called by
      btrfs_create_pending_block_groups(), when it attempts to insert a
      duplicated device extent item via btrfs_alloc_dev_extent().
      
      This issue was reproducible with fstests generic/038 running in a loop for
      several hours (it's very hard to hit) and using MOUNT_OPTIONS="-o discard".
      Applying Jeff's recent patch titled "btrfs: add missing discards when
      unpinning extents with -o discard" makes the issue much easier to reproduce
      (usually within 4 to 5 hours), since it pins chunks for longer periods of
      time when an unused block group is deleted by the cleaner kthread.
      
      Fix this by making sure that we never adjust the start offset to a lower
      value than it currently has.
      
      Fixes: 1b984508 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole"
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c152b63e
    • Sasha Levin's avatar
      btrfs: use after free when closing devices · 2037a093
      Sasha Levin authored
      __btrfs_close_devices() would call_rcu to free the device, which is racy with
      list_for_each_entry() accessing the memory to retrieve the next device on the
      list.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2037a093
    • David Sterba's avatar
      btrfs: make root id query unprivileged · 01b810b8
      David Sterba authored
      The INO_LOOKUP ioctl can lookup path for a given inode number and is
      thus restricted. As a sideefect it can find the root id of the
      containing subvolume and we're using this int the 'btrfs inspect rootid'
      command.
      
      The restriction is unnecessary in case we set the ioctl args
       args::treeid    = 0
       args::objectid  = 256 (BTRFS_FIRST_FREE_OBJECTID)
      
      Then the path will be empty and the treeid is filled with the root id of
      the inode on which the ioctl is called. This behaviour is unchanged,
      after the root restriction is removed.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      01b810b8
    • Filipe Manana's avatar
      Btrfs: fix block group ->space_info null pointer dereference · 2e6e5183
      Filipe Manana authored
      When we create a block group we add it to the rbtree of block groups
      before setting its ->space_info field (while it's NULL). This is
      problematic since other tasks can access the block group from the
      rbtree and attempt to use its ->space_info before it is set by
      btrfs_make_block_group().
      
      This can happen for example when a concurrent fitrim ioctl operation
      is ongoing, which produces a trace like the following when
      CONFIG_DEBUG_PAGEALLOC is set.
      
      [11509.604369] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      [11509.606373] IP: [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
      [11509.608179] PGD 2296a8067 PUD 22f4a2067 PMD 0
      [11509.608179] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [11509.608179] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq processor i2c_piix4 psmou
      [11509.608179] CPU: 10 PID: 8538 Comm: fstrim Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [11509.608179] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [11509.608179] task: ffff88009f5c46d0 ti: ffff8801b3edc000 task.ti: ffff8801b3edc000
      [11509.608179] RIP: 0010:[<ffffffff8107d675>]  [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
      [11509.608179] RSP: 0018:ffff8801b3edf9e8  EFLAGS: 00010002
      [11509.608179] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000000
      [11509.608179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
      [11509.608179] RBP: ffff8801b3edfaa8 R08: 0000000000000001 R09: 0000000000000000
      [11509.608179] R10: 0000000000000000 R11: ffff88009f5c4f98 R12: 0000000000000000
      [11509.608179] R13: 0000000000000000 R14: 0000000000000018 R15: ffff88009f5c46d0
      [11509.608179] FS:  00007f280a10e840(0000) GS:ffff88023ed40000(0000) knlGS:0000000000000000
      [11509.608179] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [11509.608179] CR2: 0000000000000018 CR3: 00000002119bc000 CR4: 00000000000006e0
      [11509.608179] Stack:
      [11509.608179]  0000000000000000 0000000000000000 0000000000000004 0000000000000000
      [11509.608179]  ffff880100000000 ffffffff00000000 0000000000000001 ffffffff00000000
      [11509.608179]  0000000000000001 0000000000000000 ffff880100000000 00000000000006c4
      [11509.608179] Call Trace:
      [11509.608179]  [<ffffffff8107dc57>] ? __lock_acquire+0x696/0xf02
      [11509.608179]  [<ffffffff8107e806>] lock_acquire+0xa5/0x116
      [11509.608179]  [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs]
      [11509.608179]  [<ffffffff81434f37>] _raw_spin_lock+0x34/0x44
      [11509.608179]  [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs]
      [11509.608179]  [<ffffffffa04cc876>] do_trimming+0x51/0x145 [btrfs]
      [11509.608179]  [<ffffffffa04cde7d>] btrfs_trim_block_group+0x201/0x491 [btrfs]
      [11509.608179]  [<ffffffffa04849e2>] btrfs_trim_fs+0xe0/0x129 [btrfs]
      [11509.608179]  [<ffffffffa04bb80a>] btrfs_ioctl_fitrim+0x138/0x167 [btrfs]
      [11509.608179]  [<ffffffffa04c002f>] btrfs_ioctl+0x50d/0x21e8 [btrfs]
      [11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
      [11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
      [11509.608179]  [<ffffffff81123bda>] ? might_fault+0x58/0xb5
      [11509.608179]  [<ffffffff81158050>] ? cp_new_stat+0x147/0x15e
      [11509.608179]  [<ffffffff81163041>] do_vfs_ioctl+0x3c6/0x479
      [11509.608179]  [<ffffffff81158116>] ? SYSC_newfstat+0x25/0x2e
      [11509.608179]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
      [11509.608179]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
      [11509.608179]  [<ffffffff8116314e>] SyS_ioctl+0x5a/0x7f
      [11509.608179]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [11509.608179] Code: f4 01 00 0f 85 c0 00 00 00 48 c7 c1 f3 1f 7d 81 48 c7 c2 aa cb 7c 81 be fc 0b 00 00 eb 70 83 3d 61 eb 9c 00 00 0f 84 a5 00 00 00 <49> 81 3e 40 a3 2b 82 b8 00 00 00
      [11509.608179] RIP  [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02
      [11509.608179]  RSP <ffff8801b3edf9e8>
      [11509.608179] CR2: 0000000000000018
      [11509.608179] ---[ end trace 570a5c6769f0e49a ]---
      
      Which corresponds to the following access in fs/btrfs/free-space-cache.c:
      
        static int do_trimming(struct btrfs_block_group_cache *block_group,
                               u64 *total_trimmed, u64 start, u64 bytes,
                               u64 reserved_start, u64 reserved_bytes,
                               struct btrfs_trim_range *trim_entry)
        {
             struct btrfs_space_info *space_info = block_group->space_info;
        (...)
             spin_lock(&space_info->lock);
             ^^^^^ - block_group->space_info is NULL...
      
      Fix this by ensuring the block group's ->space_info is set before adding
      the block group to the rbtree.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2e6e5183
    • Anand Jain's avatar
      Btrfs: check error before reporting missing device and add uuid · 33b97e43
      Anand Jain authored
      Report missing device when add is successful,
      otherwise it would exit as ENOMEM. And add uuid
      to the report.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      33b97e43
    • Qu Wenruo's avatar
      btrfs: Fix superblock csum type check. · 1f6e4b3f
      Qu Wenruo authored
      Old csum type check is wrong and can't catch csum_type 1(not supported).
      
      Fix it to avoid hostile 0 division.
      Reported-by: default avatarLukas Lueg <lukas.lueg@gmail.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1f6e4b3f
    • Filipe Manana's avatar
      Btrfs: incremental send, fix clone operations for compressed extents · 619d8c4e
      Filipe Manana authored
      Marc reported a problem where the receiving end of an incremental send
      was performing clone operations that failed with -EINVAL. This happened
      because, unlike for uncompressed extents, we were not checking if the
      source clone offset and length, after summing the data offset, falls
      within the source file's boundaries.
      
      So make sure we do such checks when attempting to issue clone operations
      for compressed extents.
      
      Problem reproducible with the following steps:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount -o compress /dev/sdb /mnt
        $ mkfs.btrfs -f /dev/sdc
        $ mount -o compress /dev/sdc /mnt2
      
        # Create the file with a single extent of 128K. This creates a metadata file
        # extent item with a data start offset of 0 and a logical length of 128K.
        $ xfs_io -f -c "pwrite -S 0xaa 64K 128K" -c "fsync" /mnt/foo
      
        # Now rewrite the range 64K to 112K of our file. This will make the inode's
        # metadata continue to point to the 128K extent we created before, but now
        # with an extent item that points to the extent with a data start offset of
        # 112K and a logical length of 16K.
        # That metadata file extent item is associated with the logical file offset
        # at 176K and covers the logical file range 176K to 192K.
        $ xfs_io -c "pwrite -S 0xbb 64K 112K" -c "fsync" /mnt/foo
      
        # Now rewrite the range 180K to 12K. This will make the inode's metadata
        # continue to point the the 128K extent we created earlier, with a single
        # extent item that points to it with a start offset of 112K and a logical
        # length of 4K.
        # That metadata file extent item is associated with the logical file offset
        # at 176K and covers the logical file range 176K to 180K.
        $ xfs_io -c "pwrite -S 0xcc 180K 12K" -c "fsync" /mnt/foo
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      
        $ touch /mnt/bar
        # Calls the btrfs clone ioctl.
        $ ~/xfstests/src/cloner -s $((176 * 1024)) -d $((176 * 1024)) \
          -l $((4 * 1024)) /mnt/foo /mnt/bar
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      
        $ btrfs send /mnt/snap1 | btrfs receive /mnt2
        At subvol /mnt/snap1
        At subvol snap1
      
        $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
        At subvol /mnt/snap2
        At snapshot snap2
        ERROR: failed to clone extents to bar
        Invalid argument
      
      A test case for fstests follows soon.
      Reported-by: default avatarMarc MERLIN <marc@merlins.org>
      Tested-by: default avatarMarc MERLIN <marc@merlins.org>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Tested-by: default avatarJan Alexander Steffens (heftig) <jan.steffens@gmail.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      619d8c4e
    • Christian Engelmayer's avatar
      btrfs: qgroup: Fix possible leak in btrfs_add_qgroup_relation() · ab3680dd
      Christian Engelmayer authored
      Commit 9c8b35b1 ("btrfs: quota: Automatically update related qgroups or
      mark INCONSISTENT flags when assigning/deleting a qgroup relations.")
      introduced the allocation of a temporary ulist in function
      btrfs_add_qgroup_relation() and added the corresponding cleanup to the out
      path. However, the allocation was introduced before the src/dst level check
      that directly returns. Fix the possible leakage of the ulist by moving the
      allocation after the input validation. Detected by Coverity CID 1295988.
      Signed-off-by: default avatarChristian Engelmayer <cengelma@gmx.at>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ab3680dd
    • Filipe Manana's avatar
      Btrfs: fix mutex unlock without prior lock on space cache truncation · 35c76642
      Filipe Manana authored
      If the call to btrfs_truncate_inode_items() failed and we don't have a block
      group, we were unlocking the cache_write_mutex without having locked it (we
      do it only if we have a block group).
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      35c76642
    • Anand Jain's avatar
      Btrfs: log when missing device is created · 816fcebe
      Anand Jain authored
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      816fcebe
    • David Sterba's avatar
      btrfs: fix warnings after changes in btrfs_abort_transaction · 6d13f549
      David Sterba authored
      fs/btrfs/volumes.c: In function ‘btrfs_create_uuid_tree’:
      fs/btrfs/volumes.c:3909:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
         btrfs_abort_transaction(trans, tree_root,
         ^
        CC [M]  fs/btrfs/ioctl.o
      fs/btrfs/ioctl.c: In function ‘create_subvol’:
      fs/btrfs/ioctl.c:549:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
         btrfs_abort_transaction(trans, root, PTR_ERR(new_root));
      
      PTR_ERR returns long, but we're really using 'int' for the error codes
      everywhere so just set and use the local variable.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      6d13f549
    • David Sterba's avatar
      btrfs: add 'cold' compiler annotations to all error handling functions · c0d19e2b
      David Sterba authored
      The annotated functios will be placed into .text.unlikely section. The
      annotation also hints compiler to move the code out of the hot paths,
      and may implicitly mark if-statement leading to that block as unlikely.
      
      This is a heuristic, the impact on the generated code is not
      significant.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c0d19e2b
    • David Sterba's avatar
      btrfs: report exact callsite where transaction abort occurs · 1a9a8a71
      David Sterba authored
      WARN is called from a single location and all bugreports say that's in
      super.c __btrfs_abort_transaction. This is slightly confusing as we'd
      rather want to know the exact callsite. Whereas this information is
      printed in the syslog below the stacktrace, this requires further look
      and we usually see only the headline from WARNING.
      
      Moving the WARN into the macro has to inline some code and increases
      code by a few kilobytes:
      
        text    data     bss     dec     hex filename
      835481   20305   14120  869906   d4612 btrfs.ko.before
      842883   20305   14120  877308   d62fc btrfs.ko.after
      
      The delta is +7k (130+ calls), measured on 3.19 x86_64, distro config.
      The increase is not small and could lead to worse icache use. The code
      is on error/exit paths that can be recognized by compiler as cold and
      moved out of the way so the impact is speculated to be low, if
      measurable at all.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1a9a8a71
    • David Sterba's avatar
      btrfs: let tree defrag work in SSD mode · 13028901
      David Sterba authored
      Long time ago (2008) the defrag was automatic for new b-tree writes but
      has been disabled after performance problems. There was a leftover in
      tree-defrag.c that effectively stops any defragmentation on b-trees.
      This is a bit unexpected and IMHO undesired. The SSD mode is an
      optimization and defrag is supposed to work if the users asks for it.
      
      Related commits:
      
      6702ed49
      Btrfs: Add run time btree defrag, and an ioctl to force btree defrag
      
      e18e4809
      Btrfs: Add mount -o ssd, which includes optimizations for seek free
      storage
      
      b3236e68
      Btrfs: Leave on the tree defragger in mount -o ssd, it still helps there
      
      9afbb0b7
      Btrfs: Disable tree defrag in SSD mode
      
      The last three commits switch the defrag+ssd off/on/off and the last one
      
      3f157a2f
      Btrfs: Online btree defragmentation fixes
      
      misses the bits from tree-defrag.c to revert to the behaviour introduced
      in e18e4809.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      13028901
    • Filipe Manana's avatar
      Btrfs: check pending chunks when shrinking fs to avoid corruption · 53e489bc
      Filipe Manana authored
      When we shrink the usable size of a device (its total_bytes), we go over
      all the device extent items in the device tree and attempt to relocate
      the chunk of any device extent that goes beyond the new usable size for
      the device. We do that after setting the new usable size (total_bytes) in
      the device object, so that all new allocations (and reallocations) don't
      use areas of the device that go beyond the new (shorter) size. However we
      were not considering that before setting the new size in the device,
      pending chunks might have been created that use device extents that go
      beyond the new size, and those device extents are not yet in the device
      tree after we search the device tree - they are still attached to the
      list of new block group for some ongoing transaction handle, and they are
      only added to the device tree when the transaction handle is ended (via
      btrfs_create_pending_block_groups()).
      
      So check for pending chunks with device extents that go beyond the new
      size and if any exists, commit the current transaction and repeat the
      search in the device tree.
      
      Not doing this it would mean we would return success to user space while
      still having extents that go beyond the new size, and later user space
      could override those locations on the device while the fs still references
      them, causing all sorts of corruption and unexpected events.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      53e489bc
    • Omar Sandoval's avatar
      Btrfs: don't invalidate root dentry when subvolume deletion fails · 64ad6c48
      Omar Sandoval authored
      Since commit bafc9b75 ("vfs: More precise tests in d_invalidate"),
      mounted subvolumes can be deleted because d_invalidate() won't fail.
      However, we run into problems when we attempt to delete the default
      subvolume while it is mounted as the root filesystem:
      
      	# btrfs subvol list /
      	ID 257 gen 306 top level 5 path rootvol
      	ID 267 gen 334 top level 5 path snap1
      	# btrfs subvol get-default /
      	ID 267 gen 334 top level 5 path snap1
      	# btrfs inspect-internal rootid /
      	267
      	# mount -o subvol=/ /dev/vda1 /mnt
      	# btrfs subvol del /mnt/snap1
      	Delete subvolume (no-commit): '/mnt/snap1'
      	ERROR: cannot delete '/mnt/snap1' - Operation not permitted
      	# findmnt /
      	findmnt: can't read /proc/mounts: No such file or directory
      	# ls /proc
      	#
      
      Markus reported that this same scenario simply led to a kernel oops.
      
      This happens because in btrfs_ioctl_snap_destroy(), we call
      d_invalidate() before we check may_destroy_subvol(), which means that we
      detach the submounts and drop the dentry before erroring out. Instead,
      we should only invalidate the dentry once the deletion has succeeded.
      Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate()
      will prune the dcache for the deleted subvolume.
      
      Cc: <stable@vger.kernel.org>
      Fixes: bafc9b75 ("vfs: More precise tests in d_invalidate")
      Reported-by: default avatarMarkus Schauler <mschauler@gmail.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      64ad6c48
  2. 01 Jun, 2015 1 commit
  3. 31 May, 2015 13 commits
  4. 30 May, 2015 1 commit