1. 17 Dec, 2018 40 commits
    • Filipe Manana's avatar
      Btrfs: fix access to available allocation bits when starting balance · 5a8067c0
      Filipe Manana authored
      The available allocation bits members from struct btrfs_fs_info are
      protected by a sequence lock, and when starting balance we access them
      incorrectly in two different ways:
      
      1) In the read sequence lock loop at btrfs_balance() we use the values we
         read from fs_info->avail_*_alloc_bits and we can immediately do actions
         that have side effects and can not be undone (printing a message and
         jumping to a label). This is wrong because a retry might be needed, so
         our actions must not have side effects and must be repeatable as long
         as read_seqretry() returns a non-zero value. In other words, we were
         essentially ignoring the sequence lock;
      
      2) Right below the read sequence lock loop, we were reading the values
         from avail_metadata_alloc_bits and avail_data_alloc_bits without any
         protection from concurrent writers, that is, reading them outside of
         the read sequence lock critical section.
      
      So fix this by making sure we only read the available allocation bits
      while in a read sequence lock critical section and that what we do in the
      critical section is repeatable (has nothing that can not be undone) so
      that any eventual retry that is needed is handled properly.
      
      Fixes: de98ced9 ("Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits")
      Fixes: 14506127 ("btrfs: fix a bogus warning when converting only data or metadata")
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a8067c0
    • Filipe Manana's avatar
      Btrfs: allow clear_extent_dirty() to receive a cached extent state record · 0e6ec385
      Filipe Manana authored
      We can have a lot freed extents during the life span of transaction, so
      the red black tree that keeps track of the ranges of each freed extent
      (fs_info->freed_extents[]) can get quite big. When finishing a
      transaction commit we find each range, process it (discard the extents,
      unpin them) and then remove it from the red black tree.
      
      We can use an extent state record as a cache when searching for a range,
      so that when we clean the range we can use the cached extent state we
      passed to the search function instead of iterating the red black tree
      again. Doing things as fast as possible when finishing a transaction (in
      state TRANS_STATE_UNBLOCKED) is convenient as it reduces the time we
      block another task that wants to commit the next transaction.
      
      So change clear_extent_dirty() to allow an optional extent state record to
      be passed as an argument, which will be passed down to __clear_extent_bit.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e6ec385
    • Nikolay Borisov's avatar
      btrfs: Handle final split-brain possibility during fsid change · cc5de4e7
      Nikolay Borisov authored
      This patch lands the last case which needs to be handled by the fsid
      change code. Namely, this is the case where a multidisk filesystem has
      already undergone at least one successful fsid change i.e all disks
      have the METADATA_UUID incompat bit and power failure occurs as another
      fsid change is in progress. When such an event occurs, disks could be
      split in 2 groups. One of the groups will have both METADATA_UUID and
      CHANGING_FSID_V2 flags set coupled with old fsid/metadata_uuid pairs.
      The other group of disks will have only METADATA_UUID bit set and their
      fsid will be different than the one in disks in the first group. Here
      we look at the following cases:
      
        a) A disk from the first group is scanned first, so fs_devices is
        created with stale fsid/metdata_uuid. Then when a disk from the
        second group is scanned it needs to first check whether there exists
        such an fs_devices that has fsid_change set to true (because it was
        created with a disk having the CHANGING_FSID_V2 flag), the
        metadata_uuid and fsid of the fs_devices will be different (since it was
        created by a disk which already has had at least 1 successful fsid change)
        and finally the metadata_uuid of the fs_devices will equal that of the
        currently scanned disk (because metadata_uuid never really changes).
        When the correct fs_devices is found the information from the scanned
        disk will replace the current one in fs_devices since the scanned disk
        will have higher generation number.
      
        b) A disk from the second group is scanned so fs_devices is created
        as usual with differing fsid/metdata_uid. Then when a disk from the
        first group is scanned the code detects that it has both
        CHANGING_FSID_V2 and METADATA_UUID flags set and will search for
        fs_devices that has differing metadata_uuid/fsid and whose
        metadata_uuid is the same as that of the scanned device.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc5de4e7
    • Nikolay Borisov's avatar
      btrfs: Handle one more split-brain scenario during fsid change · 7a62d0f0
      Nikolay Borisov authored
      This commit continues hardening the scanning code to handle cases where
      power loss could have caused disks in a multi-disk filesystem to be
      in inconsistent state. Namely handle the situation that can occur when
      some of the disks in multi-disk fs have completed their fsid change i.e
      they have METADATA_UUID incompat flag set, have cleared the
      CHANGING_FSID_V2 flag and their fsid/metadata_uuid are different. At
      the same time the other half of the disks will have their
      fsid/metadata_uuid unchanged and will only have CHANGING_FSID_V2 flag.
      
      This is handled by introducing code in the scan path which:
      
       a) Handles the case when a device with CHANGING_FSID_V2 flag is
       scanned and as a result btrfs_fs_devices is created with matching
       fsid/metdata_uuid. Subsequently, when a device with completed fsid
       change is scanned it will detect this via the new code in find_fsid
       i.e that such an fs_devices exist that fsid_change flag is set to true,
       it's metadata_uuid/fsid match and the metadata_uuid of the scanned
       device matches that of the fs_devices. In this case, it's important to
       note that the devices which has its fsid change completed will have a
       higher generation number than the device with FSID_CHANGING_V2 flag
       set, so its superblock block will be used during mount. To prevent an
       assertion triggering because the sb used for mounting will have
       differing fsid/metadata_uuid than the ones in the fs_devices struct
       also add code in device_list_add which overwrites the values in
       fs_devices.
      
       b) Alternatively we can end up with a device that completed its
       fsid change be scanned first which will create the respective
       btrfs_fs_devices struct with differing fsid/metadata_uuid. In this
       case when a device with FSID_CHANGING_V2 flag set is scanned it will
       call the newly added find_fsid_inprogress function which will return
       the correct fs_devices.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7a62d0f0
    • Nikolay Borisov's avatar
      btrfs: add members to fs_devices to track fsid changes · d1a63002
      Nikolay Borisov authored
      In order to gracefully handle split-brain scenario during fsid change
      (which are very unlikely, yet possible), two more pieces of information
      will be necessary:
      
      1. The highest generation number among all devices registered to a
         particular btrfs_fs_devices
      
      2. A boolean flag whether a given btrfs_fs_devices was created by a
         device which had the FSID_CHANGING_V2 flag set.
      
      This is a preparatory patch and just introduces the variables as well
      as code which sets them, their actual use is going to happen in a later
      patch.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1a63002
    • Nikolay Borisov's avatar
      btrfs: Add handling for disk split-brain scenario during fsid change · fbc6feae
      Nikolay Borisov authored
      Even though fsid change without rewrite is a very quick operation it's
      still possible to experience a split-brain scenario if power loss occurs
      at the most inconvenient time. This patch handles the case where power
      failure occurs while the first transaction (the one setting
      CHANGING_FSID_V2) flag is being persisted on disk. This can cause the
      btrfs_fs_devices of this filesystem to be created by a device which:
      
       a) has the CHANGING_FSID_V2 flag set but its fsid value is intact
      
       b) or a device which doesn't have CHANGING_FSID_V2 flag set and its
          fsid value is intact
      
      This situation is trivially handled by the current find_fsid code since
      in both cases the devices are going to be treated like ordinary devices.
      Since btrfs is always mounted using the superblock of the latest
      device (the one with highest generation number), meaning it will have
      the CHANGING_FSID_V2 flag set, ensure it's being cleared on mount. On
      the first transaction commit following mount all disks will have it
      cleared.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fbc6feae
    • Nikolay Borisov's avatar
      btrfs: Remove fsid/metadata_fsid fields from btrfs_info · de37aa51
      Nikolay Borisov authored
      Currently btrfs_fs_info structure contains a copy of the
      fsid/metadata_uuid fields. Same values are also contained in the
      btrfs_fs_devices structure which fs_info has a reference to. Let's
      reduce duplication by removing the fields from fs_info and always refer
      to the ones in fs_devices. No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de37aa51
    • Nikolay Borisov's avatar
      btrfs: Add sysfs support for metadata_uuid feature · 56f20f40
      Nikolay Borisov authored
      Since the metadata_uuid is a new incompat feature it requires the
      respective sysfs hooks. This patch adds the 'metdata_uuid' feature to
      be shown if it supported by the kernel. Additionally it adds
      /sys/fs/btrfs/UUID/metadata_uuid attribute which allows one to read
      the current metadata_uuid.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      56f20f40
    • Nikolay Borisov's avatar
      btrfs: Introduce support for FSID change without metadata rewrite · 7239ff4b
      Nikolay Borisov authored
      This field is going to be used when the user wants to change the UUID
      of the filesystem without having to rewrite all metadata blocks. This
      field adds another level of indirection such that when the FSID is
      changed what really happens is the current UUID (the one with which the
      fs was created) is copied to the 'metadata_uuid' field in the superblock
      as well as a new incompat flag is set METADATA_UUID. When the kernel
      detects this flag is set it knows that the superblock in fact has 2
      UUIDs:
      
      1. Is the UUID which is user-visible, currently known as FSID.
      2. Metadata UUID - this is the UUID which is stamped into all on-disk
         datastructures belonging to this file system.
      
      When the new incompat flag is present device scanning checks whether
      both fsid/metadata_uuid of the scanned device match any of the
      registered filesystems. When the flag is not set then both UUIDs are
      equal and only the FSID is retained on disk, metadata_uuid is set only
      in-memory during mount.
      
      Additionally a new metadata_uuid field is also added to the fs_info
      struct. It's initialised either with the FSID in case METADATA_UUID
      incompat flag is not set or with the metdata_uuid of the superblock
      otherwise.
      
      This commit introduces the new fields as well as the new incompat flag
      and switches all users of the fsid to the new logic.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ minor updates in comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7239ff4b
    • Johannes Thumshirn's avatar
      btrfs: use EXPORT_FOR_TESTS for conditionally exported functions · ce9f967f
      Johannes Thumshirn authored
      Several functions in BTRFS are only used inside the source file they are
      declared if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not defined. However if
      CONFIG_BTRFS_FS_RUN_SANITY_TESTS is defined these functions are shared
      with the unit tests code.
      
      Before the introduction of the EXPORT_FOR_TESTS macro, these functions
      could not be declared as static and the compiler had a harder task when
      optimizing and inlining them.
      
      As we have EXPORT_FOR_TESTS now, use it where appropriate to support the
      compiler.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ce9f967f
    • Johannes Thumshirn's avatar
      btrfs: introduce EXPORT_FOR_TESTS macro · f8f591df
      Johannes Thumshirn authored
      Depending on whether CONFIG_BTRFS_FS_RUN_SANITY_TESTS is set, some BTRFS
      functions are either local to the file they are implemented in and thus
      should be declared static or are called from within the test
      implementation defined in a different file.
      
      Introduce an EXPORT_FOR_TESTS macro which depending on
      CONFIG_BTRFS_FS_RUN_SANITY_TESTS either adds the 'static' keyword to a
      function or not.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f8f591df
    • Johannes Thumshirn's avatar
      btrfs: remove unused drop_on_err in btrfs_mkdir · e9a05cf3
      Johannes Thumshirn authored
      Up to commit 32955c54 ("btrfs: switch to discard_new_inode()") the
      drop_on_err variable in btrfs_mkdir() was used to check whether the
      inode had to be dropped via iput().
      
      After commit 32955c54 ("btrfs: switch to discard_new_inode()")
      discard_new_inode() is called when err is set and inode is non NULL.
      Therefore drop_on_err is not used anymore and thus causes a warning when
      building with -Wunused-but-set-variable.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e9a05cf3
    • Nikolay Borisov's avatar
      btrfs: Replace BUG_ON with ASSERT in find_lock_delalloc_range · 9bfd61d9
      Nikolay Borisov authored
      lock_delalloc_pages should only return 2 values - 0 in case of success
      and -EAGAIN if the range of pages to be locked should be shrunk due to
      some of gone. Manual inspections confirms that this is indeed the case
      since __process_pages_contig is where lock_delalloc_pages gets its
      return value. The latter always returns 0  or -EAGAIN so the invariant
      holds. No functional changes.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9bfd61d9
    • Nikolay Borisov's avatar
      btrfs: Sink find_lock_delalloc_range's 'max_bytes' argument · 917aacec
      Nikolay Borisov authored
      All callers of this function pass BTRFS_MAX_EXTENT_SIZE (128M) so let's
      reduce the argument count and make that a local variable. No functional
      changes.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      917aacec
    • Nikolay Borisov's avatar
      btrfs: Remove superfluous check form btrfs_remove_chunk · 64bc6c2a
      Nikolay Borisov authored
      It's unnecessary to check map->stripes[i].dev for NULL given its value
      is already set and dereferenced above the the check. No functional
      changes.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      64bc6c2a
    • Anand Jain's avatar
      btrfs: don't report user-requested cancel as an error · f9085abf
      Anand Jain authored
      As of now only user requested replace cancel can cancel the
      replace-scrub so no need to log the error.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9085abf
    • Anand Jain's avatar
      btrfs: silence warning if replace is canceled · 49365e69
      Anand Jain authored
      When we successfully cancel the device replace, its scrub worker returns
      -ECANCELED, which is then passed to btrfs_dev_replace_finishing.
      
      It cleans up based on the returned status and propagates the same
      -ECANCELED back the parent function. As of now only user can cancel the
      replace-scrub, so its ok to silence the warning here.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      49365e69
    • Anand Jain's avatar
      btrfs: dev-replace: add explicit check for replace result "no error" · 53e62fb5
      Anand Jain authored
      We recast the replace return status
      BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS to 0, to indicate no
      error.
      And since BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR should also return 0,
      which is also declared as 0, so we just return. Instead add it to the if
      statement so that there is enough clarity while reading the code.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53e62fb5
    • Anand Jain's avatar
      btrfs: dev-replace: replace's scrub must not be running in suspended state · fe97e2e1
      Anand Jain authored
      When the replace state is in the suspended state, btrfs_scrub_cancel()
      should fail with -ENOTCONN as there is no scrub running. As a safety
      catch check if btrfs_scrub_cancel() returns -ENOTCONN and assert if it
      doesn't.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fe97e2e1
    • Anand Jain's avatar
      btrfs: dev-replace: set result code of cancel by status of scrub · b47dda2e
      Anand Jain authored
      The device-replace needs to check the result code of the scrub workers
      in btrfs_dev_replace_cancel and distinguish if successful cancel
      operation and when the there was no operation running.
      
      If btrfs_scrub_cancel() fails, return
      BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED so that user can try
      to cancel the replace again.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b47dda2e
    • Anand Jain's avatar
      btrfs: fix use-after-free due to race between replace start and cancel · d189dd70
      Anand Jain authored
      The device replace cancel thread can race with the replace start thread
      and if fs_info::scrubs_running is not yet set, btrfs_scrub_cancel() will
      fail to stop the scrub thread.
      
      The scrub thread continues with the scrub for replace which then will
      try to write to the target device and which is already freed by the
      cancel thread.
      
      scrub_setup_ctx() warns as tgtdev is NULL.
      
        struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
        {
        ...
      	  if (is_dev_replace) {
      		  WARN_ON(!fs_info->dev_replace.tgtdev);  <===
      		  sctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO;
      		  sctx->wr_tgtdev = fs_info->dev_replace.tgtdev;
      		  sctx->flush_all_writes = false;
      	  }
      
        [ 6724.497655] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc started
        [ 6753.945017] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc canceled
        [ 6852.426700] WARNING: CPU: 0 PID: 4494 at fs/btrfs/scrub.c:622 scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
        ...
        [ 6852.428928] RIP: 0010:scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
        ...
        [ 6852.432970] Call Trace:
        [ 6852.433202]  btrfs_scrub_dev+0x19b/0x5c0 [btrfs]
        [ 6852.433471]  btrfs_dev_replace_start+0x48c/0x6a0 [btrfs]
        [ 6852.433800]  btrfs_dev_replace_by_ioctl+0x3a/0x60 [btrfs]
        [ 6852.434097]  btrfs_ioctl+0x2476/0x2d20 [btrfs]
        [ 6852.434365]  ? do_sigaction+0x7d/0x1e0
        [ 6852.434623]  do_vfs_ioctl+0xa9/0x6c0
        [ 6852.434865]  ? syscall_trace_enter+0x1c8/0x310
        [ 6852.435124]  ? syscall_trace_enter+0x1c8/0x310
        [ 6852.435387]  ksys_ioctl+0x60/0x90
        [ 6852.435663]  __x64_sys_ioctl+0x16/0x20
        [ 6852.435907]  do_syscall_64+0x50/0x180
        [ 6852.436150]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Further, as the replace thread enters scrub_write_page_to_dev_replace()
      without the target device it panics:
      
        static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
      				      struct scrub_page *spage)
        {
        ...
      	bio_set_dev(bio, sbio->dev->bdev); <======
      
        [ 6929.715145] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
        ..
        [ 6929.717106] Workqueue: btrfs-scrub btrfs_scrub_helper [btrfs]
        [ 6929.717420] RIP: 0010:scrub_write_page_to_dev_replace+0xb4/0x260
        [btrfs]
        ..
        [ 6929.721430] Call Trace:
        [ 6929.721663]  scrub_write_block_to_dev_replace+0x3f/0x60 [btrfs]
        [ 6929.721975]  scrub_bio_end_io_worker+0x1af/0x490 [btrfs]
        [ 6929.722277]  normal_work_helper+0xf0/0x4c0 [btrfs]
        [ 6929.722552]  process_one_work+0x1f4/0x520
        [ 6929.722805]  ? process_one_work+0x16e/0x520
        [ 6929.723063]  worker_thread+0x46/0x3d0
        [ 6929.723313]  kthread+0xf8/0x130
        [ 6929.723544]  ? process_one_work+0x520/0x520
        [ 6929.723800]  ? kthread_delayed_work_timer_fn+0x80/0x80
        [ 6929.724081]  ret_from_fork+0x3a/0x50
      
      Fix this by letting the btrfs_dev_replace_finishing() to do the job of
      cleaning after the cancel, including freeing of the target device.
      btrfs_dev_replace_finishing() is called when btrfs_scub_dev() returns
      along with the scrub return status.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d189dd70
    • Anand Jain's avatar
      btrfs: dev-replace: go back to suspend state if another EXCL_OP is running · 05c49e6b
      Anand Jain authored
      In a secnario where balance and replace co-exists as below,
      
        - start balance
        - pause balance
        - start replace
        - reboot
      
      and when system restarts, balance resumes first. Then the replace is
      attempted to restart but will fail as the EXCL_OP lock is already held
      by the balance. If so place the replace state back to
      BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state.
      
      Fixes: 010a47bd ("btrfs: add proper safety check before resuming dev-replace")
      CC: stable@vger.kernel.org # 4.18+
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      05c49e6b
    • Anand Jain's avatar
      btrfs: dev-replace: go back to suspended state if target device is missing · 0d228ece
      Anand Jain authored
      At the time of forced unmount we place the running replace to
      BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state, so when the system comes
      back and expect the target device is missing.
      
      Then let the replace state continue to be in
      BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state instead of
      BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED as there isn't any matching scrub
      running as part of replace.
      
      Fixes: e93c89c1 ("Btrfs: add new sources for device replace code")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d228ece
    • Anand Jain's avatar
      btrfs: mark btrfs_dev_replace_start as static · 54862d6d
      Anand Jain authored
      There isn't any other consumer other than in its own file dev-replace.c.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54862d6d
    • Anand Jain's avatar
      btrfs: harden agaist duplicate fsid on scanned devices · a9261d41
      Anand Jain authored
      It's not that impossible to imagine that a device OR a btrfs image is
      copied just by using the dd or the cp command. Which in case both the
      copies of the btrfs will have the same fsid. If on the system with
      automount enabled, the copied FS gets scanned.
      
      We have a known bug in btrfs, that we let the device path be changed
      after the device has been mounted. So using this loop hole the new
      copied device would appears as if its mounted immediately after it's
      been copied.
      
      For example:
      
      Initially.. /dev/mmcblk0p4 is mounted as /
      
        $ lsblk
        NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
        mmcblk0     179:0    0 29.2G  0 disk
        |-mmcblk0p4 179:4    0    4G  0 part /
        |-mmcblk0p2 179:2    0  500M  0 part /boot
        |-mmcblk0p3 179:3    0  256M  0 part [SWAP]
        `-mmcblk0p1 179:1    0  256M  0 part /boot/efi
      
        $ btrfs fi show
           Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
           Total devices 1 FS bytes used 1.40GiB
           devid    1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4
      
      Copy mmcblk0 to sda
      
        $ dd if=/dev/mmcblk0 of=/dev/sda
      
      And immediately after the copy completes the change in the device
      superblock is notified which the automount scans using btrfs device scan
      and the new device sda becomes the mounted root device.
      
        $ lsblk
        NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
        sda           8:0    1 14.9G  0 disk
        |-sda4        8:4    1    4G  0 part /
        |-sda2        8:2    1  500M  0 part
        |-sda3        8:3    1  256M  0 part
        `-sda1        8:1    1  256M  0 part
        mmcblk0     179:0    0 29.2G  0 disk
        |-mmcblk0p4 179:4    0    4G  0 part
        |-mmcblk0p2 179:2    0  500M  0 part /boot
        |-mmcblk0p3 179:3    0  256M  0 part [SWAP]
        `-mmcblk0p1 179:1    0  256M  0 part /boot/efi
      
        $ btrfs fi show /
          Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
          Total devices 1 FS bytes used 1.40GiB
          devid    1 size 4.00GiB used 3.00GiB path /dev/sda4
      
      The bug is quite nasty that you can't either unmount /dev/sda4 or
      /dev/mmcblk0p4. And the problem does not get solved until you take sda
      out of the system on to another system to change its fsid using the
      'btrfstune -u' command.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a9261d41
    • Hans van Kranenburg's avatar
      btrfs: introduce nparity raid_attr · b50836ed
      Hans van Kranenburg authored
      Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an
      nparity field in raid_attr.
      Signed-off-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b50836ed
    • Hans van Kranenburg's avatar
      btrfs: fix ncopies raid_attr for RAID56 · da612e31
      Hans van Kranenburg authored
      RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These
      values are not yet used anywhere so there's no change.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da612e31
    • Hans van Kranenburg's avatar
      btrfs: alloc_chunk: fix more DUP stripe size handling · baf92114
      Hans van Kranenburg authored
      Commit 92e222df "btrfs: alloc_chunk: fix DUP stripe size handling"
      fixed calculating the stripe_size for a new DUP chunk.
      
      However, the same calculation reappears a bit later, and that one was
      not changed yet. The resulting bug that is exposed is that the newly
      allocated device extents ('stripes') can have a few MiB overlap with the
      next thing stored after them, which is another device extent or the end
      of the disk.
      
      The scenario in which this can happen is:
      * The block device for the filesystem is less than 10GiB in size.
      * The amount of contiguous free unallocated disk space chosen to use for
        chunk allocation is 20% of the total device size, or a few MiB more or
        less.
      
      An example:
      - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
      - There's 1578MiB unallocated raw disk space left in one contiguous
        piece.
      
      In this case stripe_size is first calculated as 789MiB, (half of
      1578MiB).
      
      Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
      enter the if block. Now stripe_size value is immediately overwritten
      while calculating an adjusted value based on max_chunk_size, which ends
      up as 788MiB.
      
      Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
      actually more than the value we had before. However, the last comparison
      fails to detect this, because it's comparing the value with the total
      amount of free space, which is about twice the size of stripe_size.
      
      In the example above, this means that the resulting raw disk space being
      allocated is 1600MiB, while only a gap of 1578MiB has been found. The
      second device extent object for this DUP chunk will overlap for 22MiB
      with whatever comes next.
      
      The underlying problem here is that the stripe_size is reused all the
      time for different things. So, when entering the code in the if block,
      stripe_size is immediately overwritten with something else. If later we
      decide we want to have the previous value back, then the logic to
      compute it was copy pasted in again.
      
      With this change, the value in stripe_size is not unnecessarily
      destroyed, so the duplicated calculation is not needed any more.
      Signed-off-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      baf92114
    • Hans van Kranenburg's avatar
      btrfs: alloc_chunk: improve chunk size variable name · 23f0ff1e
      Hans van Kranenburg authored
      The variable num_bytes is really a way too generic name for a variable
      in this function. There are a dozen other variables that hold a number
      of bytes as value.
      
      Give it a name that actually describes what it does, which is holding
      the size of the chunk that we're allocating.
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      23f0ff1e
    • Hans van Kranenburg's avatar
      btrfs: alloc_chunk: do not refurbish num_bytes · 2f29df4f
      Hans van Kranenburg authored
      The variable num_bytes is used to store the chunk length of the chunk
      that we're allocating. Do not reuse it for something really different in
      the same function.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2f29df4f
    • Ethan Lien's avatar
      btrfs: use tagged writepage to mitigate livelock of snapshot · 3cd24c69
      Ethan Lien authored
      Snapshot is expected to be fast. But if there are writers steadily
      creating dirty pages in our subvolume, the snapshot may take a very long
      time to complete. To fix the problem, we use tagged writepage for
      snapshot flusher as we do in the generic write_cache_pages(), so we can
      omit pages dirtied after the snapshot command.
      
      This does not change the semantics regarding which data get to the
      snapshot, if there are pages being dirtied during the snapshotting
      operation.  There's a sync called before snapshot is taken in old/new
      case, any IO in flight just after that may be in the snapshot but this
      depends on other system effects that might still sync the IO.
      
      We do a simple snapshot speed test on a Intel D-1531 box:
      
      fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
      --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
      --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
      time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
      
      original: 1m58sec
      patched:  6.54sec
      
      This is the best case for this patch since for a sequential write case,
      we omit nearly all pages dirtied after the snapshot command.
      
      For a multi writers, random write test:
      
      fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
      --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
      --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
      time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
      
      original: 15.83sec
      patched:  10.35sec
      
      The improvement is smaller compared to the sequential write case,
      since we omit only half of the pages dirtied after snapshot command.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarEthan Lien <ethanlien@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3cd24c69
    • Nikolay Borisov's avatar
      btrfs: Remove unused extent_state argument from btrfs_writepage_endio_finish_ordered · c629732d
      Nikolay Borisov authored
      This parameter was never used, yet was part of the interface of the
      function ever since its introduction as extent_io_ops::writepage_end_io_hook
      in e6dcd2dc ("Btrfs: New data=ordered implementation"). Now that
      NULL is passed everywhere as a value for this parameter let's remove it
      for good. No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c629732d
    • Nikolay Borisov's avatar
      btrfs: Remove extent_page_data argument from writepage_delalloc · 8cc0237a
      Nikolay Borisov authored
      The only remaining use of the 'epd' argument in writepage_delalloc is
      to reference the extent_io_tree which was set in extent_writepages. Since
      it is guaranteed that page->mapping of any page passed to
      writepage_delalloc (and __extent_writepage as the sole caller) to be
      equal to that passed in extent_writepages we can directly get the
      io_tree via the already passed inode (which is also taken from
      page->mapping->host). No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8cc0237a
    • Nikolay Borisov's avatar
      btrfs: Move epd::extent_locked check to writepage_delalloc's caller · 7789a55a
      Nikolay Borisov authored
      If epd::extent_locked is set then writepage_delalloc terminates. Make
      this a bit more apparent in the caller by simply bubbling the check up.
      This enables to remove epd as an argument to writepage_delalloc in a
      future patch. No functional change.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7789a55a
    • Nikolay Borisov's avatar
      btrfs: Check for missing device before bio submission in btrfs_map_bio · fc8a168a
      Nikolay Borisov authored
      Before btrfs_map_bio submits all stripe bios it does a number of checks
      to ensure the device for every stripe is present. However, it doesn't do
      a DEV_STATE_MISSING check, instead this is relegated to the lower level
      btrfs_schedule_bio (in the async submission case, sync submission
      doesn't check DEV_STATE_MISSING at all). Additionally
      btrfs_schedule_bios does the duplicate device->bdev check which has
      already been performed in btrfs_map_bio.
      
      This patch moves the DEV_STATE_MISSING check in btrfs_map_bio and
      removes the duplicate device->bdev check. Doing so ensures that no bio
      cloning/submission happens for both async/sync requests in the face of
      missing device. This makes the async io submission path slightly shorter
      in terms of instruction count. No functional changes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc8a168a
    • Anand Jain's avatar
      btrfs: remove redundant replace_state init · ab457246
      Anand Jain authored
      dev_replace::replace_state has been set to
      BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED (0) in the same function,
      So delete the line which sets replace_state = 0;
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab457246
    • Filipe Manana's avatar
      Btrfs: remove no longer used io_err from btrfs_log_ctx · 6d4cbf79
      Filipe Manana authored
      The io_err field of struct btrfs_log_ctx is no longer used after the
      recent simplification of the fast fsync path, where we now wait for
      ordered extents to complete before logging the inode. We did this in
      commit b5e6c3e1 ("btrfs: always wait on ordered extents at fsync
      time") and commit a2120a47 ("btrfs: clean up the left over
      logged_list usage") removed its last use.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d4cbf79
    • Filipe Manana's avatar
      Btrfs: simpler and more efficient cleanup of a log tree's extent io tree · 59b0713a
      Filipe Manana authored
      We currently are in a loop finding each range (corresponding to a btree
      node/leaf) in a log root's extent io tree and then clean it up. This is a
      waste of time since we are traversing the extent io tree's rb_tree more
      times then needed (one for a range lookup and another for cleaning it up)
      without any good reason.
      
      We free the log trees when we are in the critical section of a transaction
      commit (the transaction state is set to TRANS_STATE_COMMIT_DOING), so it's
      of great convenience to do everything as fast as possible in order to
      reduce the time we block other tasks from starting a new transaction.
      
      So fix this by traversing the extent io tree once and cleaning up all its
      records in one go while traversing it.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      59b0713a
    • Nikolay Borisov's avatar
      btrfs: Adjust loop in free_extent_buffer · 46cc775e
      Nikolay Borisov authored
      The loop construct in free_extent_buffer was added in
      242e18c7 ("Btrfs: reduce lock contention on extent buffer locks")
      as means of reducing the times the eb lock is taken, the non-last ref
      count is decremented and lock is released. As the special handling
      of UNMAPPED extent buffers was removed now there is only one decrement
      op which is happening for EXTENT_BUFFER_UNMAPPED case.
      
      This commit modifies the loop condition so that in case of UNMAPPED
      buffers the eb's lock is taken only if we are 100% sure the eb is going
      to be freed by the current executor of the code. Additionally, remove
      superfluous ref count ops in btrfs test.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      46cc775e
    • Nikolay Borisov's avatar
      btrfs: Remove special handling of EXTENT_BUFFER_UNMAPPED while freeing · 9cfc8ba7
      Nikolay Borisov authored
      Now that the whole of btrfs code has been audited for eb reference count
      management it's time to remove the hunk in free_extent_buffer that
      essentially considered the condition
      
        "eb->ref == 2 && EXTENT_BUFFER_DUMMY"
      
      to equal "eb->ref = 1". Also remove the last location
      which takes an extra reference count in alloc_test_extent_buffer.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9cfc8ba7