1. 04 Jun, 2021 20 commits
    • Damien Le Moal's avatar
      dm: introduce zone append emulation · bb37d772
      Damien Le Moal authored
      For zoned targets that cannot support zone append operations, implement
      an emulation using regular write operations. If the original BIO
      submitted by the user is a zone append operation, change its clone into
      a regular write operation directed at the target zone write pointer
      position.
      
      To do so, an array of write pointer offsets (write pointer position
      relative to the start of a zone) is added to struct mapped_device. All
      operations that modify a sequential zone write pointer (writes, zone
      reset, zone finish and zone append) are intersepted in __map_bio() and
      processed using the new functions dm_zone_map_bio().
      
      Detection of the target ability to natively support zone append
      operations is done from dm_table_set_restrictions() by calling the
      function dm_set_zones_restrictions(). A target that does not support
      zone append operation, either by explicitly declaring it using the new
      struct dm_target field zone_append_not_supported, or because the device
      table contains a non-zoned device, has its mapped device marked with the
      new flag DMF_ZONE_APPEND_EMULATED. The helper function
      dm_emulate_zone_append() is introduced to test a mapped device for this
      new flag.
      
      Atomicity of the zones write pointer tracking and updates is done using
      a zone write locking mechanism based on a bitmap. This is similar to
      the block layer method but based on BIOs rather than struct request.
      A zone write lock is taken in dm_zone_map_bio() for any clone BIO with
      an operation type that changes the BIO target zone write pointer
      position. The zone write lock is released if the clone BIO is failed
      before submission or when dm_zone_endio() is called when the clone BIO
      completes.
      
      The zone write lock bitmap of the mapped device, together with a bitmap
      indicating zone types (conv_zones_bitmap) and the write pointer offset
      array (zwp_offset) are allocated and initialized with a full device zone
      report in dm_set_zones_restrictions() using the function
      dm_revalidate_zones().
      
      For failed operations that may have modified a zone write pointer, the
      zone write pointer offset is marked as invalid in dm_zone_endio().
      Zones with an invalid write pointer offset are checked and the write
      pointer updated using an internal report zone operation when the
      faulty zone is accessed again by the user.
      
      All functions added for this emulation have a minimal overhead for
      zoned targets natively supporting zone append operations. Regular
      device targets are also not affected. The added code also does not
      impact builds with CONFIG_BLK_DEV_ZONED disabled by stubbing out all
      dm zone related functions.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      bb37d772
    • Damien Le Moal's avatar
      dm: rearrange core declarations for extended use from dm-zone.c · e2118b3c
      Damien Le Moal authored
      Move the definitions of struct dm_target_io, struct dm_io and the bits
      of the flags field of struct mapped_device from dm.c to dm-core.h to
      make them usable from dm-zone.c. For the same reason, declare
      dec_pending() in dm-core.h after renaming it to dm_io_dec_pending().
      And for symmetry of the function names, introduce the inline helper
      dm_io_inc_pending() instead of directly using atomic_inc() calls.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      e2118b3c
    • Damien Le Moal's avatar
      block: introduce BIO_ZONE_WRITE_LOCKED bio flag · 9ffbbb43
      Damien Le Moal authored
      Introduce the BIO flag BIO_ZONE_WRITE_LOCKED to indicate that a BIO owns
      the write lock of the zone it is targeting. This is the counterpart of
      the struct request flag RQF_ZONE_WRITE_LOCKED.
      
      This new BIO flag is reserved for now for zone write locking control
      for device mapper targets exposing a zoned block device. Since in this
      case, the lock flag must not be propagated to the struct request that
      will be used to process the BIO, a BIO private flag is used rather than
      changing the RQF_ZONE_WRITE_LOCKED request flag into a common REQ_XXX
      flag that could be used for both BIO and request. This avoids conflicts
      down the stack with the block IO scheduler zone write locking
      (in mq-deadline).
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      9ffbbb43
    • Damien Le Moal's avatar
      block: introduce bio zone helpers · d0ea6bde
      Damien Le Moal authored
      Introduce the helper functions bio_zone_no() and bio_zone_is_seq().
      Both are the BIO counterparts of the request helpers blk_rq_zone_no()
      and blk_rq_zone_is_seq(), respectively returning the number of the
      target zone of a bio and true if the BIO target zone is sequential.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      d0ea6bde
    • Damien Le Moal's avatar
      block: improve handling of all zones reset operation · 1ee533ec
      Damien Le Moal authored
      SCSI, ZNS and null_blk zoned devices support resetting all zones using
      a single command (REQ_OP_ZONE_RESET_ALL), as indicated using the device
      request queue flag QUEUE_FLAG_ZONE_RESETALL. This flag is not set for
      device mapper targets creating zoned devices. In this case, a user
      request for resetting all zones of a device is processed in
      blkdev_zone_mgmt() by issuing a REQ_OP_ZONE_RESET operation for each
      zone of the device. This leads to different behaviors of the
      BLKRESETZONE ioctl() depending on the target device support for the
      reset all operation. E.g.
      
      blkzone reset /dev/sdX
      
      will reset all zones of a SCSI device using a single command that will
      ignore conventional, read-only or offline zones.
      
      But a dm-linear device including conventional, read-only or offline
      zones cannot be reset in the same manner as some of the single zone
      reset operations issued by blkdev_zone_mgmt() will fail. E.g.:
      
      blkzone reset /dev/dm-Y
      blkzone: /dev/dm-0: BLKRESETZONE ioctl failed: Remote I/O error
      
      To simplify applications and tools development, unify the behavior of
      the all-zone reset operation by modifying blkdev_zone_mgmt() to not
      issue a zone reset operation for conventional, read-only and offline
      zones, thus mimicking what an actual reset-all device command does on a
      device supporting REQ_OP_ZONE_RESET_ALL. This emulation is done using
      the new function blkdev_zone_reset_all_emulated(). The zones needing a
      reset are identified using a bitmap that is initialized using a zone
      report. Since empty zones do not need a reset, also ignore these zones.
      The function blkdev_zone_reset_all() is introduced for block devices
      natively supporting reset all operations. blkdev_zone_mgmt() is modified
      to call either function to execute an all zone reset request.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      [hch: split into multiple functions]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      1ee533ec
    • Damien Le Moal's avatar
      dm: Forbid requeue of writes to zones · bf14e2b2
      Damien Le Moal authored
      A target map method requesting the requeue of a bio with
      DM_MAPIO_REQUEUE or completing it with DM_ENDIO_REQUEUE can cause
      unaligned write errors if the bio is a write operation targeting a
      sequential zone. If a zoned target request such a requeue, warn about
      it and kill the IO.
      
      The function dm_is_zone_write() is introduced to detect write operations
      to zoned targets.
      
      This change does not affect the target drivers supporting zoned devices
      and exposing a zoned device, namely dm-crypt, dm-linear and dm-flakey as
      none of these targets ever request a requeue.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      bf14e2b2
    • Damien Le Moal's avatar
      dm: Introduce dm_report_zones() · 912e8875
      Damien Le Moal authored
      To simplify the implementation of the report_zones operation of a zoned
      target, introduce the function dm_report_zones() to set a target
      mapping start sector in struct dm_report_zones_args and call
      blkdev_report_zones(). This new function is exported and the report
      zones callback function dm_report_zones_cb() is not.
      
      dm-linear, dm-flakey and dm-crypt are modified to use dm_report_zones().
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      912e8875
    • Damien Le Moal's avatar
      dm: move zone related code to dm-zone.c · 7fc18728
      Damien Le Moal authored
      Move core and table code used for zoned targets and conditionally
      defined with #ifdef CONFIG_BLK_DEV_ZONED to the new file dm-zone.c.
      This file is conditionally compiled depending on CONFIG_BLK_DEV_ZONED.
      The small helper dm_set_zones_restrictions() is introduced to
      initialize a mapped device request queue zone attributes in
      dm_table_set_restrictions().
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      7fc18728
    • Damien Le Moal's avatar
      dm: cleanup device_area_is_invalid() · dd73c320
      Damien Le Moal authored
      In device_area_is_invalid(), use bdev_is_zoned() instead of open
      coding the test on the zoned model returned by bdev_zoned_model().
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      dd73c320
    • Damien Le Moal's avatar
      dm: Fix dm_accept_partial_bio() relative to zone management commands · 6842d264
      Damien Le Moal authored
      Fix dm_accept_partial_bio() to actually check that zone management
      commands are not passed as explained in the function documentation
      comment. Also, since a zone append operation cannot be split, add
      REQ_OP_ZONE_APPEND as a forbidden command.
      
      White lines are added around the group of BUG_ON() calls to make the
      code more legible.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      6842d264
    • Damien Le Moal's avatar
      dm zoned: check zone capacity · bab68499
      Damien Le Moal authored
      The dm-zoned target cannot support zoned block devices with zones that
      have a capacity smaller than the zone size (e.g. NVMe zoned namespaces)
      due to the current chunk zone mapping implementation as it is assumed
      that zones and chunks have the same size with all blocks usable.
      If a zoned drive is found to have zones with a capacity different from
      the zone size, fail the target initialization.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Cc: stable@vger.kernel.org # v5.9+
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      bab68499
    • Rikard Falkeborn's avatar
      dm table: Constify static struct blk_ksm_ll_ops · ccde2cbf
      Rikard Falkeborn authored
      The only usage of dm_ksm_ll_ops is to make a copy of it to the ksm_ll_ops
      field in the blk_keyslot_manager struct. Make it const to allow the
      compiler to put it in read-only memory.
      Signed-off-by: default avatarRikard Falkeborn <rikard.falkeborn@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      ccde2cbf
    • Mikulas Patocka's avatar
      dm writecache: interrupt writeback if suspended · af4f6cab
      Mikulas Patocka authored
      If the DM device is suspended, interrupt the writeback sequence so
      that there is no excessive suspend delay.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      af4f6cab
    • Mikulas Patocka's avatar
      dm writecache: don't split bios when overwriting contiguous cache content · ee50cc19
      Mikulas Patocka authored
      If dm-writecache overwrites existing cached data, it splits the
      incoming bio into many block-sized bios. The I/O scheduler does merge
      these bios into one large request but this needless splitting and
      merging causes performance degradation.
      
      Fix this by avoiding bio splitting if the cache target area that is
      being overwritten is contiguous.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      ee50cc19
    • Mikulas Patocka's avatar
      dm kcopyd: avoid spin_lock_irqsave from process context · 6bcd658f
      Mikulas Patocka authored
      The functions "pop", "push_head", "do_work" can only be called from
      process context. Therefore, replace spin_lock_irq{save,restore} with
      spin_{lock,unlock}_irq.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      6bcd658f
    • Mikulas Patocka's avatar
      dm kcopyd: avoid useless atomic operations · db2351eb
      Mikulas Patocka authored
      The functions set_bit and clear_bit are atomic. We don't need
      atomicity when making flags for dm-kcopyd. So, change them to direct
      manipulation of the flags.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      db2351eb
    • Joe Thornber's avatar
      dm space map disk: cache a small number of index entries · 6b06dd5a
      Joe Thornber authored
      The disk space map stores it's index entries in a btree, these are
      accessed very frequently, so having a few cached makes a big difference
      to performance.
      
      With this change provisioning a new block takes roughly 20% less cpu.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      6b06dd5a
    • Joe Thornber's avatar
      dm space maps: improve performance with inc/dec on ranges of blocks · be500ed7
      Joe Thornber authored
      When we break sharing on btree nodes we typically need to increment
      the reference counts to every value held in the node.  This can
      cause a lot of repeated calls to the space maps.  Fix this by changing
      the interface to the space map inc/dec methods to take ranges of
      adjacent blocks to be operated on.
      
      For installations that are using a lot of snapshots this will reduce
      cpu overhead of fundamental operations such as provisioning a new block,
      or deleting a snapshot, by as much as 10 times.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      be500ed7
    • Joe Thornber's avatar
      dm space maps: don't reset space map allocation cursor when committing · 5faafc77
      Joe Thornber authored
      Current commit code resets the place where the search for free blocks
      will begin back to the start of the metadata device.  There are a couple
      of repercussions to this:
      
      - The first allocation after the commit is likely to take longer than
        normal as it searches for a free block in an area that is likely to
        have very few free blocks (if any).
      
      - Any free blocks it finds will have been recently freed.  Reusing them
        means we have fewer old copies of the metadata to aid recovery from
        hardware error.
      
      Fix these issues by leaving the cursor alone, only resetting when the
      search hits the end of the metadata device.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      5faafc77
    • Joe Thornber's avatar
      dm btree: improve btree residency · 4eafdb15
      Joe Thornber authored
      This commit improves the residency of btrees built in the metadata for
      dm-thin and dm-cache.
      
      When inserting a new entry into a full btree node the current code
      splits the node into two.  This can result in very many half full nodes,
      particularly if the insertions are occurring in an ascending order (as
      happens in dm-thin with large writes).
      
      With this commit, when we insert into a full node we first try and move
      some entries to a neighbouring node that has space, failing that it
      tries to split two neighbouring nodes into three.
      
      Results are given below.  'Residency' is how full nodes are on average
      as a percentage.  Average instruction counts for the operations
      are given to show the extra processing has little overhead.
      
                               +--------------------------+--------------------------+
                               |         Before           |         After            |
      +------------+-----------+-----------+--------------+-----------+--------------+
      |    Test    |   Phase   | Residency | Instructions | Residency | Instructions |
      +------------+-----------+-----------+--------------+-----------+--------------+
      | Ascending  | insert    |        50 |         1876 |        96 |         1930 |
      |            | overwrite |        50 |         1789 |        96 |         1746 |
      |            | lookup    |        50 |          778 |        96 |          778 |
      | Descending | insert    |        50 |         3024 |        96 |         3181 |
      |            | overwrite |        50 |         1789 |        96 |         1746 |
      |            | lookup    |        50 |          778 |        96 |          778 |
      | Random     | insert    |        68 |         3800 |        84 |         3736 |
      |            | overwrite |        68 |         4254 |        84 |         3911 |
      |            | lookup    |        68 |          779 |        84 |          779 |
      | Runs       | insert    |        63 |         2546 |        82 |         2815 |
      |            | overwrite |        63 |         2013 |        82 |         1986 |
      |            | lookup    |        63 |          778 |        82 |          779 |
      +------------+-----------+-----------+--------------+-----------+--------------+
      
         Ascending - keys are inserted in ascending order.
         Descending - keys are inserted in descending order.
         Random - keys are inserted in random order.
         Runs - keys are split into ascending runs of ~20 length.  Then
                the runs are shuffled.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: Colin Ian King <colin.king@canonical.com> # contains_key() fix
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      4eafdb15
  2. 25 May, 2021 3 commits
  3. 23 May, 2021 17 commits