1. 10 Jul, 2024 11 commits
  2. 03 Jul, 2024 8 commits
    • Luca Boccassi's avatar
      dm verity: add support for signature verification with platform keyring · 6fce1f40
      Luca Boccassi authored
      Add a new configuration CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG_PLATFORM_KEYRING
      that enables verifying dm-verity signatures using the platform keyring,
      which is populated using the UEFI DB certificates. This is useful for
      self-enrolled systems that do not use MOK, as the secondary keyring which
      is already used for verification, if the relevant kconfig is enabled, is
      linked to the machine keyring, which gets its certificates loaded from MOK.
      On datacenter/virtual/cloud deployments it is more common to deploy one's
      own certificate chain directly in DB on first boot in unattended mode,
      rather than relying on MOK, as the latter typically requires interactive
      authentication to enroll, and is more suited for personal machines.
      
      Default to the same value as DM_VERITY_VERIFY_ROOTHASH_SIG_SECONDARY_KEYRING
      if not otherwise specified, as it is likely that if one wants to use
      MOK certificates to verify dm-verity volumes, DB certificates are
      going to be used too. Keys in DB are allowed to load a full kernel
      already anyway, so they are already highly privileged.
      Signed-off-by: default avatarLuca Boccassi <bluca@debian.org>
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      6fce1f40
    • Benjamin Marzinski's avatar
      dm-raid: Fix WARN_ON_ONCE check for sync_thread in raid_resume · 3199a34b
      Benjamin Marzinski authored
      rm-raid devices will occasionally trigger the following warning when
      being resumed after a table load because DM_RECOVERY_RUNNING is set:
      
      WARNING: CPU: 7 PID: 5660 at drivers/md/dm-raid.c:4105 raid_resume+0xee/0x100 [dm_raid]
      
      The failing check is:
      WARN_ON_ONCE(test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
      
      This check is designed to make sure that the sync thread isn't
      registered, but md_check_recovery can set MD_RECOVERY_RUNNING without
      the sync_thread ever getting registered. Instead of checking if
      MD_RECOVERY_RUNNING is set, check if sync_thread is non-NULL.
      
      Fixes: 16c4770c ("dm-raid: really frozen sync_thread during suspend")
      Suggested-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Reviewed-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      3199a34b
    • Eric Biggers's avatar
      dm-verity: hash blocks with shash import+finup when possible · b76ad884
      Eric Biggers authored
      Currently dm-verity computes the hash of each block by using multiple
      calls to the "ahash" crypto API.  While the exact sequence depends on
      the chosen dm-verity settings, in the vast majority of cases it is:
      
          1. crypto_ahash_init()
          2. crypto_ahash_update() [salt]
          3. crypto_ahash_update() [data]
          4. crypto_ahash_final()
      
      This is inefficient for two main reasons:
      
      - It makes multiple indirect calls, which is expensive on modern CPUs
        especially when mitigations for CPU vulnerabilities are enabled.
      
        Since the salt is the same across all blocks on a given dm-verity
        device, a much more efficient sequence would be to do an import of the
        pre-salted state, then a finup.
      
      - It uses the ahash (asynchronous hash) API, despite the fact that
        CPU-based hashing is almost always used in practice, and therefore it
        experiences the overhead of the ahash-based wrapper for shash.
      
        Because dm-verity was intentionally converted to ahash to support
        off-CPU crypto accelerators, a full reversion to shash might not be
        acceptable.  Yet, we should still provide a fast path for shash with
        the most common dm-verity settings.
      
        Another reason for shash over ahash is that the upcoming multibuffer
        hashing support, which is specific to CPU-based hashing, is much
        better suited for shash than for ahash.  Supporting it via ahash would
        add significant complexity and overhead.  And it's not possible for
        the "same" code to properly support both multibuffer hashing and HW
        accelerators at the same time anyway, given the different computation
        models.  Unfortunately there will always be code specific to each
        model needed (for users who want to support both).
      
      Therefore, this patch adds a new shash import+finup based fast path to
      dm-verity.  It is used automatically when appropriate.  This makes
      dm-verity optimized for what the vast majority of users want: CPU-based
      hashing with the most common settings, while still retaining support for
      rarer settings and off-CPU crypto accelerators.
      
      In benchmarks with veritysetup's default parameters (SHA-256, 4K data
      and hash block sizes, 32-byte salt), which also match the parameters
      that Android currently uses, this patch improves block hashing
      performance by about 15% on x86_64 using the SHA-NI instructions, or by
      about 5% on arm64 using the ARMv8 SHA2 instructions.  On x86_64 roughly
      two-thirds of the improvement comes from the use of import and finup,
      while the remaining third comes from the switch from ahash to shash.
      
      Note that another benefit of using "import" to handle the salt is that
      if the salt size is equal to the input size of the hash algorithm's
      compression function, e.g. 64 bytes for SHA-256, then the performance is
      exactly the same as no salt.  This doesn't seem to be much better than
      veritysetup's current default of 32-byte salts, due to the way SHA-256's
      finalization padding works, but it should be marginally better.
      Reviewed-by: default avatarSami Tolvanen <samitolvanen@google.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      b76ad884
    • Eric Biggers's avatar
      dm-verity: make verity_hash() take dm_verity_io instead of ahash_request · e8f5e933
      Eric Biggers authored
      In preparation for adding shash support to dm-verity, change
      verity_hash() to take a pointer to a struct dm_verity_io instead of a
      pointer to the ahash_request embedded inside it.
      Reviewed-by: default avatarSami Tolvanen <samitolvanen@google.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      e8f5e933
    • Eric Biggers's avatar
      dm-verity: always "map" the data blocks · cf715f4b
      Eric Biggers authored
      dm-verity needs to access data blocks by virtual address in three
      different cases (zeroization, recheck, and forward error correction),
      and one more case (shash support) is coming.  Since it's guaranteed that
      dm-verity data blocks never cross pages, and kmap_local_page and
      kunmap_local are no-ops on modern platforms anyway, just unconditionally
      "map" every data block's page and work with the virtual buffer directly.
      This simplifies the code and eliminates unnecessary overhead.
      Reviewed-by: default avatarSami Tolvanen <samitolvanen@google.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      cf715f4b
    • Eric Biggers's avatar
      dm-verity: provide dma_alignment limit in io_hints · 09d14308
      Eric Biggers authored
      Since Linux v6.1, some filesystems support submitting direct I/O that is
      aligned to only dma_alignment instead of the logical_block_size
      alignment that was required before.  I/O that is not aligned to the
      logical_block_size is difficult to handle in device-mapper targets that
      do cryptographic processing of data, as it makes the units of data that
      are hashed or encrypted possibly be split across pages, creating rarely
      used and rarely tested edge cases.
      
      As such, dm-crypt and dm-integrity have already opted out of this by
      setting dma_alignment to 'logical_block_size - 1'.
      
      Although dm-verity does have code that handles these cases (or at least
      is intended to do so), supporting direct I/O with such a low amount of
      alignment is not really useful on dm-verity devices.  So, opt dm-verity
      out of it too so that it's not necessary to handle these edge cases.
      Reviewed-by: default avatarSami Tolvanen <samitolvanen@google.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      09d14308
    • Eric Biggers's avatar
      dm-verity: make real_digest and want_digest fixed-length · a7ddb3d4
      Eric Biggers authored
      Change the digest fields in struct dm_verity_io from variable-length to
      fixed-length, since their maximum length is fixed at
      HASH_MAX_DIGESTSIZE, i.e. 64 bytes, which is not too big.  This is
      simpler and makes the fields a bit faster to access.
      
      (HASH_MAX_DIGESTSIZE did not exist when this code was written, which may
      explain why it wasn't used.)
      
      This makes the verity_io_real_digest() and verity_io_want_digest()
      functions trivial, but this patch leaves them in place temporarily since
      most of their callers will go away in a later patch anyway.
      Reviewed-by: default avatarSami Tolvanen <samitolvanen@google.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      a7ddb3d4
    • Eric Biggers's avatar
      dm-verity: move data hash mismatch handling into its own function · e41e52e5
      Eric Biggers authored
      Move the code that handles mismatches of data block hashes into its own
      function so that it doesn't clutter up verity_verify_io().
      Reviewed-by: default avatarSami Tolvanen <samitolvanen@google.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      e41e52e5
  3. 02 Jul, 2024 6 commits
  4. 26 Jun, 2024 2 commits
    • Mikulas Patocka's avatar
      dm: optimize flushes · aaa53168
      Mikulas Patocka authored
      Device mapper sends flush bios to all the targets and the targets send it
      to the underlying device. That may be inefficient, for example if a table
      contains 10 linear targets pointing to the same physical device, then
      device mapper would send 10 flush bios to that device - despite the fact
      that only one bio would be sufficient.
      
      This commit optimizes the flush behavior. It introduces a per-target
      variable flush_bypasses_map - it is set when the target supports flush
      optimization - currently, the dm-linear and dm-stripe targets support it.
      When all the targets in a table have flush_bypasses_map,
      flush_bypasses_map on the table is set. __send_empty_flush tests if the
      table has flush_bypasses_map - and if it has, no flush bios are sent to
      the targets via the "map" method and the list dm_table->devices is
      iterated and the flush bios are sent to each member of the list.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: default avatarMike Snitzer <snitzer@kernel.org>
      Suggested-by: default avatarYang Yang <yang.yang@vivo.com>
      aaa53168
    • Mikulas Patocka's avatar
      block: change rq_integrity_vec to respect the iterator · cf546dd2
      Mikulas Patocka authored
      If we allocate a bio that is larger than NVMe maximum request size,
      attach integrity metadata to it and send it to the NVMe subsystem, the
      integrity metadata will be corrupted.
      
      Splitting the bio works correctly. The function bio_split will clone the
      bio, trim the iterator of the first bio and advance the iterator of the
      second bio.
      
      However, the function rq_integrity_vec has a bug - it returns the first
      vector of the bio's metadata and completely disregards the metadata
      iterator that was advanced when the bio was split. Thus, the second bio
      uses the same metadata as the first bio and this leads to metadata
      corruption.
      
      This commit changes rq_integrity_vec, so that it calls mp_bvec_iter_bvec
      instead of returning the first vector. mp_bvec_iter_bvec reads the
      iterator and uses it to build a bvec for the current position in the
      iterator.
      
      The "queue_max_integrity_segments(rq->q) > 1" check was removed, because
      the updated rq_integrity_vec function works correctly with multiple
      segments.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: default avatarAnuj Gupta <anuj20.g@samsung.com>
      Reviewed-by: default avatarKanchan Joshi <joshi.k@samsung.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/49d1afaa-f934-6ed2-a678-e0d428c63a65@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cf546dd2
  5. 24 Jun, 2024 2 commits
  6. 23 Jun, 2024 1 commit
  7. 21 Jun, 2024 4 commits
  8. 20 Jun, 2024 6 commits
    • Alan Adamson's avatar
      nvme: Atomic write support · 5f9bbea0
      Alan Adamson authored
      Add support to set block layer request_queue atomic write limits. The
      limits will be derived from either the namespace or controller atomic
      parameters.
      
      NVMe atomic-related parameters are grouped into "normal" and "power-fail"
      (or PF) class of parameter. For atomic write support, only PF parameters
      are of interest. The "normal" parameters are concerned with racing reads
      and writes (which also applies to PF). See NVM Command Set Specification
      Revision 1.0d section 2.1.4 for reference.
      
      Whether to use per namespace or controller atomic parameters is decided by
      NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data
      Structure, NVM Command Set.
      
      NVMe namespaces may define an atomic boundary, whereby no atomic guarantees
      are provided for a write which straddles this per-lba space boundary. The
      block layer merging policy is such that no merges may occur in which the
      resultant request would straddle such a boundary.
      
      Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from
      atomic boundary rule. In addition, again unlike SCSI, there is no
      dedicated atomic write command - a write which adheres to the atomic size
      limit and boundary is implicitly atomic.
      
      If NSFEAT bit 1 is set, the following parameters are of interest:
      - NAWUPF (Namespace Atomic Write Unit Power Fail)
      - NABSPF (Namespace Atomic Boundary Size Power Fail)
      - NABO (Namespace Atomic Boundary Offset)
      
      and we set request_queue limits as follows:
      - atomic_write_unit_max = rounddown_pow_of_two(NAWUPF)
      - atomic_write_max_bytes = NAWUPF
      - atomic_write_boundary = NABSPF
      
      If in the unlikely scenario that NABO is non-zero, then atomic writes will
      not be supported at all as dealing with this adds extra complexity. This
      policy may change in future.
      
      In all cases, atomic_write_unit_min is set to the logical block size.
      
      If NSFEAT bit 1 is unset, the following parameter is of interest:
      - AWUPF (Atomic Write Unit Power Fail)
      
      and we set request_queue limits as follows:
      - atomic_write_unit_max = rounddown_pow_of_two(AWUPF)
      - atomic_write_max_bytes = AWUPF
      - atomic_write_boundary = 0
      
      A new function, nvme_valid_atomic_write(), is also called from submission
      path to verify that a request has been submitted to the driver will
      actually be executed atomically. As mentioned, there is no dedicated NVMe
      atomic write command (which may error for a command which exceeds the
      controller atomic write limits).
      
      Note on NABSPF:
      There seems to be some vagueness in the spec as to whether NABSPF applies
      for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF
      and how it is affected by bit 1. However Figure 4 does tell to check Figure
      97 for info about per-namespace parameters, which NABSPF is, so it is
      implied. However currently nvme_update_disk_info() does check namespace
      parameter NABO regardless of this bit.
      Signed-off-by: default avatarAlan Adamson <alan.adamson@oracle.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      jpg: total rewrite
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Link: https://lore.kernel.org/r/20240620125359.2684798-11-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5f9bbea0
    • John Garry's avatar
      scsi: scsi_debug: Atomic write support · 84f3a3c0
      John Garry authored
      Add initial support for atomic writes.
      
      As is standard method, feed device properties via modules param, those
      being:
      - atomic_max_size_blks
      - atomic_alignment_blks
      - atomic_granularity_blks
      - atomic_max_size_with_boundary_blks
      - atomic_max_boundary_blks
      
      These just match sbc4r22 section 6.6.4 - Block limits VPD page.
      
      We just support ATOMIC WRITE (16).
      
      The major change in the driver is how we lock the device for RW accesses.
      
      Currently the driver uses a per-device lock for accessing device metadata
      and "media" data (calls to do_device_access()) atomically for the duration
      of the whole read/write command.
      
      This should not suit verifying atomic writes. Reason being that currently
      all reads/writes are atomic, so using atomic writes does not prove
      anything.
      
      Change device access model to basis that regular writes only atomic on a
      per-sector basis, while reads and atomic writes are fully atomic.
      
      As mentioned, since accessing metadata and device media is atomic,
      continue to have regular writes involving metadata - like discard or PI -
      as atomic. We can improve this later.
      
      Currently we only support model where overlapping going reads or writes
      wait for current access to complete before commencing an atomic write.
      This is described in 4.29.3.2 section of the SBC. However, we simplify,
      things and wait for all accesses to complete (when issuing an atomic
      write).
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-10-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      84f3a3c0
    • John Garry's avatar
      scsi: sd: Atomic write support · bf4ae8f2
      John Garry authored
      Support is divided into two main areas:
      - reading VPD pages and setting sdev request_queue limits
      - support WRITE ATOMIC (16) command and tracing
      
      The relevant block limits VPD page need to be read to allow the block layer
      request_queue atomic write limits to be set. These VPD page limits are
      described in sbc4r22 section 6.6.4 - Block limits VPD page.
      
      There are five limits of interest:
      - MAXIMUM ATOMIC TRANSFER LENGTH
      - ATOMIC ALIGNMENT
      - ATOMIC TRANSFER LENGTH GRANULARITY
      - MAXIMUM ATOMIC TRANSFER LENGTH WITH BOUNDARY
      - MAXIMUM ATOMIC BOUNDARY SIZE
      
      MAXIMUM ATOMIC TRANSFER LENGTH is the maximum length for a WRITE ATOMIC
      (16) command. It will not be greater than the device MAXIMUM TRANSFER
      LENGTH.
      
      ATOMIC ALIGNMENT and ATOMIC TRANSFER LENGTH GRANULARITY are the minimum
      alignment and length values for an atomic write in terms of logical blocks.
      
      Unlike NVMe, SCSI does not specify an LBA space boundary, but does specify
      a per-IO boundary granularity. The maximum boundary size is specified in
      MAXIMUM ATOMIC BOUNDARY SIZE. When used, this boundary value is set in the
      WRITE ATOMIC (16) ATOMIC BOUNDARY field - layout for the WRITE_ATOMIC_16
      command can be found in sbc4r22 section 5.48. This boundary value is the
      granularity size at which the device may atomically write the data. A value
      of zero in WRITE ATOMIC (16) ATOMIC BOUNDARY field means that all data must
      be atomically written together.
      
      MAXIMUM ATOMIC TRANSFER LENGTH WITH BOUNDARY is the maximum atomic write
      length if a non-zero boundary value is set.
      
      For atomic write support, the WRITE ATOMIC (16) boundary is not of much
      interest, as the block layer expects each request submitted to be executed
      atomically. However, the SCSI spec does leave itself open to a quirky
      scenario where MAXIMUM ATOMIC TRANSFER LENGTH is zero, yet MAXIMUM ATOMIC
      TRANSFER LENGTH WITH BOUNDARY and MAXIMUM ATOMIC BOUNDARY SIZE are both
      non-zero. This case will be supported.
      
      To set the block layer request_queue atomic write capabilities, sanitize
      the VPD page limits and set limits as follows:
      - atomic_write_unit_min is derived from granularity and alignment values.
        If no granularity value is not set, use physical block size
      - atomic_write_unit_max is derived from MAXIMUM ATOMIC TRANSFER LENGTH. In
        the scenario where MAXIMUM ATOMIC TRANSFER LENGTH is zero and boundary
        limits are non-zero, use MAXIMUM ATOMIC BOUNDARY SIZE for
        atomic_write_unit_max. New flag scsi_disk.use_atomic_write_boundary is
        set for this scenario.
      - atomic_write_boundary_bytes is set to zero always
      
      SCSI also supports a WRITE ATOMIC (32) command, which is for type 2
      protection enabled. This is not going to be supported now, so check for
      T10_PI_TYPE2_PROTECTION when setting any request_queue limits.
      
      To handle an atomic write request, add support for WRITE ATOMIC (16)
      command in handler sd_setup_atomic_cmnd(). Flag use_atomic_write_boundary
      is checked here for encoding ATOMIC BOUNDARY field.
      
      Trace info is also added for WRITE_ATOMIC_16 command.
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-9-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bf4ae8f2
    • John Garry's avatar
      block: Add fops atomic write support · caf336f8
      John Garry authored
      Support atomic writes by submitting a single BIO with the REQ_ATOMIC set.
      
      It must be ensured that the atomic write adheres to its rules, like
      naturally aligned offset, so call blkdev_dio_invalid() ->
      blkdev_atomic_write_valid() [with renaming blkdev_dio_unaligned() to
      blkdev_dio_invalid()] for this purpose. The BIO submission path currently
      checks for atomic writes which are too large, so no need to check here.
      
      In blkdev_direct_IO(), if the nr_pages exceeds BIO_MAX_VECS, then we cannot
      produce a single BIO, so error in this case.
      
      Finally set FMODE_CAN_ATOMIC_WRITE when the bdev can support atomic writes
      and the associated file flag is for O_DIRECT.
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-8-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      caf336f8
    • Prasad Singamsetty's avatar
      block: Add atomic write support for statx · 9abcfbd2
      Prasad Singamsetty authored
      Extend statx system call to return additional info for atomic write support
      support if the specified file is a block device.
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarPrasad Singamsetty <prasad.singamsetty@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-7-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9abcfbd2
    • John Garry's avatar
      block: Add core atomic write support · 9da3d1e9
      John Garry authored
      Add atomic write support, as follows:
      - add helper functions to get request_queue atomic write limits
      - report request_queue atomic write support limits to sysfs and update Doc
      - support to safely merge atomic writes
      - deal with splitting atomic writes
      - misc helper functions
      - add a per-request atomic write flag
      
      New request_queue limits are added, as follows:
      - atomic_write_hw_max is set by the block driver and is the maximum length
        of an atomic write which the device may support. It is not
        necessarily a power-of-2.
      - atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
        max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
        and atomic_write_max_sectors would be the limit on a merged atomic write
        request size. This value is not capped at max_sectors, as the value in
        max_sectors can be controlled from userspace, and it would only cause
        trouble if userspace could limit atomic_write_unit_max_bytes and the
        other atomic write limits.
      - atomic_write_hw_unit_{min,max} are set by the block driver and are the
        min/max length of an atomic write unit which the device may support. They
        both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
        the same value as atomic_write_hw_max.
      - atomic_write_unit_{min,max} are derived from
        atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
        Both min and max values must be a power-of-2.
      - atomic_write_hw_boundary is set by the block driver. If non-zero, it
        indicates an LBA space boundary at which an atomic write straddles no
        longer is atomically executed by the disk. The value must be a
        power-of-2. Note that it would be acceptable to enforce a rule that
        atomic_write_hw_boundary_sectors is a multiple of
        atomic_write_hw_unit_max, but the resultant code would be more
        complicated.
      
      All atomic writes limits are by default set 0 to indicate no atomic write
      support. Even though it is assumed by Linux that a logical block can always
      be atomically written, we ignore this as it is not of particular interest.
      Stacked devices are just not supported either for now.
      
      An atomic write must always be submitted to the block driver as part of a
      single request. As such, only a single BIO must be submitted to the block
      layer for an atomic write. When a single atomic write BIO is submitted, it
      cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
      by the maximum guaranteed BIO size which will not be required to be split.
      This max size is calculated by request_queue max segments and the number
      of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
      issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
      segment containing PAGE_SIZE of data, apart from the first+last, which each
      can fit logical block size of data. The first+last will be LBS
      length/aligned as we rely on direct IO alignment rules also.
      
      New sysfs files are added to report the following atomic write limits:
      - atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
      				bytes
      - atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
      				bytes
      - atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
      				bytes
      - atomic_write_max_bytes      - same as atomic_write_max_sectors in bytes
      
      Atomic writes may only be merged with other atomic writes and only under
      the following conditions:
      - total resultant request length <= atomic_write_max_bytes
      - the merged write does not straddle a boundary
      
      Helper function bdev_can_atomic_write() is added to indicate whether
      atomic writes may be issued to a bdev. If a bdev is a partition, the
      partition start must be aligned with both atomic_write_unit_min_sectors
      and atomic_write_hw_boundary_sectors.
      
      FSes will rely on the block layer to validate that an atomic write BIO
      submitted will be of valid size, so add blk_validate_atomic_write_op_size()
      for this purpose. Userspace expects an atomic write which is of invalid
      size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
      BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
      invalid size BIO.
      
      Flag REQ_ATOMIC is used for indicating an atomic write.
      Co-developed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Signed-off-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9da3d1e9