1. 07 May, 2024 5 commits
    • Christoph Hellwig's avatar
      block: add a blk_alloc_discard_bio helper · e8b4869b
      Christoph Hellwig authored
      Factor out a helper from __blkdev_issue_discard that chews off as much as
      possible from a discard range and allocates a bio for it.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20240506042027.2289826-5-hch@lst.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e8b4869b
    • Christoph Hellwig's avatar
      block: add a bio_chain_and_submit helper · 81c2168c
      Christoph Hellwig authored
      This is basically blk_next_bio just with the bio allocation moved
      to the caller to allow for more flexible bio handling in the caller.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20240506042027.2289826-4-hch@lst.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      81c2168c
    • Christoph Hellwig's avatar
      block: move discard checks into the ioctl handler · 30f1e724
      Christoph Hellwig authored
      Most bio operations get basic sanity checking in submit_bio and anything
      more complicated than that is done in the callers.  Discards are a bit
      different from that in that a lot of checking is done in
      __blkdev_issue_discard, and the specific errnos for that are returned
      to userspace.  Move the checks that require specific errnos to the ioctl
      handler instead, and just leave the basic sanity checking in submit_bio
      for the other handlers.  This introduces two changes in behavior:
      
       1) the logical block size alignment check of the start and len is lost
          for non-ioctl callers.
          This matches what is done for other operations including reads and
          writes.  We should probably verify this for all bios, but for now
          make discards match the normal flow.
       2) for non-ioctl callers all errors are reported on I/O completion now
          instead of synchronously.  Callers in general mostly ignore or log
          errors so this will actually simplify the code once cleaned up
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20240506042027.2289826-3-hch@lst.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      30f1e724
    • Christoph Hellwig's avatar
      block: remove the discard_granularity check in __blkdev_issue_discard · 09425920
      Christoph Hellwig authored
      We now set a default granularity in the queue limits API, so don't
      bother with this extra check.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20240506042027.2289826-2-hch@lst.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      09425920
    • Justin Stitt's avatar
      block/ioctl: prefer different overflow check · ccb326b5
      Justin Stitt authored
      Running syzkaller with the newly reintroduced signed integer overflow
      sanitizer shows this report:
      
      [   62.982337] ------------[ cut here ]------------
      [   62.985692] cgroup: Invalid name
      [   62.986211] UBSAN: signed-integer-overflow in ../block/ioctl.c:36:46
      [   62.989370] 9pnet_fd: p9_fd_create_tcp (7343): problem connecting socket to 127.0.0.1
      [   62.992992] 9223372036854775807 + 4095 cannot be represented in type 'long long'
      [   62.997827] 9pnet_fd: p9_fd_create_tcp (7345): problem connecting socket to 127.0.0.1
      [   62.999369] random: crng reseeded on system resumption
      [   63.000634] GUP no longer grows the stack in syz-executor.2 (7353): 20002000-20003000 (20001000)
      [   63.000668] CPU: 0 PID: 7353 Comm: syz-executor.2 Not tainted 6.8.0-rc2-00035-gb3ef86b5a957 #1
      [   63.000677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
      [   63.000682] Call Trace:
      [   63.000686]  <TASK>
      [   63.000731]  dump_stack_lvl+0x93/0xd0
      [   63.000919]  __get_user_pages+0x903/0xd30
      [   63.001030]  __gup_longterm_locked+0x153e/0x1ba0
      [   63.001041]  ? _raw_read_unlock_irqrestore+0x17/0x50
      [   63.001072]  ? try_get_folio+0x29c/0x2d0
      [   63.001083]  internal_get_user_pages_fast+0x1119/0x1530
      [   63.001109]  iov_iter_extract_pages+0x23b/0x580
      [   63.001206]  bio_iov_iter_get_pages+0x4de/0x1220
      [   63.001235]  iomap_dio_bio_iter+0x9b6/0x1410
      [   63.001297]  __iomap_dio_rw+0xab4/0x1810
      [   63.001316]  iomap_dio_rw+0x45/0xa0
      [   63.001328]  ext4_file_write_iter+0xdde/0x1390
      [   63.001372]  vfs_write+0x599/0xbd0
      [   63.001394]  ksys_write+0xc8/0x190
      [   63.001403]  do_syscall_64+0xd4/0x1b0
      [   63.001421]  ? arch_exit_to_user_mode_prepare+0x3a/0x60
      [   63.001479]  entry_SYSCALL_64_after_hwframe+0x6f/0x77
      [   63.001535] RIP: 0033:0x7f7fd3ebf539
      [   63.001551] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      [   63.001562] RSP: 002b:00007f7fd32570c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [   63.001584] RAX: ffffffffffffffda RBX: 00007f7fd3ff3f80 RCX: 00007f7fd3ebf539
      [   63.001590] RDX: 4db6d1e4f7e43360 RSI: 0000000020000000 RDI: 0000000000000004
      [   63.001595] RBP: 00007f7fd3f1e496 R08: 0000000000000000 R09: 0000000000000000
      [   63.001599] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      [   63.001604] R13: 0000000000000006 R14: 00007f7fd3ff3f80 R15: 00007ffd415ad2b8
      ...
      [   63.018142] ---[ end trace ]---
      
      Historically, the signed integer overflow sanitizer did not work in the
      kernel due to its interaction with `-fwrapv` but this has since been
      changed [1] in the newest version of Clang; It was re-enabled in the
      kernel with Commit 557f8c58 ("ubsan: Reintroduce signed overflow
      sanitizer").
      
      Let's rework this overflow checking logic to not actually perform an
      overflow during the check itself, thus avoiding the UBSAN splat.
      
      [1]: https://github.com/llvm/llvm-project/pull/82432Signed-off-by: default avatarJustin Stitt <justinstitt@google.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20240507-b4-sio-block-ioctl-v3-1-ba0c2b32275e@google.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ccb326b5
  2. 06 May, 2024 1 commit
  3. 03 May, 2024 4 commits
    • INAGAKI Hiroshi's avatar
      block: fix and simplify blkdevparts= cmdline parsing · bc2e07df
      INAGAKI Hiroshi authored
      Fix the cmdline parsing of the "blkdevparts=" parameter using strsep(),
      which makes the code simpler.
      
      Before commit 146afeb2 ("block: use strscpy() to instead of
      strncpy()"), we used a strncpy() to copy a block device name and partition
      names. The commit simply replaced a strncpy() and NULL termination with
      a strscpy(). It did not update calculations of length passed to strscpy().
      While the length passed to strncpy() is just a length of valid characters
      without NULL termination ('\0'), strscpy() takes it as a length of the
      destination buffer, including a NULL termination.
      
      Since the source buffer is not necessarily NULL terminated, the current
      code copies "length - 1" characters and puts a NULL character in the
      destination buffer. It replaces the last character with NULL and breaks
      the parsing.
      
      As an example, that buffer will be passed to parse_parts() and breaks
      parsing sub-partitions due to the missing ')' at the end, like the
      following.
      
      example (Check Point V-80 & OpenWrt):
      
      - Linux Kernel 6.6
      
        [    0.000000] Kernel command line: console=ttyS0,115200 earlycon=uart8250,mmio32,0xf0512000 crashkernel=30M mvpp2x.queue_mode=1 blkdevparts=mmcblk1:48M@10M(kernel-1),1M(dtb-1),720M(rootfs-1),48M(kernel-2),1M(dtb-2),720M(rootfs-2),300M(default_sw),650M(logs),1M(preset_cfg),1M(adsl),-(storage) maxcpus=4
        ...
        [    0.884016] mmc1: new HS200 MMC card at address 0001
        [    0.889951] mmcblk1: mmc1:0001 004GA0 3.69 GiB
        [    0.895043] cmdline partition format is invalid.
        [    0.895704]  mmcblk1: p1
        [    0.903447] mmcblk1boot0: mmc1:0001 004GA0 2.00 MiB
        [    0.908667] mmcblk1boot1: mmc1:0001 004GA0 2.00 MiB
        [    0.913765] mmcblk1rpmb: mmc1:0001 004GA0 512 KiB, chardev (248:0)
      
        1. "48M@10M(kernel-1),..." is passed to strscpy() with length=17
           from parse_parts()
        2. strscpy() returns -E2BIG and the destination buffer has
           "48M@10M(kernel-1\0"
        3. "48M@10M(kernel-1\0" is passed to parse_subpart()
        4. parse_subpart() fails to find ')' when parsing a partition name,
           and returns error
      
      - Linux Kernel 6.1
      
        [    0.000000] Kernel command line: console=ttyS0,115200 earlycon=uart8250,mmio32,0xf0512000 crashkernel=30M mvpp2x.queue_mode=1 blkdevparts=mmcblk1:48M@10M(kernel-1),1M(dtb-1),720M(rootfs-1),48M(kernel-2),1M(dtb-2),720M(rootfs-2),300M(default_sw),650M(logs),1M(preset_cfg),1M(adsl),-(storage) maxcpus=4
        ...
        [    0.953142] mmc1: new HS200 MMC card at address 0001
        [    0.959114] mmcblk1: mmc1:0001 004GA0 3.69 GiB
        [    0.964259]  mmcblk1: p1(kernel-1) p2(dtb-1) p3(rootfs-1) p4(kernel-2) p5(dtb-2) 6(rootfs-2) p7(default_sw) p8(logs) p9(preset_cfg) p10(adsl) p11(storage)
        [    0.979174] mmcblk1boot0: mmc1:0001 004GA0 2.00 MiB
        [    0.984674] mmcblk1boot1: mmc1:0001 004GA0 2.00 MiB
        [    0.989926] mmcblk1rpmb: mmc1:0001 004GA0 512 KiB, chardev (248:0
      
      By the way, strscpy() takes a length of destination buffer and it is
      often confusing when copying characters with a specified length. Using
      strsep() helps to separate the string by the specified character. Then,
      we can use strscpy() naturally with the size of the destination buffer.
      
      Separating the string on the fly is also useful to omit the redundant
      string copy, reducing memory usage and improve the code readability.
      
      Fixes: 146afeb2 ("block: use strscpy() to instead of strncpy()")
      Suggested-by: default avatarNaohiro Aota <naota@elisp.net>
      Signed-off-by: default avatarINAGAKI Hiroshi <musashino.open@gmail.com>
      Reviewed-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Link: https://lore.kernel.org/r/20240421074005.565-1-musashino.open@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bc2e07df
    • Christoph Hellwig's avatar
      block: refine the EOF check in blkdev_iomap_begin · 0c12028a
      Christoph Hellwig authored
      blkdev_iomap_begin rounds down the offset to the logical block size
      before stashing it in iomap->offset and checking that it still is
      inside the inode size.
      
      Check the i_size check to the raw pos value so that we don't try a
      zero size write if iter->pos is unaligned.
      
      Fixes: 487c607d ("block: use iomap for writes to block devices")
      Reported-by: syzbot+0a3683a0a6fecf909244@syzkaller.appspotmail.com
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: syzbot+0a3683a0a6fecf909244@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/r/20240503081042.2078062-1-hch@lst.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0c12028a
    • Christoph Hellwig's avatar
      block: add a partscan sysfs attribute for disks · a4217c67
      Christoph Hellwig authored
      Userspace had been unknowingly relying on a non-stable interface of
      kernel internals to determine if partition scanning is enabled for a
      given disk. Provide a stable interface for this purpose instead.
      
      Cc: stable@vger.kernel.org # 6.3+
      Depends-on: 140ce28d ("block: add a disk_has_partscan helper")
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/linux-block/ZhQJf8mzq_wipkBH@gardel-login/
      Link: https://lore.kernel.org/r/20240502130033.1958492-3-hch@lst.de
      [axboe: add links and commit message from Keith]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a4217c67
    • Christoph Hellwig's avatar
      block: add a disk_has_partscan helper · 140ce28d
      Christoph Hellwig authored
      Add a helper to check if partition scanning is enabled instead of
      open coding the check in a few places.  This now always checks for
      the hidden flag even if all but one of the callers are never reachable
      for hidden gendisks.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20240502130033.1958492-2-hch@lst.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      140ce28d
  4. 02 May, 2024 2 commits
  5. 01 May, 2024 14 commits
  6. 26 Apr, 2024 1 commit
  7. 25 Apr, 2024 3 commits
  8. 23 Apr, 2024 1 commit
    • Damien Le Moal's avatar
      block: use a per disk workqueue for zone write plugging · a8f59e5a
      Damien Le Moal authored
      A zone write plug BIO work function blk_zone_wplug_bio_work() calls
      submit_bio_noacct_nocheck() to execute the next unplugged BIO. This
      function may block. So executing zone plugs BIO works using the block
      layer global kblockd workqueue can potentially lead to preformance or
      latency issues as the number of concurrent work for a workqueue is
      limited to WQ_DFL_ACTIVE (256).
      1) For a system with a large number of zoned disks, issuing write
         requests to otherwise unused zones may be delayed wiating for a work
         thread to become available.
      2) Requeue operations which use kblockd but are independent of zone
         write plugging may alsoi end up being delayed.
      
      To avoid these potential performance issues, create a workqueue per
      zoned device to execute zone plugs BIO work. The workqueue max active
      parameter is set to the maximum number of zone write plugs allocated
      with the zone write plug mempool. This limit is equal to the maximum
      number of open zones of the disk and defaults to 128 for disks that do
      not have a limit on the number of open zones.
      
      Fixes: dd291d77 ("block: Introduce zone write plugging")
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20240420075811.1276893-3-dlemoal@kernel.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a8f59e5a
  9. 19 Apr, 2024 1 commit
  10. 17 Apr, 2024 8 commits