1. 15 Aug, 2019 3 commits
    • Jackie Liu's avatar
      io_uring: fix an issue when IOSQE_IO_LINK is inserted into defer list · a982eeb0
      Jackie Liu authored
      This patch may fix two issues:
      
      First, when IOSQE_IO_DRAIN set, the next IOs need to be inserted into
      defer list to delay execution, but link io will be actively scheduled to
      run by calling io_queue_sqe.
      
      Second, when multiple LINK_IOs are inserted together with defer_list,
      the LINK_IO is no longer keep order.
      
         |-------------|
         |   LINK_IO   |      ----> insert to defer_list  -----------
         |-------------|                                            |
         |   LINK_IO   |      ----> insert to defer_list  ----------|
         |-------------|                                            |
         |   LINK_IO   |      ----> insert to defer_list  ----------|
         |-------------|                                            |
         |   NORMAL_IO |      ----> insert to defer_list  ----------|
         |-------------|                                            |
                                                                    |
                                    queue_work at same time   <-----|
      
      Fixes: 9e645e11 ("io_uring: add support for sqe links")
      Signed-off-by: default avatarJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a982eeb0
    • Jens Axboe's avatar
      block: remove REQ_NOWAIT_INLINE · 7b6620d7
      Jens Axboe authored
      We had a few issues with this code, and there's still a problem around
      how we deal with error handling for chained/split bios. For now, just
      revert the code and we'll try again with a thoroug solution. This
      reverts commits:
      
      e15c2ffa ("block: fix O_DIRECT error handling for bio fragments")
      0eb6ddfb ("block: Fix __blkdev_direct_IO() for bio fragments")
      6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      893a1c97 ("blk-mq: allow REQ_NOWAIT to return an error inline")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7b6620d7
    • Aleix Roca Nonell's avatar
      io_uring: fix manual setup of iov_iter for fixed buffers · 99c79f66
      Aleix Roca Nonell authored
      Commit bd11b3a3 ("io_uring: don't use iov_iter_advance() for fixed
      buffers") introduced an optimization to avoid using the slow
      iov_iter_advance by manually populating the iov_iter iterator in some
      cases.
      
      However, the computation of the iterator count field was erroneous: The
      first bvec was always accounted for an extent of page size even if the
      bvec length was smaller.
      
      In consequence, some I/O operations on fixed buffers were unable to
      operate on the full extent of the buffer, consistently skipping some
      bytes at the end of it.
      
      Fixes: bd11b3a3 ("io_uring: don't use iov_iter_advance() for fixed buffers")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAleix Roca Nonell <aleix.rocanonell@bsc.es>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      99c79f66
  2. 12 Aug, 2019 2 commits
  3. 11 Aug, 2019 1 commit
    • Jens Axboe's avatar
      Merge branch 'nvme-5.3-rc' of git://git.infradead.org/nvme into for-linus · 0c9c8304
      Jens Axboe authored
      Pull NVMe fixes from Sagi:
      
      "Few nvme fixes for the next rc round.
      - detect capacity changes on the mpath disk from Anthony
      - probe/remove fix from Keith
      - various fixes to pass blktests from Logan
      - deadlock in reset/scan race fix
      - nvme-rdma use-after-free fix
      - deadlock fix when passthru commands race mpath disk info update"
      
      * 'nvme-5.3-rc' of git://git.infradead.org/nvme:
        nvme-pci: Fix async probe remove race
        nvme: fix controller removal race with scan work
        nvme-rdma: fix possible use-after-free in connect error flow
        nvme: fix a possible deadlock when passthru commands sent to a multipath device
        nvme-core: Fix extra device_put() call on error path
        nvmet-file: fix nvmet_file_flush() always returning an error
        nvmet-loop: Flush nvme_delete_wq when removing the port
        nvmet: Fix use-after-free bug when a port is removed
        nvme-multipath: revalidate nvme_ns_head gendisk in nvme_validate_ns
      0c9c8304
  4. 09 Aug, 2019 1 commit
    • Coly Li's avatar
      bcache: Revert "bcache: use sysfs_match_string() instead of __sysfs_match_string()" · 20621fed
      Coly Li authored
      This reverts commit 89e0341a.
      
      In drivers/md/bcache/sysfs.c:bch_snprint_string_list(), NULL pointer at
      the end of list is necessary. Remove the NULL from last element of each
      lists will cause the following panic,
      
      [ 4340.455652] bcache: register_cache() registered cache device nvme0n1
      [ 4340.464603] bcache: register_bdev() registered backing device sdk
      [ 4421.587335] bcache: bch_cached_dev_run() cached dev sdk is running already
      [ 4421.587348] bcache: bch_cached_dev_attach() Caching sdk as bcache0 on set 354e1d46-d99f-4d8b-870b-078b80dc88a6
      [ 5139.247950] general protection fault: 0000 [#1] SMP NOPTI
      [ 5139.247970] CPU: 9 PID: 5896 Comm: cat Not tainted 4.12.14-95.29-default #1 SLE12-SP4
      [ 5139.247988] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 04/18/2019
      [ 5139.248006] task: ffff888fb25c0b00 task.stack: ffff9bbacc704000
      [ 5139.248021] RIP: 0010:string+0x21/0x70
      [ 5139.248030] RSP: 0018:ffff9bbacc707bf0 EFLAGS: 00010286
      [ 5139.248043] RAX: ffffffffa7e432e3 RBX: ffff8881c20da02a RCX: ffff0a00ffffff04
      [ 5139.248058] RDX: 3f00656863616362 RSI: ffff8881c20db000 RDI: ffffffffffffffff
      [ 5139.248075] RBP: ffff8881c20db000 R08: 0000000000000000 R09: ffff8881c20da02a
      [ 5139.248090] R10: 0000000000000004 R11: 0000000000000000 R12: ffff9bbacc707c48
      [ 5139.248104] R13: 0000000000000fd6 R14: ffffffffc0665855 R15: ffffffffc0665855
      [ 5139.248119] FS:  00007faf253b8700(0000) GS:ffff88903f840000(0000) knlGS:0000000000000000
      [ 5139.248137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5139.248149] CR2: 00007faf25395008 CR3: 0000000f72150006 CR4: 00000000007606e0
      [ 5139.248164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 5139.248179] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 5139.248193] PKRU: 55555554
      [ 5139.248200] Call Trace:
      [ 5139.248210]  vsnprintf+0x1fb/0x510
      [ 5139.248221]  snprintf+0x39/0x40
      [ 5139.248238]  bch_snprint_string_list.constprop.15+0x5b/0x90 [bcache]
      [ 5139.248256]  __bch_cached_dev_show+0x44d/0x5f0 [bcache]
      [ 5139.248270]  ? __alloc_pages_nodemask+0xb2/0x210
      [ 5139.248284]  bch_cached_dev_show+0x2c/0x50 [bcache]
      [ 5139.248297]  sysfs_kf_seq_show+0xbb/0x190
      [ 5139.248308]  seq_read+0xfc/0x3c0
      [ 5139.248317]  __vfs_read+0x26/0x140
      [ 5139.248327]  vfs_read+0x87/0x130
      [ 5139.248336]  SyS_read+0x42/0x90
      [ 5139.248346]  do_syscall_64+0x74/0x160
      [ 5139.248358]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      [ 5139.248370] RIP: 0033:0x7faf24eea370
      [ 5139.248379] RSP: 002b:00007fff82d03f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
      [ 5139.248395] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007faf24eea370
      [ 5139.248411] RDX: 0000000000020000 RSI: 00007faf25396000 RDI: 0000000000000003
      [ 5139.248426] RBP: 00007faf25396000 R08: 00000000ffffffff R09: 0000000000000000
      [ 5139.248441] R10: 000000007c9d4d41 R11: 0000000000000246 R12: 00007faf25396000
      [ 5139.248456] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
      [ 5139.248892] Code: ff ff ff 0f 1f 80 00 00 00 00 49 89 f9 48 89 cf 48 c7 c0 e3 32 e4 a7 48 c1 ff 30 48 81 fa ff 0f 00 00 48 0f 46 d0 48 85 ff 74 45 <44> 0f b6 02 48 8d 42 01 45 84 c0 74 38 48 01 fa 4c 89 cf eb 0e
      
      The simplest way to fix is to revert commit 89e0341a ("bcache: use
      sysfs_match_string() instead of __sysfs_match_string()").
      
      This bug was introduced in Linux v5.2, so this fix only applies to
      Linux v5.2 is enough for stable tree maintainer.
      
      Fixes: 89e0341a ("bcache: use sysfs_match_string() instead of __sysfs_match_string()")
      Cc: stable@vger.kernel.org
      Cc: Alexandru Ardelean <alexandru.ardelean@analog.com>
      Reported-by: default avatarPeifeng Lin <pflin@suse.com>
      Acked-by: default avatarAlexandru Ardelean <alexandru.ardelean@analog.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      20621fed
  5. 08 Aug, 2019 6 commits
    • Mikulas Patocka's avatar
      loop: set PF_MEMALLOC_NOIO for the worker thread · d0a255e7
      Mikulas Patocka authored
      A deadlock with this stacktrace was observed.
      
      The loop thread does a GFP_KERNEL allocation, it calls into dm-bufio
      shrinker and the shrinker depends on I/O completion in the dm-bufio
      subsystem.
      
      In order to fix the deadlock (and other similar ones), we set the flag
      PF_MEMALLOC_NOIO at loop thread entry.
      
      PID: 474    TASK: ffff8813e11f4600  CPU: 10  COMMAND: "kswapd0"
         #0 [ffff8813dedfb938] __schedule at ffffffff8173f405
         #1 [ffff8813dedfb990] schedule at ffffffff8173fa27
         #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec
         #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186
         #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f
         #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8
         #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81
         #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio]
         #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio]
         #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio]
        #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce
        #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778
        #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f
        #13 [ffff8813dedfbec0] kthread at ffffffff810a8428
        #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242
      
        PID: 14127  TASK: ffff881455749c00  CPU: 11  COMMAND: "loop1"
         #0 [ffff88272f5af228] __schedule at ffffffff8173f405
         #1 [ffff88272f5af280] schedule at ffffffff8173fa27
         #2 [ffff88272f5af2a0] schedule_preempt_disabled at ffffffff8173fd5e
         #3 [ffff88272f5af2b0] __mutex_lock_slowpath at ffffffff81741fb5
         #4 [ffff88272f5af330] mutex_lock at ffffffff81742133
         #5 [ffff88272f5af350] dm_bufio_shrink_count at ffffffffa03865f9 [dm_bufio]
         #6 [ffff88272f5af380] shrink_slab at ffffffff811a86bd
         #7 [ffff88272f5af470] shrink_zone at ffffffff811ad778
         #8 [ffff88272f5af500] do_try_to_free_pages at ffffffff811adb34
         #9 [ffff88272f5af590] try_to_free_pages at ffffffff811adef8
        #10 [ffff88272f5af610] __alloc_pages_nodemask at ffffffff811a09c3
        #11 [ffff88272f5af710] alloc_pages_current at ffffffff811e8b71
        #12 [ffff88272f5af760] new_slab at ffffffff811f4523
        #13 [ffff88272f5af7b0] __slab_alloc at ffffffff8173a1b5
        #14 [ffff88272f5af880] kmem_cache_alloc at ffffffff811f484b
        #15 [ffff88272f5af8d0] do_blockdev_direct_IO at ffffffff812535b3
        #16 [ffff88272f5afb00] __blockdev_direct_IO at ffffffff81255dc3
        #17 [ffff88272f5afb30] xfs_vm_direct_IO at ffffffffa01fe3fc [xfs]
        #18 [ffff88272f5afb90] generic_file_read_iter at ffffffff81198994
        #19 [ffff88272f5afc50] __dta_xfs_file_read_iter_2398 at ffffffffa020c970 [xfs]
        #20 [ffff88272f5afcc0] lo_rw_aio at ffffffffa0377042 [loop]
        #21 [ffff88272f5afd70] loop_queue_work at ffffffffa0377c3b [loop]
        #22 [ffff88272f5afe60] kthread_worker_fn at ffffffff810a8a0c
        #23 [ffff88272f5afec0] kthread at ffffffff810a8428
        #24 [ffff88272f5aff50] ret_from_fork at ffffffff81745242
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d0a255e7
    • Jan Kara's avatar
      bdev: Fixup error handling in blkdev_get() · e91455ba
      Jan Kara authored
      Commit 89e524c0 ("loop: Fix mount(2) failure due to race with
      LOOP_SET_FD") converted blkdev_get() to use the new helpers for
      finishing claiming of a block device. However the conversion botched the
      error handling in blkdev_get() and thus the bdev has been marked as held
      even in case __blkdev_get() returned error. This led to occasional
      warnings with block/001 test from blktests like:
      
      kernel: WARNING: CPU: 5 PID: 907 at fs/block_dev.c:1899 __blkdev_put+0x396/0x3a0
      
      Correct the error handling.
      
      CC: stable@vger.kernel.org
      Fixes: 89e524c0 ("loop: Fix mount(2) failure due to race with LOOP_SET_FD")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e91455ba
    • Paolo Valente's avatar
      block, bfq: handle NULL return value by bfq_init_rq() · fd03177c
      Paolo Valente authored
      As reported in [1], the call bfq_init_rq(rq) may return NULL in case
      of OOM (in particular, if rq->elv.icq is NULL because memory
      allocation failed in failed in ioc_create_icq()).
      
      This commit handles this circumstance.
      
      [1] https://lkml.org/lkml/2019/7/22/824
      
      Cc: Hsin-Yi Wang <hsinyi@google.com>
      Cc: Nicolas Boichat <drinkcat@chromium.org>
      Cc: Doug Anderson <dianders@chromium.org>
      Reported-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Reported-by: default avatarHsin-Yi Wang <hsinyi@google.com>
      Reviewed-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd03177c
    • Paolo Valente's avatar
      block, bfq: move update of waker and woken list to queue freeing · 3f758e84
      Paolo Valente authored
      Since commit 13a857a4 ("block, bfq: detect wakers and
      unconditionally inject their I/O"), every bfq_queue has a pointer to a
      waker bfq_queue and a list of the bfq_queues it may wake. In this
      respect, when a bfq_queue, say Q, remains with no I/O source attached
      to it, Q cannot be woken by any other bfq_queue, and cannot wake any
      other bfq_queue. Then Q must be removed from the woken list of its
      possible waker bfq_queue, and all bfq_queues in the woken list of Q
      must stop having a waker bfq_queue.
      
      Q remains with no I/O source in two cases: when the last process
      associated with Q exits or when such a process gets associated with a
      different bfq_queue. Unfortunately, commit 13a857a4 ("block, bfq:
      detect wakers and unconditionally inject their I/O") performed the
      above updates only in the first case.
      
      This commit fixes this bug by moving these updates to when Q gets
      freed. This is a simple and safe way to handle all cases, as both the
      above events, process exit and re-association, lead to Q being freed
      soon, and because dangling references would come out only after Q gets
      freed (if no update were performed).
      
      Fixes: 13a857a4 ("block, bfq: detect wakers and unconditionally inject their I/O")
      Reported-by: default avatarDouglas Anderson <dianders@chromium.org>
      Tested-by: default avatarDouglas Anderson <dianders@chromium.org>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3f758e84
    • Paolo Valente's avatar
      block, bfq: reset last_completed_rq_bfqq if the pointed queue is freed · 08d383a7
      Paolo Valente authored
      Since commit 13a857a4 ("block, bfq: detect wakers and
      unconditionally inject their I/O"), BFQ stores, in a per-device
      pointer last_completed_rq_bfqq, the last bfq_queue that had an I/O
      request completed. If some bfq_queue receives new I/O right after the
      last request of last_completed_rq_bfqq has been completed, then
      last_completed_rq_bfqq may be a waker bfq_queue.
      
      But if the bfq_queue last_completed_rq_bfqq points to is freed, then
      last_completed_rq_bfqq becomes a dangling reference. This commit
      resets last_completed_rq_bfqq if the pointed bfq_queue is freed.
      
      Fixes: 13a857a4 ("block, bfq: detect wakers and unconditionally inject their I/O")
      Reported-by: default avatarDouglas Anderson <dianders@chromium.org>
      Tested-by: default avatarDouglas Anderson <dianders@chromium.org>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      08d383a7
    • He Zhe's avatar
      block: aoe: Fix kernel crash due to atomic sleep when exiting · 430380b4
      He Zhe authored
      Since commit 3582dd29 ("aoe: convert aoeblk to blk-mq"), aoedev_downdev
      has had the possibility of sleeping and causing the following crash.
      
      BUG: scheduling while atomic: rmmod/2242/0x00000003
      Modules linked in: aoe
      Preemption disabled at:
      [<ffffffffc01d95e5>] flush+0x95/0x4a0 [aoe]
      CPU: 7 PID: 2242 Comm: rmmod Tainted: G          I       5.2.3 #1
      Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009
      Call Trace:
       dump_stack+0x4f/0x6a
       ? flush+0x95/0x4a0 [aoe]
       __schedule_bug.cold+0x44/0x54
       __schedule+0x44f/0x680
       schedule+0x44/0xd0
       blk_mq_freeze_queue_wait+0x46/0xb0
       ? wait_woken+0x80/0x80
       blk_mq_freeze_queue+0x1b/0x20
       aoedev_downdev+0x111/0x160 [aoe]
       flush+0xff/0x4a0 [aoe]
       aoedev_exit+0x23/0x30 [aoe]
       aoe_exit+0x35/0x948 [aoe]
       __se_sys_delete_module+0x183/0x210
       __x64_sys_delete_module+0x16/0x20
       do_syscall_64+0x4d/0x130
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x7f24e0043b07
      Code: 73 01 c3 48 8b 0d 89 73 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f
      1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff
      ff 73 01 c3 48 8b 0d 59 73 0b 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffe18f7f1e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f24e0043b07
      RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000555c3ecf87c8
      RBP: 00007ffe18f7f1f0 R08: 0000000000000000 R09: 0000000000000000
      R10: 00007f24e00b4ac0 R11: 0000000000000206 R12: 00007ffe18f7f238
      R13: 00007ffe18f7f410 R14: 00007ffe18f80e73 R15: 0000555c3ecf8760
      
      This patch, handling in the same way of pass two, unlocks the locks and
      restart pass one after aoedev_downdev is done.
      
      Fixes: 3582dd29 ("aoe: convert aoeblk to blk-mq")
      Signed-off-by: default avatarHe Zhe <zhe.he@windriver.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      430380b4
  6. 07 Aug, 2019 3 commits
    • Jens Axboe's avatar
      libata: add SG safety checks in SFF pio transfers · 752ead44
      Jens Axboe authored
      Abort processing of a command if we run out of mapped data in the
      SG list. This should never happen, but a previous bug caused it to
      be possible. Play it safe and attempt to abort nicely if we don't
      have more SG segments left.
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      752ead44
    • Jens Axboe's avatar
      libata: have ata_scsi_rw_xlat() fail invalid passthrough requests · 2d727150
      Jens Axboe authored
      For passthrough requests, libata-scsi takes what the user passes in
      as gospel. This can be problematic if the user fills in the CDB
      incorrectly. One example of that is in request sizes. For read/write
      commands, the CDB contains fields describing the transfer length of
      the request. These should match with the SG_IO header fields, but
      libata-scsi currently does no validation of that.
      
      Check that the number of blocks in the CDB for passthrough requests
      matches what was mapped into the request. If the CDB asks for more
      data then the validated SG_IO header fields, error it.
      Reported-by: default avatarKrishna Ram Prakash R <krp@gtux.in>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2d727150
    • Jens Axboe's avatar
      block: fix O_DIRECT error handling for bio fragments · e15c2ffa
      Jens Axboe authored
      0eb6ddfb tried to fix this up, but introduced a use-after-free
      of dio. Additionally, we still had an issue with error handling,
      as reported by Darrick:
      
      "I noticed a regression in xfs/747 (an unreleased xfstest for the
      xfs_scrub media scanning feature) on 5.3-rc3.  I'll condense that down
      to a simpler reproducer:
      
      error-test: 0 209 linear 8:48 0
      error-test: 209 1 error
      error-test: 210 6446894 linear 8:48 210
      
      Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
      for sector 209 and to pass the io to the scsi device everywhere else.
      
      On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
      (in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
      EIO like you'd expect:
      
      pread64(3, 0x7f880e1c7000, 1048576, 0)  = -1 EIO (Input/output error)
      pread: Input/output error
      +++ exited with 0 +++
      
      But doing it with a larger buffer succeeds(!):
      
      pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
      read 1146880/1146880 bytes at offset 0
      1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
      +++ exited with 0 +++
      
      (Note that the part of the buffer corresponding to the dm-error area is
      uninitialized)
      
      On 5.3-rc2, both commands would fail with EIO like you'd expect.  The
      only change between rc2 and rc3 is commit 0eb6ddfb ("block: Fix
      __blkdev_direct_IO() for bio fragments").
      
      AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
      split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
      and a second one for the 96k after that."
      
      Fix this by noting that it's always safe to dereference dio if we get
      BLK_QC_T_EAGAIN returned, as end_io hasn't been run for that case. So
      we can safely increment the dio size before calling submit_bio(), and
      then decrement it on failure (not that it really matters, as the bio
      and dio are going away).
      
      For error handling, return to the original method of just using 'ret'
      for tracking the error, and the size tracking in dio->size.
      
      Fixes: 0eb6ddfb ("block: Fix __blkdev_direct_IO() for bio fragments")
      Fixes: 6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      Reported-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e15c2ffa
  7. 06 Aug, 2019 1 commit
  8. 02 Aug, 2019 1 commit
    • Stefan Haberland's avatar
      s390/dasd: fix endless loop after read unit address configuration · 41995342
      Stefan Haberland authored
      After getting a storage server event that causes the DASD device driver
      to update its unit address configuration during a device shutdown there is
      the possibility of an endless loop in the device driver.
      
      In the system log there will be ongoing DASD error messages with RC: -19.
      
      The reason is that the loop starting the ruac request only terminates when
      the retry counter is decreased to 0. But in the sleep_on function there are
      early exit paths that do not decrease the retry counter.
      
      Prevent an endless loop by handling those cases separately.
      
      Remove the unnecessary do..while loop since the sleep_on function takes
      care of retries by itself.
      
      Fixes: 8e09f215 ("[S390] dasd: add hyper PAV support to DASD device driver, part 1")
      Cc: stable@vger.kernel.org # 2.6.25+
      Signed-off-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Reviewed-by: default avatarJan Hoeppner <hoeppner@linux.ibm.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      41995342
  9. 01 Aug, 2019 9 commits
    • Damien Le Moal's avatar
      block: Fix __blkdev_direct_IO() for bio fragments · 0eb6ddfb
      Damien Le Moal authored
      The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
      (patch 6a43074e) introduced two problems with BIO fragment handling
      for direct IOs:
      1) The dio size processed is calculated by incrementing the ret variable
      by the size of the bio fragment issued for the dio. However, this size
      is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
      which may result in referencing the bi_size value after the bio
      completed, resulting in an incorrect value use.
      2) The ret variable is not incremented by the size of the last bio
      fragment issued for the bio, leading to an invalid IO size being
      returned to the user.
      
      Fix both problem by using dio->size (which is incremented before the bio
      submission) to update the value of ret after bio submissions, including
      for the last bio fragment issued.
      
      Fixes: 6a43074e ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
      Reported-by: default avatarMasato Suzuki <masato.suzuki@wdc.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0eb6ddfb
    • Keith Busch's avatar
      nvme-pci: Fix async probe remove race · bd46a906
      Keith Busch authored
      Ensure the controller is not in the NEW state when nvme_probe() exits.
      This will always allow a subsequent nvme_remove() to set the state to
      DELETING, fixing a potential race between the initial asynchronous probe
      and device removal.
      Reported-by: default avatarLi Zhong <lizhongfs@gmail.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      bd46a906
    • Sagi Grimberg's avatar
      nvme: fix controller removal race with scan work · 0157ec8d
      Sagi Grimberg authored
      With multipath enabled, nvme_scan_work() can read from the device
      (through nvme_mpath_add_disk()) and hang [1]. However, with fabrics,
      once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang
      (see nvmf_check_ready()) and the mpath stack device make_request
      will block if head->list is not empty. However, when the head->list
      consistst of only DELETING/DEAD controllers, we should actually not
      block, but rather fail immediately.
      
      In addition, before we go ahead and remove the namespaces, make sure
      to clear the current path and kick the requeue list so that the
      request will fast fail upon requeuing.
      
      [1]:
      --
        INFO: task kworker/u4:3:166 blocked for more than 120 seconds.
              Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        kworker/u4:3    D    0   166      2 0x80004000
        Workqueue: nvme-wq nvme_scan_work
        Call Trace:
         __schedule+0x851/0x1400
         schedule+0x99/0x210
         io_schedule+0x21/0x70
         do_read_cache_page+0xa57/0x1330
         read_cache_page+0x4a/0x70
         read_dev_sector+0xbf/0x380
         amiga_partition+0xc4/0x1230
         check_partition+0x30f/0x630
         rescan_partitions+0x19a/0x980
         __blkdev_get+0x85a/0x12f0
         blkdev_get+0x2a5/0x790
         __device_add_disk+0xe25/0x1250
         device_add_disk+0x13/0x20
         nvme_mpath_set_live+0x172/0x2b0
         nvme_update_ns_ana_state+0x130/0x180
         nvme_set_ns_ana_state+0x9a/0xb0
         nvme_parse_ana_log+0x1c3/0x4a0
         nvme_mpath_add_disk+0x157/0x290
         nvme_validate_ns+0x1017/0x1bd0
         nvme_scan_work+0x44d/0x6a0
         process_one_work+0x7d7/0x1240
         worker_thread+0x8e/0xff0
         kthread+0x2c3/0x3b0
         ret_from_fork+0x35/0x40
      
         INFO: task kworker/u4:1:1034 blocked for more than 120 seconds.
              Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        kworker/u4:1    D    0  1034      2 0x80004000
        Workqueue: nvme-delete-wq nvme_delete_ctrl_work
        Call Trace:
         __schedule+0x851/0x1400
         schedule+0x99/0x210
         schedule_timeout+0x390/0x830
         wait_for_completion+0x1a7/0x310
         __flush_work+0x241/0x5d0
         flush_work+0x10/0x20
         nvme_remove_namespaces+0x85/0x3d0
         nvme_do_delete_ctrl+0xb4/0x1e0
         nvme_delete_ctrl_work+0x15/0x20
         process_one_work+0x7d7/0x1240
         worker_thread+0x8e/0xff0
         kthread+0x2c3/0x3b0
         ret_from_fork+0x35/0x40
      --
      Reported-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Tested-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      0157ec8d
    • Sagi Grimberg's avatar
      nvme-rdma: fix possible use-after-free in connect error flow · d94211b8
      Sagi Grimberg authored
      When start_queue fails, we need to make sure to drain the
      queue cq before freeing the rdma resources because we might
      still race with the completion path. Have start_queue() error
      path safely stop the queue.
      
      --
      [30371.808111] nvme nvme1: Failed reconnect attempt 11
      [30371.808113] nvme nvme1: Reconnecting in 10 seconds...
      [...]
      [30382.069315] nvme nvme1: creating 4 I/O queues.
      [30382.257058] nvme nvme1: Connect Invalid SQE Parameter, qid 4
      [30382.257061] nvme nvme1: failed to connect queue: 4 ret=386
      [30382.305001] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      [30382.305022] IP: qedr_poll_cq+0x8a3/0x1170 [qedr]
      [30382.305028] PGD 0 P4D 0
      [30382.305037] Oops: 0000 [#1] SMP PTI
      [...]
      [30382.305153] Call Trace:
      [30382.305166]  ? __switch_to_asm+0x34/0x70
      [30382.305187]  __ib_process_cq+0x56/0xd0 [ib_core]
      [30382.305201]  ib_poll_handler+0x26/0x70 [ib_core]
      [30382.305213]  irq_poll_softirq+0x88/0x110
      [30382.305223]  ? sort_range+0x20/0x20
      [30382.305232]  __do_softirq+0xde/0x2c6
      [30382.305241]  ? sort_range+0x20/0x20
      [30382.305249]  run_ksoftirqd+0x1c/0x60
      [30382.305258]  smpboot_thread_fn+0xef/0x160
      [30382.305265]  kthread+0x113/0x130
      [30382.305273]  ? kthread_create_worker_on_cpu+0x50/0x50
      [30382.305281]  ret_from_fork+0x35/0x40
      --
      Reported-by: default avatarNicolas Morey-Chaisemartin <NMoreyChaisemartin@suse.com>
      Reviewed-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      d94211b8
    • Sagi Grimberg's avatar
      nvme: fix a possible deadlock when passthru commands sent to a multipath device · b9156dae
      Sagi Grimberg authored
      When the user issues a command with side effects, we will end up freezing
      the namespace request queue when updating disk info (and the same for
      the corresponding mpath disk node).
      
      However, we are not freezing the mpath node request queue,
      which means that mpath I/O can still come in and block on blk_queue_enter
      (called from nvme_ns_head_make_request -> direct_make_request).
      
      This is a deadlock, because blk_queue_enter will block until the inner
      namespace request queue is unfroze, but that process is blocked because
      the namespace revalidation is trying to update the mpath disk info
      and freeze its request queue (which will never complete because
      of the I/O that is blocked on blk_queue_enter).
      
      Fix this by freezing all the subsystem nsheads request queues before
      executing the passthru command. Given that these commands are infrequent
      we should not worry about this temporary I/O freeze to keep things sane.
      
      Here is the matching hang traces:
      --
      [ 374.465002] INFO: task systemd-udevd:17994 blocked for more than 122 seconds.
      [ 374.472975] Not tainted 5.2.0-rc3-mpdebug+ #42
      [ 374.478522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 374.487274] systemd-udevd D 0 17994 1 0x00000000
      [ 374.493407] Call Trace:
      [ 374.496145] __schedule+0x2ef/0x620
      [ 374.500047] schedule+0x38/0xa0
      [ 374.503569] blk_queue_enter+0x139/0x220
      [ 374.507959] ? remove_wait_queue+0x60/0x60
      [ 374.512540] direct_make_request+0x60/0x130
      [ 374.517219] nvme_ns_head_make_request+0x11d/0x420 [nvme_core]
      [ 374.523740] ? generic_make_request_checks+0x307/0x6f0
      [ 374.529484] generic_make_request+0x10d/0x2e0
      [ 374.534356] submit_bio+0x75/0x140
      [ 374.538163] ? guard_bio_eod+0x32/0xe0
      [ 374.542361] submit_bh_wbc+0x171/0x1b0
      [ 374.546553] block_read_full_page+0x1ed/0x330
      [ 374.551426] ? check_disk_change+0x70/0x70
      [ 374.556008] ? scan_shadow_nodes+0x30/0x30
      [ 374.560588] blkdev_readpage+0x18/0x20
      [ 374.564783] do_read_cache_page+0x301/0x860
      [ 374.569463] ? blkdev_writepages+0x10/0x10
      [ 374.574037] ? prep_new_page+0x88/0x130
      [ 374.578329] ? get_page_from_freelist+0xa2f/0x1280
      [ 374.583688] ? __alloc_pages_nodemask+0x179/0x320
      [ 374.588947] read_cache_page+0x12/0x20
      [ 374.593142] read_dev_sector+0x2d/0xd0
      [ 374.597337] read_lba+0x104/0x1f0
      [ 374.601046] find_valid_gpt+0xfa/0x720
      [ 374.605243] ? string_nocheck+0x58/0x70
      [ 374.609534] ? find_valid_gpt+0x720/0x720
      [ 374.614016] efi_partition+0x89/0x430
      [ 374.618113] ? string+0x48/0x60
      [ 374.621632] ? snprintf+0x49/0x70
      [ 374.625339] ? find_valid_gpt+0x720/0x720
      [ 374.629828] check_partition+0x116/0x210
      [ 374.634214] rescan_partitions+0xb6/0x360
      [ 374.638699] __blkdev_reread_part+0x64/0x70
      [ 374.643377] blkdev_reread_part+0x23/0x40
      [ 374.647860] blkdev_ioctl+0x48c/0x990
      [ 374.651956] block_ioctl+0x41/0x50
      [ 374.655766] do_vfs_ioctl+0xa7/0x600
      [ 374.659766] ? locks_lock_inode_wait+0xb1/0x150
      [ 374.664832] ksys_ioctl+0x67/0x90
      [ 374.668539] __x64_sys_ioctl+0x1a/0x20
      [ 374.672732] do_syscall_64+0x5a/0x1c0
      [ 374.676828] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      [ 374.738474] INFO: task nvmeadm:49141 blocked for more than 123 seconds.
      [ 374.745871] Not tainted 5.2.0-rc3-mpdebug+ #42
      [ 374.751419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 374.760170] nvmeadm D 0 49141 36333 0x00004080
      [ 374.766301] Call Trace:
      [ 374.769038] __schedule+0x2ef/0x620
      [ 374.772939] schedule+0x38/0xa0
      [ 374.776452] blk_mq_freeze_queue_wait+0x59/0x100
      [ 374.781614] ? remove_wait_queue+0x60/0x60
      [ 374.786192] blk_mq_freeze_queue+0x1a/0x20
      [ 374.790773] nvme_update_disk_info.isra.57+0x5f/0x350 [nvme_core]
      [ 374.797582] ? nvme_identify_ns.isra.50+0x71/0xc0 [nvme_core]
      [ 374.804006] __nvme_revalidate_disk+0xe5/0x110 [nvme_core]
      [ 374.810139] nvme_revalidate_disk+0xa6/0x120 [nvme_core]
      [ 374.816078] ? nvme_submit_user_cmd+0x11e/0x320 [nvme_core]
      [ 374.822299] nvme_user_cmd+0x264/0x370 [nvme_core]
      [ 374.827661] nvme_dev_ioctl+0x112/0x1d0 [nvme_core]
      [ 374.833114] do_vfs_ioctl+0xa7/0x600
      [ 374.837117] ? __audit_syscall_entry+0xdd/0x130
      [ 374.842184] ksys_ioctl+0x67/0x90
      [ 374.845891] __x64_sys_ioctl+0x1a/0x20
      [ 374.850082] do_syscall_64+0x5a/0x1c0
      [ 374.854178] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      --
      Reported-by: default avatarJames Puthukattukaran <james.puthukattukaran@oracle.com>
      Tested-by: default avatarJames Puthukattukaran <james.puthukattukaran@oracle.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      b9156dae
    • Logan Gunthorpe's avatar
      nvme-core: Fix extra device_put() call on error path · 8c36e66f
      Logan Gunthorpe authored
      In the error path for nvme_init_subsystem(), nvme_put_subsystem()
      will call device_put(), but it will get called again after the
      mutex_unlock().
      
      The device_put() only needs to be called if device_add() fails.
      
      This bug caused a KASAN use-after-free error when adding and
      removing subsytems in a loop:
      
        BUG: KASAN: use-after-free in device_del+0x8d9/0x9a0
        Read of size 8 at addr ffff8883cdaf7120 by task multipathd/329
      
        CPU: 0 PID: 329 Comm: multipathd Not tainted 5.2.0-rc6-vmlocalyes-00019-g70a2b39005fd-dirty #314
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
        Call Trace:
         dump_stack+0x7b/0xb5
         print_address_description+0x6f/0x280
         ? device_del+0x8d9/0x9a0
         __kasan_report+0x148/0x199
         ? device_del+0x8d9/0x9a0
         ? class_release+0x100/0x130
         ? device_del+0x8d9/0x9a0
         kasan_report+0x12/0x20
         __asan_report_load8_noabort+0x14/0x20
         device_del+0x8d9/0x9a0
         ? device_platform_notify+0x70/0x70
         nvme_destroy_subsystem+0xf9/0x150
         nvme_free_ctrl+0x280/0x3a0
         device_release+0x72/0x1d0
         kobject_put+0x144/0x410
         put_device+0x13/0x20
         nvme_free_ns+0xc4/0x100
         nvme_release+0xb3/0xe0
         __blkdev_put+0x549/0x6e0
         ? kasan_check_write+0x14/0x20
         ? bd_set_size+0xb0/0xb0
         ? kasan_check_write+0x14/0x20
         ? mutex_lock+0x8f/0xe0
         ? __mutex_lock_slowpath+0x20/0x20
         ? locks_remove_file+0x239/0x370
         blkdev_put+0x72/0x2c0
         blkdev_close+0x8d/0xd0
         __fput+0x256/0x770
         ? _raw_read_lock_irq+0x40/0x40
         ____fput+0xe/0x10
         task_work_run+0x10c/0x180
         ? filp_close+0xf7/0x140
         exit_to_usermode_loop+0x151/0x170
         do_syscall_64+0x240/0x2e0
         ? prepare_exit_to_usermode+0xd5/0x190
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f5a79af05d7
        Code: 00 00 0f 05 48 3d 00 f0 ff ff 77 3f c3 66 0f 1f 44 00 00 53 89 fb 48 83 ec 10 e8 c4 fb ff ff 89 df 89 c2 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2b 89 d7 89 44 24 0c e8 06 fc ff ff 8b 44 24
        RSP: 002b:00007f5a7799c810 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
        RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007f5a79af05d7
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000008
        RBP: 00007f5a58000f98 R08: 0000000000000002 R09: 00007f5a7935ee80
        R10: 0000000000000000 R11: 0000000000000293 R12: 000055e432447240
        R13: 0000000000000000 R14: 0000000000000001 R15: 000055e4324a9cf0
      
        Allocated by task 1236:
         save_stack+0x21/0x80
         __kasan_kmalloc.constprop.6+0xab/0xe0
         kasan_kmalloc+0x9/0x10
         kmem_cache_alloc_trace+0x102/0x210
         nvme_init_identify+0x13c3/0x3820
         nvme_loop_configure_admin_queue+0x4fa/0x5e0
         nvme_loop_create_ctrl+0x469/0xf40
         nvmf_dev_write+0x19a3/0x21ab
         __vfs_write+0x66/0x120
         vfs_write+0x154/0x490
         ksys_write+0x104/0x240
         __x64_sys_write+0x73/0xb0
         do_syscall_64+0xa5/0x2e0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 329:
         save_stack+0x21/0x80
         __kasan_slab_free+0x129/0x190
         kasan_slab_free+0xe/0x10
         kfree+0xa7/0x200
         nvme_release_subsystem+0x49/0x60
         device_release+0x72/0x1d0
         kobject_put+0x144/0x410
         put_device+0x13/0x20
         klist_class_dev_put+0x31/0x40
         klist_put+0x8f/0xf0
         klist_del+0xe/0x10
         device_del+0x3a7/0x9a0
         nvme_destroy_subsystem+0xf9/0x150
         nvme_free_ctrl+0x280/0x3a0
         device_release+0x72/0x1d0
         kobject_put+0x144/0x410
         put_device+0x13/0x20
         nvme_free_ns+0xc4/0x100
         nvme_release+0xb3/0xe0
         __blkdev_put+0x549/0x6e0
         blkdev_put+0x72/0x2c0
         blkdev_close+0x8d/0xd0
         __fput+0x256/0x770
         ____fput+0xe/0x10
         task_work_run+0x10c/0x180
         exit_to_usermode_loop+0x151/0x170
         do_syscall_64+0x240/0x2e0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 32fd90c4 ("nvme: change locking for the per-subsystem controller list")
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by : Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      8c36e66f
    • Logan Gunthorpe's avatar
      nvmet-file: fix nvmet_file_flush() always returning an error · cfc1a1af
      Logan Gunthorpe authored
      Presently, nvmet_file_flush() always returns a call to
      errno_to_nvme_status() but that helper doesn't take into account the
      case when errno=0. So nvmet_file_flush() always returns an error code.
      
      All other callers of errno_to_nvme_status() check for success before
      calling it.
      
      To fix this, ensure errno_to_nvme_status() returns success if the
      errno is zero. This should prevent future mistakes like this from
      happening.
      
      Fixes: c6aa3542 ("nvmet: add error log support for file backend")
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      cfc1a1af
    • Logan Gunthorpe's avatar
      nvmet-loop: Flush nvme_delete_wq when removing the port · 86b9a63e
      Logan Gunthorpe authored
      After calling nvme_loop_delete_ctrl(), the controllers will not
      yet be deleted because nvme_delete_ctrl() only schedules work
      to do the delete.
      
      This means a race can occur if a port is removed but there
      are still active controllers trying to access that memory.
      
      To fix this, flush the nvme_delete_wq before returning from
      nvme_loop_remove_port() so that any controllers that might
      be in the process of being deleted won't access a freed port.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by : Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      86b9a63e
    • Logan Gunthorpe's avatar
      nvmet: Fix use-after-free bug when a port is removed · 3aed8673
      Logan Gunthorpe authored
      When a port is removed through configfs, any connected controllers
      are still active and can still send commands. This causes a
      use-after-free bug which is detected by KASAN for any admin command
      that dereferences req->port (like in nvmet_execute_identify_ctrl).
      
      To fix this, disconnect all active controllers when a subsystem is
      removed from a port. This ensures there are no active controllers
      when the port is eventually removed.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by : Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      3aed8673
  10. 31 Jul, 2019 4 commits
    • Denis Efremov's avatar
      MAINTAINERS: floppy: take over maintainership · 3d0b63c5
      Denis Efremov authored
      I would like to maintain the floppy driver. After the recent fixes,
      I think I know the code pretty well. Nowadays I've got 2 physical 3.5"
      readers to test all the changes.
      Signed-off-by: default avatarDenis Efremov <efremov@linux.com>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3d0b63c5
    • Munehisa Kamata's avatar
      nbd: replace kill_bdev() with __invalidate_device() again · 2b5c8f00
      Munehisa Kamata authored
      Commit abbbdf12 ("replace kill_bdev() with __invalidate_device()")
      once did this, but 29eaadc0 ("nbd: stop using the bdev everywhere")
      resurrected kill_bdev() and it has been there since then. So buffer_head
      mappings still get killed on a server disconnection, and we can still
      hit the BUG_ON on a filesystem on the top of the nbd device.
      
        EXT4-fs (nbd0): mounted filesystem with ordered data mode. Opts: (null)
        block nbd0: Receive control failed (result -32)
        block nbd0: shutting down sockets
        print_req_error: I/O error, dev nbd0, sector 66264 flags 3000
        EXT4-fs warning (device nbd0): htree_dirblock_to_tree:979: inode #2: lblock 0: comm ls: error -5 reading directory block
        print_req_error: I/O error, dev nbd0, sector 2264 flags 3000
        EXT4-fs error (device nbd0): __ext4_get_inode_loc:4690: inode #2: block 283: comm ls: unable to read itable block
        EXT4-fs error (device nbd0) in ext4_reserve_inode_write:5894: IO failure
        ------------[ cut here ]------------
        kernel BUG at fs/buffer.c:3057!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 7 PID: 40045 Comm: jbd2/nbd0-8 Not tainted 5.1.0-rc3+ #4
        Hardware name: Amazon EC2 m5.12xlarge/, BIOS 1.0 10/16/2017
        RIP: 0010:submit_bh_wbc+0x18b/0x190
        ...
        Call Trace:
         jbd2_write_superblock+0xf1/0x230 [jbd2]
         ? account_entity_enqueue+0xc5/0xf0
         jbd2_journal_update_sb_log_tail+0x94/0xe0 [jbd2]
         jbd2_journal_commit_transaction+0x12f/0x1d20 [jbd2]
         ? __switch_to_asm+0x40/0x70
         ...
         ? lock_timer_base+0x67/0x80
         kjournald2+0x121/0x360 [jbd2]
         ? remove_wait_queue+0x60/0x60
         kthread+0xf8/0x130
         ? commit_timeout+0x10/0x10 [jbd2]
         ? kthread_bind+0x10/0x10
         ret_from_fork+0x35/0x40
      
      With __invalidate_device(), I no longer hit the BUG_ON with sync or
      unmount on the disconnected device.
      
      Fixes: 29eaadc0 ("nbd: stop using the bdev everywhere")
      Cc: linux-block@vger.kernel.org
      Cc: Ratna Manoj Bolla <manoj.br@gmail.com>
      Cc: nbd@other.debian.org
      Cc: stable@vger.kernel.org
      Cc: David Woodhouse <dwmw@amazon.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarMunehisa Kamata <kamatam@amazon.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2b5c8f00
    • Miquel Raynal's avatar
      ata: libahci: do not complain in case of deferred probe · 090bb803
      Miquel Raynal authored
      Retrieving PHYs can defer the probe, do not spawn an error when
      -EPROBE_DEFER is returned, it is normal behavior.
      
      Fixes: b1a9edbd ("ata: libahci: allow to use multiple PHYs")
      Reviewed-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarMiquel Raynal <miquel.raynal@bootlin.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      090bb803
    • Jackie Liu's avatar
      io_uring: fix KASAN use after free in io_sq_wq_submit_work · d0ee8791
      Jackie Liu authored
      [root@localhost ~]# ./liburing/test/link
      
      QEMU Standard PC report that:
      
      [   29.379892] CPU: 0 PID: 84 Comm: kworker/u2:2 Not tainted 5.3.0-rc2-00051-g4010b622-dirty #86
      [   29.379902] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
      [   29.379913] Workqueue: io_ring-wq io_sq_wq_submit_work
      [   29.379929] Call Trace:
      [   29.379953]  dump_stack+0xa9/0x10e
      [   29.379970]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.379986]  print_address_description.cold.6+0x9/0x317
      [   29.379999]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380010]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380026]  __kasan_report.cold.7+0x1a/0x34
      [   29.380044]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380061]  kasan_report+0xe/0x12
      [   29.380076]  io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380104]  ? io_sq_thread+0xaf0/0xaf0
      [   29.380152]  process_one_work+0xb59/0x19e0
      [   29.380184]  ? pwq_dec_nr_in_flight+0x2c0/0x2c0
      [   29.380221]  worker_thread+0x8c/0xf40
      [   29.380248]  ? __kthread_parkme+0xab/0x110
      [   29.380265]  ? process_one_work+0x19e0/0x19e0
      [   29.380278]  kthread+0x30b/0x3d0
      [   29.380292]  ? kthread_create_on_node+0xe0/0xe0
      [   29.380311]  ret_from_fork+0x3a/0x50
      
      [   29.380635] Allocated by task 209:
      [   29.381255]  save_stack+0x19/0x80
      [   29.381268]  __kasan_kmalloc.constprop.6+0xc1/0xd0
      [   29.381279]  kmem_cache_alloc+0xc0/0x240
      [   29.381289]  io_submit_sqe+0x11bc/0x1c70
      [   29.381300]  io_ring_submit+0x174/0x3c0
      [   29.381311]  __x64_sys_io_uring_enter+0x601/0x780
      [   29.381322]  do_syscall_64+0x9f/0x4d0
      [   29.381336]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [   29.381633] Freed by task 84:
      [   29.382186]  save_stack+0x19/0x80
      [   29.382198]  __kasan_slab_free+0x11d/0x160
      [   29.382210]  kmem_cache_free+0x8c/0x2f0
      [   29.382220]  io_put_req+0x22/0x30
      [   29.382230]  io_sq_wq_submit_work+0x28b/0xe90
      [   29.382241]  process_one_work+0xb59/0x19e0
      [   29.382251]  worker_thread+0x8c/0xf40
      [   29.382262]  kthread+0x30b/0x3d0
      [   29.382272]  ret_from_fork+0x3a/0x50
      
      [   29.382569] The buggy address belongs to the object at ffff888067172140
                      which belongs to the cache io_kiocb of size 224
      [   29.384692] The buggy address is located 120 bytes inside of
                      224-byte region [ffff888067172140, ffff888067172220)
      [   29.386723] The buggy address belongs to the page:
      [   29.387575] page:ffffea00019c5c80 refcount:1 mapcount:0 mapping:ffff88806ace5180 index:0x0
      [   29.387587] flags: 0x100000000000200(slab)
      [   29.387603] raw: 0100000000000200 dead000000000100 dead000000000122 ffff88806ace5180
      [   29.387617] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
      [   29.387624] page dumped because: kasan: bad access detected
      
      [   29.387920] Memory state around the buggy address:
      [   29.388771]  ffff888067172080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
      [   29.390062]  ffff888067172100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
      [   29.391325] >ffff888067172180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [   29.392578]                                         ^
      [   29.393480]  ffff888067172200: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
      [   29.394744]  ffff888067172280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   29.396003] ==================================================================
      [   29.397260] Disabling lock debugging due to kernel taint
      
      io_sq_wq_submit_work free and read req again.
      
      Cc: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
      Cc: linux-block@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: f7b76ac9 ("io_uring: fix counter inc/dec mismatch in async_list")
      Signed-off-by: default avatarJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d0ee8791
  11. 30 Jul, 2019 2 commits
    • Jan Kara's avatar
      loop: Fix mount(2) failure due to race with LOOP_SET_FD · 89e524c0
      Jan Kara authored
      Commit 33ec3e53 ("loop: Don't change loop device under exclusive
      opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
      while it updates loop device binding. However this can make perfectly
      valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
      temporarily the exclusive bdev reference in cases like this:
      
      for i in {a..z}{a..z}; do
              dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
              mkfs.ext2 $i.image
              mkdir mnt$i
      done
      
      echo "Run"
      for i in {a..z}{a..z}; do
              mount -o loop -t ext2 $i.image mnt$i &
      done
      
      Fix the problem by not getting full exclusive bdev reference in
      LOOP_SET_FD but instead just mark the bdev as being claimed while we
      update the binding information. This just blocks new exclusive openers
      instead of failing them with EBUSY thus fixing the problem.
      
      Fixes: 33ec3e53 ("loop: Don't change loop device under exclusive opener")
      Cc: stable@vger.kernel.org
      Tested-by: default avatarKai-Heng Feng <kai.heng.feng@canonical.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      89e524c0
    • Anthony Iliopoulos's avatar
      nvme-multipath: revalidate nvme_ns_head gendisk in nvme_validate_ns · fab7772b
      Anthony Iliopoulos authored
      When CONFIG_NVME_MULTIPATH is set, only the hidden gendisk associated
      with the per-controller ns is run through revalidate_disk when a
      rescan is triggered, while the visible blockdev never gets its size
      (bdev->bd_inode->i_size) updated to reflect any capacity changes that
      may have occurred.
      
      This prevents online resizing of nvme block devices and in extension of
      any filesystems atop that will are unable to expand while mounted, as
      userspace relies on the blockdev size for obtaining the disk capacity
      (via BLKGETSIZE/64 ioctls).
      
      Fix this by explicitly revalidating the actual namespace gendisk in
      addition to the per-controller gendisk, when multipath is enabled.
      Signed-off-by: default avatarAnthony Iliopoulos <ailiopoulos@suse.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      fab7772b
  12. 29 Jul, 2019 2 commits
  13. 28 Jul, 2019 5 commits