1. 26 Jul, 2018 1 commit
  2. 25 Jul, 2018 2 commits
  3. 24 Jul, 2018 4 commits
    • Keith Busch's avatar
      scsi: set timed out out mq requests to complete · 065990bd
      Keith Busch authored
      The scsi block layer requires requests claimed by the error handling be
      completed by the error handler. A previous commit allowed completions
      to proceed for blk-mq, breaking that assumption.
      
      This patch prevents completions that may race with the timeout handler
      by marking the state to complete, restoring the previous behavior.
      
      Fixes: 12f5b931 ("blk-mq: Remove generation seqeunce")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      065990bd
    • Keith Busch's avatar
      blk-mq: export setting request completion state · 0fc09f92
      Keith Busch authored
      This is preparing for drivers that want to directly alter the state of
      their requests. No functional change here.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0fc09f92
    • James Smart's avatar
      nvme: if_ready checks to fail io to deleting controller · 6cdefc6e
      James Smart authored
      The revised if_ready checks skipped over the case of returning error when
      the controller is being deleted.  Instead it was returning BUSY, which
      caused the ios to retry, which caused the ns delete to hang waiting for
      the ios to drain.
      
      Stack trace of hang looks like:
       kworker/u64:2   D    0    74      2 0x80000000
       Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core]
       Call Trace:
        ? __schedule+0x26d/0x820
        schedule+0x32/0x80
        blk_mq_freeze_queue_wait+0x36/0x80
        ? remove_wait_queue+0x60/0x60
        blk_cleanup_queue+0x72/0x160
        nvme_ns_remove+0x106/0x140 [nvme_core]
        nvme_remove_namespaces+0x7e/0xa0 [nvme_core]
        nvme_delete_ctrl_work+0x4d/0x80 [nvme_core]
        process_one_work+0x160/0x350
        worker_thread+0x1c3/0x3d0
        kthread+0xf5/0x130
        ? process_one_work+0x350/0x350
        ? kthread_bind+0x10/0x10
        ret_from_fork+0x1f/0x30
      
      Extend nvmf_fail_nonready_command() to supply the controller pointer so
      that the controller state can be looked at. Fail any io to a controller
      that is deleting.
      
      Fixes: 3bc32bb1 ("nvme-fabrics: refactor queue ready check")
      Fixes: 35897b92 ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready")
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarEwan D. Milne <emilne@redhat.com>
      Reviewed-by: default avatarEwan D. Milne <emilne@redhat.com>
      6cdefc6e
    • James Smart's avatar
      nvmet-fc: fix target sgl list on large transfers · d082dc15
      James Smart authored
      The existing code to carve up the sg list expected an sg element-per-page
      which can be very incorrect with iommu's remapping multiple memory pages
      to fewer bus addresses. To hit this error required a large io payload
      (greater than 256k) and a system that maps on a per-page basis. It's
      possible that large ios could get by fine if the system condensed the
      sgl list into the first 64 elements.
      
      This patch corrects the sg list handling by specifically walking the
      sg list element by element and attempting to divide the transfer up
      on a per-sg element boundary. While doing so, it still tries to keep
      sequences under 256k, but will exceed that rule if a single sg element
      is larger than 256k.
      
      Fixes: 48fa362b ("nvmet-fc: simplify sg list handling")
      Cc: <stable@vger.kernel.org> # 4.14
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      d082dc15
  4. 16 Jul, 2018 2 commits
    • Josef Bacik's avatar
      nbd: handle unexpected replies better · 8f3ea359
      Josef Bacik authored
      If the server or network is misbehaving and we get an unexpected reply
      we can sometimes miss the request not being started and wait on a
      request and never get a response, or even double complete the same
      request.  Fix this by replacing the send_complete completion with just a
      per command lock.  Add a per command cookie as well so that we can know
      if we're getting a double completion for a previous event.  Also check
      to make sure we dont have REQUEUED set as that means we raced with the
      timeout handler and need to just let the retry occur.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8f3ea359
    • Josef Bacik's avatar
      nbd: don't requeue the same request twice. · d7d94d48
      Josef Bacik authored
      We can race with the snd timeout and the per-request timeout and end up
      requeuing the same request twice.  We can't use the send_complete
      completion to tell if everything is ok because we hold the tx_lock
      during send, so the timeout stuff will block waiting to mark the socket
      dead, and we could be marked complete and still requeue.  Instead add a
      flag to the socket so we know whether we've been requeued yet.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d7d94d48
  5. 11 Jul, 2018 1 commit
  6. 03 Jul, 2018 1 commit
  7. 02 Jul, 2018 1 commit
    • Lars Ellenberg's avatar
      drbd: fix access after free · 64dafbc9
      Lars Ellenberg authored
      We have
        struct drbd_requests { ... struct bio *private_bio;  ... }
      to hold a bio clone for local submission.
      
      On local IO completion, we put that bio, and in case we want to use the
      result later, we overload that member to hold the ERR_PTR() of the
      completion result,
      
      Which, before v4.3, used to be the passed in "int error",
      so we could first bio_put(), then assign.
      
      v4.3-rc1~100^2~21 4246a0b6 block: add a bi_error field to struct bio
      changed that:
        	bio_put(req->private_bio);
       -	req->private_bio = ERR_PTR(error);
       +	req->private_bio = ERR_PTR(bio->bi_error);
      
      Which introduces an access after free,
      because it was non obvious that req->private_bio == bio.
      
      Impact of that was mostly unnoticable, because we only use that value
      in a multiple-failure case, and even then map any "unexpected" error
      code to EIO, so worst case we could potentially mask a more specific
      error with EIO in a multiple failure case.
      
      Unless the pointed to memory region was unmapped, as is the case with
      CONFIG_DEBUG_PAGEALLOC, in which case this results in
      
        BUG: unable to handle kernel paging request
      
      v4.13-rc1~70^2~75 4e4cbee9 block: switch bios to blk_status_t
      changes it further to
        	bio_put(req->private_bio);
        	req->private_bio = ERR_PTR(blk_status_to_errno(bio->bi_status));
      
      And blk_status_to_errno() now contains a WARN_ON_ONCE() for unexpected
      values, which catches this "sometimes", if the memory has been reused
      quickly enough for other things.
      
      Should also go into stable since 4.3, with the trivial change around 4.13.
      
      Cc: stable@vger.kernel.org
      Fixes: 4246a0b6 block: add a bi_error field to struct bio
      Reported-by: default avatarSarah Newman <srn@prgmr.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      64dafbc9
  8. 29 Jun, 2018 4 commits
  9. 28 Jun, 2018 2 commits
  10. 26 Jun, 2018 1 commit
  11. 23 Jun, 2018 1 commit
  12. 22 Jun, 2018 3 commits
    • Jan Kara's avatar
      bdi: Fix another oops in wb_workfn() · 3ee7e869
      Jan Kara authored
      syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
      wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
      WB_shutting_down after wb->bdi->dev became NULL. This indicates that
      unregister_bdi() failed to call wb_shutdown() on one of wb objects.
      
      The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
      drops bdi's reference to wb structures before going through the list of
      wbs again and calling wb_shutdown() on each of them. This way the loop
      iterating through all wbs can easily miss a wb if that wb has already
      passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
      from cgwb_release_workfn() and as a result fully shutdown bdi although
      wb_workfn() for this wb structure is still running. In fact there are
      also other ways cgwb_bdi_unregister() can race with
      cgwb_release_workfn() leading e.g. to use-after-free issues:
      
      CPU1                            CPU2
                                      cgwb_bdi_unregister()
                                        cgwb_kill(*slot);
      
      cgwb_release()
        queue_work(cgwb_release_wq, &wb->release_work);
      cgwb_release_workfn()
                                        wb = list_first_entry(&bdi->wb_list, ...)
                                        spin_unlock_irq(&cgwb_lock);
        wb_shutdown(wb);
        ...
        kfree_rcu(wb, rcu);
                                        wb_shutdown(wb); -> oops use-after-free
      
      We solve these issues by synchronizing writeback structure shutdown from
      cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
      way we also no longer need synchronization using WB_shutting_down as the
      mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
      CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
      bdi_unregister().
      Reported-by: default avatarsyzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3ee7e869
    • Geert Uytterhoeven's avatar
      lightnvm: Remove depends on HAS_DMA in case of platform dependency · 0ae52ddf
      Geert Uytterhoeven authored
      Remove dependencies on HAS_DMA where a Kconfig symbol depends on another
      symbol that implies HAS_DMA, and, optionally, on "|| COMPILE_TEST".
      In most cases this other symbol is an architecture or platform specific
      symbol, or PCI.
      
      Generic symbols and drivers without platform dependencies keep their
      dependencies on HAS_DMA, to prevent compiling subsystems or drivers that
      cannot work anyway.
      
      This simplifies the dependencies, and allows to improve compile-testing.
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarMark Brown <broonie@kernel.org>
      Acked-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Reviewed-by: default avatarMatias Bjørling <mb@lightnvm.io>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0ae52ddf
    • Jens Axboe's avatar
      Merge branch 'nvme-4.18' of git://git.infradead.org/nvme into for-linus · f9da9d07
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "Various relatively small fixes, mostly to fix error handling of various
       sorts."
      
      * 'nvme-4.18' of git://git.infradead.org/nvme:
        nvme-pci: limit max IO size and segments to avoid high order allocations
        nvme-pci: move nvme_kill_queues to nvme_remove_dead_ctrl
        nvme-fc: release io queues to allow fast fail
        nvmet: reset keep alive timer in controller enable
        nvme-rdma: don't override opts->queue_size
        nvme-rdma: Fix command completion race at error recovery
        nvme-rdma: fix possible free of a non-allocated async event buffer
        nvme-rdma: fix possible double free condition when failing to create a controller
      f9da9d07
  13. 21 Jun, 2018 4 commits
  14. 20 Jun, 2018 7 commits
  15. 19 Jun, 2018 2 commits
    • Bart Van Assche's avatar
      Revert "block: Add warning for bi_next not NULL in bio_endio()" · 9c24c10a
      Bart Van Assche authored
      Commit 0ba99ca4 ("block: Add warning for bi_next not NULL in
      bio_endio()") breaks the dm driver. end_clone_bio() detects whether
      or not a bio is the last bio associated with a request by checking
      the .bi_next field. Commit 0ba99ca4 clears that field before
      end_clone_bio() has had a chance to inspect that field. Hence revert
      commit 0ba99ca4.
      
      This patch avoids that KASAN reports the following complaint when
      running the srp-test software (srp-test/run_tests -c -d -r 10 -t 02-mq):
      
      ==================================================================
      BUG: KASAN: use-after-free in bio_advance+0x11b/0x1d0
      Read of size 4 at addr ffff8801300e06d0 by task ksoftirqd/0/9
      
      CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 4.18.0-rc1-dbg+ #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
      Call Trace:
       dump_stack+0xa4/0xf5
       print_address_description+0x6f/0x270
       kasan_report+0x241/0x360
       __asan_load4+0x78/0x80
       bio_advance+0x11b/0x1d0
       blk_update_request+0xa7/0x5b0
       scsi_end_request+0x56/0x320 [scsi_mod]
       scsi_io_completion+0x7d6/0xb20 [scsi_mod]
       scsi_finish_command+0x1c0/0x280 [scsi_mod]
       scsi_softirq_done+0x19a/0x230 [scsi_mod]
       blk_mq_complete_request+0x160/0x240
       scsi_mq_done+0x50/0x1a0 [scsi_mod]
       srp_recv_done+0x515/0x1330 [ib_srp]
       __ib_process_cq+0xa0/0xf0 [ib_core]
       ib_poll_handler+0x38/0xa0 [ib_core]
       irq_poll_softirq+0xe8/0x1f0
       __do_softirq+0x128/0x60d
       run_ksoftirqd+0x3f/0x60
       smpboot_thread_fn+0x352/0x460
       kthread+0x1c1/0x1e0
       ret_from_fork+0x24/0x30
      
      Allocated by task 1918:
       save_stack+0x43/0xd0
       kasan_kmalloc+0xad/0xe0
       kasan_slab_alloc+0x11/0x20
       kmem_cache_alloc+0xfe/0x350
       mempool_alloc_slab+0x15/0x20
       mempool_alloc+0xfb/0x270
       bio_alloc_bioset+0x244/0x350
       submit_bh_wbc+0x9c/0x2f0
       __block_write_full_page+0x299/0x5a0
       block_write_full_page+0x16b/0x180
       blkdev_writepage+0x18/0x20
       __writepage+0x42/0x80
       write_cache_pages+0x376/0x8a0
       generic_writepages+0xbe/0x110
       blkdev_writepages+0xe/0x10
       do_writepages+0x9b/0x180
       __filemap_fdatawrite_range+0x178/0x1c0
       file_write_and_wait_range+0x59/0xc0
       blkdev_fsync+0x46/0x80
       vfs_fsync_range+0x66/0x100
       do_fsync+0x3d/0x70
       __x64_sys_fsync+0x21/0x30
       do_syscall_64+0x77/0x230
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 9:
       save_stack+0x43/0xd0
       __kasan_slab_free+0x137/0x190
       kasan_slab_free+0xe/0x10
       kmem_cache_free+0xd3/0x380
       mempool_free_slab+0x17/0x20
       mempool_free+0x63/0x160
       bio_free+0x81/0xa0
       bio_put+0x59/0x60
       end_bio_bh_io_sync+0x5d/0x70
       bio_endio+0x1a7/0x360
       blk_update_request+0xd0/0x5b0
       end_clone_bio+0xa3/0xd0 [dm_mod]
       bio_endio+0x1a7/0x360
       blk_update_request+0xd0/0x5b0
       scsi_end_request+0x56/0x320 [scsi_mod]
       scsi_io_completion+0x7d6/0xb20 [scsi_mod]
       scsi_finish_command+0x1c0/0x280 [scsi_mod]
       scsi_softirq_done+0x19a/0x230 [scsi_mod]
       blk_mq_complete_request+0x160/0x240
       scsi_mq_done+0x50/0x1a0 [scsi_mod]
       srp_recv_done+0x515/0x1330 [ib_srp]
       __ib_process_cq+0xa0/0xf0 [ib_core]
       ib_poll_handler+0x38/0xa0 [ib_core]
       irq_poll_softirq+0xe8/0x1f0
       __do_softirq+0x128/0x60d
      
      The buggy address belongs to the object at ffff8801300e0640
       which belongs to the cache bio-0 of size 200
      The buggy address is located 144 bytes inside of
       200-byte region [ffff8801300e0640, ffff8801300e0708)
      The buggy address belongs to the page:
      page:ffffea0004c03800 count:1 mapcount:0 mapping:ffff88015a563a00 index:0x0 compound_mapcount: 0
      flags: 0x8000000000008100(slab|head)
      raw: 8000000000008100 dead000000000100 dead000000000200 ffff88015a563a00
      raw: 0000000000000000 0000000000330033 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8801300e0580: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
       ffff8801300e0600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
      >ffff8801300e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                       ^
       ffff8801300e0700: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff8801300e0780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      ==================================================================
      
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Fixes: 0ba99ca4 ("block: Add warning for bi_next not NULL in bio_endio()")
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9c24c10a
    • Christoph Hellwig's avatar
      block: fix timeout changes for legacy request drivers · 0cc61e64
      Christoph Hellwig authored
      blk_mq_complete_request can only be called for blk-mq drivers, but when
      removing the BLK_EH_HANDLED return value, two legacy request timeout
      methods incorrectly got switched to call blk_mq_complete_request.
      Call __blk_complete_request instead to reinstance the previous behavior.
      For that __blk_complete_request needs to be exported.
      
      Fixes: 1fc2b62e ("scsi_transport_fc: complete requests from ->timeout")
      Fixes: 0df0bb08 ("null_blk: complete requests from ->timeout")
      Reported-by: default avatarJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0cc61e64
  16. 15 Jun, 2018 4 commits
    • Anatoliy Glagolev's avatar
      bsg: fix race of bsg_open and bsg_unregister · d6c73964
      Anatoliy Glagolev authored
      The existing implementation allows races between bsg_unregister and
      bsg_open paths. bsg_unregister and request_queue cleanup and deletion
      may start and complete right after bsg_get_device (in bsg_open path)
      retrieves bsg_class_device and releases the mutex. Then bsg_open path
      touches freed memory of bsg_class_device and request_queue.
      
      One possible fix is to hold the mutex all the way through bsg_get_device
      instead of releasing it after bsg_class_device retrieval.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-Off-By: default avatarAnatoliy Glagolev <glagolig@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d6c73964
    • Christoph Hellwig's avatar
      block: remov blk_queue_invalidate_tags · be7f99c5
      Christoph Hellwig authored
      This function is entirely unused, so remove it and the tag_queue_busy
      member of struct request_queue.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      be7f99c5
    • Jens Axboe's avatar
      Merge branch 'nvme-4.18' of git://git.infradead.org/nvme into for-linus · 95c7c09f
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "Fix various little regressions introduced in this merge window, plus
       a rework of the fibre channel connect and reconnect path to share the
       code instead of having separate sets of bugs. Last but not least a
       trivial trace point addition from Hannes."
      
      * 'nvme-4.18' of git://git.infradead.org/nvme:
        nvme-fabrics: fix and refine state checks in __nvmf_check_ready
        nvme-fabrics: handle the admin-only case properly in nvmf_check_ready
        nvme-fabrics: refactor queue ready check
        blk-mq: remove blk_mq_tagset_iter
        nvme: remove nvme_reinit_tagset
        nvme-fc: fix nulling of queue data on reconnect
        nvme-fc: remove reinit_request routine
        nvme-fc: change controllers first connect to use reconnect path
        nvme: don't rely on the changed namespace list log
        nvmet: free smart-log buffer after use
        nvme-rdma: fix error flow during mapping request data
        nvme: add bio remapping tracepoint
        nvme: fix NULL pointer dereference in nvme_init_subsystem
      95c7c09f
    • Christoph Hellwig's avatar
      nvme-fabrics: fix and refine state checks in __nvmf_check_ready · 35897b92
      Christoph Hellwig authored
       - make sure we only allow internally generates commands in any non-live
         state
       - only allow connect commands on non-live queues when actually in the
         new or connecting states
       - treat all other non-live, non-dead states the same as a default
         cach-all
      
      This fixes a regression where we could not shutdown a controller
      orderly as we didn't allow the internal generated Property Set
      command, and also ensures we don't accidentally let a Connect command
      through in the wrong state.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Smart <james.smart@broadcom.com>
      35897b92