1. 13 May, 2020 4 commits
    • Ming Lei's avatar
      block: re-organize fields of 'struct hd_part' · 520138c3
      Ming Lei authored
      Put all fields accessed in IO path together at the beginning
      of the struct, so that all can be fetched in single cacheline.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@infradead.org>
      Cc: Yufen Yu <yuyufen@huawei.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      520138c3
    • Ming Lei's avatar
      block: only define 'nr_sects_seq' in hd_part for 32bit SMP · 07c4e1e8
      Ming Lei authored
      The seqcount of 'nr_sects_seq' is only needed in case of 32bit SMP,
      so define it just for 32bit SMP.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@infradead.org>
      Cc: Yufen Yu <yuyufen@huawei.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      07c4e1e8
    • Ming Lei's avatar
      block: fix use-after-free on cached last_lookup partition · b7d6c303
      Ming Lei authored
      delete_partition() clears the cached last_lookup partition. However the
      .last_lookup cache may be overwritten by one IO path after it is cleared
      from delete_partition(). Then another IO path may use the cached deleting
      partition after hd_struct_free() is called, then use-after-free is triggered
      on the cached partition.
      
      Fixes the issue by the following approach:
      
      1) always get the partition's refcount via hd_struct_try_get() before
      setting .last_lookup
      
      2) move clearing .last_lookup from delete_partition() to hd_struct_free()
      which is the release handle of the partition's percpu-refcount, so that no
      IO path can cache deleteing partition via .last_lookup.
      
      It is one candidate approach of Yufen's patch[1] which adds overhead
      in fast path by indirect lookup which may introduce one extra cacheline
      in IO path. Also this patch relies on percpu-refcount's protection, and
      it is easier to understand and verify.
      
      [1] https://lore.kernel.org/linux-block/20200109013551.GB9655@ming.t460p/T/#tReported-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b7d6c303
    • Weiping Zhang's avatar
      block: reset mapping if failed to update hardware queue count · aa880ad6
      Weiping Zhang authored
      When we increase hardware queue count, blk_mq_update_queue_map will
      reset the mapping between cpu and hardware queue base on the hardware
      queue count(set->nr_hw_queues). The mapping cannot be reset if it
      encounters error in blk_mq_realloc_hw_ctxs, but the fallback flow will
      continue using it, then blk_mq_map_swqueue will touch a invalid memory,
      because the mapping points to a wrong hctx.
      
      blktest block/030:
      
      null_blk: module loaded
      Increasing nr_hw_queues to 8 fails, fallback to 1
      ==================================================================
      BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x2f2/0x830
      Read of size 8 at addr 0000000000000128 by task nproc/8541
      
      CPU: 5 PID: 8541 Comm: nproc Not tainted 5.7.0-rc4-dbg+ #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
      Call Trace:
      dump_stack+0xa5/0xe6
      __kasan_report.cold+0x65/0xbb
      kasan_report+0x45/0x60
      check_memory_region+0x15e/0x1c0
      __kasan_check_read+0x15/0x20
      blk_mq_map_swqueue+0x2f2/0x830
      __blk_mq_update_nr_hw_queues+0x3df/0x690
      blk_mq_update_nr_hw_queues+0x32/0x50
      nullb_device_submit_queues_store+0xde/0x160 [null_blk]
      configfs_write_file+0x1c4/0x250 [configfs]
      __vfs_write+0x4c/0x90
      vfs_write+0x14b/0x2d0
      ksys_write+0xdd/0x180
      __x64_sys_write+0x47/0x50
      do_syscall_64+0x6f/0x310
      entry_SYSCALL_64_after_hwframe+0x49/0xb3
      Signed-off-by: default avatarWeiping Zhang <zhangweiping@didiglobal.com>
      Tested-by: default avatarBart van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aa880ad6
  2. 11 May, 2020 1 commit
  3. 09 May, 2020 16 commits
    • Christoph Hellwig's avatar
      hfs: stop using ioctl_by_bdev · af00423a
      Christoph Hellwig authored
      Instead just call the CDROM layer functionality directly.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      af00423a
    • Christoph Hellwig's avatar
      bdi: remove the name field in struct backing_dev_info · 1cd925d5
      Christoph Hellwig authored
      The name is only printed for a not registered bdi in writeback.  Use the
      device name there as is more useful anyway for the unlike case that the
      warning triggers.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1cd925d5
    • Christoph Hellwig's avatar
      bdi: simplify bdi_alloc · aef33c2f
      Christoph Hellwig authored
      Merge the _node vs normal version and drop the superflous gfp_t argument.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aef33c2f
    • Christoph Hellwig's avatar
      bdi: remove bdi_register_owner · 3c5d202b
      Christoph Hellwig authored
      Split out a new bdi_set_owner helper to set the owner, and move the policy
      for creating the bdi name back into genhd.c, where it belongs.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3c5d202b
    • Christoph Hellwig's avatar
      bdi: unexport bdi_register_va · a5a6c66d
      Christoph Hellwig authored
      bdi_register_va is only used by super.c, which can't be modular.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a5a6c66d
    • Christoph Hellwig's avatar
      driver core: remove device_create_vargs · 4c747466
      Christoph Hellwig authored
      All external users of device_create_vargs are gone, so remove it and
      open code it in the only caller.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4c747466
    • Weiping Zhang's avatar
      block: rename blk_mq_alloc_rq_maps · 79fab528
      Weiping Zhang authored
      rename blk_mq_alloc_rq_maps to blk_mq_alloc_map_and_requests,
      this function allocs both map and request, make function name align
      with funtion.
      Signed-off-by: default avatarWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      79fab528
    • Weiping Zhang's avatar
      block: rename __blk_mq_alloc_rq_map · 03b63b02
      Weiping Zhang authored
      rename __blk_mq_alloc_rq_map to __blk_mq_alloc_map_and_request,
      actually it alloc both map and request, make function name
      align with function.
      Signed-off-by: default avatarWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      03b63b02
    • Ming Lei's avatar
      block: alloc map and request for new hardware queue · fd689871
      Ming Lei authored
      Alloc new map and request for new hardware queue when increse
      hardware queue count. Before this patch, it will show a
      warning for each new hardware queue, but it's not enough, these
      hctx have no maps and reqeust, when a bio was mapped to these
      hardware queue, it will trigger kernel panic when get request
      from these hctx.
      
      Test environment:
       * A NVMe disk supports 128 io queues
       * 96 cpus in system
      
      A corner case can always trigger this panic, there are 96
      io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
      log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
      queues to 96, then nvme will alloc others(32) queues for read, but
      blk_mq_update_nr_hw_queues does not alloc map and request for these new
      added io queues. So when process read nvme disk, it will trigger kernel
      panic when get request from these hardware context.
      
      Reproduce script:
      
      nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
      echo $nr > /sys/module/nvme/parameters/write_queues
      echo 1 > /sys/block/nvme0n1/device/reset_controller
      dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1
      
      [ 8040.805626] ------------[ cut here ]------------
      [ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
      [ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
      ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
      tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
      cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
      rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
      [ 8040.805637]  ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
      [ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G        W         5.6.0-rc5.78317c+ #2
      [ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
      [ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
      [ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
      [ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
      [ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246
      [ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d
      [ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90
      [ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d
      [ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008
      [ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f
      [ 8040.805645] FS:  0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000
      [ 8040.805646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0
      [ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 8040.805647] PKRU: 55555554
      [ 8040.805647] Call Trace:
      [ 8040.805649]  blk_mq_update_nr_hw_queues+0x31b/0x390
      [ 8040.805650]  nvme_reset_work+0xb4b/0xeab [nvme]
      [ 8040.805651]  process_one_work+0x1a7/0x370
      [ 8040.805652]  worker_thread+0x1c9/0x380
      [ 8040.805653]  ? max_active_store+0x80/0x80
      [ 8040.805655]  kthread+0x112/0x130
      [ 8040.805656]  ? __kthread_parkme+0x70/0x70
      [ 8040.805657]  ret_from_fork+0x35/0x40
      [ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]---
      [ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004
      [ 8229.365165] #PF: supervisor read access in kernel mode
      [ 8229.365178] #PF: error_code(0x0000) - not-present page
      [ 8229.365191] PGD 0 P4D 0
      [ 8229.365201] Oops: 0000 [#1] SMP PTI
      [ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G        W         5.6.0-rc5.78317c+ #2
      [ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
      [ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
      [ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
      [ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246
      [ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998
      [ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38
      [ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198
      [ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000
      [ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001
      [ 8229.365397] FS:  00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000
      [ 8229.365415] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0
      [ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 8229.365476] PKRU: 55555554
      [ 8229.365484] Call Trace:
      [ 8229.365498]  ? finish_wait+0x80/0x80
      [ 8229.365512]  blk_mq_get_request+0xcb/0x3f0
      [ 8229.365525]  blk_mq_make_request+0x143/0x5d0
      [ 8229.365538]  generic_make_request+0xcf/0x310
      [ 8229.365553]  ? scan_shadow_nodes+0x30/0x30
      [ 8229.365564]  submit_bio+0x3c/0x150
      [ 8229.365576]  mpage_readpages+0x163/0x1a0
      [ 8229.365588]  ? blkdev_direct_IO+0x490/0x490
      [ 8229.365601]  read_pages+0x6b/0x190
      [ 8229.365612]  __do_page_cache_readahead+0x1c1/0x1e0
      [ 8229.365626]  ondemand_readahead+0x182/0x2f0
      [ 8229.365639]  generic_file_buffered_read+0x590/0xab0
      [ 8229.365655]  new_sync_read+0x12a/0x1c0
      [ 8229.365666]  vfs_read+0x8a/0x140
      [ 8229.365676]  ksys_read+0x59/0xd0
      [ 8229.365688]  do_syscall_64+0x55/0x1d0
      [ 8229.365700]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarWeiping Zhang <zhangweiping@didiglobal.com>
      Tested-by: default avatarWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd689871
    • Weiping Zhang's avatar
      block: save previous hardware queue count before udpate · a2584e43
      Weiping Zhang authored
      blk_mq_realloc_tag_set_tags will update set->nr_hw_queues, so
      save old set->nr_hw_queues before call this function.
      Signed-off-by: default avatarWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a2584e43
    • Weiping Zhang's avatar
      block: free both rq_map and request · 2e194422
      Weiping Zhang authored
      Allocation:
      
      __blk_mq_alloc_rq_map
      	blk_mq_alloc_rq_map
      		blk_mq_alloc_rq_map
      			tags = blk_mq_init_tags : kzalloc_node:
      			tags->rqs = kcalloc_node
      			tags->static_rqs = kcalloc_node
      	blk_mq_alloc_rqs
      		p = alloc_pages_node
      		tags->static_rqs[i] = p + offset;
      
      Free:
      
      blk_mq_free_rq_map
      	kfree(tags->rqs);
      	kfree(tags->static_rqs);
      	blk_mq_free_tags
      		kfree(tags);
      
      The page allocated in blk_mq_alloc_rqs cannot be released,
      so we should use blk_mq_free_map_and_requests here.
      
      blk_mq_free_map_and_requests
      	blk_mq_free_rqs
      		__free_pages : cleanup for blk_mq_alloc_rqs
      	blk_mq_free_rq_map : cleanup for blk_mq_alloc_rq_map
      Signed-off-by: default avatarWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2e194422
    • Jens Axboe's avatar
      Merge branch 'block-5.7' into for-5.8/block · 873f1c8d
      Jens Axboe authored
      Pull in block-5.7 fixes for 5.8. Mostly to resolve a conflict with
      the blk-iocost changes, but we also need the base of the bdi
      use-after-free as well as we build on top of it.
      
      * block-5.7:
        nvme: fix possible hang when ns scanning fails during error recovery
        nvme-pci: fix "slimmer CQ head update"
        bdi: add a ->dev_name field to struct backing_dev_info
        bdi: use bdi_dev_name() to get device name
        bdi: move bdi_dev_name out of line
        vboxsf: don't use the source name in the bdi name
        iocost: protect iocg->abs_vdebt with iocg->waitq.lock
        block: remove the bd_openers checks in blk_drop_partitions
        nvme: prevent double free in nvme_alloc_ns() error handling
        null_blk: Cleanup zoned device initialization
        null_blk: Fix zoned command handling
        block: remove unused header
        blk-iocost: Fix error on iocost_ioc_vrate_adj
        bdev: Reduce time holding bd_mutex in sync in blkdev_close()
        buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      873f1c8d
    • Sagi Grimberg's avatar
      nvme: fix possible hang when ns scanning fails during error recovery · 59c7c3ca
      Sagi Grimberg authored
      When the controller is reconnecting, the host fails I/O and admin
      commands as the host cannot reach the controller. ns scanning may
      revalidate namespaces during that period and it is wrong to remove
      namespaces due to these failures as we may hang (see 205da243).
      
      One command that may fail is nvme_identify_ns_descs. Since we return
      success due to having ns identify descriptor list optional, we continue
      to compare ns identifiers in nvme_revalidate_disk, obviously fail and
      return -ENODEV to nvme_validate_ns, which will remove the namespace.
      
      Exactly what we don't want to happen.
      
      Fixes: 22802bf7 ("nvme: Namepace identification descriptor list is optional")
      Tested-by: default avatarAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      59c7c3ca
    • Alexey Dobriyan's avatar
      nvme-pci: fix "slimmer CQ head update" · a8de6639
      Alexey Dobriyan authored
      Pre-incrementing ->cq_head can't be done in memory because OOB value
      can be observed by another context.
      
      This devalues space savings compared to original code :-\
      
      	$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
      	add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-32 (-32)
      	Function                                     old     new   delta
      	nvme_poll_irqdisable                         464     456      -8
      	nvme_poll                                    455     447      -8
      	nvme_irq                                     388     380      -8
      	nvme_dev_disable                             955     947      -8
      
      But the code is minimal now: one read for head, one read for q_depth,
      one increment, one comparison, single instruction phase bit update and
      one write for new head.
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Reported-by: default avatarJohn Garry <john.garry@huawei.com>
      Tested-by: default avatarJohn Garry <john.garry@huawei.com>
      Fixes: e2a366a4 ("nvme-pci: slimmer CQ head update")
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a8de6639
    • Christoph Hellwig's avatar
      bdi: add a ->dev_name field to struct backing_dev_info · 6bd87eec
      Christoph Hellwig authored
      Cache a copy of the name for the life time of the backing_dev_info
      structure so that we can reference it even after unregistering.
      
      Fixes: 68f23b89 ("memcg: fix a crash in wb_workfn when a device disappears")
      Reported-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6bd87eec
    • Yufen Yu's avatar
      bdi: use bdi_dev_name() to get device name · d51cfc53
      Yufen Yu authored
      Use the common interface bdi_dev_name() to get device name.
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      
      Add missing <linux/backing-dev.h> include BFQ
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d51cfc53
  4. 07 May, 2020 2 commits
  5. 05 May, 2020 1 commit
    • Tejun Heo's avatar
      iocost: protect iocg->abs_vdebt with iocg->waitq.lock · 0b80f986
      Tejun Heo authored
      abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup
      is and controls the activation of use_delay mechanism. Once a cgroup goes
      over budget from forced IOs, it has to pay it back with its future budget.
      The progress guarantee on debt paying comes from the iocg being active -
      active iocgs are processed by the periodic timer, which ensures that as time
      passes the debts dissipate and the iocg returns to normal operation.
      
      However, both iocg activation and vdebt handling are asynchronous and a
      sequence like the following may happen.
      
      1. The iocg is in the process of being deactivated by the periodic timer.
      
      2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns
         without anything because it still sees that the iocg is already active.
      
      3. The iocg is deactivated.
      
      4. The bio from #2 is over budget but needs to be forced. It increases
         abs_vdebt and goes over the threshold and enables use_delay.
      
      5. IO control is enabled for the iocg's subtree and now IOs are attributed
         to the descendant cgroups and the iocg itself no longer issues IOs.
      
      This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no
      further IOs which can activate it. This can end up unduly punishing all the
      descendants cgroups.
      
      The usual throttling path has the same issue - the iocg must be active while
      throttled to ensure that future event will wake it up - and solves the
      problem by synchronizing the throttling path with a spinlock. abs_vdebt
      handling is another form of overage handling and shares a lot of
      characteristics including the fact that it isn't in the hottest path.
      
      This patch fixes the above and other possible races by strictly
      synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarVlad Dmitriev <vvd@fb.com>
      Cc: stable@vger.kernel.org # v5.4+
      Fixes: e1518f63 ("blk-iocost: Don't let merges push vtime into the future")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0b80f986
  6. 04 May, 2020 7 commits
  7. 30 Apr, 2020 6 commits
    • Tejun Heo's avatar
      iocost_monitor: drop string wrap around numbers when outputting json · 21f3cfea
      Tejun Heo authored
      Wrapping numbers in strings is used by some to work around bit-width issues in
      some enviroments. The problem isn't innate to json and the workaround seems to
      cause more integration problems than help. Let's drop the string wrapping.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      21f3cfea
    • Tejun Heo's avatar
      iocost_monitor: exit successfully if interval is zero · f4fe3ea6
      Tejun Heo authored
      This is to help external tools to decide whether iocost_monitor has all its
      requirements met or not based on the exit status of an -i0 run.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f4fe3ea6
    • Tejun Heo's avatar
      blk-iocost: account for IO size when testing latencies · cd006509
      Tejun Heo authored
      On each IO completion, iocost decides whether the IO met or missed its latency
      target. Currently, the targets are fixed numbers per IO type. While this can be
      good enough for loose latency targets way higher than typical completion
      latencies, the effect of IO size makes it difficult to tighten the latency
      target - a target adequate for 4k IOs might be too tight for 512k IOs and
      vice-versa.
      
      iocost already has all the necessary information to account for different IO
      sizes when testing whether the latency target is met as iocost can calculate the
      size vtime cost of a given IO. This patch updates the completion path to
      calculate the size vtime cost of the IO, deduct the nsec equivalent from the
      observed latency and use the adjusted value to decide whether the target is met.
      
      This makes latency targets independent from IO size and enables determining
      adequate latency targets with fixed size fio runs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cd006509
    • Tejun Heo's avatar
      blk-iocost: switch to fixed non-auto-decaying use_delay · 54c52e10
      Tejun Heo authored
      The use_delay mechanism was introduced by blk-iolatency to hold memory
      allocators accountable for the reclaim and other shared IOs they cause. The
      duration of the delay is dynamically balanced between iolatency increasing the
      value on each target miss and it auto-decaying as time passes and threads get
      delayed on it.
      
      While this works well for iolatency, iocost's control model isn't compatible
      with it. There is no repeated "violation" events which can be balanced against
      auto-decaying. iocost instead knows how much a given cgroup is over budget and
      wants to prevent that cgroup from issuing IOs while over budget. Until now,
      iocost has been adding the cost of force-issued IOs. However, this doesn't
      reflect the amount which is already over budget and is simply not enough to
      counter the auto-decaying allowing anon-memory leaking low priority cgroup to
      go over its alloted share of IOs.
      
      As auto-decaying doesn't make much sense for iocost, this patch introduces a
      different mode of operation for use_delay - when blkcg_set_delay() are used
      insted of blkcg_add/use_delay(), the delay duration is not auto-decayed until it
      is explicitly cleared with blkcg_clear_delay(). iocost is updated to keep the
      delay duration synchronized to the budget overage amount.
      
      With this change, iocost can effectively police cgroups which generate
      significant amount of force-issued IOs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      54c52e10
    • Christoph Hellwig's avatar
      block: remove the bd_openers checks in blk_drop_partitions · 10c70d95
      Christoph Hellwig authored
      When replacing the bd_super check with a bd_openers I followed a logical
      conclusion, which turns out to be utterly wrong.  When a block device has
      bd_super sets it has a mount file system on it (although not every
      mounted file system sets bd_super), but that also implies it doesn't even
      have partitions to start with.
      
      So instead of trying to come up with a logical check for all openers,
      just remove the check entirely.
      
      Fixes: d3ef5536 ("block: fix busy device checking in blk_drop_partitions")
      Fixes: cb6b771b ("block: fix busy device checking in blk_drop_partitions again")
      Reported-by: default avatarMichal Koutný <mkoutny@suse.com>
      Reported-by: default avatarYang Xu <xuyang2018.jy@cn.fujitsu.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      10c70d95
    • Jens Axboe's avatar
      Merge branch 'nvme-5.7' of git://git.infradead.org/nvme into block-5.7 · 47ed39e0
      Jens Axboe authored
      Pull NVMe fix from Christoph.
      
      * 'nvme-5.7' of git://git.infradead.org/nvme:
        nvme: prevent double free in nvme_alloc_ns() error handling
      47ed39e0
  8. 29 Apr, 2020 3 commits