- 20 Jun, 2023 6 commits
-
-
Jingbo Xu authored
Refuse Persistent Reservation operations on partitions as reservation on partitions doesn't make sense. Besides, introduce blkdev_pr_allowed() helper, where more policies could be placed here later. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230613084008.93795-2-jefflexu@linux.alibaba.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
In journal_init_dev(), if super bdev is used as 'j_dev_bd', then blkdev_get_by_dev() is called with NULL holder, otherwise, holder will be journal. However, later in release_journal_dev(), blkdev_put() is called with journal unconditionally, cause following warning: WARNING: CPU: 1 PID: 5034 at block/bdev.c:617 bd_end_claim block/bdev.c:617 [inline] WARNING: CPU: 1 PID: 5034 at block/bdev.c:617 blkdev_put+0x562/0x8a0 block/bdev.c:901 RIP: 0010:blkdev_put+0x562/0x8a0 block/bdev.c:901 Call Trace: <TASK> release_journal_dev fs/reiserfs/journal.c:2592 [inline] free_journal_ram+0x421/0x5c0 fs/reiserfs/journal.c:1896 do_journal_release fs/reiserfs/journal.c:1960 [inline] journal_release+0x276/0x630 fs/reiserfs/journal.c:1971 reiserfs_put_super+0xe4/0x5c0 fs/reiserfs/super.c:616 generic_shutdown_super+0x158/0x480 fs/super.c:499 kill_block_super+0x64/0xb0 fs/super.c:1422 deactivate_locked_super+0x98/0x160 fs/super.c:330 deactivate_super+0xb1/0xd0 fs/super.c:361 cleanup_mnt+0x2ae/0x3d0 fs/namespace.c:1247 task_work_run+0x16f/0x270 kernel/task_work.c:179 exit_task_work include/linux/task_work.h:38 [inline] do_exit+0xadc/0x2a30 kernel/exit.c:874 do_group_exit+0xd4/0x2a0 kernel/exit.c:1024 __do_sys_exit_group kernel/exit.c:1035 [inline] __se_sys_exit_group kernel/exit.c:1033 [inline] __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:1033 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fix this problem by passing in NULL holder in this case. Reported-by: syzbot+04625c80899f4555de39@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=04625c80899f4555de39 Fixes: 2736e8ee ("block: use the holder as indication for exclusive opens") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620111322.1014775-1-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
After commit 2736e8ee ("block: use the holder as indication for exclusive opens"), blkdev_get_by_dev() will warn if holder is NULL and mode contains 'FMODE_EXCL'. holder from blkdev_get_by_dev() from disk_scan_partitions() is always NULL, hence it should not use 'FMODE_EXCL', which is broben by the commit. For consequence, WARN_ON_ONCE() will be triggered from blkdev_get_by_dev() if user scan partitions with device opened exclusively. Fix this problem by removing 'FMODE_EXCL' from disk_scan_partitions(), as it used to be. Reported-by: syzbot+00cd27751f78817f167b@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=00cd27751f78817f167b Fixes: 2736e8ee ("block: use the holder as indication for exclusive opens") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230618140402.7556-1-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620043536.707249-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Demi Marie Obenour authored
Currently, associating a loop device with a different file descriptor does not increment its diskseq. This allows the following race condition: 1. Program X opens a loop device 2. Program X gets the diskseq of the loop device. 3. Program X associates a file with the loop device. 4. Program X passes the loop device major, minor, and diskseq to something. 5. Program X exits. 6. Program Y detaches the file from the loop device. 7. Program Y attaches a different file to the loop device. 8. The opener finally gets around to opening the loop device and checks that the diskseq is what it expects it to be. Even though the diskseq is the expected value, the result is that the opener is accessing the wrong file. From discussions with Christoph Hellwig, it appears that disk_force_media_change() was supposed to call inc_diskseq(), but in fact it does not. Adding a Fixes: tag to indicate this. Christoph's Reported-by is because he stated that disk_force_media_change() calls inc_diskseq(), which is what led me to discover that it should but does not. Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Fixes: e6138dc1 ("block: add a helper to raise a media changed event") Cc: stable@vger.kernel.org # 5.15+ Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230607170837.1559-1-demi@invisiblethingslab.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Fix a missing conversion to the new BLK_OPEN constant in swim. Fixes: 05bdb996 ("block: replace fmode_t with a block-specific type for block open flags") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620043051.707196-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 16 Jun, 2023 7 commits
-
-
Ming Lei authored
After grabbing q->sysfs_lock, q->elevator may become NULL because of elevator switch. Fix the NULL dereference on q->elevator by checking it with lock. Reported-by: Guangwu Zhang <guazhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230616132354.415109-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Now that the direct I/O helpers have switched to use iov_iter_extract_pages, these helpers are unused. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230614140341.521331-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Now that all block direct I/O helpers use page pinning, this flag is unused. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230614140341.521331-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Check for -EFAULT instead of wrapping the check in an ret < 0 block. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230614140341.521331-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
copy_splice_read calls into ->read_iter to read the data, which already calls file_accessed. Fixes: 33b3b041 ("splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230614140341.521331-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
git://git.infradead.org/nvmeJens Axboe authored
Pull NVMe updates from Keith: "nvme updates for Linux 6.5 - Various cleanups all around (Irvin, Chaitanya, Christophe) - Better struct packing (Christophe JAILLET) - Reduce controller error logs for optional commands (Keith) - Support for >=64KiB block sizes (Daniel Gomez) - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner)" * tag 'nvme-6.5-2023-06-16' of git://git.infradead.org/nvme: (27 commits) nvme: forward port sysfs delete fix nvme: skip optional id ctrl csi if it failed nvme-core: use nvme_ns_head_multipath instead of ns->head->disk nvmet-fcloop: Do not wait on completion when unregister fails nvme-fabrics: open code __nvmf_host_find() nvme-fabrics: error out to unlock the mutex nvme: Increase block size variable size to 32-bit nvme-fcloop: no need to return from void function nvmet-auth: remove unnecessary break after goto nvmet-auth: remove some dead code nvme-core: remove redundant check from nvme_init_ns_head nvme: move sysfs code to a dedicated sysfs.c file nvme-fabrics: prevent overriding of existing host nvme-fabrics: check hostid using uuid_equal nvme-fabrics: unify common code in admin and io queue connect nvmet: reorder fields in 'struct nvmefc_fcp_req' nvmet: reorder fields in 'struct nvme_dhchap_queue_context' nvmet: reorder fields in 'struct nvmf_ctrl_options' nvme: reorder fields in 'struct nvme_ctrl' nvmet: reorder fields in 'struct nvmet_sq' ...
-
Keith Busch authored
We had a late fix that modified nvme_sysfs_delete() after the staging branch for the next merge window relocated the function to a new file. Port commit 2eb94dd5 ("nvme: do not let the user delete a ctrl before a complete") to the latest to avoid a potentially confusing merge conflict. Cc: Maurizio Lombardi <mlombard@redhat.com> Cc: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
-
- 15 Jun, 2023 9 commits
-
-
Mingzhe Zou authored
We get a kernel crash about "list_add corruption. next->prev should be prev (ffff9c801bc01210), but was ffff9c77b688237c. (next=ffffae586d8afe68)." crash> struct list_head 0xffff9c801bc01210 struct list_head { next = 0xffffae586d8afe68, prev = 0xffffae586d8afe68 } crash> struct list_head 0xffff9c77b688237c struct list_head { next = 0x0, prev = 0x0 } crash> struct list_head 0xffffae586d8afe68 struct list_head struct: invalid kernel virtual address: ffffae586d8afe68 type: "gdb_readmem_callback" Cannot access memory at address 0xffffae586d8afe68 [230469.019492] Call Trace: [230469.032041] prepare_to_wait+0x8a/0xb0 [230469.044363] ? bch_btree_keys_free+0x6c/0xc0 [escache] [230469.056533] mca_cannibalize_lock+0x72/0x90 [escache] [230469.068788] mca_alloc+0x2ae/0x450 [escache] [230469.080790] bch_btree_node_get+0x136/0x2d0 [escache] [230469.092681] bch_btree_check_thread+0x1e1/0x260 [escache] [230469.104382] ? finish_wait+0x80/0x80 [230469.115884] ? bch_btree_check_recurse+0x1a0/0x1a0 [escache] [230469.127259] kthread+0x112/0x130 [230469.138448] ? kthread_flush_work_fn+0x10/0x10 [230469.149477] ret_from_fork+0x35/0x40 bch_btree_check_thread() and bch_dirty_init_thread() may call mca_cannibalize() to cannibalize other cached btree nodes. Only one thread can do it at a time, so the op of other threads will be added to the btree_cache_wait list. We must call finish_wait() to remove op from btree_cache_wait before free it's memory address. Otherwise, the list will be damaged. Also should call bch_cannibalize_unlock() to release the btree_cache_alloc_lock and wake_up other waiters. Fixes: 8e710227 ("bcache: make bch_btree_check() to be multithreaded") Fixes: b144e45f ("bcache: make bch_sectors_dirty_init() to be multithreaded") Cc: stable@vger.kernel.org Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-7-colyli@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Zheng Wang authored
In some specific situations, the return value of __bch_btree_node_alloc may be NULL. This may lead to a potential NULL pointer dereference in caller function like a calling chain : btree_split->bch_btree_node_alloc->__bch_btree_node_alloc. Fix it by initializing the return value in __bch_btree_node_alloc. Fixes: cafe5635 ("bcache: A block layer cache") Cc: stable@vger.kernel.org Signed-off-by: Zheng Wang <zyytlz.wz@163.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-6-colyli@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Zheng Wang authored
Due to the previous fix of __bch_btree_node_alloc, the return value will never be a NULL pointer. So IS_ERR is enough to handle the failure situation. Fix it by replacing IS_ERR_OR_NULL check by an IS_ERR check. Fixes: cafe5635 ("bcache: A block layer cache") Cc: stable@vger.kernel.org Signed-off-by: Zheng Wang <zyytlz.wz@163.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-5-colyli@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Andrea Tomassetti authored
The cache_readaheads stat counter is not used anymore and should be removed. Signed-off-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-4-colyli@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Thomas Weißschuh authored
Since commit ee6d3dd4 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-3-colyli@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
ye xingchen authored
Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-2-colyli@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
Commit 99d055b4 ("block: remove per-disk debugfs files in blk_unregister_queue") moves blk_trace_shutdown() from blk_release_queue() to blk_unregister_queue(), this is safe if blktrace is created through sysfs, however, there is a regression in corner case. blktrace can still be enabled after del_gendisk() through ioctl if the disk is opened before del_gendisk(), and if blktrace is not shutdown through ioctl before closing the disk, debugfs entries will be leaked. Fix this problem by shutdown blktrace in disk_release(), this is safe because blk_trace_remove() is reentrant. Fixes: 99d055b4 ("block: remove per-disk debugfs files in blk_unregister_queue") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230610022003.2557284-4-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
sg_ioctl() support to enable blktrace, which will create debugfs entries "/sys/kernel/debug/block/sgx/", however, there is no guarantee that user will remove these entries through ioctl, and deleting sg device doesn't cleanup these blktrace entries. This problem can be fixed by cleanup blktrace while releasing request_queue, however, it's not a good idea to do this special handling in common layer just for sg device. Fix this problem by shutdown bltkrace in sg_device_destroy(), where the device is deleted and all the users close the device, also grab a scsi_device reference from sg_add_device() to prevent scsi_device to be freed before sg_device_destroy(); Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230610022003.2557284-3-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
If config is disabled, call blk_trace_remove() directly will trigger build warning, hence use inline function instead, prepare to fix blktrace debugfs entries leakage. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230610022003.2557284-2-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 14 Jun, 2023 4 commits
-
-
Pankaj Raghav authored
The body of the loop is run without RCU lock held. Use the regular cond_resched() instead of cond_resched_rcu(). Fixes: 786bb024 ("brd: use XArray instead of radix-tree to index backing pages") Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20230614133538.1279369-1-p.raghav@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
Ed Tsai authored
commit f168420c ("blk-mq: don't redirect completion for hctx withs only one ctx mapping") When nvme applies a 1:1 mapping of hctx and ctx, there will be no remote request. But for ufs, the submission and completion queues could be asymmetric. (e.g. Multiple SQs share one CQ) Therefore, 1:1 mapping of hctx and ctx won't complete request on the submission cpu. In this situation, this nr_ctx check could violate the QUEUE_FLAG_SAME_FORCE, as a result, check on cpu id when there is only one ctx mapping. Signed-off-by: Ed Tsai <ed.tsai@mediatek.com> Signed-off-by: Po-Wen Kao <powen.kao@mediatek.com> Suggested-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230614002529.6636-1-ed.tsai@mediatek.com [axboe: fixed up indentation] Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Merge tag 'md-next-20230613' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.5/block Pull MD updates from Song: "The major changes are: 1. Protect md_thread with rcu, by Yu Kuai; 2. Various non-urgent raid5 and raid1/10 fixes, by Yu Kuai; 3. Non-urgent raid10 fixes, by Li Nan." * tag 'md-next-20230613' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (29 commits) md/raid1-10: limit the number of plugged bio md/raid1-10: don't handle pluged bio by daemon thread md/md-bitmap: add a new helper to unplug bitmap asynchrously md/raid1-10: submit write io directly if bitmap is not enabled md/raid1-10: factor out a helper to submit normal write md/raid1-10: factor out a helper to add bio to plug md/raid10: prevent soft lockup while flush writes md/raid10: fix io loss while replacement replace rdev md/raid10: Do not add spare disk when recovery fails md/raid10: clean up md_add_new_disk() md/raid10: prioritize adding disk to 'removed' mirror md/raid10: improve code of mrdev in raid10_sync_request md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request md/raid5: don't start reshape when recovery or replace is in progress md: protect md_thread with rcu md/bitmap: factor out a helper to set timeout md/bitmap: always wake up md_thread in timeout_store dm-raid: remove useless checking in raid_message() md: factor out a helper to wake up md_thread directly md: fix duplicate filename for rdev ...
-
David Howells authored
Fix dio_bio_cleanup() to advance the head index into the list of pages past the pages it has released, as __blockdev_direct_IO() will call it twice if do_direct_IO() fails. The issue was causing: WARNING: CPU: 6 PID: 2220 at mm/gup.c:76 try_get_folio This can be triggered by setting up a clean pair of UDF filesystems on loopback devices and running the generic/451 xfstest with them as the scratch and test partitions. Something like the following: fallocate /mnt2/udf_scratch -l 1G fallocate /mnt2/udf_test -l 1G mknod /dev/lo0 b 7 0 mknod /dev/lo1 b 7 1 losetup lo0 /mnt2/udf_scratch losetup lo1 /mnt2/udf_test mkfs -t udf /dev/lo0 mkfs -t udf /dev/lo1 cd xfstests ./check generic/451 with xfstests configured by putting the following into local.config: export FSTYP=udf export DISABLE_UDF_TEST=1 export TEST_DEV=/dev/lo1 export TEST_DIR=/xfstest.test export SCRATCH_DEV=/dev/lo0 export SCRATCH_MNT=/xfstest.scratch Fixes: 1ccf164e ("block: Use iov_iter_extract_pages() and page pinning in direct-io.c") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202306120931.a9606b88-oliver.sang@intel.comSigned-off-by: David Howells <dhowells@redhat.com> cc: Christoph Hellwig <hch@infradead.org> cc: David Hildenbrand <david@redhat.com> cc: Andrew Morton <akpm@linux-foundation.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Matthew Wilcox <willy@infradead.org> cc: Jan Kara <jack@suse.cz> cc: Jeff Layton <jlayton@kernel.org> cc: Jason Gunthorpe <jgg@nvidia.com> cc: Logan Gunthorpe <logang@deltatee.com> cc: Hillf Danton <hdanton@sina.com> cc: Christian Brauner <brauner@kernel.org> cc: Linus Torvalds <torvalds@linux-foundation.org> cc: linux-fsdevel@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-kernel@vger.kernel.org cc: linux-mm@kvack.org Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/1193485.1686693279@warthog.procyon.org.ukSigned-off-by: Jens Axboe <axboe@kernel.dk>
-
- 13 Jun, 2023 14 commits
-
-
Yu Kuai authored
bio can be added to plug infinitely, and following writeback test can trigger huge amount of plugged bio: Test script: modprobe brd rd_nr=4 rd_size=10485760 mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean --bitmap=internal echo 0 > /proc/sys/vm/dirty_background_ratio fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test Test result: Monitor /sys/block/md0/inflight will found that inflight keep increasing until fio finish writing, after running for about 2 minutes: [root@fedora ~]# cat /sys/block/md0/inflight 0 4474191 Fix the problem by limiting the number of plugged bio based on the number of copies for original bio. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-8-yukuai1@huaweicloud.com
-
Yu Kuai authored
current->bio_list will be set under submit_bio() context, in this case bitmap io will be added to the list and wait for current io submission to finish, while current io submission must wait for bitmap io to be done. commit 874807a8 ("md/raid1{,0}: fix deadlock in bitmap_unplug.") fix the deadlock by handling plugged bio by daemon thread. On the one hand, the deadlock won't exist after commit a214b949 ("blk-mq: only flush requests from the plug in blk_mq_submit_bio"). On the other hand, current solution makes it impossible to flush plugged bio in raid1/10_make_request(), because this will cause that all the writes will goto daemon thread. In order to limit the number of plugged bio, commit 874807a8 ("md/raid1{,0}: fix deadlock in bitmap_unplug.") is reverted, and the deadlock is fixed by handling bitmap io asynchronously. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-7-yukuai1@huaweicloud.com
-
Yu Kuai authored
If bitmap is enabled, bitmap must update before submitting write io, this is why unplug callback must move these io to 'conf->pending_io_list' if 'current->bio_list' is not empty, which will suffer performance degradation. A new helper md_bitmap_unplug_async() is introduced to submit bitmap io in a kworker, so that submit bitmap io in raid10_unplug() doesn't require that 'current->bio_list' is empty. This patch prepare to limit the number of plugged bio. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-6-yukuai1@huaweicloud.com
-
Yu Kuai authored
Commit 6cce3b23 ("[PATCH] md: write intent bitmap support for raid10") add bitmap support, and it changed that write io is submitted through daemon thread because bitmap need to be updated before write io. And later, plug is used to fix performance regression because all the write io will go to demon thread, which means io can't be issued concurrently. However, if bitmap is not enabled, the write io should not go to daemon thread in the first place, and plug is not needed as well. Fixes: 6cce3b23 ("[PATCH] md: write intent bitmap support for raid10") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-5-yukuai1@huaweicloud.com
-
Yu Kuai authored
There are multiple places to do the same thing, factor out a helper to prevent redundant code, and the helper will be used in following patch as well. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-4-yukuai1@huaweicloud.com
-
Yu Kuai authored
The code in raid1 and raid10 is identical, prepare to limit the number of plugged bios. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-3-yukuai1@huaweicloud.com
-
Yu Kuai authored
Currently, there is no limit for raid1/raid10 plugged bio. While flushing writes, raid1 has cond_resched() while raid10 doesn't, and too many writes can cause soft lockup. Follow up soft lockup can be triggered easily with writeback test for raid10 with ramdisks: watchdog: BUG: soft lockup - CPU#10 stuck for 27s! [md0_raid10:1293] Call Trace: <TASK> call_rcu+0x16/0x20 put_object+0x41/0x80 __delete_object+0x50/0x90 delete_object_full+0x2b/0x40 kmemleak_free+0x46/0xa0 slab_free_freelist_hook.constprop.0+0xed/0x1a0 kmem_cache_free+0xfd/0x300 mempool_free_slab+0x1f/0x30 mempool_free+0x3a/0x100 bio_free+0x59/0x80 bio_put+0xcf/0x2c0 free_r10bio+0xbf/0xf0 raid_end_bio_io+0x78/0xb0 one_write_done+0x8a/0xa0 raid10_end_write_request+0x1b4/0x430 bio_endio+0x175/0x320 brd_submit_bio+0x3b9/0x9b7 [brd] __submit_bio+0x69/0xe0 submit_bio_noacct_nocheck+0x1e6/0x5a0 submit_bio_noacct+0x38c/0x7e0 flush_pending_writes+0xf0/0x240 raid10d+0xac/0x1ed0 Fix the problem by adding cond_resched() to raid10 like what raid1 did. Note that unlimited plugged bio still need to be optimized, for example, in the case of lots of dirty pages writeback, this will take lots of memory and io will spend a long time in plug, hence io latency is bad. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-2-yukuai1@huaweicloud.com
-
Li Nan authored
When removing a disk with replacement, the replacement will be used to replace rdev. During this process, there is a brief window in which both rdev and replacement are read as NULL in raid10_write_request(). This will result in io not being submitted but it should be. //remove //write raid10_remove_disk raid10_write_request mirror->rdev = NULL read rdev -> NULL mirror->rdev = mirror->replacement mirror->replacement = NULL read replacement -> NULL Fix it by reading replacement first and rdev later, meanwhile, use smp_mb() to prevent memory reordering. Fixes: 475b0321 ("md/raid10: writes should get directed to replacement as well as original.") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230602091839.743798-3-linan666@huaweicloud.com
-
Li Nan authored
In raid10_sync_request(), if data cannot be read from any disk for recovery, it will go to 'giveup' and let 'chunks_skipped' + 1. After multiple 'giveup', when 'chunks_skipped >= geo.raid_disks', it will return 'max_sector', indicating that the recovery has been completed. However, the recovery is just aborted and the data remains inconsistent. Fix it by setting mirror->recovery_disabled, which will prevent the spare disk from being added to this mirror. The same issue also exists during resync, it will be fixed afterwards. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230602091839.743798-2-linan666@huaweicloud.com
-
Li Nan authored
Commit 1a855a06 ("md: fix bug with re-adding of partially recovered device.") only add device which is set to In_sync. But it let devices without metadata cannot be added when they should be. Commit bf572541 ("md: fix regression with re-adding devices to arrays with no metadata") fix the above issue, it set device without metadata to In_sync when add new disk. However, after commit f466722c ("md: Change handling of save_raid_disk and metadata update during recovery.") deletes changes of the first patch, setting In_sync for devcie without metadata is meanless because the flag will be cleared soon and will not be used during this period. Clean it up. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230527101851.3266500-2-linan666@huaweicloud.com
-
Li Nan authored
When add a new disk to raid10, it will traverse conf->mirror from start and find one of the following mirror to add: 1. mirror->rdev is set to WantReplacement and it have no replacement, set new disk to mirror->replacement. 2. no mirror->rdev, set new disk to mirror->rdev. There is a array as below (sda is set to WantReplacement): Number Major Minor RaidDevice State 0 8 0 0 active sync set-A /dev/sda - 0 0 1 removed 2 8 32 2 active sync set-A /dev/sdc 3 8 48 3 active sync set-B /dev/sdd Use 'mdadm --add' to add a new disk to this array, the new disk will become sda's replacement instead of add to removed position, which is confusing for users. Meanwhile, after new disk recovery success, sda will be set to Faulty. Prioritize adding disk to 'removed' mirror is a better choice. In the above scenario, the behavior is the same as before, except sda will not be deleted. Before other disks are added, continued use sda is more reliable. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230527092007.3008856-1-linan666@huaweicloud.com
-
Li Nan authored
'need_recover' and 'mrdev' are equivalent in raid10_sync_request(), and inc mrdev->nr_pending is unreasonable if don't need recovery. Replace 'need_recover' with 'mrdev', and only inc nr_pending when needed. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230527072218.2365857-3-linan666@huaweicloud.com
-
Li Nan authored
There are two check of 'mreplace' in raid10_sync_request(). In the first check, 'need_replace' will be set and 'mreplace' will be used later if no-Faulty 'mreplace' exists, In the second check, 'mreplace' will be set to NULL if it is Faulty, but 'need_replace' will not be changed accordingly. null-ptr-deref occurs if Faulty is set between two check. Fix it by merging two checks into one. And replace 'need_replace' with 'mreplace' because their values are always the same. Fixes: ee37d731 ("md/raid10: Fix raid10 replace hang when new added disk faulty") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230527072218.2365857-2-linan666@huaweicloud.com
-
Yu Kuai authored
When recovery is interrupted (reboot, etc.) check for MD_RECOVERY_RUNNING is not enough to tell recovery is in progress. Also check recovery_cp before starting reshape. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529133410.2125914-1-yukuai1@huaweicloud.com
-