- 27 Aug, 2019 6 commits
-
-
NeilBrown authored
Since commit 4ad23a97 ("MD: use per-cpu counter for writes_pending"), set_in_sync() is substantially more expensive: it can wait for a full RCU grace period which can be 10s of milliseconds. So we should only call it when the cost is justified. md_check_recovery() currently calls set_in_sync() every time it finds anything to do (on non-external active arrays). For an array performing resync or recovery, this will be quite often. Each call will introduce a delay to the md thread, which can noticeable affect IO submission latency. In md_check_recovery() we only need to call set_in_sync() if 'safemode' was non-zero at entry, meaning that there has been not recent IO. So we save this "safemode was nonzero" state, and only call set_in_sync() if it was non-zero. This measurably reduces mean and maximum IO submission latency during resync/recovery. Reported-and-tested-by: Jack Wang <jinpu.wang@cloud.ionos.com> Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending") Cc: stable@vger.kernel.org (v4.12+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Tejun Heo authored
There's an inherent mismatch between memcg and writeback. The former trackes ownership per-page while the latter per-inode. This was a deliberate design decision because honoring per-page ownership in the writeback path is complicated, may lead to higher CPU and IO overheads and deemed unnecessary given that write-sharing an inode across different cgroups isn't a common use-case. Combined with inode majority-writer ownership switching, this works well enough in most cases but there are some pathological cases. For example, let's say there are two cgroups A and B which keep writing to different but confined parts of the same inode. B owns the inode and A's memory is limited far below B's. A's dirty ratio can rise enough to trigger balance_dirty_pages() sleeps but B's can be low enough to avoid triggering background writeback. A will be slowed down without a way to make writeback of the dirty pages happen. This patch implements foreign dirty recording and foreign mechanism so that when a memcg encounters a condition as above it can trigger flushes on bdi_writebacks which can clean its pages. Please see the comment on top of mem_cgroup_track_foreign_dirty_slowpath() for details. A reproducer follows. write-range.c:: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <sys/types.h> static const char *usage = "write-range FILE START SIZE\n"; int main(int argc, char **argv) { int fd; unsigned long start, size, end, pos; char *endp; char buf[4096]; if (argc < 4) { fprintf(stderr, usage); return 1; } fd = open(argv[1], O_WRONLY); if (fd < 0) { perror("open"); return 1; } start = strtoul(argv[2], &endp, 0); if (*endp != '\0') { fprintf(stderr, usage); return 1; } size = strtoul(argv[3], &endp, 0); if (*endp != '\0') { fprintf(stderr, usage); return 1; } end = start + size; while (1) { for (pos = start; pos < end; ) { long bread, bwritten = 0; if (lseek(fd, pos, SEEK_SET) < 0) { perror("lseek"); return 1; } bread = read(0, buf, sizeof(buf) < end - pos ? sizeof(buf) : end - pos); if (bread < 0) { perror("read"); return 1; } if (bread == 0) return 0; while (bwritten < bread) { long this; this = write(fd, buf + bwritten, bread - bwritten); if (this < 0) { perror("write"); return 1; } bwritten += this; pos += bwritten; } } } } repro.sh:: #!/bin/bash set -e set -x sysctl -w vm.dirty_expire_centisecs=300000 sysctl -w vm.dirty_writeback_centisecs=300000 sysctl -w vm.dirtytime_expire_seconds=300000 echo 3 > /proc/sys/vm/drop_caches TEST=/sys/fs/cgroup/test A=$TEST/A B=$TEST/B mkdir -p $A $B echo "+memory +io" > $TEST/cgroup.subtree_control echo $((1<<30)) > $A/memory.high echo $((32<<30)) > $B/memory.high rm -f testfile touch testfile fallocate -l 4G testfile echo "Starting B" (echo $BASHPID > $B/cgroup.procs pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) & echo "Waiting 10s to ensure B claims the testfile inode" sleep 5 sync sleep 5 sync echo "Starting A" (echo $BASHPID > $A/cgroup.procs pv < /dev/urandom | ./write-range testfile 0 $((2<<30))) v2: Added comments explaining why the specific intervals are being used. v3: Use 0 @nr when calling cgroup_writeback_by_id() to use best-effort flushing while avoding possible livelocks. v4: Use get_jiffies_64() and time_before/after64() instead of raw jiffies_64 and arthimetic comparisons as suggested by Jan. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Tejun Heo authored
Implement cgroup_writeback_by_id() which initiates cgroup writeback from bdi and memcg IDs. This will be used by memcg foreign inode flushing. v2: Use wb_get_lookup() instead of wb_get_create() to avoid creating spurious wbs. v3: Interpret 0 @nr as 1.25 * nr_dirty to implement best-effort flushing while avoding possible livelocks. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Tejun Heo authored
Separate out wb_get_lookup() which doesn't try to create one if there isn't already one from wb_get_create(). This will be used by later patches. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Tejun Heo authored
There currently is no way to universally identify and lookup a bdi without holding a reference and pointer to it. This patch adds an non-recycling bdi->id and implements bdi_get_by_id() which looks up bdis by their ids. This will be used by memcg foreign inode flushing. I left bdi_list alone for simplicity and because while rb_tree does support rcu assignment it doesn't seem to guarantee lossless walk when walk is racing aginst tree rebalance operations. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Tejun Heo authored
wb_completion is used to track writeback completions. We want to use it from memcg side for foreign inode flushes. This patch updates it to remember the target waitq instead of assuming bdi->wb_waitq and expose it outside of fs-writeback.c. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 23 Aug, 2019 7 commits
-
-
Jens Axboe authored
You can't magically mark a function inline and expect that to work. Fixes: fceb5d1b ("null_blk: create a helper for zoned devices") Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Chaitanya Kulkarni authored
This patch creates a helper function for handling the request completion in the null_handle_cmd(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Chaitanya Kulkarni authored
This patch creates a helper function for handling zoned block device operations. This patch also restructured the code for null_blk_zoned.c and uses the pattern to return blk_status_t and catch the error in the function null_handle_cmd() into cmd->error variable instead of setting it up in the deeper layer just like the way it is done for flush, badblocks and memory backed case in the null_handle_cmd(). We also move null_handle_zoned() to the null_blk_zoned.c to keep the zoned code separate. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Chaitanya Kulkarni authored
This patch creates a helper for handling requests when null_blk is memory backed in the null_handle_cmd(). Although the helper is very simple right now, it makes the code flow consistent with the rest of code in the null_handle_cmd() and provides a uniform code structure for future code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Chaitanya Kulkarni authored
This patch creates a helper for handling badblocks code in the null_handle_cmd(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Chaitanya Kulkarni authored
This patch creates a helper for handling throttling code in the null_handle_cmd(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Chaitanya Kulkarni authored
This is a preparation patch which moves the duplicate code for sectors and nr_sectors calculations for bio vs request mode into their respective callers (null_queue_bio(), null_qeueue_req()). Now the core function only deals with the respective actions and commands instead of having to calculte the bio vs req operations and different sector related variables. We also move the flush command handling at the top which significantly simplifies the rest of the code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 22 Aug, 2019 3 commits
-
-
Christoph Hellwig authored
Hiding page refcount manipulation inside a low-level bio helper is somewhat awkward. Instead return the same page information to the callers, where it fits in much better. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Passsthrough bio handling should be the same as normal bio handling, except that we need to take hardware limitations into account. Thus use the common try_merge implementation after checking the hardware limits. This changes behavior in that we now also check segment and dma boundary settings for same page merges, which is a little more work but has no effect as those need to be larger than the page size. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
If we can add more data into an existing segment we do not create a gap per definition, so move the check for a gap after the attempt to merge into the segment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 20 Aug, 2019 8 commits
-
-
Mike Christie authored
This fixes a bug added in 4.10 with commit: commit 9561a7ad Author: Josef Bacik <jbacik@fb.com> Date: Tue Nov 22 14:04:40 2016 -0500 nbd: add multi-connection support that limited the number of devices to 256. Before the patch we could create 1000s of devices, but the patch switched us from using our own thread to using a work queue which has a default limit of 256 active works. The problem is that our recv_work function sits in a loop until disconnection but only handles IO for one connection. The work is started when the connection is started/restarted, but if we end up creating 257 or more connections, the queue_work call just queues connection257+'s recv_work and that waits for connection 1 - 256's recv_work to be disconnected and that work instance completing. Instead of reverting back to kthreads, this has us allocate a workqueue_struct per device, so we can block in the work. Cc: stable@vger.kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Mike Christie authored
This fixes a regression added in 4.9 with commit: commit 0eadf37a Author: Josef Bacik <jbacik@fb.com> Date: Thu Sep 8 12:33:40 2016 -0700 nbd: allow block mq to deal with timeouts where before the patch userspace would set the timeout to 0 to disable it. With the above patch, a zero timeout tells the block layer to use the default value of 30 seconds. For setups where commands can take a long time or experience transient issues like network disruptions this then results in IO errors being sent to the application. To fix this, the patch still uses the common block layer timeout framework, but if zero is set, nbd just logs a message and then resets the timer when it expires. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Mike Christie authored
Fix bug added with the patch: commit 8f3ea359 Author: Josef Bacik <josef@toxicpanda.com> Date: Mon Jul 16 12:11:35 2018 -0400 nbd: handle unexpected replies better where if the timeout handler runs when the completion path is and we fail to grab the mutex in the timeout handler we will leave a config reference and cannot free the config later. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Mike Christie authored
This adds a helper function to convert a block req op to a nbd cmd type. It will be used in the last patch to log the type in the timeout handler. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Mike Christie authored
Add a helper to set the cmd timeout. It does not really do a lot now, but will be more useful in the next patches. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Revanth Rajashekar authored
The original commit adding the sed-opal library by mistake added two definitions of OPAL_METHOD_LENGTH, remove one of them. Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Revanth Rajashekar authored
In the function 'response_parse', num_entries will never be 0 as slen is checked for 0. Hence, the condition 'if (num_entries == 0)' can never be true. Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Revanth Rajashekar authored
Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 19 Aug, 2019 1 commit
-
-
Junxiao Bi authored
The dispatch list is not used any more, as the legacy block IO stack has been removed. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 15 Aug, 2019 2 commits
-
-
Tejun Heo authored
As inode wb switching may make sync(2) miss some inodes, they're synchronized using wb_switch_rwsem so that no wb switching happens while sync(2) is in progress. In addition to synchronizing the actual switching, the rwsem is also used to prevent queueing new switch attempts while sync(2) is in progress. This is to avoid queueing too many instances while the rwsem is held by sync(2). Unfortunately, this is too agressive and can block wb switching for a long time if sync(2) is frequent. The goal is avoiding expolding the number of scheduled switches, not avoiding scheduling anything. Let's use wb_switch_rwsem only for synchronizing the actual switching and sync(2) and use isw_nr_in_flight instead for limiting the maximum number of scheduled switches. The limit is set to 1024 which should be more than enough while still avoiding extreme situations. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Tejun Heo authored
WB_FRN_TIME_CUT_DIV is used to tell the foreign inode detection logic to ignore short writeback rounds to prevent getting confused by a burst of short writebacks. The parameter is currently 2 meaning that anything smaller than half of the running average writback duration will be ignored. This is unnecessarily aggressive. The detection logic uses 16 history slots and is already reasonably protected against some short bursts confusing it and the current parameter can lead to tens of seconds of missed detection depending on the writeback pattern. Let's change the parameter to 8, so that it only ignores writeback with are smaller than 12.5% of the current running average. v2: Add comment explaining what's going on with the foreign detection parameters. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 14 Aug, 2019 1 commit
-
-
Johannes Weiner authored
psi tracks the time tasks wait for refaulting pages to become uptodate, but it does not track the time spent submitting the IO. The submission part can be significant if backing storage is contended or when cgroup throttling (io.latency) is in effect - a lot of time is spent in submit_bio(). In that case, we underreport memory pressure. Annotate submit_bio() to account submission time as memory stall when the bio is reading userspace workingset pages. Tested-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 12 Aug, 2019 1 commit
-
-
zhengbin authored
If blk_mq_init_allocated_queue->elevator_init_mq fails, need to release the previously requested resources. Fixes: d3484991 ("blk-mq-sched: allow setting of default IO scheduler") Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 09 Aug, 2019 3 commits
-
-
Jann Horn authored
As sparse points out, these two copy_from_user() should actually be copy_to_user(). Fixes: 229b53c9 ("take floppy compat ioctls to sodding floppy.c") Cc: stable@vger.kernel.org Acked-by: Alexander Popov <alex.popov@linux.com> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
A previous commit correctly removed set-but-not-read variables, but this left two new variables now unused. Kill them. Fixes: ba6f7da9 ("lightnvm: remove set but not used variables 'data_len' and 'rq_len'") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Alessio Balsini authored
Enabling Direct I/O with loop devices helps reducing memory usage by avoiding double caching. 32 bit applications running on 64 bits systems are currently not able to request direct I/O because is missing from the lo_compat_ioctl. This patch fixes the compatibility issue mentioned above by exporting LOOP_SET_DIRECT_IO as additional lo_compat_ioctl() entry. The input argument for this ioctl is a single long converted to a 1-bit boolean, so compatibility is preserved. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Alessio Balsini <balsini@android.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 08 Aug, 2019 2 commits
-
-
Zhou Wang authored
In function sg_split, the second sg_calculate_split will return -EINVAL when in_mapped_nents is 0. Indeed there is no need to do second sg_calculate_split and sg_split_mapped when in_mapped_nents is 0, as in_mapped_nents indicates no mapped entry in original sgl. Signed-off-by: Zhou Wang <wangzhou1@hisilicon.com> Acked-by: Robert Jarzmik <robert.jarzmik@free.fr> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
YueHaibing authored
drivers/lightnvm/pblk-read.c: In function pblk_submit_read_gc: drivers/lightnvm/pblk-read.c:423:6: warning: variable data_len set but not used [-Wunused-but-set-variable] drivers/lightnvm/pblk-recovery.c: In function pblk_recov_scan_oob: drivers/lightnvm/pblk-recovery.c:368:15: warning: variable rq_len set but not used [-Wunused-but-set-variable] They are not used since commit 48e5da72 ("lightnvm: move metadata mapping to lower level driver") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 07 Aug, 2019 6 commits
-
-
https://github.com/liu-song-6/linuxJens Axboe authored
Pull MD changes from Song. * 'md-next' of https://github.com/liu-song-6/linux: raid1: factor out a common routine to handle the completion of sync write md: don't call spare_active in md_reap_sync_thread if all member devices can't work md: don't set In_sync if array is frozen md: allow last device to be forcibly removed from RAID1/RAID10. md: Convert to use int_pow() md/raid10: end bio when the device faulty md/raid1: end bio when the device faulty md/raid6: Set R5_ReadError when there is read failure on parity disk raid1: use an int as the return value of raise_barrier()
-
Hou Tao authored
It's just code clean-up. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Guoqing Jiang authored
When add one disk to array, the md_reap_sync_thread is responsible to activate the spare and set In_sync flag for the new member in spare_active(). But if raid1 has one member disk A, and disk B is added to the array. Then we offline A before all the datas are synchronized from A to B, obviously B doesn't have the latest data as A, but B is still marked with In_sync flag. So let's not call spare_active under the condition, otherwise B is still showed with 'U' state which is not correct. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Guoqing Jiang authored
When a disk is added to array, the following path is called in mdadm. Manage_subdevs -> sysfs_freeze_array -> Manage_add -> sysfs_set_str(&info, NULL, "sync_action","idle") Then from kernel side, Manage_add invokes the path (add_new_disk -> validate_super = super_1_validate) to set In_sync flag. Since In_sync means "device is in_sync with rest of array", and the new added disk need to resync thread to help the synchronization of data. And md_reap_sync_thread would call spare_active to set In_sync for the new added disk finally. So don't set In_sync if array is in frozen. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Guoqing Jiang authored
When the 'last' device in a RAID1 or RAID10 reports an error, we do not mark it as failed. This would serve little purpose as there is no risk of losing data beyond that which is obviously lost (as there is with RAID5), and there could be other sectors on the device which are readable, and only readable from this device. This in general this maximises access to data. However the current implementation also stops an admin from removing the last device by direct action. This is rarely useful, but in many case is not harmful and can make automation easier by removing special cases. Also, if an attempt to write metadata fails the device must be marked as faulty, else an infinite loop will result, attempting to update the metadata on all non-faulty devices. So add 'fail_last_dev' member to 'struct mddev', then we can bypasses the 'last disk' checks for RAID1 and RAID10, and control the behavior per array by change sysfs node. Signed-off-by: NeilBrown <neilb@suse.de> [add sysfs node for fail_last_dev by Guoqing] Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Andy Shevchenko authored
Instead of linear approach to calculate power of 10, use generic int_pow() which does it better. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-