Commits · 1cb039f3dc1619eb795c54aad0a98fdb379b4237 · nexedi / linux

24 Sep, 2020 11 commits

bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag · 1cb039f3

Christoph Hellwig authored Sep 24, 2020

The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it.  This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore.  It is replaced with a queue attribute which
also is writable for easier testing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1cb039f3

mm: use SWP_SYNCHRONOUS_IO more intelligently · 5115db10

Christoph Hellwig authored Sep 24, 2020

There is no point in trying to call bdev_read_page if SWP_SYNCHRONOUS_IO
is not set, as the device won't support it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5115db10

bdi: remove BDI_CAP_SYNCHRONOUS_IO · a8b456d0

Christoph Hellwig authored Sep 24, 2020

BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to
decided if ->rw_page can be used on a block device.  Just check up for
the method instead.  The only complication is that zram needs a second
set of block_device_operations as it can switch between modes that
actually support ->rw_page and those who don't.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a8b456d0

bdi: remove BDI_CAP_CGROUP_WRITEBACK · ed7b6b4f

Christoph Hellwig authored Sep 24, 2020

Just checking SB_I_CGROUPWB for cgroup writeback support is enough.
Either the file system allocates its own bdi (e.g. btrfs), in which case
it is known to support cgroup writeback, or the bdi comes from the block
layer, which always supports cgroup writeback.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ed7b6b4f

block: lift setting the readahead size into the block layer · c2e4cd57

Christoph Hellwig authored Sep 24, 2020

Drivers shouldn't really mess with the readahead size, as that is a VM
concept.  Instead set it based on the optimal I/O size by lifting the
algorithm from the md driver when registering the disk.  Also set
bdi->io_pages there as well by applying the same scheme based on
max_sectors.  To ensure the limits work well for stacking drivers a
new helper is added to update the readahead limits from the block
limits, which is also called from disk_stack_limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c2e4cd57

md: update the optimal I/O size on reshape · 16ef5101

Christoph Hellwig authored Sep 24, 2020

The raid5 and raid10 drivers currently update the read-ahead size,
but not the optimal I/O size on reshape.  To prepare for deriving the
read-ahead size from the optimal I/O size make sure it is updated
as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

16ef5101

bdi: initialize ->ra_pages and ->io_pages in bdi_init · 55b2598e

Christoph Hellwig authored Sep 24, 2020

Set up a readahead size by default, as very few users have a good
reason to change it.  This means code, ecryptfs, and orangefs now
set up the values while they were previously missing it, while ubifs,
mtd and vboxsf manually set it to 0 to avoid readahead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

55b2598e

aoe: set an optimal I/O size · 9e82d35b

Christoph Hellwig authored Sep 24, 2020

aoe forces a larger readahead size, but any reason to do larger I/O
is not limited to readahead.  Also set the optimal I/O size, and
remove the local constants in favor of just using SZ_2G.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9e82d35b

bcache: inherit the optimal I/O size · 5d4ce78b

Christoph Hellwig authored Sep 24, 2020

Inherit the optimal I/O size setting just like the readahead window,
as any reason to do larger I/O does not apply to just readahead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5d4ce78b

drbd: remove dead code in device_to_statistics · b807a2c5

Christoph Hellwig authored Sep 24, 2020

Ever since the switch to blk-mq, a lower device not used for VM
writeback will not be marked congested, so the check will never
trigger.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b807a2c5

fs: remove the unused SB_I_MULTIROOT flag · 402dd2cf

Christoph Hellwig authored Sep 24, 2020

The last user of SB_I_MULTIROOT is disappeared with commit f2aedb71
("NFS: Add fs_context support.")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

402dd2cf

23 Sep, 2020 18 commits

block: mark blkdev_get static · 1fb1a2ad

Christoph Hellwig authored Sep 21, 2020

There are no users outside the core block code left now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1fb1a2ad

PM: mm: cleanup swsusp_swap_check · 36daaa98

Christoph Hellwig authored Sep 21, 2020

Use blkdev_get_by_dev instead of bdget + blkdev_get.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

36daaa98

mm: split swap_type_of · 21bd9005

Christoph Hellwig authored Sep 21, 2020

swap_type_of is used for two entirely different purposes:

 (1) check what swap type a given device/offset corresponds to
 (2) find the first available swap device that can be written to

Mixing both in a single function creates an unreadable mess.  Create two
separate functions instead, and switch both to pass a dev_t instead of
a struct block_device to further simplify the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

21bd9005

PM: rewrite is_hibernate_resume_dev to not require an inode · bb3247a3

Christoph Hellwig authored Sep 21, 2020

Just check the dev_t to help simplifying the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bb3247a3

mm: cleanup claim_swapfile · ef16e1d9

Christoph Hellwig authored Sep 21, 2020

Use blkdev_get_by_dev instead of bdgrab + blkdev_get.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ef16e1d9

ocfs2: cleanup o2hb_region_dev_store · e455ed22

Christoph Hellwig authored Sep 21, 2020

Use blkdev_get_by_dev instead of igrab (aka open coded bdgrab) +
blkdev_get.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e455ed22

dasd: cleanup dasd_scan_partitions · 841ab72c

Christoph Hellwig authored Sep 21, 2020

Use blkdev_get_by_dev instead of bdget_disk + blkdev_get.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

841ab72c

raw: don't keep unopened block device around · 5a56ad78

Christoph Hellwig authored Sep 21, 2020

Turn binding into a normal dev_t as the struct block device doesn't
buy us anything and use blkdev_open_by_dev to actually open it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5a56ad78

zram: cleanup backing_dev_store · 0fc66c9d

Christoph Hellwig authored Sep 21, 2020

Use blkdev_get_by_dev instead of bdgrab + blkdev_get.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0fc66c9d

pktcdvd: use blkdev_get_by_dev instead of open coding it · b8d95484

Christoph Hellwig authored Sep 21, 2020

Replace bdget + blkdev_get by blkdev_get_by_dev.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b8d95484

pktcdvd: remove the if 0'ed pkt_start_recovery function · 2d9e28a9

Christoph Hellwig authored Sep 21, 2020

Remove code which has been dead since the initial commit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2d9e28a9

block: cleanup blkdev_bszset · 47816282

Christoph Hellwig authored Sep 21, 2020

Use blkdev_get_by_dev instead of bdgrab + blkdev_get.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

47816282

block: cleanup partition scanning in register_disk · 9301fe73

Christoph Hellwig authored Sep 21, 2020

Use blkdev_get_by_dev instead of open coding it using bdget_disk +
blkdev_get, and split the code to read the partition table into a
separate helper to make it a little more obvious.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9301fe73

block: move the NEED_PART_SCAN flag to struct gendisk · 38430f08

Christoph Hellwig authored Sep 21, 2020

We can only scan for partitions on the whole disk, so move the flag
from struct block_device to struct gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

38430f08

block: allow 'chunk_sectors' to be non-power-of-2 · 07d098e6

Mike Snitzer authored Sep 21, 2020

It is possible, albeit more unlikely, for a block device to have a non
power-of-2 for chunk_sectors (e.g. 10+2 RAID6 with 128K chunk_sectors,
which results in a full-stripe size of 1280K. This causes the RAID6's
io_opt to be advertised as 1280K, and a stacked device _could_ then be
made to use a blocksize, aka chunk_sectors, that matches non power-of-2
io_opt of underlying RAID6 -- resulting in stacked device's
chunk_sectors being a non power-of-2).

Update blk_queue_chunk_sectors() and blk_max_size_offset() to
accommodate drivers that need a non power-of-2 chunk_sectors.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

07d098e6

block: use lcm_not_zero() when stacking chunk_sectors · 22ada802

Mike Snitzer authored Sep 21, 2020

Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using
lcm_not_zero() rather than min_not_zero() -- otherwise the final
'chunk_sectors' could result in sub-optimal alignment of IO to
component devices in the IO stack.

Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size'
then it is a bug in the driver and the device should be flagged as
'misaligned'.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

22ada802

block: fix bmd->is_null_mapped initialization · 03859717

Christoph Hellwig authored Sep 23, 2020

bmd is allocated using kmalloc in bio_alloc_map_data, so make sure
is_null_mapped is properly initialized to false for the !null_mapped
case.

Fixes: f3256075 ("block: remove the BIO_NULL_MAPPED flag")
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

03859717

block: drop double zeroing · f952eefe

Julia Lawall authored Sep 20, 2020

sg_init_table zeroes its first argument, so the allocation of that argument
doesn't have to.

the semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression x;
@@

x =
- kzalloc
+ kmalloc
 (...)
...
sg_init_table(x,...)
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f952eefe

15 Sep, 2020 5 commits

blk-throttle: Avoid checking bps/iops limitation if bps or iops is unlimited · 87fbeb88

Baolin Wang authored Sep 07, 2020

Do not need check the bps or iops limitation if bps or iops is unlimited.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

87fbeb88

blk-throttle: Avoid calculating bps/iops limitation repeatedly · 4599ea49

Baolin Wang authored Sep 07, 2020

The tg_may_dispatch() will call tg_with_in_bps_limit() and
tg_with_in_iops_limit() to check if we can dispatch a bio or
not, which will calculate bps/iops limitation multiple times.
But tg_may_dispatch() is always called under queue lock, which
means the bps/iops limitation will not change in tg_may_dispatch().

So we can calculate the bps/iops limitation only once, and pass
them to tg_with_in_bps_limit() and tg_with_in_iops_limit() to
avoid calculating bps/iops limitation repeatedly.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4599ea49

blk-throttle: Define readable macros instead of static variables · e675df2a

Baolin Wang authored Sep 07, 2020

The 'throtl_grp_quantum' and 'throtl_quantum' are both read-only
variables, thus better to use readable macros instead of static
variables, which can also save some spaces for .bss area.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e675df2a

blk-throttle: Use readable READ/WRITE macros · ff8b22c0

Baolin Wang authored Sep 07, 2020

Use readable READ/WRITE macros instead of magic numbers.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ff8b22c0

blk-throttle: Fix some comments' typos · b53b072c

Baolin Wang authored Sep 07, 2020

Fix some comments' typos.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b53b072c

14 Sep, 2020 1 commit

iocost: fix infinite loop bug in adjust_inuse_and_calc_cost() · aa67db24

Tejun Heo authored Sep 14, 2020

adjust_inuse_and_calc_cost() is responsible for reducing the amount of
donated weights dynamically in period as the budget runs low. Because we
don't want to do full donation calculation in period, we keep latching up
inuse by INUSE_ADJ_STEP_PCT of the active weight of the cgroup until the
resulting hweight_inuse is satisfactory.

Unfortunately, the adj_step calculation was reading the active weight before
acquiring ioc->lock. Because the current thread could have lost race to
activate the iocg to another thread before entering this function, it may
read the active weight as zero before acquiring ioc->lock. When this
happens, the adj_step is calculated as zero and the incremental adjustment
loop becomes an infinite one.

Fix it by fetching the active weight after acquiring ioc->lock.

Fixes: b0853ab4 ("blk-iocost: revamp in-period donation snapbacks")
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

aa67db24

11 Sep, 2020 5 commits

blk-iocost: fix divide-by-zero in transfer_surpluses() · 769b628d

Tejun Heo authored Sep 11, 2020

Conceptually, root_iocg->hweight_donating must be less than WEIGHT_ONE but
all hweight calculations round up and thus it may end up >= WEIGHT_ONE
triggering divide-by-zero and other issues. Bound the value to avoid
surprises.

Fixes: e08d02aa ("blk-iocost: implement Andy's method for donation weight updates")
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

769b628d

bcache: use part_[begin|end]_io_acct instead of disk_[begin|end]_io_acct · 0806e60f

Song Liu authored Aug 31, 2020

This enables proper statistics in /proc/diskstats for bcache partitions.
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0806e60f

md: use part_[begin|end]_io_acct instead of disk_[begin|end]_io_acct · 00fe60ea

Song Liu authored Aug 31, 2020

This enables proper statistics in /proc/diskstats for md partitions.
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

00fe60ea

block: introduce part_[begin|end]_io_acct · 7b26410b

Song Liu authored Aug 31, 2020

These functions can be used to enable iostat for partitions on devices
like md, bcache.
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7b26410b

blk-mq: always allow reserved allocation in hctx_may_queue · 28500850

Ming Lei authored Sep 11, 2020

NVMe shares tagset between fabric queue and admin queue or between
connect_q and NS queue, so hctx_may_queue() can be called to allocate
request for these queues.

Tags can be reserved in these tagset. Before error recovery, there is
often lots of in-flight requests which can't be completed, and new
reserved request may be needed in error recovery path. However,
hctx_may_queue() can always return false because there is too many
in-flight requests which can't be completed during error handling.
Finally, nothing can proceed.

Fix this issue by always allowing reserved tag allocation in
hctx_may_queue(). This is reasonable because reserved tags are supposed
to always be available.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: David Milburn <dmilburn@redhat.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

28500850