Commit 6ab9e092 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "This is the main pull request for block changes for 4.20. This
  contains:

   - Series enabling runtime PM for blk-mq (Bart).

   - Two pull requests from Christoph for NVMe, with items such as;
      - Better AEN tracking
      - Multipath improvements
      - RDMA fixes
      - Rework of FC for target removal
      - Fixes for issues identified by static checkers
      - Fabric cleanups, as prep for TCP transport
      - Various cleanups and bug fixes

   - Block merging cleanups (Christoph)

   - Conversion of drivers to generic DMA mapping API (Christoph)

   - Series fixing ref count issues with blkcg (Dennis)

   - Series improving BFQ heuristics (Paolo, et al)

   - Series improving heuristics for the Kyber IO scheduler (Omar)

   - Removal of dangerous bio_rewind_iter() API (Ming)

   - Apply single queue IPI redirection logic to blk-mq (Ming)

   - Set of fixes and improvements for bcache (Coly et al)

   - Series closing a hotplug race with sysfs group attributes (Hannes)

   - Set of patches for lightnvm:
      - pblk trace support (Hans)
      - SPDX license header update (Javier)
      - Tons of refactoring patches to cleanly abstract the 1.2 and 2.0
        specs behind a common core interface. (Javier, Matias)
      - Enable pblk to use a common interface to retrieve chunk metadata
        (Matias)
      - Bug fixes (Various)

   - Set of fixes and updates to the blk IO latency target (Josef)

   - blk-mq queue number updates fixes (Jianchao)

   - Convert a bunch of drivers from the old legacy IO interface to
     blk-mq. This will conclude with the removal of the legacy IO
     interface itself in 4.21, with the rest of the drivers (me, Omar)

   - Removal of the DAC960 driver. The SCSI tree will introduce two
     replacement drivers for this (Hannes)"

* tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits)
  block: setup bounce bio_sets properly
  blkcg: reassociate bios when make_request() is called recursively
  blkcg: fix edge case for blk_get_rl() under memory pressure
  nvme-fabrics: move controller options matching to fabrics
  nvme-rdma: always have a valid trsvcid
  mtip32xx: fully switch to the generic DMA API
  rsxx: switch to the generic DMA API
  umem: switch to the generic DMA API
  sx8: switch to the generic DMA API
  sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code
  skd: switch to the generic DMA API
  ubd: remove use of blk_rq_map_sg
  nvme-pci: remove duplicate check
  drivers/block: Remove DAC960 driver
  nvme-pci: fix hot removal during error handling
  nvmet-fcloop: suppress a compiler warning
  nvme-core: make implicit seed truncation explicit
  nvmet-fc: fix kernel-doc headers
  nvme-fc: rework the request initialization code
  nvme-fc: introduce struct nvme_fcp_op_w_sgl
  ...
parents 52898511 52990a5f
......@@ -1857,8 +1857,10 @@ following two functions.
wbc_init_bio(@wbc, @bio)
Should be called for each bio carrying writeback data and
associates the bio with the inode's owner cgroup. Can be
called anytime between bio allocation and submission.
associates the bio with the inode's owner cgroup and the
corresponding request queue. This must be called after
a queue (device) has been associated with the bio and
before submission.
wbc_account_io(@wbc, @page, @bytes)
Should be called for each data segment being written out.
......@@ -1877,7 +1879,7 @@ the configuration, the bio may be executed at a lower priority and if
the writeback session is holding shared resources, e.g. a journal
entry, may lead to priority inversion. There is no one easy solution
for the problem. Filesystems can try to work around specific problem
cases by skipping wbc_init_bio() or using bio_associate_blkcg()
cases by skipping wbc_init_bio() or using bio_associate_create_blkg()
directly.
......
This diff is collapsed.
......@@ -190,7 +190,7 @@ whitespace:
notify_free Depending on device usage scenario it may account
a) the number of pages freed because of swap slot free
notifications or b) the number of pages freed because of
REQ_DISCARD requests sent by bio. The former ones are
REQ_OP_DISCARD requests sent by bio. The former ones are
sent to a swap block device when a swap slot is freed,
which implies that this disk is being used as a swap disk.
The latter ones are sent by filesystem mounted with
......
......@@ -38,7 +38,7 @@ inconsistent file system.
Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
they complete as those requests will obviously bypass the device cache.
Any REQ_DISCARD requests are treated like WRITE requests. Otherwise we would
Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would
have all the DISCARD requests, and then the WRITE requests and then the FLUSH
request. Consider the following example:
......
......@@ -28,7 +28,6 @@
#include <asm/byteorder.h>
#include <asm/memory.h>
#include <asm-generic/pci_iomap.h>
#include <xen/xen.h>
/*
* ISA I/O bus memory addresses are 1:1 with the physical address.
......@@ -459,20 +458,6 @@ extern void pci_iounmap(struct pci_dev *dev, void __iomem *addr);
#include <asm-generic/io.h>
/*
* can the hardware map this into one segment or not, given no other
* constraints.
*/
#define BIOVEC_MERGEABLE(vec1, vec2) \
((bvec_to_phys((vec1)) + (vec1)->bv_len) == bvec_to_phys((vec2)))
struct bio_vec;
extern bool xen_biovec_phys_mergeable(const struct bio_vec *vec1,
const struct bio_vec *vec2);
#define BIOVEC_PHYS_MERGEABLE(vec1, vec2) \
(__BIOVEC_PHYS_MERGEABLE(vec1, vec2) && \
(!xen_domain() || xen_biovec_phys_mergeable(vec1, vec2)))
#ifdef CONFIG_MMU
#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
extern int valid_phys_addr_range(phys_addr_t addr, size_t size);
......
......@@ -31,8 +31,6 @@
#include <asm/alternative.h>
#include <asm/cpufeature.h>
#include <xen/xen.h>
/*
* Generic IO read/write. These perform native-endian accesses.
*/
......@@ -205,12 +203,5 @@ extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t size);
extern int devmem_is_allowed(unsigned long pfn);
struct bio_vec;
extern bool xen_biovec_phys_mergeable(const struct bio_vec *vec1,
const struct bio_vec *vec2);
#define BIOVEC_PHYS_MERGEABLE(vec1, vec2) \
(__BIOVEC_PHYS_MERGEABLE(vec1, vec2) && \
(!xen_domain() || xen_biovec_phys_mergeable(vec1, vec2)))
#endif /* __KERNEL__ */
#endif /* __ASM_IO_H */
......@@ -73,7 +73,7 @@ static blk_qc_t nfhd_make_request(struct request_queue *queue, struct bio *bio)
len = bvec.bv_len;
len >>= 9;
nfhd_read_write(dev->id, 0, dir, sec >> shift, len >> shift,
bvec_to_phys(&bvec));
page_to_phys(bvec.bv_page) + bvec.bv_offset);
sec += len;
}
bio_endio(bio);
......
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _ASM_M68K_FD_H
#define _ASM_M68K_FD_H
/* Definitions for the Atari Floppy driver */
struct atari_format_descr {
int track; /* to be formatted */
int head; /* "" "" */
int sect_offset; /* offset of first sector */
};
#endif
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _LINUX_FDREG_H
#define _LINUX_FDREG_H
/*
** WD1772 stuff
*/
/* register codes */
#define FDCSELREG_STP (0x80) /* command/status register */
#define FDCSELREG_TRA (0x82) /* track register */
#define FDCSELREG_SEC (0x84) /* sector register */
#define FDCSELREG_DTA (0x86) /* data register */
/* register names for FDC_READ/WRITE macros */
#define FDCREG_CMD 0
#define FDCREG_STATUS 0
#define FDCREG_TRACK 2
#define FDCREG_SECTOR 4
#define FDCREG_DATA 6
/* command opcodes */
#define FDCCMD_RESTORE (0x00) /* - */
#define FDCCMD_SEEK (0x10) /* | */
#define FDCCMD_STEP (0x20) /* | TYP 1 Commands */
#define FDCCMD_STIN (0x40) /* | */
#define FDCCMD_STOT (0x60) /* - */
#define FDCCMD_RDSEC (0x80) /* - TYP 2 Commands */
#define FDCCMD_WRSEC (0xa0) /* - " */
#define FDCCMD_RDADR (0xc0) /* - */
#define FDCCMD_RDTRA (0xe0) /* | TYP 3 Commands */
#define FDCCMD_WRTRA (0xf0) /* - */
#define FDCCMD_FORCI (0xd0) /* - TYP 4 Command */
/* command modifier bits */
#define FDCCMDADD_SR6 (0x00) /* step rate settings */
#define FDCCMDADD_SR12 (0x01)
#define FDCCMDADD_SR2 (0x02)
#define FDCCMDADD_SR3 (0x03)
#define FDCCMDADD_V (0x04) /* verify */
#define FDCCMDADD_H (0x08) /* wait for spin-up */
#define FDCCMDADD_U (0x10) /* update track register */
#define FDCCMDADD_M (0x10) /* multiple sector access */
#define FDCCMDADD_E (0x04) /* head settling flag */
#define FDCCMDADD_P (0x02) /* precompensation off */
#define FDCCMDADD_A0 (0x01) /* DAM flag */
/* status register bits */
#define FDCSTAT_MOTORON (0x80) /* motor on */
#define FDCSTAT_WPROT (0x40) /* write protected (FDCCMD_WR*) */
#define FDCSTAT_SPINUP (0x20) /* motor speed stable (Type I) */
#define FDCSTAT_DELDAM (0x20) /* sector has deleted DAM (Type II+III) */
#define FDCSTAT_RECNF (0x10) /* record not found */
#define FDCSTAT_CRC (0x08) /* CRC error */
#define FDCSTAT_TR00 (0x04) /* Track 00 flag (Type I) */
#define FDCSTAT_LOST (0x04) /* Lost Data (Type II+III) */
#define FDCSTAT_IDX (0x02) /* Index status (Type I) */
#define FDCSTAT_DRQ (0x02) /* DRQ status (Type II+III) */
#define FDCSTAT_BUSY (0x01) /* FDC is busy */
/* PSG Port A Bit Nr 0 .. Side Sel .. 0 -> Side 1 1 -> Side 2 */
#define DSKSIDE (0x01)
#define DSKDRVNONE (0x06)
#define DSKDRV0 (0x02)
#define DSKDRV1 (0x04)
/* step rates */
#define FDCSTEP_6 0x00
#define FDCSTEP_12 0x01
#define FDCSTEP_2 0x02
#define FDCSTEP_3 0x03
#endif
......@@ -23,6 +23,7 @@
#include <linux/module.h>
#include <linux/init.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/ata.h>
#include <linux/hdreg.h>
#include <linux/cdrom.h>
......@@ -142,7 +143,6 @@ struct cow {
#define MAX_SG 64
struct ubd {
struct list_head restart;
/* name (and fd, below) of the file opened for writing, either the
* backing or the cow file. */
char *file;
......@@ -156,11 +156,8 @@ struct ubd {
struct cow cow;
struct platform_device pdev;
struct request_queue *queue;
struct blk_mq_tag_set tag_set;
spinlock_t lock;
struct scatterlist sg[MAX_SG];
struct request *request;
int start_sg, end_sg;
sector_t rq_pos;
};
#define DEFAULT_COW { \
......@@ -182,10 +179,6 @@ struct ubd {
.shared = 0, \
.cow = DEFAULT_COW, \
.lock = __SPIN_LOCK_UNLOCKED(ubd_devs.lock), \
.request = NULL, \
.start_sg = 0, \
.end_sg = 0, \
.rq_pos = 0, \
}
/* Protected by ubd_lock */
......@@ -196,6 +189,9 @@ static int fake_ide = 0;
static struct proc_dir_entry *proc_ide_root = NULL;
static struct proc_dir_entry *proc_ide = NULL;
static blk_status_t ubd_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd);
static void make_proc_ide(void)
{
proc_ide_root = proc_mkdir("ide", NULL);
......@@ -436,11 +432,8 @@ __uml_help(udb_setup,
" in the boot output.\n\n"
);
static void do_ubd_request(struct request_queue * q);
/* Only changed by ubd_init, which is an initcall. */
static int thread_fd = -1;
static LIST_HEAD(restart);
/* Function to read several request pointers at a time
* handling fractional reads if (and as) needed
......@@ -498,9 +491,6 @@ static int bulk_req_safe_read(
/* Called without dev->lock held, and only in interrupt context. */
static void ubd_handler(void)
{
struct ubd *ubd;
struct list_head *list, *next_ele;
unsigned long flags;
int n;
int count;
......@@ -520,23 +510,17 @@ static void ubd_handler(void)
return;
}
for (count = 0; count < n/sizeof(struct io_thread_req *); count++) {
blk_end_request(
(*irq_req_buffer)[count]->req,
BLK_STS_OK,
(*irq_req_buffer)[count]->length
);
kfree((*irq_req_buffer)[count]);
struct io_thread_req *io_req = (*irq_req_buffer)[count];
int err = io_req->error ? BLK_STS_IOERR : BLK_STS_OK;
if (!blk_update_request(io_req->req, err, io_req->length))
__blk_mq_end_request(io_req->req, err);
kfree(io_req);
}
}
reactivate_fd(thread_fd, UBD_IRQ);
list_for_each_safe(list, next_ele, &restart){
ubd = container_of(list, struct ubd, restart);
list_del_init(&ubd->restart);
spin_lock_irqsave(&ubd->lock, flags);
do_ubd_request(ubd->queue);
spin_unlock_irqrestore(&ubd->lock, flags);
}
reactivate_fd(thread_fd, UBD_IRQ);
}
static irqreturn_t ubd_intr(int irq, void *dev)
......@@ -857,6 +841,7 @@ static void ubd_device_release(struct device *dev)
struct ubd *ubd_dev = dev_get_drvdata(dev);
blk_cleanup_queue(ubd_dev->queue);
blk_mq_free_tag_set(&ubd_dev->tag_set);
*ubd_dev = ((struct ubd) DEFAULT_UBD);
}
......@@ -891,7 +876,7 @@ static int ubd_disk_register(int major, u64 size, int unit,
disk->private_data = &ubd_devs[unit];
disk->queue = ubd_devs[unit].queue;
device_add_disk(parent, disk);
device_add_disk(parent, disk, NULL);
*disk_out = disk;
return 0;
......@@ -899,6 +884,10 @@ static int ubd_disk_register(int major, u64 size, int unit,
#define ROUND_BLOCK(n) ((n + ((1 << 9) - 1)) & (-1 << 9))
static const struct blk_mq_ops ubd_mq_ops = {
.queue_rq = ubd_queue_rq,
};
static int ubd_add(int n, char **error_out)
{
struct ubd *ubd_dev = &ubd_devs[n];
......@@ -915,15 +904,23 @@ static int ubd_add(int n, char **error_out)
ubd_dev->size = ROUND_BLOCK(ubd_dev->size);
INIT_LIST_HEAD(&ubd_dev->restart);
sg_init_table(ubd_dev->sg, MAX_SG);
ubd_dev->tag_set.ops = &ubd_mq_ops;
ubd_dev->tag_set.queue_depth = 64;
ubd_dev->tag_set.numa_node = NUMA_NO_NODE;
ubd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
ubd_dev->tag_set.driver_data = ubd_dev;
ubd_dev->tag_set.nr_hw_queues = 1;
err = -ENOMEM;
ubd_dev->queue = blk_init_queue(do_ubd_request, &ubd_dev->lock);
if (ubd_dev->queue == NULL) {
*error_out = "Failed to initialize device queue";
err = blk_mq_alloc_tag_set(&ubd_dev->tag_set);
if (err)
goto out;
ubd_dev->queue = blk_mq_init_queue(&ubd_dev->tag_set);
if (IS_ERR(ubd_dev->queue)) {
err = PTR_ERR(ubd_dev->queue);
goto out_cleanup;
}
ubd_dev->queue->queuedata = ubd_dev;
blk_queue_write_cache(ubd_dev->queue, true, false);
......@@ -931,7 +928,7 @@ static int ubd_add(int n, char **error_out)
err = ubd_disk_register(UBD_MAJOR, ubd_dev->size, n, &ubd_gendisk[n]);
if(err){
*error_out = "Failed to register device";
goto out_cleanup;
goto out_cleanup_tags;
}
if (fake_major != UBD_MAJOR)
......@@ -949,6 +946,8 @@ static int ubd_add(int n, char **error_out)
out:
return err;
out_cleanup_tags:
blk_mq_free_tag_set(&ubd_dev->tag_set);
out_cleanup:
blk_cleanup_queue(ubd_dev->queue);
goto out;
......@@ -1290,123 +1289,82 @@ static void cowify_req(struct io_thread_req *req, unsigned long *bitmap,
req->bitmap_words, bitmap_len);
}
/* Called with dev->lock held */
static void prepare_request(struct request *req, struct io_thread_req *io_req,
unsigned long long offset, int page_offset,
int len, struct page *page)
static int ubd_queue_one_vec(struct blk_mq_hw_ctx *hctx, struct request *req,
u64 off, struct bio_vec *bvec)
{
struct gendisk *disk = req->rq_disk;
struct ubd *ubd_dev = disk->private_data;
io_req->req = req;
io_req->fds[0] = (ubd_dev->cow.file != NULL) ? ubd_dev->cow.fd :
ubd_dev->fd;
io_req->fds[1] = ubd_dev->fd;
io_req->cow_offset = -1;
io_req->offset = offset;
io_req->length = len;
io_req->error = 0;
io_req->sector_mask = 0;
io_req->op = (rq_data_dir(req) == READ) ? UBD_READ : UBD_WRITE;
io_req->offsets[0] = 0;
io_req->offsets[1] = ubd_dev->cow.data_offset;
io_req->buffer = page_address(page) + page_offset;
io_req->sectorsize = 1 << 9;
if(ubd_dev->cow.file != NULL)
cowify_req(io_req, ubd_dev->cow.bitmap,
ubd_dev->cow.bitmap_offset, ubd_dev->cow.bitmap_len);
}
struct ubd *dev = hctx->queue->queuedata;
struct io_thread_req *io_req;
int ret;
/* Called with dev->lock held */
static void prepare_flush_request(struct request *req,
struct io_thread_req *io_req)
{
struct gendisk *disk = req->rq_disk;
struct ubd *ubd_dev = disk->private_data;
io_req = kmalloc(sizeof(struct io_thread_req), GFP_ATOMIC);
if (!io_req)
return -ENOMEM;
io_req->req = req;
io_req->fds[0] = (ubd_dev->cow.file != NULL) ? ubd_dev->cow.fd :
ubd_dev->fd;
io_req->op = UBD_FLUSH;
}
if (dev->cow.file)
io_req->fds[0] = dev->cow.fd;
else
io_req->fds[0] = dev->fd;
static bool submit_request(struct io_thread_req *io_req, struct ubd *dev)
{
int n = os_write_file(thread_fd, &io_req,
sizeof(io_req));
if (n != sizeof(io_req)) {
if (n != -EAGAIN)
printk("write to io thread failed, "
"errno = %d\n", -n);
else if (list_empty(&dev->restart))
list_add(&dev->restart, &restart);
if (req_op(req) == REQ_OP_FLUSH) {
io_req->op = UBD_FLUSH;
} else {
io_req->fds[1] = dev->fd;
io_req->cow_offset = -1;
io_req->offset = off;
io_req->length = bvec->bv_len;
io_req->error = 0;
io_req->sector_mask = 0;
io_req->op = rq_data_dir(req) == READ ? UBD_READ : UBD_WRITE;
io_req->offsets[0] = 0;
io_req->offsets[1] = dev->cow.data_offset;
io_req->buffer = page_address(bvec->bv_page) + bvec->bv_offset;
io_req->sectorsize = 1 << 9;
if (dev->cow.file) {
cowify_req(io_req, dev->cow.bitmap,
dev->cow.bitmap_offset, dev->cow.bitmap_len);
}
}
ret = os_write_file(thread_fd, &io_req, sizeof(io_req));
if (ret != sizeof(io_req)) {
if (ret != -EAGAIN)
pr_err("write to io thread failed: %d\n", -ret);
kfree(io_req);
return false;
}
return true;
return ret;
}
/* Called with dev->lock held */
static void do_ubd_request(struct request_queue *q)
static blk_status_t ubd_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
struct io_thread_req *io_req;
struct request *req;
while(1){
struct ubd *dev = q->queuedata;
if(dev->request == NULL){
struct request *req = blk_fetch_request(q);
if(req == NULL)
return;
dev->request = req;
dev->rq_pos = blk_rq_pos(req);
dev->start_sg = 0;
dev->end_sg = blk_rq_map_sg(q, req, dev->sg);
}
req = dev->request;
struct request *req = bd->rq;
int ret = 0;
if (req_op(req) == REQ_OP_FLUSH) {
io_req = kmalloc(sizeof(struct io_thread_req),
GFP_ATOMIC);
if (io_req == NULL) {
if (list_empty(&dev->restart))
list_add(&dev->restart, &restart);
return;
}
prepare_flush_request(req, io_req);
if (submit_request(io_req, dev) == false)
return;
}
blk_mq_start_request(req);
while(dev->start_sg < dev->end_sg){
struct scatterlist *sg = &dev->sg[dev->start_sg];
io_req = kmalloc(sizeof(struct io_thread_req),
GFP_ATOMIC);
if(io_req == NULL){
if(list_empty(&dev->restart))
list_add(&dev->restart, &restart);
return;
}
prepare_request(req, io_req,
(unsigned long long)dev->rq_pos << 9,
sg->offset, sg->length, sg_page(sg));
if (submit_request(io_req, dev) == false)
return;
dev->rq_pos += sg->length >> 9;
dev->start_sg++;
if (req_op(req) == REQ_OP_FLUSH) {
ret = ubd_queue_one_vec(hctx, req, 0, NULL);
} else {
struct req_iterator iter;
struct bio_vec bvec;
u64 off = (u64)blk_rq_pos(req) << 9;
rq_for_each_segment(bvec, req, iter) {
ret = ubd_queue_one_vec(hctx, req, off, &bvec);
if (ret < 0)
goto out;
off += bvec.bv_len;
}
dev->end_sg = 0;
dev->request = NULL;
}
out:
if (ret < 0) {
blk_mq_requeue_request(req, true);
}
return BLK_STS_OK;
}
static int ubd_getgeo(struct block_device *bdev, struct hd_geometry *geo)
......
......@@ -369,18 +369,6 @@ extern void __iomem *ioremap_wt(resource_size_t offset, unsigned long size);
extern bool is_early_ioremap_ptep(pte_t *ptep);
#ifdef CONFIG_XEN
#include <xen/xen.h>
struct bio_vec;
extern bool xen_biovec_phys_mergeable(const struct bio_vec *vec1,
const struct bio_vec *vec2);
#define BIOVEC_PHYS_MERGEABLE(vec1, vec2) \
(__BIOVEC_PHYS_MERGEABLE(vec1, vec2) && \
(!xen_domain() || xen_biovec_phys_mergeable(vec1, vec2)))
#endif /* CONFIG_XEN */
#define IO_SPACE_LIMIT 0xffff
#include <asm-generic/io.h>
......
......@@ -2,6 +2,8 @@
#ifndef _ASM_X86_XEN_EVENTS_H
#define _ASM_X86_XEN_EVENTS_H
#include <xen/xen.h>
enum ipi_vector {
XEN_RESCHEDULE_VECTOR,
XEN_CALL_FUNCTION_VECTOR,
......
......@@ -5,6 +5,7 @@
#include <linux/kexec.h>
#include <linux/slab.h>
#include <xen/xen.h>
#include <xen/features.h>
#include <xen/page.h>
#include <xen/interface/memory.h>
......
......@@ -11,6 +11,7 @@
#include <asm/xen/interface.h>
#include <asm/xen/hypercall.h>
#include <xen/xen.h>
#include <xen/interface/memory.h>
#include <xen/interface/hvm/start_info.h>
......
......@@ -23,6 +23,7 @@
#include <linux/io.h>
#include <linux/export.h>
#include <xen/xen.h>
#include <xen/platform_pci.h>
#include "xen-ops.h"
......
......@@ -3,6 +3,7 @@
#include <linux/interrupt.h>
#include <asm/xen/hypercall.h>
#include <xen/xen.h>
#include <xen/page.h>
#include <xen/interface/xen.h>
#include <xen/interface/vcpu.h>
......
......@@ -74,7 +74,6 @@ config BLK_DEV_BSG
config BLK_DEV_BSGLIB
bool "Block layer SG support v4 helper lib"
default n
select BLK_DEV_BSG
select BLK_SCSI_REQUEST
help
......@@ -107,7 +106,6 @@ config BLK_DEV_ZONED
config BLK_DEV_THROTTLING
bool "Block layer bio throttling support"
depends on BLK_CGROUP=y
default n
---help---
Block layer bio throttling support. It can be used to limit
the IO rate to a device. IO rate policies are per cgroup and
......@@ -119,7 +117,6 @@ config BLK_DEV_THROTTLING
config BLK_DEV_THROTTLING_LOW
bool "Block throttling .low limit interface support (EXPERIMENTAL)"
depends on BLK_DEV_THROTTLING
default n
---help---
Add .low limit interface for block throttling. The low limit is a best
effort limit to prioritize cgroups. Depending on the setting, the limit
......@@ -130,7 +127,6 @@ config BLK_DEV_THROTTLING_LOW
config BLK_CMDLINE_PARSER
bool "Block device command line partition parser"
default n
---help---
Enabling this option allows you to specify the partition layout from
the kernel boot args. This is typically of use for embedded devices
......@@ -141,7 +137,6 @@ config BLK_CMDLINE_PARSER
config BLK_WBT
bool "Enable support for block device writeback throttling"
default n
---help---
Enabling this option enables the block layer to throttle buffered
background writeback from the VM, making it more smooth and having
......@@ -152,7 +147,6 @@ config BLK_WBT
config BLK_CGROUP_IOLATENCY
bool "Enable support for latency based cgroup IO protection"
depends on BLK_CGROUP=y
default n
---help---
Enabling this option enables the .latency interface for IO throttling.
The IO controller will attempt to maintain average IO latencies below
......@@ -163,7 +157,6 @@ config BLK_CGROUP_IOLATENCY
config BLK_WBT_SQ
bool "Single queue writeback throttling"
default n
depends on BLK_WBT
---help---
Enable writeback throttling by default on legacy single queue devices
......@@ -228,4 +221,7 @@ config BLK_MQ_RDMA
depends on BLOCK && INFINIBAND
default y
config BLK_PM
def_bool BLOCK && PM
source block/Kconfig.iosched
......@@ -36,7 +36,6 @@ config IOSCHED_CFQ
config CFQ_GROUP_IOSCHED
bool "CFQ Group Scheduling support"
depends on IOSCHED_CFQ && BLK_CGROUP
default n
---help---
Enable group IO scheduling in CFQ.
......@@ -82,7 +81,6 @@ config MQ_IOSCHED_KYBER
config IOSCHED_BFQ
tristate "BFQ I/O scheduler"
default n
---help---
BFQ I/O scheduler for BLK-MQ. BFQ distributes the bandwidth of
of the device among all processes according to their weights,
......@@ -94,7 +92,6 @@ config IOSCHED_BFQ
config BFQ_GROUP_IOSCHED
bool "BFQ hierarchical scheduling support"
depends on IOSCHED_BFQ && BLK_CGROUP
default n
---help---
Enable hierarchical scheduling in BFQ, using the blkio
......
......@@ -37,3 +37,4 @@ obj-$(CONFIG_BLK_WBT) += blk-wbt.o
obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o
obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o
obj-$(CONFIG_BLK_PM) += blk-pm.o
......@@ -642,7 +642,7 @@ void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio)
uint64_t serial_nr;
rcu_read_lock();
serial_nr = bio_blkcg(bio)->css.serial_nr;
serial_nr = __bio_blkcg(bio)->css.serial_nr;
/*
* Check whether blkcg has changed. The condition may trigger
......@@ -651,7 +651,7 @@ void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio)
if (unlikely(!bfqd) || likely(bic->blkcg_serial_nr == serial_nr))
goto out;
bfqg = __bfq_bic_change_cgroup(bfqd, bic, bio_blkcg(bio));
bfqg = __bfq_bic_change_cgroup(bfqd, bic, __bio_blkcg(bio));
/*
* Update blkg_path for bfq_log_* functions. We cache this
* path, and update it here, for the following
......
This diff is collapsed.
......@@ -108,15 +108,14 @@ struct bfq_sched_data {
};
/**
* struct bfq_weight_counter - counter of the number of all active entities
* struct bfq_weight_counter - counter of the number of all active queues
* with a given weight.
*/
struct bfq_weight_counter {
unsigned int weight; /* weight of the entities this counter refers to */
unsigned int num_active; /* nr of active entities with this weight */
unsigned int weight; /* weight of the queues this counter refers to */
unsigned int num_active; /* nr of active queues with this weight */
/*
* Weights tree member (see bfq_data's @queue_weights_tree and
* @group_weights_tree)
* Weights tree member (see bfq_data's @queue_weights_tree)
*/
struct rb_node weights_node;
};
......@@ -151,8 +150,6 @@ struct bfq_weight_counter {
struct bfq_entity {
/* service_tree member */
struct rb_node rb_node;
/* pointer to the weight counter associated with this entity */
struct bfq_weight_counter *weight_counter;
/*
* Flag, true if the entity is on a tree (either the active or
......@@ -266,6 +263,9 @@ struct bfq_queue {
/* entity representing this queue in the scheduler */
struct bfq_entity entity;
/* pointer to the weight counter associated with this entity */
struct bfq_weight_counter *weight_counter;
/* maximum budget allowed from the feedback mechanism */
int max_budget;
/* budget expiration (in jiffies) */
......@@ -351,6 +351,32 @@ struct bfq_queue {
unsigned long split_time; /* time of last split */
unsigned long first_IO_time; /* time of first I/O for this queue */
/* max service rate measured so far */
u32 max_service_rate;
/*
* Ratio between the service received by bfqq while it is in
* service, and the cumulative service (of requests of other
* queues) that may be injected while bfqq is empty but still
* in service. To increase precision, the coefficient is
* measured in tenths of unit. Here are some example of (1)
* ratios, (2) resulting percentages of service injected
* w.r.t. to the total service dispatched while bfqq is in
* service, and (3) corresponding values of the coefficient:
* 1 (50%) -> 10
* 2 (33%) -> 20
* 10 (9%) -> 100
* 9.9 (9%) -> 99
* 1.5 (40%) -> 15
* 0.5 (66%) -> 5
* 0.1 (90%) -> 1
*
* So, if the coefficient is lower than 10, then
* injected service is more than bfqq service.
*/
unsigned int inject_coeff;
/* amount of service injected in current service slot */
unsigned int injected_service;
};
/**
......@@ -423,14 +449,9 @@ struct bfq_data {
*/
struct rb_root queue_weights_tree;
/*
* rbtree of non-queue @bfq_entity weight counters, sorted by
* weight. Used to keep track of whether all @bfq_groups have
* the same weight. The tree contains one counter for each
* distinct weight associated to some active @bfq_group (see
* the comments to the functions bfq_weights_tree_[add|remove]
* for further details).
* number of groups with requests still waiting for completion
*/
struct rb_root group_weights_tree;
unsigned int num_active_groups;
/*
* Number of bfq_queues containing requests (including the
......@@ -825,10 +846,10 @@ struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync);
void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync);
struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic);
void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq);
void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity,
void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
struct rb_root *root);
void __bfq_weights_tree_remove(struct bfq_data *bfqd,
struct bfq_entity *entity,
struct bfq_queue *bfqq,
struct rb_root *root);
void bfq_weights_tree_remove(struct bfq_data *bfqd,
struct bfq_queue *bfqq);
......
......@@ -788,25 +788,29 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
new_weight = entity->orig_weight *
(bfqq ? bfqq->wr_coeff : 1);
/*
* If the weight of the entity changes, remove the entity
* from its old weight counter (if there is a counter
* associated with the entity), and add it to the counter
* associated with its new weight.
* If the weight of the entity changes, and the entity is a
* queue, remove the entity from its old weight counter (if
* there is a counter associated with the entity).
*/
if (prev_weight != new_weight) {
root = bfqq ? &bfqd->queue_weights_tree :
&bfqd->group_weights_tree;
__bfq_weights_tree_remove(bfqd, entity, root);
if (bfqq) {
root = &bfqd->queue_weights_tree;
__bfq_weights_tree_remove(bfqd, bfqq, root);
} else
bfqd->num_active_groups--;
}
entity->weight = new_weight;
/*
* Add the entity to its weights tree only if it is
* not associated with a weight-raised queue.
* Add the entity, if it is not a weight-raised queue,
* to the counter associated with its new weight.
*/
if (prev_weight != new_weight &&
(bfqq ? bfqq->wr_coeff == 1 : 1))
/* If we get here, root has been initialized. */
bfq_weights_tree_add(bfqd, entity, root);
if (prev_weight != new_weight) {
if (bfqq && bfqq->wr_coeff == 1) {
/* If we get here, root has been initialized. */
bfq_weights_tree_add(bfqd, bfqq, root);
} else
bfqd->num_active_groups++;
}
new_st->wsum += entity->weight;
......@@ -1012,9 +1016,9 @@ static void __bfq_activate_entity(struct bfq_entity *entity,
if (!bfq_entity_to_bfqq(entity)) { /* bfq_group */
struct bfq_group *bfqg =
container_of(entity, struct bfq_group, entity);
struct bfq_data *bfqd = bfqg->bfqd;
bfq_weights_tree_add(bfqg->bfqd, entity,
&bfqd->group_weights_tree);
bfqd->num_active_groups++;
}
#endif
......@@ -1181,10 +1185,17 @@ bool __bfq_deactivate_entity(struct bfq_entity *entity, bool ins_into_idle_tree)
st = bfq_entity_service_tree(entity);
is_in_service = entity == sd->in_service_entity;
if (is_in_service) {
bfq_calc_finish(entity, entity->service);
bfq_calc_finish(entity, entity->service);
if (is_in_service)
sd->in_service_entity = NULL;
}
else
/*
* Non in-service entity: nobody will take care of
* resetting its service counter on expiration. Do it
* now.
*/
entity->service = 0;
if (entity->tree == &st->active)
bfq_active_extract(st, entity);
......@@ -1685,7 +1696,7 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
if (!bfqq->dispatched)
if (bfqq->wr_coeff == 1)
bfq_weights_tree_add(bfqd, &bfqq->entity,
bfq_weights_tree_add(bfqd, bfqq,
&bfqd->queue_weights_tree);
if (bfqq->wr_coeff > 1)
......
......@@ -306,6 +306,8 @@ bool bio_integrity_prep(struct bio *bio)
if (bio_data_dir(bio) == WRITE) {
bio_integrity_process(bio, &bio->bi_iter,
bi->profile->generate_fn);
} else {
bip->bio_iter = bio->bi_iter;
}
return true;
......@@ -331,20 +333,14 @@ static void bio_integrity_verify_fn(struct work_struct *work)
container_of(work, struct bio_integrity_payload, bip_work);
struct bio *bio = bip->bip_bio;
struct blk_integrity *bi = blk_get_integrity(bio->bi_disk);
struct bvec_iter iter = bio->bi_iter;
/*
* At the moment verify is called bio's iterator was advanced
* during split and completion, we need to rewind iterator to
* it's original position.
*/
if (bio_rewind_iter(bio, &iter, iter.bi_done)) {
bio->bi_status = bio_integrity_process(bio, &iter,
bi->profile->verify_fn);
} else {
bio->bi_status = BLK_STS_IOERR;
}
bio->bi_status = bio_integrity_process(bio, &bip->bio_iter,
bi->profile->verify_fn);
bio_integrity_free(bio);
bio_endio(bio);
}
......
......@@ -609,7 +609,9 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
bio_clone_blkcg_association(bio, bio_src);
bio_clone_blkg_association(bio, bio_src);
blkcg_bio_issue_init(bio);
}
EXPORT_SYMBOL(__bio_clone_fast);
......@@ -729,7 +731,7 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
}
/* If we may be able to merge these biovecs, force a recount */
if (bio->bi_vcnt > 1 && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec)))
if (bio->bi_vcnt > 1 && biovec_phys_mergeable(q, bvec - 1, bvec))
bio_clear_flag(bio, BIO_SEG_VALID);
done:
......@@ -827,6 +829,8 @@ int bio_add_page(struct bio *bio, struct page *page,
}
EXPORT_SYMBOL(bio_add_page);
#define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *))
/**
* __bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
* @bio: bio to add pages to
......@@ -839,38 +843,35 @@ EXPORT_SYMBOL(bio_add_page);
*/
static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt, idx;
unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
struct page **pages = (struct page **)bv;
ssize_t size, left;
unsigned len, i;
size_t offset;
ssize_t size;
/*
* Move page array up in the allocated memory for the bio vecs as far as
* possible so that we can start filling biovecs from the beginning
* without overwriting the temporary page array.
*/
BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;
idx = nr_pages = (size + offset + PAGE_SIZE - 1) / PAGE_SIZE;
/*
* Deep magic below: We need to walk the pinned pages backwards
* because we are abusing the space allocated for the bio_vecs
* for the page array. Because the bio_vecs are larger than the
* page pointers by definition this will always work. But it also
* means we can't use bio_add_page, so any changes to it's semantics
* need to be reflected here as well.
*/
bio->bi_iter.bi_size += size;
bio->bi_vcnt += nr_pages;
for (left = size, i = 0; left > 0; left -= len, i++) {
struct page *page = pages[i];
while (idx--) {
bv[idx].bv_page = pages[idx];
bv[idx].bv_len = PAGE_SIZE;
bv[idx].bv_offset = 0;
len = min_t(size_t, PAGE_SIZE - offset, left);
if (WARN_ON_ONCE(bio_add_page(bio, page, len, offset) != len))
return -EINVAL;
offset = 0;
}
bv[0].bv_offset += offset;
bv[0].bv_len -= offset;
bv[nr_pages - 1].bv_len -= nr_pages * PAGE_SIZE - offset - size;
iov_iter_advance(iter, size);
return 0;
}
......@@ -1807,7 +1808,6 @@ struct bio *bio_split(struct bio *bio, int sectors,
bio_integrity_trim(split);
bio_advance(bio, split->bi_iter.bi_size);
bio->bi_iter.bi_done = 0;
if (bio_flagged(bio, BIO_TRACE_COMPLETION))
bio_set_flag(split, BIO_TRACE_COMPLETION);
......@@ -1956,69 +1956,151 @@ EXPORT_SYMBOL(bioset_init_from_src);
#ifdef CONFIG_BLK_CGROUP
/**
* bio_associate_blkg - associate a bio with the a blkg
* @bio: target bio
* @blkg: the blkg to associate
*
* This tries to associate @bio with the specified blkg. Association failure
* is handled by walking up the blkg tree. Therefore, the blkg associated can
* be anything between @blkg and the root_blkg. This situation only happens
* when a cgroup is dying and then the remaining bios will spill to the closest
* alive blkg.
*
* A reference will be taken on the @blkg and will be released when @bio is
* freed.
*/
int bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg)
{
if (unlikely(bio->bi_blkg))
return -EBUSY;
bio->bi_blkg = blkg_tryget_closest(blkg);
return 0;
}
/**
* __bio_associate_blkg_from_css - internal blkg association function
*
* This in the core association function that all association paths rely on.
* A blkg reference is taken which is released upon freeing of the bio.
*/
static int __bio_associate_blkg_from_css(struct bio *bio,
struct cgroup_subsys_state *css)
{
struct request_queue *q = bio->bi_disk->queue;
struct blkcg_gq *blkg;
int ret;
rcu_read_lock();
if (!css || !css->parent)
blkg = q->root_blkg;
else
blkg = blkg_lookup_create(css_to_blkcg(css), q);
ret = bio_associate_blkg(bio, blkg);
rcu_read_unlock();
return ret;
}
/**
* bio_associate_blkg_from_css - associate a bio with a specified css
* @bio: target bio
* @css: target css
*
* Associate @bio with the blkg found by combining the css's blkg and the
* request_queue of the @bio. This falls back to the queue's root_blkg if
* the association fails with the css.
*/
int bio_associate_blkg_from_css(struct bio *bio,
struct cgroup_subsys_state *css)
{
if (unlikely(bio->bi_blkg))
return -EBUSY;
return __bio_associate_blkg_from_css(bio, css);
}
EXPORT_SYMBOL_GPL(bio_associate_blkg_from_css);
#ifdef CONFIG_MEMCG
/**
* bio_associate_blkcg_from_page - associate a bio with the page's blkcg
* bio_associate_blkg_from_page - associate a bio with the page's blkg
* @bio: target bio
* @page: the page to lookup the blkcg from
*
* Associate @bio with the blkcg from @page's owning memcg. This works like
* every other associate function wrt references.
* Associate @bio with the blkg from @page's owning memcg and the respective
* request_queue. If cgroup_e_css returns NULL, fall back to the queue's
* root_blkg.
*
* Note: this must be called after bio has an associated device.
*/
int bio_associate_blkcg_from_page(struct bio *bio, struct page *page)
int bio_associate_blkg_from_page(struct bio *bio, struct page *page)
{
struct cgroup_subsys_state *blkcg_css;
struct cgroup_subsys_state *css;
int ret;
if (unlikely(bio->bi_css))
if (unlikely(bio->bi_blkg))
return -EBUSY;
if (!page->mem_cgroup)
return 0;
blkcg_css = cgroup_get_e_css(page->mem_cgroup->css.cgroup,
&io_cgrp_subsys);
bio->bi_css = blkcg_css;
return 0;
rcu_read_lock();
css = cgroup_e_css(page->mem_cgroup->css.cgroup, &io_cgrp_subsys);
ret = __bio_associate_blkg_from_css(bio, css);
rcu_read_unlock();
return ret;
}
#endif /* CONFIG_MEMCG */
/**
* bio_associate_blkcg - associate a bio with the specified blkcg
* bio_associate_create_blkg - associate a bio with a blkg from q
* @q: request_queue where bio is going
* @bio: target bio
* @blkcg_css: css of the blkcg to associate
*
* Associate @bio with the blkcg specified by @blkcg_css. Block layer will
* treat @bio as if it were issued by a task which belongs to the blkcg.
*
* This function takes an extra reference of @blkcg_css which will be put
* when @bio is released. The caller must own @bio and is responsible for
* synchronizing calls to this function.
* Associate @bio with the blkg found from the bio's css and the request_queue.
* If one is not found, bio_lookup_blkg creates the blkg. This falls back to
* the queue's root_blkg if association fails.
*/
int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css)
int bio_associate_create_blkg(struct request_queue *q, struct bio *bio)
{
if (unlikely(bio->bi_css))
return -EBUSY;
css_get(blkcg_css);
bio->bi_css = blkcg_css;
return 0;
struct cgroup_subsys_state *css;
int ret = 0;
/* someone has already associated this bio with a blkg */
if (bio->bi_blkg)
return ret;
rcu_read_lock();
css = blkcg_css();
ret = __bio_associate_blkg_from_css(bio, css);
rcu_read_unlock();
return ret;
}
EXPORT_SYMBOL_GPL(bio_associate_blkcg);
/**
* bio_associate_blkg - associate a bio with the specified blkg
* bio_reassociate_blkg - reassociate a bio with a blkg from q
* @q: request_queue where bio is going
* @bio: target bio
* @blkg: the blkg to associate
*
* Associate @bio with the blkg specified by @blkg. This is the queue specific
* blkcg information associated with the @bio, a reference will be taken on the
* @blkg and will be freed when the bio is freed.
* When submitting a bio, multiple recursive calls to make_request() may occur.
* This causes the initial associate done in blkcg_bio_issue_check() to be
* incorrect and reference the prior request_queue. This performs reassociation
* when this situation happens.
*/
int bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg)
int bio_reassociate_blkg(struct request_queue *q, struct bio *bio)
{
if (unlikely(bio->bi_blkg))
return -EBUSY;
if (!blkg_try_get(blkg))
return -ENODEV;
bio->bi_blkg = blkg;
return 0;
if (bio->bi_blkg) {
blkg_put(bio->bi_blkg);
bio->bi_blkg = NULL;
}
return bio_associate_create_blkg(q, bio);
}
/**
......@@ -2031,10 +2113,6 @@ void bio_disassociate_task(struct bio *bio)
put_io_context(bio->bi_ioc);
bio->bi_ioc = NULL;
}
if (bio->bi_css) {
css_put(bio->bi_css);
bio->bi_css = NULL;
}
if (bio->bi_blkg) {
blkg_put(bio->bi_blkg);
bio->bi_blkg = NULL;
......@@ -2042,16 +2120,16 @@ void bio_disassociate_task(struct bio *bio)
}
/**
* bio_clone_blkcg_association - clone blkcg association from src to dst bio
* bio_clone_blkg_association - clone blkg association from src to dst bio
* @dst: destination bio
* @src: source bio
*/
void bio_clone_blkcg_association(struct bio *dst, struct bio *src)
void bio_clone_blkg_association(struct bio *dst, struct bio *src)
{
if (src->bi_css)
WARN_ON(bio_associate_blkcg(dst, src->bi_css));
if (src->bi_blkg)
bio_associate_blkg(dst, src->bi_blkg);
}
EXPORT_SYMBOL_GPL(bio_clone_blkcg_association);
EXPORT_SYMBOL_GPL(bio_clone_blkg_association);
#endif /* CONFIG_BLK_CGROUP */
static void __init biovec_init_slabs(void)
......
......@@ -84,6 +84,37 @@ static void blkg_free(struct blkcg_gq *blkg)
kfree(blkg);
}
static void __blkg_release(struct rcu_head *rcu)
{
struct blkcg_gq *blkg = container_of(rcu, struct blkcg_gq, rcu_head);
percpu_ref_exit(&blkg->refcnt);
/* release the blkcg and parent blkg refs this blkg has been holding */
css_put(&blkg->blkcg->css);
if (blkg->parent)
blkg_put(blkg->parent);
wb_congested_put(blkg->wb_congested);
blkg_free(blkg);
}
/*
* A group is RCU protected, but having an rcu lock does not mean that one
* can access all the fields of blkg and assume these are valid. For
* example, don't try to follow throtl_data and request queue links.
*
* Having a reference to blkg under an rcu allows accesses to only values
* local to groups like group stats and group rate limits.
*/
static void blkg_release(struct percpu_ref *ref)
{
struct blkcg_gq *blkg = container_of(ref, struct blkcg_gq, refcnt);
call_rcu(&blkg->rcu_head, __blkg_release);
}
/**
* blkg_alloc - allocate a blkg
* @blkcg: block cgroup the new blkg is associated with
......@@ -110,7 +141,6 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
blkg->q = q;
INIT_LIST_HEAD(&blkg->q_node);
blkg->blkcg = blkcg;
atomic_set(&blkg->refcnt, 1);
/* root blkg uses @q->root_rl, init rl only for !root blkgs */
if (blkcg != &blkcg_root) {
......@@ -217,6 +247,11 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
blkg_get(blkg->parent);
}
ret = percpu_ref_init(&blkg->refcnt, blkg_release, 0,
GFP_NOWAIT | __GFP_NOWARN);
if (ret)
goto err_cancel_ref;
/* invoke per-policy init */
for (i = 0; i < BLKCG_MAX_POLS; i++) {
struct blkcg_policy *pol = blkcg_policy[i];
......@@ -249,6 +284,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
blkg_put(blkg);
return ERR_PTR(ret);
err_cancel_ref:
percpu_ref_exit(&blkg->refcnt);
err_put_congested:
wb_congested_put(wb_congested);
err_put_css:
......@@ -259,7 +296,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
}
/**
* blkg_lookup_create - lookup blkg, try to create one if not there
* __blkg_lookup_create - lookup blkg, try to create one if not there
* @blkcg: blkcg of interest
* @q: request_queue of interest
*
......@@ -268,12 +305,11 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
* that all non-root blkg's have access to the parent blkg. This function
* should be called under RCU read lock and @q->queue_lock.
*
* Returns pointer to the looked up or created blkg on success, ERR_PTR()
* value on error. If @q is dead, returns ERR_PTR(-EINVAL). If @q is not
* dead and bypassing, returns ERR_PTR(-EBUSY).
* Returns the blkg or the closest blkg if blkg_create fails as it walks
* down from root.
*/
struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
struct request_queue *q)
struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
struct request_queue *q)
{
struct blkcg_gq *blkg;
......@@ -285,7 +321,7 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
* we shouldn't allow anything to go through for a bypassing queue.
*/
if (unlikely(blk_queue_bypass(q)))
return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY);
return q->root_blkg;
blkg = __blkg_lookup(blkcg, q, true);
if (blkg)
......@@ -293,23 +329,58 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
/*
* Create blkgs walking down from blkcg_root to @blkcg, so that all
* non-root blkgs have access to their parents.
* non-root blkgs have access to their parents. Returns the closest
* blkg to the intended blkg should blkg_create() fail.
*/
while (true) {
struct blkcg *pos = blkcg;
struct blkcg *parent = blkcg_parent(blkcg);
while (parent && !__blkg_lookup(parent, q, false)) {
struct blkcg_gq *ret_blkg = q->root_blkg;
while (parent) {
blkg = __blkg_lookup(parent, q, false);
if (blkg) {
/* remember closest blkg */
ret_blkg = blkg;
break;
}
pos = parent;
parent = blkcg_parent(parent);
}
blkg = blkg_create(pos, q, NULL);
if (pos == blkcg || IS_ERR(blkg))
if (IS_ERR(blkg))
return ret_blkg;
if (pos == blkcg)
return blkg;
}
}
/**
* blkg_lookup_create - find or create a blkg
* @blkcg: target block cgroup
* @q: target request_queue
*
* This looks up or creates the blkg representing the unique pair
* of the blkcg and the request_queue.
*/
struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
struct request_queue *q)
{
struct blkcg_gq *blkg = blkg_lookup(blkcg, q);
unsigned long flags;
if (unlikely(!blkg)) {
spin_lock_irqsave(q->queue_lock, flags);
blkg = __blkg_lookup_create(blkcg, q);
spin_unlock_irqrestore(q->queue_lock, flags);
}
return blkg;
}
static void blkg_destroy(struct blkcg_gq *blkg)
{
struct blkcg *blkcg = blkg->blkcg;
......@@ -353,7 +424,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
* Put the reference taken at the time of creation so that when all
* queues are gone, group can be destroyed.
*/
blkg_put(blkg);
percpu_ref_kill(&blkg->refcnt);
}
/**
......@@ -380,29 +451,6 @@ static void blkg_destroy_all(struct request_queue *q)
q->root_rl.blkg = NULL;
}
/*
* A group is RCU protected, but having an rcu lock does not mean that one
* can access all the fields of blkg and assume these are valid. For
* example, don't try to follow throtl_data and request queue links.
*
* Having a reference to blkg under an rcu allows accesses to only values
* local to groups like group stats and group rate limits.
*/
void __blkg_release_rcu(struct rcu_head *rcu_head)
{
struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, rcu_head);
/* release the blkcg and parent blkg refs this blkg has been holding */
css_put(&blkg->blkcg->css);
if (blkg->parent)
blkg_put(blkg->parent);
wb_congested_put(blkg->wb_congested);
blkg_free(blkg);
}
EXPORT_SYMBOL_GPL(__blkg_release_rcu);
/*
* The next function used by blk_queue_for_each_rl(). It's a bit tricky
* because the root blkg uses @q->root_rl instead of its own rl.
......@@ -1748,8 +1796,7 @@ void blkcg_maybe_throttle_current(void)
blkg = blkg_lookup(blkcg, q);
if (!blkg)
goto out;
blkg = blkg_try_get(blkg);
if (!blkg)
if (!blkg_tryget(blkg))
goto out;
rcu_read_unlock();
......
This diff is collapsed.
......@@ -566,12 +566,12 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
EXPORT_SYMBOL(blkdev_issue_flush);
struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
int node, int cmd_size)
int node, int cmd_size, gfp_t flags)
{
struct blk_flush_queue *fq;
int rq_sz = sizeof(struct request);
fq = kzalloc_node(sizeof(*fq), GFP_KERNEL, node);
fq = kzalloc_node(sizeof(*fq), flags, node);
if (!fq)
goto fail;
......@@ -579,7 +579,7 @@ struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
spin_lock_init(&fq->mq_flush_lock);
rq_sz = round_up(rq_sz + cmd_size, cache_line_size());
fq->flush_rq = kzalloc_node(rq_sz, GFP_KERNEL, node);
fq->flush_rq = kzalloc_node(rq_sz, flags, node);
if (!fq->flush_rq)
goto fail_rq;
......
......@@ -49,12 +49,8 @@ int blk_rq_count_integrity_sg(struct request_queue *q, struct bio *bio)
bio_for_each_integrity_vec(iv, bio, iter) {
if (prev) {
if (!BIOVEC_PHYS_MERGEABLE(&ivprv, &iv))
if (!biovec_phys_mergeable(q, &ivprv, &iv))
goto new_segment;
if (!BIOVEC_SEG_BOUNDARY(q, &ivprv, &iv))
goto new_segment;
if (seg_size + iv.bv_len > queue_max_segment_size(q))
goto new_segment;
......@@ -95,12 +91,8 @@ int blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio,
bio_for_each_integrity_vec(iv, bio, iter) {
if (prev) {
if (!BIOVEC_PHYS_MERGEABLE(&ivprv, &iv))
if (!biovec_phys_mergeable(q, &ivprv, &iv))
goto new_segment;
if (!BIOVEC_SEG_BOUNDARY(q, &ivprv, &iv))
goto new_segment;
if (sg->length + iv.bv_len > queue_max_segment_size(q))
goto new_segment;
......
This diff is collapsed.
......@@ -12,6 +12,69 @@
#include "blk.h"
/*
* Check if the two bvecs from two bios can be merged to one segment. If yes,
* no need to check gap between the two bios since the 1st bio and the 1st bvec
* in the 2nd bio can be handled in one segment.
*/
static inline bool bios_segs_mergeable(struct request_queue *q,
struct bio *prev, struct bio_vec *prev_last_bv,
struct bio_vec *next_first_bv)
{
if (!biovec_phys_mergeable(q, prev_last_bv, next_first_bv))
return false;
if (prev->bi_seg_back_size + next_first_bv->bv_len >
queue_max_segment_size(q))
return false;
return true;
}
static inline bool bio_will_gap(struct request_queue *q,
struct request *prev_rq, struct bio *prev, struct bio *next)
{
struct bio_vec pb, nb;
if (!bio_has_data(prev) || !queue_virt_boundary(q))
return false;
/*
* Don't merge if the 1st bio starts with non-zero offset, otherwise it
* is quite difficult to respect the sg gap limit. We work hard to
* merge a huge number of small single bios in case of mkfs.
*/
if (prev_rq)
bio_get_first_bvec(prev_rq->bio, &pb);
else
bio_get_first_bvec(prev, &pb);
if (pb.bv_offset)
return true;
/*
* We don't need to worry about the situation that the merged segment
* ends in unaligned virt boundary:
*
* - if 'pb' ends aligned, the merged segment ends aligned
* - if 'pb' ends unaligned, the next bio must include
* one single bvec of 'nb', otherwise the 'nb' can't
* merge with 'pb'
*/
bio_get_last_bvec(prev, &pb);
bio_get_first_bvec(next, &nb);
if (bios_segs_mergeable(q, prev, &pb, &nb))
return false;
return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
}
static inline bool req_gap_back_merge(struct request *req, struct bio *bio)
{
return bio_will_gap(req->q, req, req->biotail, bio);
}
static inline bool req_gap_front_merge(struct request *req, struct bio *bio)
{
return bio_will_gap(req->q, NULL, bio, req->bio);
}
static struct bio *blk_bio_discard_split(struct request_queue *q,
struct bio *bio,
struct bio_set *bs,
......@@ -134,9 +197,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
if (bvprvp && blk_queue_cluster(q)) {
if (seg_size + bv.bv_len > queue_max_segment_size(q))
goto new_segment;
if (!BIOVEC_PHYS_MERGEABLE(bvprvp, &bv))
goto new_segment;
if (!BIOVEC_SEG_BOUNDARY(q, bvprvp, &bv))
if (!biovec_phys_mergeable(q, bvprvp, &bv))
goto new_segment;
seg_size += bv.bv_len;
......@@ -267,9 +328,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
goto new_segment;
if (!BIOVEC_PHYS_MERGEABLE(&bvprv, &bv))
goto new_segment;
if (!BIOVEC_SEG_BOUNDARY(q, &bvprv, &bv))
if (!biovec_phys_mergeable(q, &bvprv, &bv))
goto new_segment;
seg_size += bv.bv_len;
......@@ -349,17 +408,7 @@ static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
bio_get_last_bvec(bio, &end_bv);
bio_get_first_bvec(nxt, &nxt_bv);
if (!BIOVEC_PHYS_MERGEABLE(&end_bv, &nxt_bv))
return 0;
/*
* bio and nxt are contiguous in memory; check if the queue allows
* these two to be merged into one
*/
if (BIOVEC_SEG_BOUNDARY(q, &end_bv, &nxt_bv))
return 1;
return 0;
return biovec_phys_mergeable(q, &end_bv, &nxt_bv);
}
static inline void
......@@ -373,10 +422,7 @@ __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
if (*sg && *cluster) {
if ((*sg)->length + nbytes > queue_max_segment_size(q))
goto new_segment;
if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec))
goto new_segment;
if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec))
if (!biovec_phys_mergeable(q, bvprv, bvec))
goto new_segment;
(*sg)->length += nbytes;
......
......@@ -102,6 +102,14 @@ static int blk_flags_show(struct seq_file *m, const unsigned long flags,
return 0;
}
static int queue_pm_only_show(void *data, struct seq_file *m)
{
struct request_queue *q = data;
seq_printf(m, "%d\n", atomic_read(&q->pm_only));
return 0;
}
#define QUEUE_FLAG_NAME(name) [QUEUE_FLAG_##name] = #name
static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(QUEUED),
......@@ -132,7 +140,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(REGISTERED),
QUEUE_FLAG_NAME(SCSI_PASSTHROUGH),
QUEUE_FLAG_NAME(QUIESCED),
QUEUE_FLAG_NAME(PREEMPT_ONLY),
};
#undef QUEUE_FLAG_NAME
......@@ -209,6 +216,7 @@ static ssize_t queue_write_hint_store(void *data, const char __user *buf,
static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
{ "poll_stat", 0400, queue_poll_stat_show },
{ "requeue_list", 0400, .seq_ops = &queue_requeue_list_seq_ops },
{ "pm_only", 0600, queue_pm_only_show, NULL },
{ "state", 0600, queue_state_show, queue_state_write },
{ "write_hints", 0600, queue_write_hint_show, queue_write_hint_store },
{ "zone_wlock", 0400, queue_zone_wlock_show, NULL },
......@@ -423,8 +431,7 @@ static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved)
{
const struct show_busy_params *params = data;
if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
blk_mq_rq_state(rq) != MQ_RQ_IDLE)
if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx)
__blk_mq_debugfs_rq_show(params->m,
list_entry_rq(&rq->queuelist));
}
......
......@@ -49,12 +49,12 @@ blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
return true;
}
static inline void blk_mq_sched_completed_request(struct request *rq)
static inline void blk_mq_sched_completed_request(struct request *rq, u64 now)
{
struct elevator_queue *e = rq->q->elevator;
if (e && e->type->ops.mq.completed_request)
e->type->ops.mq.completed_request(rq);
e->type->ops.mq.completed_request(rq, now);
}
static inline void blk_mq_sched_started_request(struct request *rq)
......
......@@ -232,13 +232,26 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
/*
* We can hit rq == NULL here, because the tagging functions
* test and set the bit before assining ->rqs[].
* test and set the bit before assigning ->rqs[].
*/
if (rq && rq->q == hctx->queue)
iter_data->fn(hctx, rq, iter_data->data, reserved);
return true;
}
/**
* bt_for_each - iterate over the requests associated with a hardware queue
* @hctx: Hardware queue to examine.
* @bt: sbitmap to examine. This is either the breserved_tags member
* or the bitmap_tags member of struct blk_mq_tags.
* @fn: Pointer to the function that will be called for each request
* associated with @hctx that has been assigned a driver tag.
* @fn will be called as follows: @fn(@hctx, rq, @data, @reserved)
* where rq is a pointer to a request.
* @data: Will be passed as third argument to @fn.
* @reserved: Indicates whether @bt is the breserved_tags member or the
* bitmap_tags member of struct blk_mq_tags.
*/
static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
busy_iter_fn *fn, void *data, bool reserved)
{
......@@ -280,6 +293,18 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
return true;
}
/**
* bt_tags_for_each - iterate over the requests in a tag map
* @tags: Tag map to iterate over.
* @bt: sbitmap to examine. This is either the breserved_tags member
* or the bitmap_tags member of struct blk_mq_tags.
* @fn: Pointer to the function that will be called for each started
* request. @fn will be called as follows: @fn(rq, @data,
* @reserved) where rq is a pointer to a request.
* @data: Will be passed as second argument to @fn.
* @reserved: Indicates whether @bt is the breserved_tags member or the
* bitmap_tags member of struct blk_mq_tags.
*/
static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
busy_tag_iter_fn *fn, void *data, bool reserved)
{
......@@ -294,6 +319,15 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data);
}
/**
* blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
* @tags: Tag map to iterate over.
* @fn: Pointer to the function that will be called for each started
* request. @fn will be called as follows: @fn(rq, @priv,
* reserved) where rq is a pointer to a request. 'reserved'
* indicates whether or not @rq is a reserved request.
* @priv: Will be passed as second argument to @fn.
*/
static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
busy_tag_iter_fn *fn, void *priv)
{
......@@ -302,6 +336,15 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
}
/**
* blk_mq_tagset_busy_iter - iterate over all started requests in a tag set
* @tagset: Tag set to iterate over.
* @fn: Pointer to the function that will be called for each started
* request. @fn will be called as follows: @fn(rq, @priv,
* reserved) where rq is a pointer to a request. 'reserved'
* indicates whether or not @rq is a reserved request.
* @priv: Will be passed as second argument to @fn.
*/
void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
busy_tag_iter_fn *fn, void *priv)
{
......@@ -314,6 +357,20 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
}
EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
/**
* blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag
* @q: Request queue to examine.
* @fn: Pointer to the function that will be called for each request
* on @q. @fn will be called as follows: @fn(hctx, rq, @priv,
* reserved) where rq is a pointer to a request and hctx points
* to the hardware queue associated with the request. 'reserved'
* indicates whether or not @rq is a reserved request.
* @priv: Will be passed as third argument to @fn.
*
* Note: if @q->tag_set is shared with other request queues then @fn will be
* called for all requests on all queues that share that tag set and not only
* for requests associated with @q.
*/
void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv)
{
......@@ -321,9 +378,11 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
int i;
/*
* __blk_mq_update_nr_hw_queues will update the nr_hw_queues and
* queue_hw_ctx after freeze the queue, so we use q_usage_counter
* to avoid race with it.
* __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx
* while the queue is frozen. So we can use q_usage_counter to avoid
* racing with it. __blk_mq_update_nr_hw_queues() uses
* synchronize_rcu() to ensure this function left the critical section
* below.
*/
if (!percpu_ref_tryget(&q->q_usage_counter))
return;
......@@ -332,7 +391,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
struct blk_mq_tags *tags = hctx->tags;
/*
* If not software queues are currently mapped to this
* If no software queues are currently mapped to this
* hardware queue, there's nothing to check
*/
if (!blk_mq_hw_queue_mapped(hctx))
......
......@@ -33,6 +33,7 @@
#include "blk-mq.h"
#include "blk-mq-debugfs.h"
#include "blk-mq-tag.h"
#include "blk-pm.h"
#include "blk-stat.h"
#include "blk-mq-sched.h"
#include "blk-rq-qos.h"
......@@ -198,7 +199,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
WARN_ON_ONCE(freeze_depth < 0);
if (!freeze_depth) {
percpu_ref_reinit(&q->q_usage_counter);
percpu_ref_resurrect(&q->q_usage_counter);
wake_up_all(&q->mq_freeze_wq);
}
}
......@@ -475,6 +476,7 @@ static void __blk_mq_free_request(struct request *rq)
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
const int sched_tag = rq->internal_tag;
blk_pm_mark_last_busy(rq);
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
......@@ -526,6 +528,9 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
blk_stat_add(rq, now);
}
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq, now);
blk_account_io_done(rq, now);
if (rq->end_io) {
......@@ -562,8 +567,20 @@ static void __blk_mq_complete_request(struct request *rq)
if (!blk_mq_mark_complete(rq))
return;
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
/*
* Most of single queue controllers, there is only one irq vector
* for handling IO completion, and the only irq's affinity is set
* as all possible CPUs. On most of ARCHs, this affinity means the
* irq is handled on one specific CPU.
*
* So complete IO reqeust in softirq context in case of single queue
* for not degrading IO performance by irqsoff latency.
*/
if (rq->q->nr_hw_queues == 1) {
__blk_complete_request(rq);
return;
}
if (!test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags)) {
rq->q->softirq_done_fn(rq);
......@@ -2137,8 +2154,6 @@ static void blk_mq_exit_hctx(struct request_queue *q,
struct blk_mq_tag_set *set,
struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
{
blk_mq_debugfs_unregister_hctx(hctx);
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_idle(hctx);
......@@ -2165,6 +2180,7 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
queue_for_each_hw_ctx(q, hctx, i) {
if (i == nr_queue)
break;
blk_mq_debugfs_unregister_hctx(hctx);
blk_mq_exit_hctx(q, set, hctx, i);
}
}
......@@ -2194,12 +2210,12 @@ static int blk_mq_init_hctx(struct request_queue *q,
* runtime
*/
hctx->ctxs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
GFP_KERNEL, node);
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node);
if (!hctx->ctxs)
goto unregister_cpu_notifier;
if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8), GFP_KERNEL,
node))
if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8),
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node))
goto free_ctxs;
hctx->nr_ctx = 0;
......@@ -2212,7 +2228,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
goto free_bitmap;
hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size);
hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size,
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
if (!hctx->fq)
goto exit_hctx;
......@@ -2222,8 +2239,6 @@ static int blk_mq_init_hctx(struct request_queue *q,
if (hctx->flags & BLK_MQ_F_BLOCKING)
init_srcu_struct(hctx->srcu);
blk_mq_debugfs_register_hctx(q, hctx);
return 0;
free_fq:
......@@ -2492,6 +2507,39 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
}
EXPORT_SYMBOL(blk_mq_init_queue);
/*
* Helper for setting up a queue with mq ops, given queue depth, and
* the passed in mq ops flags.
*/
struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set,
const struct blk_mq_ops *ops,
unsigned int queue_depth,
unsigned int set_flags)
{
struct request_queue *q;
int ret;
memset(set, 0, sizeof(*set));
set->ops = ops;
set->nr_hw_queues = 1;
set->queue_depth = queue_depth;
set->numa_node = NUMA_NO_NODE;
set->flags = set_flags;
ret = blk_mq_alloc_tag_set(set);
if (ret)
return ERR_PTR(ret);
q = blk_mq_init_queue(set);
if (IS_ERR(q)) {
blk_mq_free_tag_set(set);
return q;
}
return q;
}
EXPORT_SYMBOL(blk_mq_init_sq_queue);
static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
{
int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
......@@ -2506,48 +2554,90 @@ static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
return hw_ctx_size;
}
static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
struct blk_mq_tag_set *set, struct request_queue *q,
int hctx_idx, int node)
{
struct blk_mq_hw_ctx *hctx;
hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
node);
if (!hctx)
return NULL;
if (!zalloc_cpumask_var_node(&hctx->cpumask,
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
node)) {
kfree(hctx);
return NULL;
}
atomic_set(&hctx->nr_active, 0);
hctx->numa_node = node;
hctx->queue_num = hctx_idx;
if (blk_mq_init_hctx(q, set, hctx, hctx_idx)) {
free_cpumask_var(hctx->cpumask);
kfree(hctx);
return NULL;
}
blk_mq_hctx_kobj_init(hctx);
return hctx;
}
static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
struct request_queue *q)
{
int i, j;
int i, j, end;
struct blk_mq_hw_ctx **hctxs = q->queue_hw_ctx;
blk_mq_sysfs_unregister(q);
/* protect against switching io scheduler */
mutex_lock(&q->sysfs_lock);
for (i = 0; i < set->nr_hw_queues; i++) {
int node;
if (hctxs[i])
continue;
struct blk_mq_hw_ctx *hctx;
node = blk_mq_hw_queue_to_node(q->mq_map, i);
hctxs[i] = kzalloc_node(blk_mq_hw_ctx_size(set),
GFP_KERNEL, node);
if (!hctxs[i])
break;
if (!zalloc_cpumask_var_node(&hctxs[i]->cpumask, GFP_KERNEL,
node)) {
kfree(hctxs[i]);
hctxs[i] = NULL;
break;
}
atomic_set(&hctxs[i]->nr_active, 0);
hctxs[i]->numa_node = node;
hctxs[i]->queue_num = i;
/*
* If the hw queue has been mapped to another numa node,
* we need to realloc the hctx. If allocation fails, fallback
* to use the previous one.
*/
if (hctxs[i] && (hctxs[i]->numa_node == node))
continue;
if (blk_mq_init_hctx(q, set, hctxs[i], i)) {
free_cpumask_var(hctxs[i]->cpumask);
kfree(hctxs[i]);
hctxs[i] = NULL;
break;
hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
if (hctx) {
if (hctxs[i]) {
blk_mq_exit_hctx(q, set, hctxs[i], i);
kobject_put(&hctxs[i]->kobj);
}
hctxs[i] = hctx;
} else {
if (hctxs[i])
pr_warn("Allocate new hctx on node %d fails,\
fallback to previous one on node %d\n",
node, hctxs[i]->numa_node);
else
break;
}
blk_mq_hctx_kobj_init(hctxs[i]);
}
for (j = i; j < q->nr_hw_queues; j++) {
/*
* Increasing nr_hw_queues fails. Free the newly allocated
* hctxs and keep the previous q->nr_hw_queues.
*/
if (i != set->nr_hw_queues) {
j = q->nr_hw_queues;
end = i;
} else {
j = i;
end = q->nr_hw_queues;
q->nr_hw_queues = set->nr_hw_queues;
}
for (; j < end; j++) {
struct blk_mq_hw_ctx *hctx = hctxs[j];
if (hctx) {
......@@ -2559,9 +2649,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
}
}
q->nr_hw_queues = i;
mutex_unlock(&q->sysfs_lock);
blk_mq_sysfs_register(q);
}
struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
......@@ -2659,25 +2747,6 @@ void blk_mq_free_queue(struct request_queue *q)
blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
}
/* Basically redo blk_mq_init_queue with queue frozen */
static void blk_mq_queue_reinit(struct request_queue *q)
{
WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
blk_mq_debugfs_unregister_hctxs(q);
blk_mq_sysfs_unregister(q);
/*
* redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
* we should change hctx numa_node according to the new topology (this
* involves freeing and re-allocating memory, worth doing?)
*/
blk_mq_map_swqueue(q);
blk_mq_sysfs_register(q);
blk_mq_debugfs_register_hctxs(q);
}
static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
{
int i;
......@@ -2964,6 +3033,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
{
struct request_queue *q;
LIST_HEAD(head);
int prev_nr_hw_queues;
lockdep_assert_held(&set->tag_list_lock);
......@@ -2987,11 +3057,30 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
if (!blk_mq_elv_switch_none(&head, q))
goto switch_back;
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_debugfs_unregister_hctxs(q);
blk_mq_sysfs_unregister(q);
}
prev_nr_hw_queues = set->nr_hw_queues;
set->nr_hw_queues = nr_hw_queues;
blk_mq_update_queue_map(set);
fallback:
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_realloc_hw_ctxs(set, q);
blk_mq_queue_reinit(q);
if (q->nr_hw_queues != set->nr_hw_queues) {
pr_warn("Increasing nr_hw_queues to %d fails, fallback to %d\n",
nr_hw_queues, prev_nr_hw_queues);
set->nr_hw_queues = prev_nr_hw_queues;
blk_mq_map_queues(set);
goto fallback;
}
blk_mq_map_swqueue(q);
}
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_sysfs_register(q);
blk_mq_debugfs_register_hctxs(q);
}
switch_back:
......
// SPDX-License-Identifier: GPL-2.0
#include <linux/blk-mq.h>
#include <linux/blk-pm.h>
#include <linux/blkdev.h>
#include <linux/pm_runtime.h>
#include "blk-mq.h"
#include "blk-mq-tag.h"
/**
* blk_pm_runtime_init - Block layer runtime PM initialization routine
* @q: the queue of the device
* @dev: the device the queue belongs to
*
* Description:
* Initialize runtime-PM-related fields for @q and start auto suspend for
* @dev. Drivers that want to take advantage of request-based runtime PM
* should call this function after @dev has been initialized, and its
* request queue @q has been allocated, and runtime PM for it can not happen
* yet(either due to disabled/forbidden or its usage_count > 0). In most
* cases, driver should call this function before any I/O has taken place.
*
* This function takes care of setting up using auto suspend for the device,
* the autosuspend delay is set to -1 to make runtime suspend impossible
* until an updated value is either set by user or by driver. Drivers do
* not need to touch other autosuspend settings.
*
* The block layer runtime PM is request based, so only works for drivers
* that use request as their IO unit instead of those directly use bio's.
*/
void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
{
q->dev = dev;
q->rpm_status = RPM_ACTIVE;
pm_runtime_set_autosuspend_delay(q->dev, -1);
pm_runtime_use_autosuspend(q->dev);
}
EXPORT_SYMBOL(blk_pm_runtime_init);
/**
* blk_pre_runtime_suspend - Pre runtime suspend check
* @q: the queue of the device
*
* Description:
* This function will check if runtime suspend is allowed for the device
* by examining if there are any requests pending in the queue. If there
* are requests pending, the device can not be runtime suspended; otherwise,
* the queue's status will be updated to SUSPENDING and the driver can
* proceed to suspend the device.
*
* For the not allowed case, we mark last busy for the device so that
* runtime PM core will try to autosuspend it some time later.
*
* This function should be called near the start of the device's
* runtime_suspend callback.
*
* Return:
* 0 - OK to runtime suspend the device
* -EBUSY - Device should not be runtime suspended
*/
int blk_pre_runtime_suspend(struct request_queue *q)
{
int ret = 0;
if (!q->dev)
return ret;
WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
/*
* Increase the pm_only counter before checking whether any
* non-PM blk_queue_enter() calls are in progress to avoid that any
* new non-PM blk_queue_enter() calls succeed before the pm_only
* counter is decreased again.
*/
blk_set_pm_only(q);
ret = -EBUSY;
/* Switch q_usage_counter from per-cpu to atomic mode. */
blk_freeze_queue_start(q);
/*
* Wait until atomic mode has been reached. Since that
* involves calling call_rcu(), it is guaranteed that later
* blk_queue_enter() calls see the pm-only state. See also
* http://lwn.net/Articles/573497/.
*/
percpu_ref_switch_to_atomic_sync(&q->q_usage_counter);
if (percpu_ref_is_zero(&q->q_usage_counter))
ret = 0;
/* Switch q_usage_counter back to per-cpu mode. */
blk_mq_unfreeze_queue(q);
spin_lock_irq(q->queue_lock);
if (ret < 0)
pm_runtime_mark_last_busy(q->dev);
else
q->rpm_status = RPM_SUSPENDING;
spin_unlock_irq(q->queue_lock);
if (ret)
blk_clear_pm_only(q);
return ret;
}
EXPORT_SYMBOL(blk_pre_runtime_suspend);
/**
* blk_post_runtime_suspend - Post runtime suspend processing
* @q: the queue of the device
* @err: return value of the device's runtime_suspend function
*
* Description:
* Update the queue's runtime status according to the return value of the
* device's runtime suspend function and mark last busy for the device so
* that PM core will try to auto suspend the device at a later time.
*
* This function should be called near the end of the device's
* runtime_suspend callback.
*/
void blk_post_runtime_suspend(struct request_queue *q, int err)
{
if (!q->dev)
return;
spin_lock_irq(q->queue_lock);
if (!err) {
q->rpm_status = RPM_SUSPENDED;
} else {
q->rpm_status = RPM_ACTIVE;
pm_runtime_mark_last_busy(q->dev);
}
spin_unlock_irq(q->queue_lock);
if (err)
blk_clear_pm_only(q);
}
EXPORT_SYMBOL(blk_post_runtime_suspend);
/**
* blk_pre_runtime_resume - Pre runtime resume processing
* @q: the queue of the device
*
* Description:
* Update the queue's runtime status to RESUMING in preparation for the
* runtime resume of the device.
*
* This function should be called near the start of the device's
* runtime_resume callback.
*/
void blk_pre_runtime_resume(struct request_queue *q)
{
if (!q->dev)
return;
spin_lock_irq(q->queue_lock);
q->rpm_status = RPM_RESUMING;
spin_unlock_irq(q->queue_lock);
}
EXPORT_SYMBOL(blk_pre_runtime_resume);
/**
* blk_post_runtime_resume - Post runtime resume processing
* @q: the queue of the device
* @err: return value of the device's runtime_resume function
*
* Description:
* Update the queue's runtime status according to the return value of the
* device's runtime_resume function. If it is successfully resumed, process
* the requests that are queued into the device's queue when it is resuming
* and then mark last busy and initiate autosuspend for it.
*
* This function should be called near the end of the device's
* runtime_resume callback.
*/
void blk_post_runtime_resume(struct request_queue *q, int err)
{
if (!q->dev)
return;
spin_lock_irq(q->queue_lock);
if (!err) {
q->rpm_status = RPM_ACTIVE;
pm_runtime_mark_last_busy(q->dev);
pm_request_autosuspend(q->dev);
} else {
q->rpm_status = RPM_SUSPENDED;
}
spin_unlock_irq(q->queue_lock);
if (!err)
blk_clear_pm_only(q);
}
EXPORT_SYMBOL(blk_post_runtime_resume);
/**
* blk_set_runtime_active - Force runtime status of the queue to be active
* @q: the queue of the device
*
* If the device is left runtime suspended during system suspend the resume
* hook typically resumes the device and corrects runtime status
* accordingly. However, that does not affect the queue runtime PM status
* which is still "suspended". This prevents processing requests from the
* queue.
*
* This function can be used in driver's resume hook to correct queue
* runtime PM status and re-enable peeking requests from the queue. It
* should be called before first request is added to the queue.
*/
void blk_set_runtime_active(struct request_queue *q)
{
spin_lock_irq(q->queue_lock);
q->rpm_status = RPM_ACTIVE;
pm_runtime_mark_last_busy(q->dev);
pm_request_autosuspend(q->dev);
spin_unlock_irq(q->queue_lock);
}
EXPORT_SYMBOL(blk_set_runtime_active);
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _BLOCK_BLK_PM_H_
#define _BLOCK_BLK_PM_H_
#include <linux/pm_runtime.h>
#ifdef CONFIG_PM
static inline void blk_pm_request_resume(struct request_queue *q)
{
if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
q->rpm_status == RPM_SUSPENDING))
pm_request_resume(q->dev);
}
static inline void blk_pm_mark_last_busy(struct request *rq)
{
if (rq->q->dev && !(rq->rq_flags & RQF_PM))
pm_runtime_mark_last_busy(rq->q->dev);
}
static inline void blk_pm_requeue_request(struct request *rq)
{
lockdep_assert_held(rq->q->queue_lock);
if (rq->q->dev && !(rq->rq_flags & RQF_PM))
rq->q->nr_pending--;
}
static inline void blk_pm_add_request(struct request_queue *q,
struct request *rq)
{
lockdep_assert_held(q->queue_lock);
if (q->dev && !(rq->rq_flags & RQF_PM))
q->nr_pending++;
}
static inline void blk_pm_put_request(struct request *rq)
{
lockdep_assert_held(rq->q->queue_lock);
if (rq->q->dev && !(rq->rq_flags & RQF_PM))
--rq->q->nr_pending;
}
#else
static inline void blk_pm_request_resume(struct request_queue *q)
{
}
static inline void blk_pm_mark_last_busy(struct request *rq)
{
}
static inline void blk_pm_requeue_request(struct request *rq)
{
}
static inline void blk_pm_add_request(struct request_queue *q,
struct request *rq)
{
}
static inline void blk_pm_put_request(struct request *rq)
{
}
#endif
#endif /* _BLOCK_BLK_PM_H_ */
......@@ -97,8 +97,8 @@ static int blk_softirq_cpu_dead(unsigned int cpu)
void __blk_complete_request(struct request *req)
{
int ccpu, cpu;
struct request_queue *q = req->q;
int cpu, ccpu = q->mq_ops ? req->mq_ctx->cpu : req->cpu;
unsigned long flags;
bool shared = false;
......@@ -110,8 +110,7 @@ void __blk_complete_request(struct request *req)
/*
* Select completion CPU
*/
if (req->cpu != -1) {
ccpu = req->cpu;
if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && ccpu != -1) {
if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
shared = cpus_share_cache(cpu, ccpu);
} else
......
......@@ -190,6 +190,7 @@ void blk_stat_enable_accounting(struct request_queue *q)
blk_queue_flag_set(QUEUE_FLAG_STATS, q);
spin_unlock(&q->stats->lock);
}
EXPORT_SYMBOL_GPL(blk_stat_enable_accounting);
struct blk_queue_stats *blk_alloc_queue_stats(void)
{
......
......@@ -84,8 +84,7 @@ struct throtl_service_queue {
* RB tree of active children throtl_grp's, which are sorted by
* their ->disptime.
*/
struct rb_root pending_tree; /* RB tree of active tgs */
struct rb_node *first_pending; /* first node in the tree */
struct rb_root_cached pending_tree; /* RB tree of active tgs */
unsigned int nr_pending; /* # queued in the tree */
unsigned long first_pending_disptime; /* disptime of the first tg */
struct timer_list pending_timer; /* fires on first_pending_disptime */
......@@ -475,7 +474,7 @@ static void throtl_service_queue_init(struct throtl_service_queue *sq)
{
INIT_LIST_HEAD(&sq->queued[0]);
INIT_LIST_HEAD(&sq->queued[1]);
sq->pending_tree = RB_ROOT;
sq->pending_tree = RB_ROOT_CACHED;
timer_setup(&sq->pending_timer, throtl_pending_timer_fn, 0);
}
......@@ -616,31 +615,23 @@ static void throtl_pd_free(struct blkg_policy_data *pd)
static struct throtl_grp *
throtl_rb_first(struct throtl_service_queue *parent_sq)
{
struct rb_node *n;
/* Service tree is empty */
if (!parent_sq->nr_pending)
return NULL;
if (!parent_sq->first_pending)
parent_sq->first_pending = rb_first(&parent_sq->pending_tree);
if (parent_sq->first_pending)
return rb_entry_tg(parent_sq->first_pending);
return NULL;
}
static void rb_erase_init(struct rb_node *n, struct rb_root *root)
{
rb_erase(n, root);
RB_CLEAR_NODE(n);
n = rb_first_cached(&parent_sq->pending_tree);
WARN_ON_ONCE(!n);
if (!n)
return NULL;
return rb_entry_tg(n);
}
static void throtl_rb_erase(struct rb_node *n,
struct throtl_service_queue *parent_sq)
{
if (parent_sq->first_pending == n)
parent_sq->first_pending = NULL;
rb_erase_init(n, &parent_sq->pending_tree);
rb_erase_cached(n, &parent_sq->pending_tree);
RB_CLEAR_NODE(n);
--parent_sq->nr_pending;
}
......@@ -658,11 +649,11 @@ static void update_min_dispatch_time(struct throtl_service_queue *parent_sq)
static void tg_service_queue_add(struct throtl_grp *tg)
{
struct throtl_service_queue *parent_sq = tg->service_queue.parent_sq;
struct rb_node **node = &parent_sq->pending_tree.rb_node;
struct rb_node **node = &parent_sq->pending_tree.rb_root.rb_node;
struct rb_node *parent = NULL;
struct throtl_grp *__tg;
unsigned long key = tg->disptime;
int left = 1;
bool leftmost = true;
while (*node != NULL) {
parent = *node;
......@@ -672,15 +663,13 @@ static void tg_service_queue_add(struct throtl_grp *tg)
node = &parent->rb_left;
else {
node = &parent->rb_right;
left = 0;
leftmost = false;
}
}
if (left)
parent_sq->first_pending = &tg->rb_node;
rb_link_node(&tg->rb_node, parent, node);
rb_insert_color(&tg->rb_node, &parent_sq->pending_tree);
rb_insert_color_cached(&tg->rb_node, &parent_sq->pending_tree,
leftmost);
}
static void __throtl_enqueue_tg(struct throtl_grp *tg)
......@@ -2126,21 +2115,11 @@ static inline void throtl_update_latency_buckets(struct throtl_data *td)
}
#endif
static void blk_throtl_assoc_bio(struct throtl_grp *tg, struct bio *bio)
{
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
/* fallback to root_blkg if we fail to get a blkg ref */
if (bio->bi_css && (bio_associate_blkg(bio, tg_to_blkg(tg)) == -ENODEV))
bio_associate_blkg(bio, bio->bi_disk->queue->root_blkg);
bio_issue_init(&bio->bi_issue, bio_sectors(bio));
#endif
}
bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
struct bio *bio)
{
struct throtl_qnode *qn = NULL;
struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg);
struct throtl_grp *tg = blkg_to_tg(blkg);
struct throtl_service_queue *sq;
bool rw = bio_data_dir(bio);
bool throttled = false;
......@@ -2159,7 +2138,6 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
if (unlikely(blk_queue_bypass(q)))
goto out_unlock;
blk_throtl_assoc_bio(tg, bio);
blk_throtl_update_idletime(tg);
sq = &tg->service_queue;
......
......@@ -4,6 +4,7 @@
#include <linux/idr.h>
#include <linux/blk-mq.h>
#include <xen/xen.h>
#include "blk-mq.h"
/* Amount of time in which a process may batch requests */
......@@ -124,7 +125,7 @@ static inline void __blk_get_queue(struct request_queue *q)
}
struct blk_flush_queue *blk_alloc_flush_queue(struct request_queue *q,
int node, int cmd_size);
int node, int cmd_size, gfp_t flags);
void blk_free_flush_queue(struct blk_flush_queue *q);
int blk_init_rl(struct request_list *rl, struct request_queue *q,
......@@ -149,6 +150,41 @@ static inline void blk_queue_enter_live(struct request_queue *q)
percpu_ref_get(&q->q_usage_counter);
}
static inline bool biovec_phys_mergeable(struct request_queue *q,
struct bio_vec *vec1, struct bio_vec *vec2)
{
unsigned long mask = queue_segment_boundary(q);
phys_addr_t addr1 = page_to_phys(vec1->bv_page) + vec1->bv_offset;
phys_addr_t addr2 = page_to_phys(vec2->bv_page) + vec2->bv_offset;
if (addr1 + vec1->bv_len != addr2)
return false;
if (xen_domain() && !xen_biovec_phys_mergeable(vec1, vec2))
return false;
if ((addr1 | mask) != ((addr2 + vec2->bv_len - 1) | mask))
return false;
return true;
}
static inline bool __bvec_gap_to_prev(struct request_queue *q,
struct bio_vec *bprv, unsigned int offset)
{
return offset ||
((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
}
/*
* Check if adding a bio_vec after bprv with offset would create a gap in
* the SG list. Most drivers don't care about this, but some do.
*/
static inline bool bvec_gap_to_prev(struct request_queue *q,
struct bio_vec *bprv, unsigned int offset)
{
if (!queue_virt_boundary(q))
return false;
return __bvec_gap_to_prev(q, bprv, offset);
}
#ifdef CONFIG_BLK_DEV_INTEGRITY
void blk_flush_integrity(void);
bool __bio_integrity_endio(struct bio *);
......@@ -158,7 +194,38 @@ static inline bool bio_integrity_endio(struct bio *bio)
return __bio_integrity_endio(bio);
return true;
}
#else
static inline bool integrity_req_gap_back_merge(struct request *req,
struct bio *next)
{
struct bio_integrity_payload *bip = bio_integrity(req->bio);
struct bio_integrity_payload *bip_next = bio_integrity(next);
return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
bip_next->bip_vec[0].bv_offset);
}
static inline bool integrity_req_gap_front_merge(struct request *req,
struct bio *bio)
{
struct bio_integrity_payload *bip = bio_integrity(bio);
struct bio_integrity_payload *bip_next = bio_integrity(req->bio);
return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
bip_next->bip_vec[0].bv_offset);
}
#else /* CONFIG_BLK_DEV_INTEGRITY */
static inline bool integrity_req_gap_back_merge(struct request *req,
struct bio *next)
{
return false;
}
static inline bool integrity_req_gap_front_merge(struct request *req,
struct bio *bio)
{
return false;
}
static inline void blk_flush_integrity(void)
{
}
......@@ -166,7 +233,7 @@ static inline bool bio_integrity_endio(struct bio *bio)
{
return true;
}
#endif
#endif /* CONFIG_BLK_DEV_INTEGRITY */
void blk_timeout_work(struct work_struct *work);
unsigned long blk_rq_timeout(unsigned long timeout);
......
......@@ -31,6 +31,24 @@
static struct bio_set bounce_bio_set, bounce_bio_split;
static mempool_t page_pool, isa_page_pool;
static void init_bounce_bioset(void)
{
static bool bounce_bs_setup;
int ret;
if (bounce_bs_setup)
return;
ret = bioset_init(&bounce_bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
BUG_ON(ret);
if (bioset_integrity_create(&bounce_bio_set, BIO_POOL_SIZE))
BUG_ON(1);
ret = bioset_init(&bounce_bio_split, BIO_POOL_SIZE, 0, 0);
BUG_ON(ret);
bounce_bs_setup = true;
}
#if defined(CONFIG_HIGHMEM)
static __init int init_emergency_pool(void)
{
......@@ -44,14 +62,7 @@ static __init int init_emergency_pool(void)
BUG_ON(ret);
pr_info("pool size: %d pages\n", POOL_SIZE);
ret = bioset_init(&bounce_bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
BUG_ON(ret);
if (bioset_integrity_create(&bounce_bio_set, BIO_POOL_SIZE))
BUG_ON(1);
ret = bioset_init(&bounce_bio_split, BIO_POOL_SIZE, 0, 0);
BUG_ON(ret);
init_bounce_bioset();
return 0;
}
......@@ -86,6 +97,8 @@ static void *mempool_alloc_pages_isa(gfp_t gfp_mask, void *data)
return mempool_alloc_pages(gfp_mask | GFP_DMA, data);
}
static DEFINE_MUTEX(isa_mutex);
/*
* gets called "every" time someone init's a queue with BLK_BOUNCE_ISA
* as the max address, so check if the pool has already been created.
......@@ -94,14 +107,20 @@ int init_emergency_isa_pool(void)
{
int ret;
if (mempool_initialized(&isa_page_pool))
mutex_lock(&isa_mutex);
if (mempool_initialized(&isa_page_pool)) {
mutex_unlock(&isa_mutex);
return 0;
}
ret = mempool_init(&isa_page_pool, ISA_POOL_SIZE, mempool_alloc_pages_isa,
mempool_free_pages, (void *) 0);
BUG_ON(ret);
pr_info("isa pool size: %d pages\n", ISA_POOL_SIZE);
init_bounce_bioset();
mutex_unlock(&isa_mutex);
return 0;
}
......@@ -257,7 +276,9 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
}
}
bio_clone_blkcg_association(bio, bio_src);
bio_clone_blkg_association(bio, bio_src);
blkcg_bio_issue_init(bio);
return bio;
}
......
......@@ -1644,14 +1644,20 @@ static void cfq_pd_offline(struct blkg_policy_data *pd)
int i;
for (i = 0; i < IOPRIO_BE_NR; i++) {
if (cfqg->async_cfqq[0][i])
if (cfqg->async_cfqq[0][i]) {
cfq_put_queue(cfqg->async_cfqq[0][i]);
if (cfqg->async_cfqq[1][i])
cfqg->async_cfqq[0][i] = NULL;
}
if (cfqg->async_cfqq[1][i]) {
cfq_put_queue(cfqg->async_cfqq[1][i]);
cfqg->async_cfqq[1][i] = NULL;
}
}
if (cfqg->async_idle_cfqq)
if (cfqg->async_idle_cfqq) {
cfq_put_queue(cfqg->async_idle_cfqq);
cfqg->async_idle_cfqq = NULL;
}
/*
* @blkg is going offline and will be ignored by
......@@ -3753,7 +3759,7 @@ static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
uint64_t serial_nr;
rcu_read_lock();
serial_nr = bio_blkcg(bio)->css.serial_nr;
serial_nr = __bio_blkcg(bio)->css.serial_nr;
rcu_read_unlock();
/*
......@@ -3818,7 +3824,7 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
struct cfq_group *cfqg;
rcu_read_lock();
cfqg = cfq_lookup_cfqg(cfqd, bio_blkcg(bio));
cfqg = cfq_lookup_cfqg(cfqd, __bio_blkcg(bio));
if (!cfqg) {
cfqq = &cfqd->oom_cfqq;
goto out;
......
......@@ -41,6 +41,7 @@
#include "blk.h"
#include "blk-mq-sched.h"
#include "blk-pm.h"
#include "blk-wbt.h"
static DEFINE_SPINLOCK(elv_list_lock);
......@@ -557,27 +558,6 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
e->type->ops.sq.elevator_bio_merged_fn(q, rq, bio);
}
#ifdef CONFIG_PM
static void blk_pm_requeue_request(struct request *rq)
{
if (rq->q->dev && !(rq->rq_flags & RQF_PM))
rq->q->nr_pending--;
}
static void blk_pm_add_request(struct request_queue *q, struct request *rq)
{
if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
(q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
pm_request_resume(q->dev);
}
#else
static inline void blk_pm_requeue_request(struct request *rq) {}
static inline void blk_pm_add_request(struct request_queue *q,
struct request *rq)
{
}
#endif
void elv_requeue_request(struct request_queue *q, struct request *rq)
{
/*
......
......@@ -567,7 +567,8 @@ static int exact_lock(dev_t devt, void *data)
return 0;
}
static void register_disk(struct device *parent, struct gendisk *disk)
static void register_disk(struct device *parent, struct gendisk *disk,
const struct attribute_group **groups)
{
struct device *ddev = disk_to_dev(disk);
struct block_device *bdev;
......@@ -582,6 +583,10 @@ static void register_disk(struct device *parent, struct gendisk *disk)
/* delay uevents, until we scanned partition table */
dev_set_uevent_suppress(ddev, 1);
if (groups) {
WARN_ON(ddev->groups);
ddev->groups = groups;
}
if (device_add(ddev))
return;
if (!sysfs_deprecated) {
......@@ -647,6 +652,7 @@ static void register_disk(struct device *parent, struct gendisk *disk)
* __device_add_disk - add disk information to kernel list
* @parent: parent device for the disk
* @disk: per-device partitioning information
* @groups: Additional per-device sysfs groups
* @register_queue: register the queue if set to true
*
* This function registers the partitioning information in @disk
......@@ -655,6 +661,7 @@ static void register_disk(struct device *parent, struct gendisk *disk)
* FIXME: error handling
*/
static void __device_add_disk(struct device *parent, struct gendisk *disk,
const struct attribute_group **groups,
bool register_queue)
{
dev_t devt;
......@@ -698,7 +705,7 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
blk_register_region(disk_devt(disk), disk->minors, NULL,
exact_match, exact_lock, disk);
}
register_disk(parent, disk);
register_disk(parent, disk, groups);
if (register_queue)
blk_register_queue(disk);
......@@ -712,15 +719,17 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk,
blk_integrity_add(disk);
}
void device_add_disk(struct device *parent, struct gendisk *disk)
void device_add_disk(struct device *parent, struct gendisk *disk,
const struct attribute_group **groups)
{
__device_add_disk(parent, disk, true);
__device_add_disk(parent, disk, groups, true);
}
EXPORT_SYMBOL(device_add_disk);
void device_add_disk_no_queue_reg(struct device *parent, struct gendisk *disk)
{
__device_add_disk(parent, disk, false);
__device_add_disk(parent, disk, NULL, false);
}
EXPORT_SYMBOL(device_add_disk_no_queue_reg);
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -121,18 +121,6 @@ source "drivers/block/mtip32xx/Kconfig"
source "drivers/block/zram/Kconfig"
config BLK_DEV_DAC960
tristate "Mylex DAC960/DAC1100 PCI RAID Controller support"
depends on PCI
help
This driver adds support for the Mylex DAC960, AcceleRAID, and
eXtremeRAID PCI RAID controllers. See the file
<file:Documentation/blockdev/README.DAC960> for further information
about this driver.
To compile this driver as a module, choose M here: the
module will be called DAC960.
config BLK_DEV_UMEM
tristate "Micro Memory MM5415 Battery Backed RAM support"
depends on PCI
......@@ -461,7 +449,6 @@ config BLK_DEV_RBD
select LIBCRC32C
select CRYPTO_AES
select CRYPTO
default n
help
Say Y here if you want include the Rados block device, which stripes
a block device over objects stored in the Ceph distributed object
......
......@@ -16,7 +16,6 @@ obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o
obj-$(CONFIG_AMIGA_Z2RAM) += z2ram.o
obj-$(CONFIG_BLK_DEV_RAM) += brd.o
obj-$(CONFIG_BLK_DEV_LOOP) += loop.o
obj-$(CONFIG_BLK_DEV_DAC960) += DAC960.o
obj-$(CONFIG_XILINX_SYSACE) += xsysace.o
obj-$(CONFIG_CDROM_PKTCDVD) += pktcdvd.o
obj-$(CONFIG_SUNVDC) += sunvdc.o
......
This diff is collapsed.
/* Copyright (c) 2013 Coraid, Inc. See COPYING for GPL terms. */
#include <linux/blk-mq.h>
#define VERSION "85"
#define AOE_MAJOR 152
#define DEVICE_NAME "aoe"
......@@ -164,6 +166,8 @@ struct aoedev {
struct gendisk *gd;
struct dentry *debugfs;
struct request_queue *blkq;
struct list_head rq_list;
struct blk_mq_tag_set tag_set;
struct hd_geometry geo;
sector_t ssize;
struct timer_list timer;
......@@ -201,7 +205,6 @@ int aoeblk_init(void);
void aoeblk_exit(void);
void aoeblk_gdalloc(void *);
void aoedisk_rm_debugfs(struct aoedev *d);
void aoedisk_rm_sysfs(struct aoedev *d);
int aoechr_init(void);
void aoechr_exit(void);
......
......@@ -6,7 +6,7 @@
#include <linux/kernel.h>
#include <linux/hdreg.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/backing-dev.h>
#include <linux/fs.h>
#include <linux/ioctl.h>
......@@ -177,10 +177,15 @@ static struct attribute *aoe_attrs[] = {
NULL,
};
static const struct attribute_group attr_group = {
static const struct attribute_group aoe_attr_group = {
.attrs = aoe_attrs,
};
static const struct attribute_group *aoe_attr_groups[] = {
&aoe_attr_group,
NULL,
};
static const struct file_operations aoe_debugfs_fops = {
.open = aoe_debugfs_open,
.read = seq_read,
......@@ -219,17 +224,6 @@ aoedisk_rm_debugfs(struct aoedev *d)
d->debugfs = NULL;
}
static int
aoedisk_add_sysfs(struct aoedev *d)
{
return sysfs_create_group(&disk_to_dev(d->gd)->kobj, &attr_group);
}
void
aoedisk_rm_sysfs(struct aoedev *d)
{
sysfs_remove_group(&disk_to_dev(d->gd)->kobj, &attr_group);
}
static int
aoeblk_open(struct block_device *bdev, fmode_t mode)
{
......@@ -274,23 +268,25 @@ aoeblk_release(struct gendisk *disk, fmode_t mode)
spin_unlock_irqrestore(&d->lock, flags);
}
static void
aoeblk_request(struct request_queue *q)
static blk_status_t aoeblk_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
struct aoedev *d;
struct request *rq;
struct aoedev *d = hctx->queue->queuedata;
spin_lock_irq(&d->lock);
d = q->queuedata;
if ((d->flags & DEVFL_UP) == 0) {
pr_info_ratelimited("aoe: device %ld.%d is not up\n",
d->aoemajor, d->aoeminor);
while ((rq = blk_peek_request(q))) {
blk_start_request(rq);
aoe_end_request(d, rq, 1);
}
return;
spin_unlock_irq(&d->lock);
blk_mq_start_request(bd->rq);
return BLK_STS_IOERR;
}
list_add_tail(&bd->rq->queuelist, &d->rq_list);
aoecmd_work(d);
spin_unlock_irq(&d->lock);
return BLK_STS_OK;
}
static int
......@@ -345,6 +341,10 @@ static const struct block_device_operations aoe_bdops = {
.owner = THIS_MODULE,
};
static const struct blk_mq_ops aoeblk_mq_ops = {
.queue_rq = aoeblk_queue_rq,
};
/* alloc_disk and add_disk can sleep */
void
aoeblk_gdalloc(void *vp)
......@@ -353,9 +353,11 @@ aoeblk_gdalloc(void *vp)
struct gendisk *gd;
mempool_t *mp;
struct request_queue *q;
struct blk_mq_tag_set *set;
enum { KB = 1024, MB = KB * KB, READ_AHEAD = 2 * MB, };
ulong flags;
int late = 0;
int err;
spin_lock_irqsave(&d->lock, flags);
if (d->flags & DEVFL_GDALLOC
......@@ -382,10 +384,25 @@ aoeblk_gdalloc(void *vp)
d->aoemajor, d->aoeminor);
goto err_disk;
}
q = blk_init_queue(aoeblk_request, &d->lock);
if (q == NULL) {
set = &d->tag_set;
set->ops = &aoeblk_mq_ops;
set->nr_hw_queues = 1;
set->queue_depth = 128;
set->numa_node = NUMA_NO_NODE;
set->flags = BLK_MQ_F_SHOULD_MERGE;
err = blk_mq_alloc_tag_set(set);
if (err) {
pr_err("aoe: cannot allocate tag set for %ld.%d\n",
d->aoemajor, d->aoeminor);
goto err_mempool;
}
q = blk_mq_init_queue(set);
if (IS_ERR(q)) {
pr_err("aoe: cannot allocate block queue for %ld.%d\n",
d->aoemajor, d->aoeminor);
blk_mq_free_tag_set(set);
goto err_mempool;
}
......@@ -417,8 +434,7 @@ aoeblk_gdalloc(void *vp)
spin_unlock_irqrestore(&d->lock, flags);
add_disk(gd);
aoedisk_add_sysfs(d);
device_add_disk(NULL, gd, aoe_attr_groups);
aoedisk_add_debugfs(d);
spin_lock_irqsave(&d->lock, flags);
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -11,7 +11,6 @@ config BLK_DEV_DRBD
depends on PROC_FS && INET
select LRU_CACHE
select LIBCRC32C
default n
help
NOTE: In order to authenticate connections you have to select
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
// SPDX-License-Identifier: GPL-2.0
/*
* Copyright (C) 2016 CNEX Labs
* Initial release: Javier Gonzalez <javier@cnexlabs.com>
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment