- 09 Oct, 2018 19 commits
-
-
Hans Holmberg authored
The prefix when printing ppas in pblk_read_check_rand should be "rnd" not "seq", so fix this so we can differentiate between lba missmatches in random and sequential reads. Also change the print order so we align with pblk_read_check_seq, printing read lba first. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
The parameters nr_ppas and ppa_list are not used, so remove them. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
Line map bitmap allocations are fairly large and can fail. Allocation failures are fatal to pblk, stopping the write pipeline. To avoid this, allocate the bitmaps using a mempool instead. Mempool allocations never fail if called from a process context, and pblk *should* only allocate map bitmaps in process context, but keep the failure handling for robustness sake. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
There is a number of places in the lightnvm subsystem where the user iterates over the ppa list. Before iterating, the user must know if it is a single or multiple LBAs due to vector commands using either the nvm_rq ->ppa_addr or ->ppa_list fields on command submission, which leads to open-coding the if/else statement. Instead of having multiple if/else's, move it into a function that can be called by its users. A nice side effect of this cleanup is that this patch fixes up a bunch of cases where we don't consider the single-ppa case in pblk. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
If a line is recovered from open chunks, the memory structures for emeta have not necessarily been properly set on line initialization. When closing a line, make sure that emeta is consistent so that the line can be recovered on the fast path on next reboot. Also, remove a couple of empty lines at the end of the function. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Removed unused struct ppa_addr variable. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Fix comment typo Decrese -> Decrease Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
The current helper to obtain a line from a ppa returns the line id, which requires its users to explicitly retrieve the pointer to the line with the id. Make 2 different helpers: one returning the line id and one returning the line directly. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
Implement helpers to go from ppas to a chunk within a line and an address within a chunk. These helpers will be used on the patches adding trace support in pblk, which will be sent in this window. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
The read completion path uses the put_line variable to decide whether the reference on a line should be released. The function name used for that is pblk_read_put_rqd_kref, which could lead one to believe that it is the rqd that is releasing the reference, while it is the line reference that is put. Rename and also split the function in two to account for either rqd or single ppa callers and move it to core, such that it later can be used in the write path as well. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Javier González <javier@cnexlabs.com> Reviewed-by: Heiner Litz <hlitz@ucsc.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
The I/O size and capacity checks are already done by the block layer. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
The calculation of pblk->min_write_pgs should only use the optimal write size attribute provided by the drive, it does not correlate to the memory page size of the system, which can be smaller or larger than the LBA size reported. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
Both NVM_MAX_VLBA and PBLK_MAX_REQ_ADDRS define how many LBAs that are available in a vector command. pblk uses them interchangeably in its implementation. Use NVM_MAX_VLBA as the main one and remove usages of PBLK_MAX_REQ_ADDRS. Also remove the power representation that only has one user, and instead calculate it at runtime. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
pblk implements two data paths for recovery line state. One for 1.2 and another for 2.0, instead of having pblk implement these, combine them in the core to reduce complexity and make available to other targets. The new interface will adhere to the 2.0 chunk definition, including managing open chunks with an active write pointer. To provide this interface, a 1.2 device recovers the state of the chunks by manually detecting if a chunk is either free/open/close/offline, and if open, scanning the flash pages sequentially to find the next writeable page. This process takes on average ~10 seconds on a device with 64 dies, 1024 blocks and 60us read access time. The process can be parallelized but is left out for maintenance simplicity, as the 1.2 specification is deprecated. For 2.0 devices, the logic is maintained internally in the drive and retrieved through the 2.0 interface. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Javier González authored
In pblk, when a new line is allocated, metadata for the previously written line is scheduled. This is done through a fixed memory region that is shared through time and contexts across different lines and therefore protected by a lock. Unfortunately, this lock is not properly covering all the metadata used for sharing this memory regions, resulting in a race condition. This patch fixes this race condition by protecting this metadata properly. Fixes: dd2a4343 ("lightnvm: pblk: sched. metadata on write thread") Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
A 1.2 device is able to manage the logical to physical mapping table internally or leave it to the host. A target only supports one of those approaches, and therefore must check on initialization. Move this check to core to avoid each target implement the check. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
rqd.error is masked by the return value of pblk_submit_io_sync. The rqd structure is then passed on to the end_io function, which assumes that any error should lead to a chunk being marked offline/bad. Since the pblk_submit_io_sync can fail before the command is issued to the device, the error value maybe not correspond to a media failure, leading to chunks being immaturely retired. Also, the pblk_blk_erase_sync function prints an error message in case the erase fails. Since the caller prints an error message by itself, remove the error message in this function. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Javier González <javier@cnexlabs.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
Add nvm_set_flags helper to enable core to appropriately set the command flags for read/write/erase depending on which version a drive supports. The flags arguments can be distilled into the access hint, scrambling, and program/erase suspend. Replace the access hint with a "is_seq" parameter. The rest of the flags are dependent on the command opcode, which is trivial to detect and set. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
No need to force NVMe device driver to be compiled in if the lightnvm subsystem is selected. Also no need for PCI to be selected as well, as it would be selected by the device driver that hooks into the subsystem. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 08 Oct, 2018 16 commits
-
-
Ming Lei authored
Lot of controllers may have only one irq vector for completing IO request. And usually affinity of the only irq vector is all possible CPUs, however, on most of ARCH, there may be only one specific CPU for handling this interrupt. So if all IOs are completed in hardirq context, it is inevitable to degrade IO performance because of increased irq latency. This patch tries to address this issue by allowing to complete request in softirq context, like the legacy IO path. IOPS is observed as ~13%+ in the following randread test on raid0 over virtio-scsi. mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Zach Marano <zmarano@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Dongbo Cao authored
when the nbuckets of cache device is smaller than 1024, making cache device will trigger BUG_ON in kernel, add a condition to avoid this. Reported-by: nitroxis <n@nxs.re> Signed-off-by: Dongbo Cao <cdbdyx@163.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Dongbo Cao authored
Split the combined '||' statements in if() check, to make the code easier for debug. Signed-off-by: Dongbo Cao <cdbdyx@163.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Shenghui Wang authored
Current cache_set has MAX_CACHES_PER_SET caches most, and the macro is used for " struct cache *cache_by_alloc[MAX_CACHES_PER_SET]; " in the define of struct cache_set. Use MAX_CACHES_PER_SET instead of magic number 8 in __bch_bucket_alloc_set. Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
In extents.c:bch_extent_bad(), number 96 is used as parameter to call btree_bug_on(). The purpose is to check whether stale gen value exceeds BUCKET_GC_GEN_MAX, so it is better to use macro BUCKET_GC_GEN_MAX to make the code more understandable. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Dongbo Cao authored
Parameter "struct kobject *kobj" in bch_debug_init() is useless, remove it in this patch. Signed-off-by: Dongbo Cao <cdbdyx@163.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Shenghui Wang authored
struct kmem_cache *bch_passthrough_cache is not used in bcache code. Remove it. Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Shenghui Wang authored
Recal cached_dev_sectors on cached_dev detached, as recal done on cached_dev attached. Update the cached_dev_sectors before bcache_device_detach called as bcache_device_detach will set bcache_device->c to NULL. Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Tang Junhui authored
refill->end record the last key of writeback, for example, at the first time, keys (1,128K) to (1,1024K) are flush to the backend device, but the end key (1,1024K) is not included, since the bellow code: if (bkey_cmp(k, refill->end) >= 0) { ret = MAP_DONE; goto out; } And in the next time when we refill writeback keybuf again, we searched key start from (1,1024K), and got a key bigger than it, so the key (1,1024K) missed. This patch modify the above code, and let the end key to be included to the writeback key buffer. Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ben Peddell authored
Forgot to include the maintainers with my first email. Somewhere between Michael Lyle's original "bcache: PI controller for writeback rate V2" patch dated 07 Sep 2017 and 1d316e65 bcache: implement PI controller for writeback rate, the mapping of the writeback_rate_minimum attribute was dropped. Re-add the missing sysfs writeback_rate_minimum attribute mapping to "allow the user to specify a minimum rate at which dirty blocks are retired." Fixes: 1d316e65 ("bcache: implement PI controller for writeback rate") Signed-off-by: Ben Peddell <klightspeed@killerwolves.net> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Tang Junhui authored
When bcache device is clean, dirty keys may still exist after journal replay, so we need to count these dirty keys even device in clean status, otherwise after writeback, the amount of dirty data would be incorrect. Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
The code comments of closure_return_with_destructor() in closure.h makrs function name as closure_return(). This patch fixes this type with the correct name - closure_return_with_destructor. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Tang Junhui authored
When doing ioctl in flash device, it will call ioctl_dev() in super.c, then we should not to get cached device since flash only device has no backend device. This patch just move the jugement dc->io_disable to cached_dev_ioctl() to make ioctl in flash device correctly. Fixes: 0f0709e6 ("bcache: stop bcache device when backing device is offline") Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Coly Li authored
In cached_dev_cache_miss() and check_should_bypass(), REQ_META is used to check whether a bio is for metadata request. REQ_META is used for blktrace, the correct REQ_ flag should be REQ_PRIO. This flag means the bio should be prior to other bio, and frequently be used to indicate metadata io in file system code. This patch replaces REQ_META with correct flag REQ_PRIO. CC Adam Manzanares because he explains to me what REQ_PRIO is for. Signed-off-by: Coly Li <colyli@suse.de> Cc: Adam Manzanares <adam.manzanares@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Tang Junhui authored
Missed reading IOs are identified by s->cache_missed, not the s->cache_miss, so in trace_bcache_read() using trace_bcache_read to identify whether the IO is missed or not. Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Shenghui Wang authored
UUIDs are considered as metadata. __uuid_write should add the number of buckets (in sectors) written to disk to ca->meta_sectors_written. Currently only 1 bucket is used in uuid write. Steps to test: 1) create a fresh backing device and a fresh cache device separately. The backing device didn't attach to any cache set. 2) cd /sys/block/<cache device>/bcache cat metadata_written // record the output value cat bucket_size 3) attach the backing device to cache set 4) cat metadata_written The output value is almost the same as the value in step 2 before the change. After the change, the value is bigger about 1 bucket size. Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Reviewed-by: Tang Junhui <tang.junhui.linux@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 05 Oct, 2018 3 commits
-
-
Bart Van Assche authored
When debugging e.g. the SCSI timeout handler it is important that requests that have not yet been started or that already have completed are also reported through debugfs. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
git://git.infradead.org/nvmeJens Axboe authored
Pull NVMe updates from Christoph: "A relatively boring merge window: - better AEN tracing (Chaitanya) - NUMA aware PCIe multipathing (me) - RDMA workqueue fixes (Sagi) - better bio usage in the target (Sagi) - FC rework for target removal (James) - better multipath handling of ->queue_rq failures (James) - various cleanups (Milan)" * 'nvme-4.20' of git://git.infradead.org/nvme: nvmet-rdma: use a private workqueue for delete nvme: take node locality into account when selecting a path nvmet: don't split large I/Os unconditionally nvme: call nvme_complete_rq when nvmf_check_ready fails for mpath I/O nvme-core: add async event trace helper nvme_fc: add 'nvme_discovery' sysfs attribute to fc transport device nvmet_fc: support target port removal with nvmet layer nvme-fc: fix for a minor typos nvmet: remove redundant module prefix nvme: fix typo in nvme_identify_ns_descs
-
Sagi Grimberg authored
Queue deletion is done asynchronous when the last reference on the queue is dropped. Thus, in order to make sure we don't over allocate under a connect/disconnect storm, we let queue deletion complete before making forward progress. However, given that we flush the system_wq from rdma_cm context which runs from a workqueue context, we can have a circular locking complaint [1]. Fix that by using a private workqueue for queue deletion. [1]: ====================================================== WARNING: possible circular locking dependency detected 4.19.0-rc4-dbg+ #3 Not tainted ------------------------------------------------------ kworker/5:0/39 is trying to acquire lock: 00000000a10b6db9 (&id_priv->handler_mutex){+.+.}, at: rdma_destroy_id+0x6f/0x440 [rdma_cm] but task is already holding lock: 00000000331b4e2c ((work_completion)(&queue->release_work)){+.+.}, at: process_one_work+0x3ed/0xa20 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 ((work_completion)(&queue->release_work)){+.+.}: process_one_work+0x474/0xa20 worker_thread+0x63/0x5a0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 -> #2 ((wq_completion)"events"){+.+.}: flush_workqueue+0xf3/0x970 nvmet_rdma_cm_handler+0x133d/0x1734 [nvmet_rdma] cma_ib_req_handler+0x72f/0xf90 [rdma_cm] cm_process_work+0x2e/0x110 [ib_cm] cm_req_handler+0x135b/0x1c30 [ib_cm] cm_work_handler+0x2b7/0x38cd [ib_cm] process_one_work+0x4ae/0xa20 nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma: disconnected (10): status 0 id 0000000040357082 worker_thread+0x63/0x5a0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 nvme nvme0: Reconnecting in 10 seconds... -> #1 (&id_priv->handler_mutex/1){+.+.}: __mutex_lock+0xfe/0xbe0 mutex_lock_nested+0x1b/0x20 cma_ib_req_handler+0x6aa/0xf90 [rdma_cm] cm_process_work+0x2e/0x110 [ib_cm] cm_req_handler+0x135b/0x1c30 [ib_cm] cm_work_handler+0x2b7/0x38cd [ib_cm] process_one_work+0x4ae/0xa20 worker_thread+0x63/0x5a0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 -> #0 (&id_priv->handler_mutex){+.+.}: lock_acquire+0xc5/0x200 __mutex_lock+0xfe/0xbe0 mutex_lock_nested+0x1b/0x20 rdma_destroy_id+0x6f/0x440 [rdma_cm] nvmet_rdma_release_queue_work+0x8e/0x1b0 [nvmet_rdma] process_one_work+0x4ae/0xa20 worker_thread+0x63/0x5a0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 Fixes: 777dc823 ("nvmet-rdma: occasionally flush ongoing controller teardown") Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
- 03 Oct, 2018 2 commits
-
-
Bart Van Assche authored
Some time ago REQ_DISCARD was renamed into REQ_OP_DISCARD. Some comments and documentation files were not updated however. Update these comments and documentation files. See also commit 4e1b2d52 ("block, fs, drivers: remove REQ_OP compat defs and related code"). Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Mike Christie <mchristi@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Young_X authored
There is another cast from unsigned long to int which causes a bounds check to fail with specially crafted input. The value is then used as an index in the slot array in cdrom_slot_status(). This issue is similar to CVE-2018-16658 and CVE-2018-10940. Signed-off-by: Young_X <YangX92@hotmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-