- 06 Apr, 2019 10 commits
-
-
Jonas Rabenstein authored
Add function address (and if available its symbol) to the message if a step function fails. Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
David Kozub authored
response_get_token had already been in place, its functionality had been duplicated within response_get_{u64,bytestring} with the same error handling. Unify the handling by reusing response_get_token within the other functions. Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
David Kozub authored
response_get_{string,u64} include error handling for argument resp being NULL but response_get_token does not handle this. Make all three of response_get_{string,u64,token} handle NULL resp in the same way. Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
David Kozub authored
Every step starts with resetting the cmd buffer as well as the comid and constructs the appropriate OPAL_CALL command. Consequently, those actions may be combined into one generic function. On should take care that the opening and closing tokens for the argument list are already emitted by cmd_start and cmd_finalize respectively and thus must not be additionally added. Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
David Kozub authored
Every step ends by calling cmd_finalize (via finalize_and_send) yet every step adds the token OPAL_ENDLIST on its own. Moving this into cmd_finalize decreases code duplication. Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jonas Rabenstein authored
All add_token_* functions have a common set of conditions that have to be checked. Use a common function for those checks in order to avoid different behaviour as well as code duplication. Acked-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jonas Rabenstein authored
Also the values of OPAL_UID_LENGTH and OPAL_METHOD_LENGTH are the same, it is weird to use OPAL_UID_LENGTH for the definition of the methods. Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
David Kozub authored
This should make no change in functionality. The formatting changes were triggered by checkpatch.pl. Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
David Kozub authored
The implementation of IOC_OPAL_ENABLE_DISABLE_MBR handled the value opal_mbr_data.enable_disable incorrectly: enable_disable is expected to be one of OPAL_MBR_ENABLE(0) or OPAL_MBR_DISABLE(1). enable_disable was passed directly to set_mbr_done and set_mbr_enable_disable where is was interpreted as either OPAL_TRUE(1) or OPAL_FALSE(0). The end result was that calling IOC_OPAL_ENABLE_DISABLE_MBR with OPAL_MBR_ENABLE actually disabled the shadow MBR and vice versa. This patch adds correct conversion from OPAL_MBR_DISABLE/ENABLE to OPAL_FALSE/TRUE. The change affects existing programs using IOC_OPAL_ENABLE_DISABLE_MBR but this is typically used only once when setting up an Opal drive. Acked-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit architectures. These types are required to support block device and/or file sizes larger than 2 TiB, and have generally defaulted to on for a long time. Enabling the option only increases the i386 tinyconfig size by 145 bytes, and many data structures already always use 64-bit values for their in-core and on-disk data structures anyway, so there should not be a large change in dynamic memory usage either. Dropping this option removes a somewhat weird non-default config that has cause various bugs or compiler warnings when actually used. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 05 Apr, 2019 25 commits
-
-
git://git.infradead.org/nvmeJens Axboe authored
Pull NVMe changes from Christoph: "Below is the first batch of nvme updates for 5.2. This includes the performance improvements for single segment I/O on PCIe, which introduce new block helpers, so it might be a good idea to get them in early. - various performance optimizations in the PCIe code (Keith and me) - new block helpers to support the above (me) - nvmet error conversion cleanup (me) - nvmet-fc variable sized array cleanup (Gustavo) - passthrough ioctl error printk cleanup (Kenneth) - small nvmet fixes (Max) - endianess conversion cleanup (Max) - nvmet-tcp faspath completion optimization (Sagi)" * 'nvme-5.2' of git://git.infradead.org/nvme: (24 commits) nvme: log the error status on Identify Namespace failure nvmet: add safety check for subsystem lock during nvmet_ns_changed nvmet: never fail double namespace enablement nvme-pci: tidy up nvme_map_data nvme-pci: optimize mapping single segment requests using SGLs nvme-pci: optimize mapping of small single segment requests nvme-pci: remove the inline scatterlist optimization nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data nvme-pci: do not build a scatterlist to map metadata nvme-pci: only call nvme_unmap_data for requests transferring data nvme-pci: merge nvme_free_iod into nvme_unmap_data nvme-pci: move the call to nvme_cleanup_cmd out of nvme_unmap_data nvme-pci: remove nvme_init_iod block: add dma_map_bvec helper block: add a rq_dma_dir helper block: add a rq_integrity_vec helper block: add a req_bvec helper nvme-pci: remove unused nvme_iod member nvme-pci: remove q_dmadev from nvme_queue nvme-pci: use a flag for polled queues ...
-
Kenneth Heitke authored
Identify Namespace failures are logged as a warning but there is not an indication of the cause for the failure. Update the log message to include the error status. Signed-off-by: Kenneth Heitke <kenneth.heitke@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Max Gurtovoy authored
we need to make sure that subsystem lock is taken during ctrl's list traversing. nvmet_ns_changed function is not static and can be used from various callers simultaneously. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Max Gurtovoy authored
In case we create N namespaces while N < NVMET_MAX_NAMESPACES, we can perform "echo 1 > <nsid>/enable" as much as we want. In case N == NVMET_MAX_NAMESPACES we fail. Make sure we have the same flow for any N. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Remove two pointless local variables, remove ret assignment that is never used, move the use_sgl initialization closer to where it is used. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Christoph Hellwig authored
If the controller supports SGLs we can take another short cut for single segment request, given that we can always map those without another indirection structure, and thus don't need to create a scatterlist structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Christoph Hellwig authored
If a request is single segment and fits into one or two PRP entries we do not have to create a scatterlist for it, but can just map the bio_vec directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Christoph Hellwig authored
We'll have a better way to optimize for small I/O that doesn't require it soon, so remove the existing inline_sg case to make that optimization easier to implement. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Christoph Hellwig authored
This prepares for some bigger changes to the data mapping helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Christoph Hellwig authored
We always have exactly one segment, so we can simply call dma_map_bvec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Christoph Hellwig authored
This mirrors how nvme_map_pci is called and will allow simplifying some checks in nvme_unmap_pci later on. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Christoph Hellwig authored
This means we now have a function that undoes everything nvme_map_data does and we can simplify the error handling a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Christoph Hellwig authored
Cleaning up the command setup isn't related to unmapping data, and disentangling them will simplify error handling a bit down the road. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Christoph Hellwig authored
nvme_init_iod should really be split into two parts: initialize a few general iod fields, which can easily be done at the beginning of nvme_queue_rq, and allocating the scatterlist if needed, which logically belongs into nvme_map_data with the code making use of it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Christoph Hellwig authored
Provide a nice little shortcut for mapping a single bvec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Christoph Hellwig authored
In a lot of places we want to know the DMA direction for a given struct request. Add a little helper to make it a littler easier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Christoph Hellwig authored
This provides a nice little shortcut to get the integrity data for drivers like NVMe that only support a single integrity segment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Christoph Hellwig authored
Return the currently active bvec segment, potentially spanning multiple pages. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Keith Busch authored
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Keith Busch authored
We don't need to save the dma device as it's not used in the hot path and hasn't in a long time. Shrink the struct nvme_queue removing this unnecessary member. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Keith Busch authored
A negative value for the cq_vector used to mean the queue is either disabled or a polled queue. However, we have a queue enabled flag, so the cq_vector had been serving double duty. Don't overload the meaning of cq_vector. Use a flag specific to the polled queues instead. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
TP 8000 says that the use of the SUCCESS flag depends on weather the controller support disabling sq_head pointer updates. Given that we support it by default, makes sense that we go the extra mile to actually use the SUCCESS flag. When we create the C2HData PDU header, we check if sqhd_disabled is set on our queue, if so, we set the SUCCESS flag in the PDU header and skip sending a completion response capsule. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Oliver Smith-Denny <osmithde@cisco.com> Tested-by: Oliver Smith-Denny <osmithde@cisco.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Gustavo A. R. Silva authored
Update the code to use a zero-sized array instead of a pointer in structure nvmet_fc_tgt_queue and use struct_size() in kzalloc(). Notice that one of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; instance = kzalloc(sizeof(struct foo) + sizeof(struct boo) * count, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Christoph Hellwig authored
Use errno_to_nvme_status to convert from a negative errno to a nvme status field instead of going through a blk_status_t. Also remove the pointless status variable in nvmet_bdev_execute_write_zeroes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
Max Gurtovoy authored
Use le16_to_cpu instead of le16_to_cpup and le64_to_cpu instead of le64_to_cpup. This will also align the code to nvme-core driver convention. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
- 04 Apr, 2019 1 commit
-
-
Johannes Thumshirn authored
With the introduction of BIO_NO_PAGE_REF we've used up all available bits in bio::bi_flags. Convert the defines of the flags to an enum and add a BUILD_BUG_ON() call to make sure no-one adds a new one and thus overrides the BVEC_POOL_IDX causing crashes. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 01 Apr, 2019 4 commits
-
-
NeilBrown authored
Currently if many flush requests are submitted to an md device is quick succession, they are serialized and can take a long to process them all. We don't really need to call flush all those times - a single flush call can satisfy all requests submitted before it started. So keep track of when the current flush started and when it finished, allow any pending flush that was requested before the flush started to complete without waiting any more. Test results from Xiao: Test is done on a raid10 device which is created by 4 SSDs. The tool is dbench. 1. The latest linux stable kernel Operation Count AvgLat MaxLat -------------------------------------------------- Deltree 768 10.509 78.305 Flush 2078376 0.013 10.094 Close 21787697 0.019 18.821 LockX 96580 0.007 3.184 Mkdir 384 0.008 0.062 Rename 1255883 0.191 23.534 ReadX 46495589 0.020 14.230 WriteX 14790591 7.123 60.706 Unlink 5989118 0.440 54.551 UnlockX 96580 0.005 2.736 FIND_FIRST 10393845 0.042 12.079 SET_FILE_INFORMATION 2415558 0.129 10.088 QUERY_FILE_INFORMATION 4711725 0.005 8.462 QUERY_PATH_INFORMATION 26883327 0.032 21.715 QUERY_FS_INFORMATION 4929409b 0.010 8.238 NTCreateX 29660080 0.100 53.268 Throughput 1034.88 MB/sec (sync open) 128 clients 128 procs max_latency=60.712 ms 2. With patch1 "Revert "MD: fix lock contention for flush bios"" Operation Count AvgLat MaxLat -------------------------------------------------- Deltree 256 8.326 36.761 Flush 693291 3.974 180.269 Close 7266404 0.009 36.929 LockX 32160 0.006 0.840 Mkdir 128 0.008 0.021 Rename 418755 0.063 29.945 ReadX 15498708 0.007 7.216 WriteX 4932310 22.482 267.928 Unlink 1997557 0.109 47.553 UnlockX 32160 0.004 1.110 FIND_FIRST 3465791 0.036 7.320 SET_FILE_INFORMATION 805825 0.015 1.561 QUERY_FILE_INFORMATION 1570950 0.005 2.403 QUERY_PATH_INFORMATION 8965483 0.013 14.277 QUERY_FS_INFORMATION 1643626 0.009 3.314 NTCreateX 9892174 0.061 41.278 Throughput 345.009 MB/sec (sync open) 128 clients 128 procs max_latency=267.939 m 3. With patch1 and patch2 Operation Count AvgLat MaxLat -------------------------------------------------- Deltree 768 9.570 54.588 Flush 2061354 0.666 15.102 Close 21604811 0.012 25.697 LockX 95770 0.007 1.424 Mkdir 384 0.008 0.053 Rename 1245411 0.096 12.263 ReadX 46103198 0.011 12.116 WriteX 14667988 7.375 60.069 Unlink 5938936 0.173 30.905 UnlockX 95770 0.005 4.147 FIND_FIRST 10306407 0.041 11.715 SET_FILE_INFORMATION 2395987 0.048 7.640 QUERY_FILE_INFORMATION 4672371 0.005 9.291 QUERY_PATH_INFORMATION 26656735 0.018 19.719 QUERY_FS_INFORMATION 4887940 0.010 7.654 NTCreateX 29410811 0.059 28.551 Throughput 1026.21 MB/sec (sync open) 128 clients 128 procs max_latency=60.075 ms Cc: <stable@vger.kernel.org> # v4.19+ Tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
NeilBrown authored
This reverts commit 5a409b4f. This patch has two problems. 1/ it make multiple calls to submit_bio() from inside a make_request_fn. The bios thus submitted will be queued on current->bio_list and not submitted immediately. As the bios are allocated from a mempool, this can theoretically result in a deadlock - all the pool of requests could be in various ->bio_list queues and a subsequent mempool_alloc could block waiting for one of them to be released. 2/ It aims to handle a case when there are many concurrent flush requests. It handles this by submitting many requests in parallel - all of which are identical and so most of which do nothing useful. It would be more efficient to just send one lower-level request, but allow that to satisfy multiple upper-level requests. Fixes: 5a409b4f ("MD: fix lock contention for flush bios") Cc: <stable@vger.kernel.org> # v4.19+ Tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Nigel Croxon authored
Changing state from check_state_check_result to check_state_compute_result not only is unsafe but also doesn't appear to serve a valid purpose. A raid6 check should only be pushing out extra writes if doing repair and a mis-match occurs. The stripe dev management will already try and do repair writes for failing sectors. This patch makes the raid6 check_state_check_result handling work more like raid5's. If somehow too many failures for a check, just quit the check operation for the stripe. When any checks pass, don't try and use check_state_compute_result for a purpose it isn't needed for and is unsafe for. Just mark the stripe as in sync for passing its parity checks and let the stripe dev read/write code and the bad blocks list do their job handling I/O errors. Repro steps from Xiao: These are the steps to reproduce this problem: 1. redefined OPT_MEDIUM_ERR_ADDR to 12000 in scsi_debug.c 2. insmod scsi_debug.ko dev_size_mb=11000 max_luns=1 num_tgts=1 3. mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6 sde is the disk created by scsi_debug 4. echo "2" >/sys/module/scsi_debug/parameters/opts 5. raid-check It panic: [ 4854.730899] md: data-check of RAID array md127 [ 4854.857455] sd 5:0:0:0: [sdr] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.859246] sd 5:0:0:0: [sdr] tag#80 Sense Key : Medium Error [current] [ 4854.860694] sd 5:0:0:0: [sdr] tag#80 Add. Sense: Unrecovered read error [ 4854.862207] sd 5:0:0:0: [sdr] tag#80 CDB: Read(10) 28 00 00 00 2d 88 00 04 00 00 [ 4854.864196] print_req_error: critical medium error, dev sdr, sector 11656 flags 0 [ 4854.867409] sd 5:0:0:0: [sdr] tag#100 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.869469] sd 5:0:0:0: [sdr] tag#100 Sense Key : Medium Error [current] [ 4854.871206] sd 5:0:0:0: [sdr] tag#100 Add. Sense: Unrecovered read error [ 4854.872858] sd 5:0:0:0: [sdr] tag#100 CDB: Read(10) 28 00 00 00 2e e0 00 00 08 00 [ 4854.874587] print_req_error: critical medium error, dev sdr, sector 12000 flags 4000 [ 4854.876456] sd 5:0:0:0: [sdr] tag#101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.878552] sd 5:0:0:0: [sdr] tag#101 Sense Key : Medium Error [current] [ 4854.880278] sd 5:0:0:0: [sdr] tag#101 Add. Sense: Unrecovered read error [ 4854.881846] sd 5:0:0:0: [sdr] tag#101 CDB: Read(10) 28 00 00 00 2e e8 00 00 08 00 [ 4854.883691] print_req_error: critical medium error, dev sdr, sector 12008 flags 4000 [ 4854.893927] sd 5:0:0:0: [sdr] tag#166 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.896002] sd 5:0:0:0: [sdr] tag#166 Sense Key : Medium Error [current] [ 4854.897561] sd 5:0:0:0: [sdr] tag#166 Add. Sense: Unrecovered read error [ 4854.899110] sd 5:0:0:0: [sdr] tag#166 CDB: Read(10) 28 00 00 00 2e e0 00 00 10 00 [ 4854.900989] print_req_error: critical medium error, dev sdr, sector 12000 flags 0 [ 4854.902757] md/raid:md127: read error NOT corrected!! (sector 9952 on sdr1). [ 4854.904375] md/raid:md127: read error NOT corrected!! (sector 9960 on sdr1). [ 4854.906201] ------------[ cut here ]------------ [ 4854.907341] kernel BUG at drivers/md/raid5.c:4190! raid5.c:4190 above is this BUG_ON: handle_parity_checks6() ... BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */ Cc: <stable@vger.kernel.org> # v3.16+ OriginalAuthor: David Jeffery <djeffery@redhat.com> Cc: Xiao Ni <xni@redhat.com> Tested-by: David Jeffery <djeffery@redhat.com> Signed-off-by: David Jeffy <djeffery@redhat.com> Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
loop is one block device, for any bio submitted to this device, the upper layer does guarantee that pages added to loop's bio won't go away when the bio is in-flight. So mark loop's bvec as ITER_BVEC_FLAG_NO_REF then get_page/put_page can be saved for serving loop's IO. Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-