1. 24 Jan, 2019 18 commits
  2. 23 Jan, 2019 12 commits
    • Linus Torvalds's avatar
      Revert "Change mincore() to count "mapped" pages rather than "cached" pages" · 30bac164
      Linus Torvalds authored
      This reverts commit 574823bf.
      
      It turns out that my hope that we could just remove the code that
      exposes the cache residency status from mincore() was too optimistic.
      
      There are various random users that want it, and one example would be
      the Netflix database cluster maintenance. To quote Josh Snyder:
      
       "For Netflix, losing accurate information from the mincore syscall
        would lengthen database cluster maintenance operations from days to
        months. We rely on cross-process mincore to migrate the contents of a
        page cache from machine to machine, and across reboots.
      
        To do this, I wrote and maintain happycache [1], a page cache
        dumper/loader tool. It is quite similar in architecture to pgfincore,
        except that it is agnostic to workload. The gist of happycache's
        operation is "produce a dump of residence status for each page, do
        some operation, then reload exactly the same pages which were present
        before." happycache is entirely dependent on accurate reporting of the
        in-core status of file-backed pages, as accessed by another process.
      
        We primarily use happycache with Cassandra, which (like Postgres +
        pgfincore) relies heavily on OS page cache to reduce disk accesses.
        Because our workloads never experience a cold page cache, we are able
        to provision hardware for a peak utilization level that is far lower
        than the hypothetical "every query is a cache miss" peak.
      
        A database warmed by happycache can be ready for service in seconds
        (bounded only by the performance of the drives and the I/O subsystem),
        with no period of in-service degradation. By contrast, putting a
        database in service without a page cache entails a potentially
        unbounded period of degradation (at Netflix, the time to populate a
        single node's cache via natural cache misses varies by workload from
        hours to weeks). If a single node upgrade were to take weeks, then
        upgrading an entire cluster would take months. Since we want to apply
        security upgrades (and other things) on a somewhat tighter schedule,
        we would have to develop more complex solutions to provide the same
        functionality already provided by mincore.
      
        At the bottom line, happycache is designed to benignly exploit the
        same information leak documented in the paper [2]. I think it makes
        perfect sense to remove cross-process mincore functionality from
        unprivileged users, but not to remove it entirely"
      
      We do have an alternate approach that limits the cache residency
      reporting only to processes that have write permissions to the file, so
      we can fix the original information leak issue that way.  It involves
      _adding_ code rather than removing it, which is sad, but hey, at least
      we haven't found any users that would find the restrictions
      unacceptable.
      
      So revert the optimistic first approach to make room for that alternate
      fix instead.
      Reported-by: default avatarJosh Snyder <joshs@netflix.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Dominique Martinet <asmadeus@codewreck.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Kevin Easton <kevin@guarana.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Cyril Hrubis <chrubis@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Daniel Gruss <daniel@gruss.cc>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      30bac164
    • Linus Torvalds's avatar
      Merge tag 'for-linus-5.0' of git://github.com/cminyard/linux-ipmi · db781446
      Linus Torvalds authored
      Pull IPMI fixes from Corey Minyard:
       "I missed the merge window, which wasn't really important at the time
        as there was nothing that critical that I had for 5.0.
      
        However, I say that,and then a number of critical fixes come in:
      
         - ipmi: fix use-after-free of user->release_barrier.rda
         - ipmi: Prevent use-after-free in deliver_response
         - ipmi: msghandler: Fix potential Spectre v1 vulnerabilities
      
        which are obvious candidates for 5.0.  Then there is:
      
         - ipmi:ssif: Fix handling of multi-part return messages
      
        which is less critical, but it still has some off-by-one things that
        are not great, so it seemed appropriate. Some machines are broken
        without it. Then:
      
         - ipmi: Don't initialize anything in the core until something uses it
      
        It turns out that using SRCU causes large chunks of memory to be used
        on big iron machines, even if IPMI is never used. This was causing
        some issues for people on those machines.
      
        Everything here is destined for stable"
      
      * tag 'for-linus-5.0' of git://github.com/cminyard/linux-ipmi:
        ipmi: Don't initialize anything in the core until something uses it
        ipmi: fix use-after-free of user->release_barrier.rda
        ipmi: Prevent use-after-free in deliver_response
        ipmi: msghandler: Fix potential Spectre v1 vulnerabilities
        ipmi:ssif: Fix handling of multi-part return messages
      db781446
    • Linus Torvalds's avatar
      Merge tag 's390-5.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 09c2fe60
      Linus Torvalds authored
      Pull s390 fixes from Martin Schwidefsky:
      
       - Do not claim to run under z/VM if the hypervisor can not be
         identified
      
       - Fix crashes due to outdated ASCEs in CR1
      
       - Avoid a deadlock in regard to CPU hotplug
      
       - Really fix the vdso mapping issue for compat tasks
      
       - Avoid crash on restart due to an incorrect stack address
      
      * tag 's390-5.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/smp: Fix calling smp_call_ipl_cpu() from ipl CPU
        s390/vdso: correct vdso mapping for compat tasks
        s390/smp: fix CPU hotplug deadlock with CPU rescan
        s390/mm: always force a load of the primary ASCE on context switch
        s390/early: improve machine detection
      09c2fe60
    • Corey Minyard's avatar
      ipmi: Don't initialize anything in the core until something uses it · 913a89f0
      Corey Minyard authored
      The IPMI driver was recently modified to use SRCU, but it turns out
      this uses a chunk of percpu memory, even if IPMI is never used.
      
      So modify thing to on initialize on the first use.  There was already
      code to sort of handle this for handling init races, so piggy back
      on top of that, and simplify it in the process.
      Signed-off-by: default avatarCorey Minyard <cminyard@mvista.com>
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org # 4.18
      913a89f0
    • Yang Yingliang's avatar
      ipmi: fix use-after-free of user->release_barrier.rda · 77f82696
      Yang Yingliang authored
      When we do the following test, we got oops in ipmi_msghandler driver
      while((1))
      do
      	service ipmievd restart & service ipmievd restart
      done
      
      ---------------------------------------------------------------
      [  294.230186] Unable to handle kernel paging request at virtual address 0000803fea6ea008
      [  294.230188] Mem abort info:
      [  294.230190]   ESR = 0x96000004
      [  294.230191]   Exception class = DABT (current EL), IL = 32 bits
      [  294.230193]   SET = 0, FnV = 0
      [  294.230194]   EA = 0, S1PTW = 0
      [  294.230195] Data abort info:
      [  294.230196]   ISV = 0, ISS = 0x00000004
      [  294.230197]   CM = 0, WnR = 0
      [  294.230199] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000a1c1b75a
      [  294.230201] [0000803fea6ea008] pgd=0000000000000000
      [  294.230204] Internal error: Oops: 96000004 [#1] SMP
      [  294.235211] Modules linked in: nls_utf8 isofs rpcrdma ib_iser ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_umad rdma_cm ib_cm iw_cm dm_mirror dm_region_hash dm_log dm_mod aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce sha2_ce ses sha256_arm64 sha1_ce hibmc_drm hisi_sas_v2_hw enclosure sg hisi_sas_main sbsa_gwdt ip_tables mlx5_ib ib_uverbs marvell ib_core mlx5_core ixgbe ipmi_si mdio hns_dsaf ipmi_devintf ipmi_msghandler hns_enet_drv hns_mdio
      [  294.277745] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted 5.0.0-rc2+ #113
      [  294.285511] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.37 11/21/2017
      [  294.292835] pstate: 80000005 (Nzcv daif -PAN -UAO)
      [  294.297695] pc : __srcu_read_lock+0x38/0x58
      [  294.301940] lr : acquire_ipmi_user+0x2c/0x70 [ipmi_msghandler]
      [  294.307853] sp : ffff00001001bc80
      [  294.311208] x29: ffff00001001bc80 x28: ffff0000117e5000
      [  294.316594] x27: 0000000000000000 x26: dead000000000100
      [  294.321980] x25: dead000000000200 x24: ffff803f6bd06800
      [  294.327366] x23: 0000000000000000 x22: 0000000000000000
      [  294.332752] x21: ffff00001001bd04 x20: ffff80df33d19018
      [  294.338137] x19: ffff80df33d19018 x18: 0000000000000000
      [  294.343523] x17: 0000000000000000 x16: 0000000000000000
      [  294.348908] x15: 0000000000000000 x14: 0000000000000002
      [  294.354293] x13: 0000000000000000 x12: 0000000000000000
      [  294.359679] x11: 0000000000000000 x10: 0000000000100000
      [  294.365065] x9 : 0000000000000000 x8 : 0000000000000004
      [  294.370451] x7 : 0000000000000000 x6 : ffff80df34558678
      [  294.375836] x5 : 000000000000000c x4 : 0000000000000000
      [  294.381221] x3 : 0000000000000001 x2 : 0000803fea6ea000
      [  294.386607] x1 : 0000803fea6ea008 x0 : 0000000000000001
      [  294.391994] Process swapper/3 (pid: 0, stack limit = 0x0000000083087293)
      [  294.398791] Call trace:
      [  294.401266]  __srcu_read_lock+0x38/0x58
      [  294.405154]  acquire_ipmi_user+0x2c/0x70 [ipmi_msghandler]
      [  294.410716]  deliver_response+0x80/0xf8 [ipmi_msghandler]
      [  294.416189]  deliver_local_response+0x28/0x68 [ipmi_msghandler]
      [  294.422193]  handle_one_recv_msg+0x158/0xcf8 [ipmi_msghandler]
      [  294.432050]  handle_new_recv_msgs+0xc0/0x210 [ipmi_msghandler]
      [  294.441984]  smi_recv_tasklet+0x8c/0x158 [ipmi_msghandler]
      [  294.451618]  tasklet_action_common.isra.5+0x88/0x138
      [  294.460661]  tasklet_action+0x2c/0x38
      [  294.468191]  __do_softirq+0x120/0x2f8
      [  294.475561]  irq_exit+0x134/0x140
      [  294.482445]  __handle_domain_irq+0x6c/0xc0
      [  294.489954]  gic_handle_irq+0xb8/0x178
      [  294.497037]  el1_irq+0xb0/0x140
      [  294.503381]  arch_cpu_idle+0x34/0x1a8
      [  294.510096]  do_idle+0x1d4/0x290
      [  294.516322]  cpu_startup_entry+0x28/0x30
      [  294.523230]  secondary_start_kernel+0x184/0x1d0
      [  294.530657] Code: d538d082 d2800023 8b010c81 8b020021 (c85f7c25)
      [  294.539746] ---[ end trace 8a7a880dee570b29 ]---
      [  294.547341] Kernel panic - not syncing: Fatal exception in interrupt
      [  294.556837] SMP: stopping secondary CPUs
      [  294.563996] Kernel Offset: disabled
      [  294.570515] CPU features: 0x002,21006008
      [  294.577638] Memory Limit: none
      [  294.587178] Starting crashdump kernel...
      [  294.594314] Bye!
      
      Because the user->release_barrier.rda is freed in ipmi_destroy_user(), but
      the refcount is not zero, when acquire_ipmi_user() uses user->release_barrier.rda
      in __srcu_read_lock(), it causes oops.
      Fix this by calling cleanup_srcu_struct() when the refcount is zero.
      
      Fixes: e86ee2d4 ("ipmi: Rework locking and shutdown for hot remove")
      Cc: stable@vger.kernel.org # 4.18
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarCorey Minyard <cminyard@mvista.com>
      77f82696
    • Fred Klassen's avatar
      ipmi: Prevent use-after-free in deliver_response · 479d6b39
      Fred Klassen authored
      Some IPMI modules (e.g. ibmpex_msg_handler()) will have ipmi_usr_hdlr
      handlers that call ipmi_free_recv_msg() directly. This will essentially
      kfree(msg), leading to use-after-free.
      
      This does not happen in the ipmi_devintf module, which will queue the
      message and run ipmi_free_recv_msg() later.
      
      BUG: KASAN: use-after-free in deliver_response+0x12f/0x1b0
      Read of size 8 at addr ffff888a7bf20018 by task ksoftirqd/3/27
      CPU: 3 PID: 27 Comm: ksoftirqd/3 Tainted: G           O      4.19.11-amd64-ani99-debug #12.0.1.601133+pv
      Hardware name: AppNeta r1000/X11SPW-TF, BIOS 2.1a-AP 09/17/2018
      Call Trace:
      dump_stack+0x92/0xeb
      print_address_description+0x73/0x290
      kasan_report+0x258/0x380
      deliver_response+0x12f/0x1b0
      ? ipmi_free_recv_msg+0x50/0x50
      deliver_local_response+0xe/0x50
      handle_one_recv_msg+0x37a/0x21d0
      handle_new_recv_msgs+0x1ce/0x440
      ...
      
      Allocated by task 9885:
      kasan_kmalloc+0xa0/0xd0
      kmem_cache_alloc_trace+0x116/0x290
      ipmi_alloc_recv_msg+0x28/0x70
      i_ipmi_request+0xb4a/0x1640
      ipmi_request_settime+0x1b8/0x1e0
      ...
      
      Freed by task 27:
      __kasan_slab_free+0x12e/0x180
      kfree+0xe9/0x280
      deliver_response+0x122/0x1b0
      deliver_local_response+0xe/0x50
      handle_one_recv_msg+0x37a/0x21d0
      handle_new_recv_msgs+0x1ce/0x440
      tasklet_action_common.isra.19+0xc4/0x250
      __do_softirq+0x11f/0x51f
      
      Fixes: e86ee2d4 ("ipmi: Rework locking and shutdown for hot remove")
      Cc: stable@vger.kernel.org # 4.18
      Signed-off-by: default avatarFred Klassen <fklassen@appneta.com>
      Signed-off-by: default avatarCorey Minyard <cminyard@mvista.com>
      479d6b39
    • Gustavo A. R. Silva's avatar
      ipmi: msghandler: Fix potential Spectre v1 vulnerabilities · a7102c74
      Gustavo A. R. Silva authored
      channel and addr->channel are indirectly controlled by user-space,
      hence leading to a potential exploitation of the Spectre variant 1
      vulnerability.
      
      These issues were detected with the help of Smatch:
      
      drivers/char/ipmi/ipmi_msghandler.c:1381 ipmi_set_my_address() warn: potential spectre issue 'user->intf->addrinfo' [w] (local cap)
      drivers/char/ipmi/ipmi_msghandler.c:1401 ipmi_get_my_address() warn: potential spectre issue 'user->intf->addrinfo' [r] (local cap)
      drivers/char/ipmi/ipmi_msghandler.c:1421 ipmi_set_my_LUN() warn: potential spectre issue 'user->intf->addrinfo' [w] (local cap)
      drivers/char/ipmi/ipmi_msghandler.c:1441 ipmi_get_my_LUN() warn: potential spectre issue 'user->intf->addrinfo' [r] (local cap)
      drivers/char/ipmi/ipmi_msghandler.c:2260 check_addr() warn: potential spectre issue 'intf->addrinfo' [r] (local cap)
      
      Fix this by sanitizing channel and addr->channel before using them to
      index user->intf->addrinfo and intf->addrinfo, correspondingly.
      
      Notice that given that speculation windows are large, the policy is
      to kill the speculation on the first load and not worry if it can be
      completed with a dependent load/store [1].
      
      [1] https://lore.kernel.org/lkml/20180423164740.GY17484@dhcp22.suse.cz/
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarCorey Minyard <cminyard@mvista.com>
      a7102c74
    • Corey Minyard's avatar
      ipmi:ssif: Fix handling of multi-part return messages · 7d6380cd
      Corey Minyard authored
      The block number was not being compared right, it was off by one
      when checking the response.
      
      Some statistics wouldn't be incremented properly in some cases.
      
      Check to see if that middle-part messages always have 31 bytes of
      data.
      Signed-off-by: default avatarCorey Minyard <cminyard@mvista.com>
      Cc: stable@vger.kernel.org # 4.4
      7d6380cd
    • Weinan Li's avatar
      drm/i915/gvt: release shadow batch buffer and wa_ctx before destroy one workload · 0f755512
      Weinan Li authored
      GVT-g will shadow the privilege batch buffer and the indirect context
      during command scan, move the release process into
      intel_vgpu_destroy_workload() to ensure the resources are recycled
      properly.
      
      Fixes: 0cce2823 ("drm/i915/gvt/kvmgt:Refine error handling for prepare_execlist_workload")
      Reviewed-by: default avatarZhenyu Wang <zhenyuw@linux.intel.com>
      Signed-off-by: default avatarWeinan Li <weinan.z.li@intel.com>
      Signed-off-by: default avatarZhenyu Wang <zhenyuw@linux.intel.com>
      0f755512
    • Linus Torvalds's avatar
      Merge branch 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux · 333478a7
      Linus Torvalds authored
      Pull thermal management fixes from Zhang Rui:
      
       - Fix a race condition that sysfs could be accessed before necessary
         initialization in int340x thermal driver. (Aaron Hill)
      
       - Fix a NULL vs IS_ERR() check in int340x thermal driver. (Dan
         Carpenter)
      
      * 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
        drivers: thermal: int340x_thermal: Fix sysfs race condition
        thermal: int340x_thermal: Fix a NULL vs IS_ERR() check
      333478a7
    • Linus Torvalds's avatar
      Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux · 0b0d4be6
      Linus Torvalds authored
      Pull clk fixes from Stephen Boyd:
       "This is a sort of random collection of clk fixes that have come in
        since the merge window:
      
         - Handful of memory allocation and potentially bad pointer usage
           fixes
      
         - JSON format was incorrect for clk_dump because it missed a comma
      
         - Two Kconfig fixes, one duplicate and one missing select line
      
         - Compiler warning fix for the VC5 clk driver
      
         - Name and rate fixes for PLLs in the stratix10 driver so it can
           properly detect PLL rates and parents"
      
      * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
        clk: socfpga: stratix10: fix naming convention for the fixed-clocks
        clk: socfpga: stratix10: fix rate calculation for pll clocks
        clk: qcom: Select QCOM_GDSC with MSM_GCC_8998
        clk: vc5: Abort clock configuration without upstream clock
        clk: sysfs: fix invalid JSON in clk_dump
        clk: imx: Remove Kconfig duplicate include
        clk: zynqmp: Fix memory allocation in zynqmp_clk_setup
        clk: tegra: dfll: Fix a potential Oop in remove()
        clk: imx: fix potential NULL dereference in imx8qxp_lpcg_clk_probe()
      0b0d4be6
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-5.0-rc4' of... · 8f45fa27
      Linus Torvalds authored
      Merge tag 'linux-kselftest-5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull kselftest fixes from Shuah Khan:
       "Fixes to rtc, seccomp and other tests"
      
      * tag 'linux-kselftest-5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        selftests/seccomp: Abort without user notification support
        selftests: gpio-mockup-chardev: Check asprintf() for error
        selftests: seccomp: use LDLIBS instead of LDFLAGS
        selftests/vm/gup_benchmark.c: match gup struct to kernel
        tools/testing/selftests/x86/unwind_vdso.c: Remove duplicate header
        x86/mpx/selftests: fix spelling mistake "succeded" -> "succeeded"
        selftests: rtc: rtctest: add alarm test on minute boundary
        selftests: rtc: rtctest: fix alarm tests
      8f45fa27
  3. 22 Jan, 2019 6 commits
  4. 21 Jan, 2019 4 commits
    • Kenneth Feng's avatar
      drm/amd/powerplay: OD setting fix on Vega10 · 6d87dc97
      Kenneth Feng authored
      gfxclk for OD setting is limited to 1980M for non-acg
      ASICs of Vega10
      Signed-off-by: default avatarKenneth Feng <kenneth.feng@amd.com>
      Reviewed-by: default avatarEvan Quan <evan.quan@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      6d87dc97
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v5.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 52e60b75
      Linus Torvalds authored
      Pull IOMMU fix from Joerg Roedel:
       "One fix only for now: Fix probe deferral in iommu/of code (broke with
        recent changes to iommu_ops->add_device invocation)"
      
      * tag 'iommu-fixes-v5.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu/of: Fix probe-deferral
      52e60b75
    • Linus Torvalds's avatar
      Merge tag 'arc-5.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · 57ef300e
      Linus Torvalds authored
      Pull ARC architecture updates from Vineet Gupta:
      
       - Perf support for raw events
      
       - boot log printing: return stack, action points
      
       - fix memset to avoid prefetchw bleeding past end of buffer
      
       - do_page_fault fix for mmap_sem held while returning to userspace
      
       - other misc fixes
      
      * tag 'arc-5.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
        ARCv2: lib: memeset: fix doing prefetchw outside of buffer
        ARC: mm: do_page_fault fixes #1: relinquish mmap_sem if signal arrives while handle_mm_fault
        ARC: show_regs: lockdep: re-enable preemption
        ARC: show_regs: lockdep: avoid page allocator...
        ARC: perf: avoid kernel killing where it is possible
        ARC: perf: move HW events mapping to separate function
        ARC: perf: introduce Kernel PMU events support
        ARC: perf: trivial code cleanup
        ARC: perf: map generic branches to correct hardware condition
        ARC: adjust memblock_reserve of kernel memory
        arc: remove redundant kernel-space generic-y
        ARC: fix __ffs return value to avoid build warnings
        ARC: boot log: print Action point details
        ARCv2: boot log: BPU return stack depth
      57ef300e
    • Mike Snitzer's avatar
      dm: fix redundant IO accounting for bios that need splitting · a1e1cb72
      Mike Snitzer authored
      The risk of redundant IO accounting was not taken into consideration
      when commit 18a25da8 ("dm: ensure bio submission follows a
      depth-first tree walk") introduced IO splitting in terms of recursion
      via generic_make_request().
      
      Fix this by subtracting the split bio's payload from the IO stats that
      were already accounted for by start_io_acct() upon dm_make_request()
      entry.  This repeat oscillation of the IO accounting, up then down,
      isn't ideal but refactoring DM core's IO splitting to pre-split bios
      _before_ they are accounted turned out to be an excessive amount of
      change that will need a full development cycle to refine and verify.
      
      Before this fix:
      
        /dev/mapper/stripe_dev is a 4-way stripe using a 32k chunksize, so
        bios are split on 32k boundaries.
      
        # fio --name=16M --filename=/dev/mapper/stripe_dev --rw=write --bs=64k --size=16M \
          	--iodepth=1 --ioengine=libaio --direct=1 --refill_buffers
      
        with debugging added:
        [103898.310264] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=0 len=128
        [103898.318704] device-mapper: core: __split_and_process_bio: recursing for following split bio:
        [103898.329136] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=64 len=64
        ...
      
        16M written yet 136M (278528 * 512b) accounted:
        # cat /sys/block/dm-2/stat | awk '{ print $7 }'
        278528
      
      After this fix:
      
        16M written and 16M (32768 * 512b) accounted:
        # cat /sys/block/dm-2/stat | awk '{ print $7 }'
        32768
      
      Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk")
      Cc: stable@vger.kernel.org # 4.16+
      Reported-by: default avatarBryan Gurney <bgurney@redhat.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      a1e1cb72