Commits · 39218ff4c625dbf2e68224024fe0acaa60bcd51a · Kirill Smelkov / linux

08 Apr, 2021 3 commits

stack: Optionally randomize kernel stack offset each syscall · 39218ff4

Kees Cook authored Apr 01, 2021

This provides the ability for architectures to enable kernel stack base
address offset randomization. This feature is controlled by the boot
param "randomize_kstack_offset=on/off", with its default value set by
CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.

This feature is based on the original idea from the last public release
of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt
All the credit for the original idea goes to the PaX team. Note that
the design and implementation of this upstream randomize_kstack_offset
feature differs greatly from the RANDKSTACK feature (see below).

Reasoning for the feature:

This feature aims to make harder the various stack-based attacks that
rely on deterministic stack structure. We have had many such attacks in
past (just to name few):

https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html

As Linux kernel stack protections have been constantly improving
(vmap-based stack allocation with guard pages, removal of thread_info,
STACKLEAK), attackers have had to find new ways for their exploits
to work. They have done so, continuing to rely on the kernel's stack
determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT
were not relevant. For example, the following recent attacks would have
been hampered if the stack offset was non-deterministic between syscalls:

https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf
(page 70: targeting the pt_regs copy with linear stack overflow)

https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
(leaked stack address from one syscall as a target during next syscall)

The main idea is that since the stack offset is randomized on each system
call, it is harder for an attack to reliably land in any particular place
on the thread stack, even with address exposures, as the stack base will
change on the next syscall. Also, since randomization is performed after
placing pt_regs, the ptrace-based approach[1] to discover the randomized
offset during a long-running syscall should not be possible.

Design description:

During most of the kernel's execution, it runs on the "thread stack",
which is pretty deterministic in its structure: it is fixed in size,
and on every entry from userspace to kernel on a syscall the thread
stack starts construction from an address fetched from the per-cpu
cpu_current_top_of_stack variable. The first element to be pushed to the
thread stack is the pt_regs struct that stores all required CPU registers
and syscall parameters. Finally the specific syscall function is called,
with the stack being used as the kernel executes the resulting request.

The goal of randomize_kstack_offset feature is to add a random offset
after the pt_regs has been pushed to the stack and before the rest of the
thread stack is used during the syscall processing, and to change it every
time a process issues a syscall. The source of randomness is currently
architecture-defined (but x86 is using the low byte of rdtsc()). Future
improvements for different entropy sources is possible, but out of scope
for this patch. Further more, to add more unpredictability, new offsets
are chosen at the end of syscalls (the timing of which should be less
easy to measure from userspace than at syscall entry time), and stored
in a per-CPU variable, so that the life of the value does not stay
explicitly tied to a single task.

As suggested by Andy Lutomirski, the offset is added using alloca()
and an empty asm() statement with an output constraint, since it avoids
changes to assembly syscall entry code, to the unwinder, and provides
correct stack alignment as defined by the compiler.

In order to make this available by default with zero performance impact
for those that don't want it, it is boot-time selectable with static
branches. This way, if the overhead is not wanted, it can just be
left turned off with no performance impact.

The generated assembly for x86_64 with GCC looks like this:

...
ffffffff81003977: 65 8b 05 02 ea 00 7f  mov %gs:0x7f00ea02(%rip),%eax
					    # 12380 <kstack_offset>
ffffffff8100397e: 25 ff 03 00 00        and $0x3ff,%eax
ffffffff81003983: 48 83 c0 0f           add $0xf,%rax
ffffffff81003987: 25 f8 07 00 00        and $0x7f8,%eax
ffffffff8100398c: 48 29 c4              sub %rax,%rsp
ffffffff8100398f: 48 8d 44 24 0f        lea 0xf(%rsp),%rax
ffffffff81003994: 48 83 e0 f0           and $0xfffffffffffffff0,%rax
...

As a result of the above stack alignment, this patch introduces about
5 bits of randomness after pt_regs is spilled to the thread stack on
x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for
stack alignment). The amount of entropy could be adjusted based on how
much of the stack space we wish to trade for security.

My measure of syscall performance overhead (on x86_64):

lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null
    randomize_kstack_offset=y	Simple syscall: 0.7082 microseconds
    randomize_kstack_offset=n	Simple syscall: 0.7016 microseconds

So, roughly 0.9% overhead growth for a no-op syscall, which is very
manageable. And for people that don't want this, it's off by default.

There are two gotchas with using the alloca() trick. First,
compilers that have Stack Clash protection (-fstack-clash-protection)
enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to
any dynamic stack allocations. While the randomization offset is
always less than a page, the resulting assembly would still contain
(unreachable!) probing routines, bloating the resulting assembly. To
avoid this, -fno-stack-clash-protection is unconditionally added to
the kernel Makefile since this is the only dynamic stack allocation in
the kernel (now that VLAs have been removed) and it is provably safe
from Stack Clash style attacks.

The second gotcha with alloca() is a negative interaction with
-fstack-protector*, in that it sees the alloca() as an array allocation,
which triggers the unconditional addition of the stack canary function
pre/post-amble which slows down syscalls regardless of the static
branch. In order to avoid adding this unneeded check and its associated
performance impact, architectures need to carefully remove uses of
-fstack-protector-strong (or -fstack-protector) in the compilation units
that use the add_random_kstack() macro and to audit the resulting stack
mitigation coverage (to make sure no desired coverage disappears). No
change is visible for this on x86 because the stack protector is already
unconditionally disabled for the compilation unit, but the change is
required on arm64. There is, unfortunately, no attribute that can be
used to disable stack protector for specific functions.

Comparison to PaX RANDKSTACK feature:

The RANDKSTACK feature randomizes the location of the stack start
(cpu_current_top_of_stack), i.e. including the location of pt_regs
structure itself on the stack. Initially this patch followed the same
approach, but during the recent discussions[2], it has been determined
to be of a little value since, if ptrace functionality is available for
an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write
different offsets in the pt_regs struct, observe the cache behavior of
the pt_regs accesses, and figure out the random stack offset. Another
difference is that the random offset is stored in a per-cpu variable,
rather than having it be per-thread. As a result, these implementations
differ a fair bit in their implementation details and results, though
obviously the intent is similar.

[1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/
[2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/
[3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.htmlCo-developed-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org

39218ff4

init_on_alloc: Optimize static branches · 51cba1eb

Kees Cook authored Apr 01, 2021

The state of CONFIG_INIT_ON_ALLOC_DEFAULT_ON (and ...ON_FREE...) did not
change the assembly ordering of the static branches: they were always out
of line. Use the new jump_label macros to check the CONFIG settings to
default to the "expected" state, which slightly optimizes the resulting
assembly code.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20210401232347.2791257-3-keescook@chromium.org

51cba1eb

jump_label: Provide CONFIG-driven build state defaults · 0d66ccc1

Kees Cook authored Apr 01, 2021

As shown in the comment in jump_label.h, choosing the initial state of
static branches changes the assembly layout. If the condition is expected
to be likely it's inline, and if unlikely it is out of line via a jump.

A few places in the kernel use (or could be using) a CONFIG to choose the
default state, which would give a small performance benefit to their
compile-time declared default. Provide the infrastructure to do this.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210401232347.2791257-2-keescook@chromium.org

0d66ccc1

04 Apr, 2021 2 commits

Linux 5.12-rc6 · e49d033b
Linus Torvalds authored Apr 04, 2021

e49d033b

firewire: nosy: Fix a use-after-free bug in nosy_ioctl() · 829933ef

Zheyu Ma authored Apr 03, 2021

For each device, the nosy driver allocates a pcilynx structure.
A use-after-free might happen in the following scenario:

 1. Open nosy device for the first time and call ioctl with command
    NOSY_IOC_START, then a new client A will be malloced and added to
    doubly linked list.
 2. Open nosy device for the second time and call ioctl with command
    NOSY_IOC_START, then a new client B will be malloced and added to
    doubly linked list.
 3. Call ioctl with command NOSY_IOC_START for client A, then client A
    will be readded to the doubly linked list. Now the doubly linked
    list is messed up.
 4. Close the first nosy device and nosy_release will be called. In
    nosy_release, client A will be unlinked and freed.
 5. Close the second nosy device, and client A will be referenced,
    resulting in UAF.

The root cause of this bug is that the element in the doubly linked list
is reentered into the list.

Fix this bug by adding a check before inserting a client.  If a client
is already in the linked list, don't insert it.

The following KASAN report reveals it:

   BUG: KASAN: use-after-free in nosy_release+0x1ea/0x210
   Write of size 8 at addr ffff888102ad7360 by task poc
   CPU: 3 PID: 337 Comm: poc Not tainted 5.12.0-rc5+ #6
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
   Call Trace:
     nosy_release+0x1ea/0x210
     __fput+0x1e2/0x840
     task_work_run+0xe8/0x180
     exit_to_user_mode_prepare+0x114/0x120
     syscall_exit_to_user_mode+0x1d/0x40
     entry_SYSCALL_64_after_hwframe+0x44/0xae

   Allocated by task 337:
     nosy_open+0x154/0x4d0
     misc_open+0x2ec/0x410
     chrdev_open+0x20d/0x5a0
     do_dentry_open+0x40f/0xe80
     path_openat+0x1cf9/0x37b0
     do_filp_open+0x16d/0x390
     do_sys_openat2+0x11d/0x360
     __x64_sys_open+0xfd/0x1a0
     do_syscall_64+0x33/0x40
     entry_SYSCALL_64_after_hwframe+0x44/0xae

   Freed by task 337:
     kfree+0x8f/0x210
     nosy_release+0x158/0x210
     __fput+0x1e2/0x840
     task_work_run+0xe8/0x180
     exit_to_user_mode_prepare+0x114/0x120
     syscall_exit_to_user_mode+0x1d/0x40
     entry_SYSCALL_64_after_hwframe+0x44/0xae

   The buggy address belongs to the object at ffff888102ad7300 which belongs to the cache kmalloc-128 of size 128
   The buggy address is located 96 bytes inside of 128-byte region [ffff888102ad7300, ffff888102ad7380)

[ Modified to use 'list_empty()' inside proper lock  - Linus ]

Link: https://lore.kernel.org/lkml/1617433116-5930-1-git-send-email-zheyuma97@gmail.com/Reported-and-tested-by: 马哲宇 (Zheyu Ma) <zheyuma97@gmail.com>
Signed-off-by: Zheyu Ma <zheyuma97@gmail.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

829933ef

03 Apr, 2021 14 commits

Merge tag 'for-linus' of git://github.com/openrisc/linux · 2023a53b

Linus Torvalds authored Apr 03, 2021

Pull OpenRISC fix from Stafford Horne:
 "Fix duplicate header include in Litex SOC driver"

* tag 'for-linus' of git://github.com/openrisc/linux:
  soc: litex: Remove duplicated header file inclusion

2023a53b

Merge tag 'io_uring-5.12-2021-04-03' of git://git.kernel.dk/linux-block · d83e98f9

Linus Torvalds authored Apr 03, 2021

POull io_uring fix from Jens Axboe:
 "Just fixing a silly braino in a previous patch, where we'd end up
  failing to compile if CONFIG_BLOCK isn't enabled.

  Not that a lot of people do that, but kernel bot spotted it and it's
  probably prudent to just flush this out now before -rc6.

  Sorry about that, none of my test compile configs have !CONFIG_BLOCK"

* tag 'io_uring-5.12-2021-04-03' of git://git.kernel.dk/linux-block:
  io_uring: fix !CONFIG_BLOCK compilation failure

d83e98f9

soc: litex: Remove duplicated header file inclusion · 1683f7de

Zhen Lei authored Mar 31, 2021

The header file <linux/errno.h> is already included above and can be
removed here.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Mateusz Holenko <mholenko@antmicro.com>
Signed-off-by: Stafford Horne <shorne@gmail.com>

1683f7de

Merge tag 'gfs2-v5.12-rc2-fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 · 8e29be34

Linus Torvalds authored Apr 03, 2021

Pull gfs2 fixes from Andreas Gruenbacher:
 "Two more gfs2 fixes"

* tag 'gfs2-v5.12-rc2-fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: report "already frozen/thawed" errors
  gfs2: Flag a withdraw if init_threads() fails

8e29be34

Merge tag 'riscv-for-linus-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 7fd7d5c2

Linus Torvalds authored Apr 03, 2021

Pull RISC-V fixes from Palmer Dabbelt:
 "A handful of fixes for 5.12:

   - fix a stack tracing regression related to "const register asm"
     variables, which have unexpected behavior.

   - ensure the value to be written by put_user() is evaluated before
     enabling access to userspace memory..

   - align the exception vector table correctly, so we don't rely on the
     firmware's handling of unaligned accesses.

   - build fix to make NUMA depend on MMU, which triggered on some
     randconfigs"

* tag 'riscv-for-linus-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Make NUMA depend on MMU
  riscv: remove unneeded semicolon
  riscv,entry: fix misaligned base for excp_vect_table
  riscv: evaluate put_user() arg before enabling user access
  riscv: Drop const annotation for sp

7fd7d5c2

Merge tag 'powerpc-5.12-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 9c2ef23e

Linus Torvalds authored Apr 03, 2021

Pull powerpc fixes from Michael Ellerman:
 "Fix a bug on pseries where spurious wakeups from H_PROD would prevent
  partition migration from succeeding.

  Fix oopses seen in pcpu_alloc(), caused by parallel faults of the
  percpu mapping causing us to corrupt the protection key used for the
  mapping, and cause a fatal key fault.

  Thanks to Aneesh Kumar K.V, Murilo Opsfelder Araujo, and Nathan Lynch"

* tag 'powerpc-5.12-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm/book3s64: Use the correct storage key value when calling H_PROTECT
  powerpc/pseries/mobility: handle premature return from H_JOIN
  powerpc/pseries/mobility: use struct for shared state

9c2ef23e

Merge tag 'hyperv-fixes-signed-20210402' of... · fa161995

Linus Torvalds authored Apr 03, 2021

Merge tag 'hyperv-fixes-signed-20210402' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull Hyper-V fixes from Wei Liu:
 "One fix from Lu Yunlong for a double free in hvfb_probe"

* tag 'hyperv-fixes-signed-20210402' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  video: hyperv_fb: Fix a double free in hvfb_probe

fa161995

Merge tag 'driver-core-5.12-rc6' of... · f5664825

Linus Torvalds authored Apr 03, 2021

Merge tag 'driver-core-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fix from Greg KH:
 "Here is a single driver core fix for a reported problem with differed
  probing. It has been in linux-next for a while with no reported
  problems"

* tag 'driver-core-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  driver core: clear deferred probe reason on probe retry

f5664825

Merge tag 'char-misc-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · a443930a

Linus Torvalds authored Apr 03, 2021

Pull char/misc driver fixes from Greg KH:
 "Here are a few small driver char/misc changes for 5.12-rc6.

  Nothing major here, a few fixes for reported issues:

   - interconnect fixes for problems found

   - fbcon syzbot-found fix

   - extcon fixes

   - firmware stratix10 bugfix

   - MAINTAINERS file update.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  drivers: video: fbcon: fix NULL dereference in fbcon_cursor()
  mei: allow map and unmap of client dma buffer only for disconnected client
  MAINTAINERS: Add linux-phy list and patchwork
  interconnect: Fix kerneldoc warning
  firmware: stratix10-svc: reset COMMAND_RECONFIG_FLAG_PARTIAL to 0
  extcon: Fix error handling in extcon_dev_register
  extcon: Add stubs for extcon_register_notifier_all() functions
  interconnect: core: fix error return code of icc_link_destroy()
  interconnect: qcom: msm8939: remove rpm-ids from non-RPM nodes

a443930a

Merge tag 'staging-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 3e707eb6

Linus Torvalds authored Apr 03, 2021

Pull staging driver fixes from Greg KH:
 "Here are two rtl8192e staging driver fixes for reported problems.

  Both of these have been in linux-next for a while with no reported
  issues"

* tag 'staging-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: rtl8192e: Change state information from u16 to u8
  staging: rtl8192e: Fix incorrect source in memcpy()

3e707eb6

Merge tag 'tty-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 0d2c5a9e

Linus Torvalds authored Apr 03, 2021

Pull serial driver fix from Greg KH:
 "Here is a single serial driver fix for 5.12-rc6. Is is a revert of a
  change that showed up in 5.9 that has been reported to cause problems.

  It has been in linux-next for a while with no reported issues"

* tag 'tty-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  soc: qcom-geni-se: Cleanup the code to remove proxy votes

0d2c5a9e

Merge tag 'usb-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · de879a8d

Linus Torvalds authored Apr 03, 2021

Pull USB fixes from Greg KH:
 "Here are a few small USB driver fixes for 5.12-rc6 to resolve reported
  problems.

  They include:

   - a number of cdc-acm fixes for reported problems. It seems more
     people are using this driver lately...

   - dwc3 driver fixes for reported problems, and fixes for the fixes :)

   - dwc2 driver fixes for reported issues.

   - musb driver fix.

   - new USB quirk additions.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'usb-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (23 commits)
  usb: dwc2: Prevent core suspend when port connection flag is 0
  usb: dwc2: Fix HPRT0.PrtSusp bit setting for HiKey 960 board.
  usb: musb: Fix suspend with devices connected for a64
  usb: xhci-mtk: fix broken streams issue on 0.96 xHCI
  usb: dwc3: gadget: Clear DEP flags after stop transfers in ep disable
  usbip: vhci_hcd fix shift out-of-bounds in vhci_hub_control()
  USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem
  USB: cdc-acm: do not log successful probe on later errors
  USB: cdc-acm: always claim data interface
  USB: cdc-acm: use negation for NULL checks
  USB: cdc-acm: clean up probe error labels
  USB: cdc-acm: drop redundant driver-data reset
  USB: cdc-acm: drop redundant driver-data assignment
  USB: cdc-acm: fix use-after-free after probe failure
  USB: cdc-acm: fix double free on probe failure
  USB: cdc-acm: downgrade message to debug
  USB: cdc-acm: untangle a circular dependency between callback and softint
  cdc-acm: fix BREAK rx code path adding necessary calls
  usb: gadget: udc: amd5536udc_pci fix null-ptr-dereference
  usb: dwc3: pci: Enable dis_uX_susphy_quirk for Intel Merrifield
  ...

de879a8d

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 57fbdb15

Linus Torvalds authored Apr 03, 2021

Pull SCSI fix from James Bottomley:
 "A single fix to iscsi for a rare race condition which can cause a
  kernel panic"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: iscsi: Fix race condition between login and sync thread

57fbdb15

io_uring: fix !CONFIG_BLOCK compilation failure · e82ad485

Jens Axboe authored Apr 02, 2021

kernel test robot correctly pinpoints a compilation failure if
CONFIG_BLOCK isn't set:

fs/io_uring.c: In function '__io_complete_rw':
>> fs/io_uring.c:2509:48: error: implicit declaration of function 'io_rw_should_reissue'; did you mean 'io_rw_reissue'? [-Werror=implicit-function-declaration]
    2509 |  if ((res == -EAGAIN || res == -EOPNOTSUPP) && io_rw_should_reissue(req)) {
         |                                                ^~~~~~~~~~~~~~~~~~~~
         |                                                io_rw_reissue
    cc1: some warnings being treated as errors

Ensure that we have a stub declaration of io_rw_should_reissue() for
!CONFIG_BLOCK.

Fixes: 230d50d4 ("io_uring: move reissue into regular IO path")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e82ad485

02 Apr, 2021 17 commits

Merge tag 'block-5.12-2021-04-02' of git://git.kernel.dk/linux-block · d93a0d43

Linus Torvalds authored Apr 02, 2021

Pull block fixes from Jens Axboe:

 - Remove comment that never came to fruition in 22 years of development
   (Christoph)

 - Remove unused request flag (Christoph)

 - Fix for null_blk fake timeout handling (Damien)

 - Fix for IOCB_NOWAIT being ignored for O_DIRECT on raw bdevs (Pavel)

 - Error propagation fix for multiple split bios (Yufen)

* tag 'block-5.12-2021-04-02' of git://git.kernel.dk/linux-block:
  block: remove the unused RQF_ALLOCED flag
  block: update a few comments in uapi/linux/blkpg.h
  block: don't ignore REQ_NOWAIT for direct IO
  null_blk: fix command timeout completion handling
  block: only update parent bi_status when bio fail

d93a0d43

Merge tag 'io_uring-5.12-2021-04-02' of git://git.kernel.dk/linux-block · 1faccb63

Linus Torvalds authored Apr 02, 2021

Pull io_uring fixes from Jens Axboe:
 "Nothing really major in here, and finally nothing really related to
  signals. A few minor fixups related to the threading changes, and some
  general fixes, that's it.

  There's the pending gdb-get-confused-about-arch, but that's more of a
  cosmetic issue, nothing that hinder use of it. And given that other
  archs will likely be affected by that oddity too, better to postpone
  any changes there until 5.13 imho"

* tag 'io_uring-5.12-2021-04-02' of git://git.kernel.dk/linux-block:
  io_uring: move reissue into regular IO path
  io_uring: fix EIOCBQUEUED iter revert
  io_uring/io-wq: protect against sprintf overflow
  io_uring: don't mark S_ISBLK async work as unbounded
  io_uring: drop sqd lock before handling signals for SQPOLL
  io_uring: handle setup-failed ctx in kill_timeouts
  io_uring: always go for cancellation spin on exec

1faccb63

Merge tag 'acpi-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 0a84c2e4

Linus Torvalds authored Apr 02, 2021

Pull ACPI fixes from Rafael Wysocki:
 "These fix an ACPI tables management issue, an issue related to the
  ACPI enumeration of devices and CPU wakeup in the ACPI processor
  driver.

  Specifics:

   - Ensure that the memory occupied by ACPI tables on x86 will always
     be reserved to prevent it from being allocated for other purposes
     which was possible in some cases (Rafael Wysocki).

   - Fix the ACPI device enumeration code to prevent it from attempting
     to evaluate the _STA control method for devices with unmet
     dependencies which is likely to fail (Hans de Goede).

   - Fix the handling of CPU0 wakeup in the ACPI processor driver to
     prevent CPU0 online failures from occurring (Vitaly Kuznetsov)"

* tag 'acpi-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()
  ACPI: scan: Fix _STA getting called on devices with unmet dependencies
  ACPI: tables: x86: Reserve memory occupied by ACPI tables

0a84c2e4

Merge tag 'pm-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 9314a0e9

Linus Torvalds authored Apr 02, 2021

Pull power management fixes from Rafael Wysocki:
 "These fix a race condition and an ordering issue related to using
  device links in the runtime PM framework and two kerneldoc comments in
  cpufreq.

  Specifics:

   - Fix race condition related to the handling of supplier devices
     during consumer device probe and fix the order of decrementation of
     two related reference counters in the runtime PM core code handling
     supplier devices (Adrian Hunter).

   - Fix kerneldoc comments in cpufreq that have not been updated along
     with the functions documented by them (Geert Uytterhoeven)"

* tag 'pm-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: runtime: Fix race getting/putting suppliers at probe
  PM: runtime: Fix ordering in pm_runtime_get_suppliers()
  cpufreq: Fix scaling_{available,boost}_frequencies_show() comments

9314a0e9

block: remove the unused RQF_ALLOCED flag · f06c6096
Christoph Hellwig authored Apr 02, 2021
```
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
```
f06c6096

block: update a few comments in uapi/linux/blkpg.h · b9c6cdc3

Christoph Hellwig authored Apr 02, 2021

The big top of the file comment talk about grand plans that never
happened, so remove them to not confuse the readers.  Also mark the
devname and volname fields as ignored as they were never used by the
kernel.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b9c6cdc3

Merge tag 'trace-v5.12-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 05de4538

Linus Torvalds authored Apr 02, 2021

Pull tracing fix from Steven Rostedt:
 "Fix stack trace entry size to stop showing garbage

  The macro that creates both the structure and the format displayed to
  user space for the stack trace event was changed a while ago to fix
  the parsing by user space tooling. But this change also modified the
  structure used to store the stack trace event. It changed the caller
  array field from [0] to [8].

  Even though the size in the ring buffer is dynamic and can be
  something other than 8 (user space knows how to handle this), the 8
  extra words was not accounted for when reserving the event on the ring
  buffer, and added 8 more entries, due to the calculation of
  "sizeof(*entry) + nr_entries * sizeof(long)", as the sizeof(*entry)
  now contains 8 entries.

  The size of the caller field needs to be subtracted from the size of
  the entry to create the correct allocation size"

* tag 'trace-v5.12-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix stack trace event size

05de4538

io_uring: move reissue into regular IO path · 230d50d4

Jens Axboe authored Apr 01, 2021

It's non-obvious how retry is done for block backed files, when it happens
off the kiocb done path. It also makes it tricky to deal with the iov_iter
handling.

Just mark the req as needing a reissue, and handling it from the
submission path instead. This makes it directly obvious that we're not
re-importing the iovec from userspace past the submit point, and it means
that we can just reuse our usual -EAGAIN retry path from the read/write
handling.

At some point in the future, we'll gain the ability to always reliably
return -EAGAIN through the stack. A previous attempt on the block side
didn't pan out and got reverted, hence the need to check for this
information out-of-band right now.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

230d50d4

Merge branches 'acpi-tables' and 'acpi-scan' · 91463ebf

Rafael J. Wysocki authored Apr 02, 2021

* acpi-tables:
  ACPI: tables: x86: Reserve memory occupied by ACPI tables

* acpi-scan:
  ACPI: scan: Fix _STA getting called on devices with unmet dependencies

91463ebf

Merge branch 'pm-cpufreq' · ac1790ad

Rafael J. Wysocki authored Apr 02, 2021

* pm-cpufreq:
  cpufreq: Fix scaling_{available,boost}_frequencies_show() comments

ac1790ad

block: don't ignore REQ_NOWAIT for direct IO · f8b78caf

Pavel Begunkov authored Nov 20, 2020

If IOCB_NOWAIT is set on submission, then that needs to get propagated to
REQ_NOWAIT on the block side. Otherwise we completely lose this
information, and any issuer of IOCB_NOWAIT IO will potentially end up
blocking on eg request allocation on the storage side.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f8b78caf

riscv: Make NUMA depend on MMU · 1adbc294

Kefeng Wang authored Mar 30, 2021

NUMA is useless when NOMMU, and it leads some build error,
make it depend on MMU.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

1adbc294

riscv: remove unneeded semicolon · 9d8c7d92

Yang Li authored Mar 22, 2021

Eliminate the following coccicheck warning:
./arch/riscv/mm/kasan_init.c:219:2-3: Unneeded semicolon
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

9d8c7d92

riscv,entry: fix misaligned base for excp_vect_table · ac8d0b90

Zihao Yu authored Mar 17, 2021

In RV64, the size of each entry in excp_vect_table is 8 bytes. If the
base of the table is not 8-byte aligned, loading an entry in the table
will raise a misaligned exception. Although such exception will be
handled by opensbi/bbl, this still causes performance degradation.
Signed-off-by: Zihao Yu <yuzihao@ict.ac.cn>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

ac8d0b90

riscv: evaluate put_user() arg before enabling user access · 285a76bb

Ben Dooks authored Mar 29, 2021

The <asm/uaccess.h> header has a problem with put_user(a, ptr) if
the 'a' is not a simple variable, such as a function. This can lead
to the compiler producing code as so:

1:	enable_user_access()
2:	evaluate 'a' into register 'r'
3:	put 'r' to 'ptr'
4:	disable_user_acess()

The issue is that 'a' is now being evaluated with the user memory
protections disabled. So we try and force the evaulation by assigning
'x' to __val at the start, and hoping the compiler barriers in
 enable_user_access() do the job of ordering step 2 before step 1.

This has shown up in a bug where 'a' sleeps and thus schedules out
and loses the SR_SUM flag. This isn't sufficient to fully fix, but
should reduce the window of opportunity. The first instance of this
we found is in scheudle_tail() where the code does:

$ less -N kernel/sched/core.c

4263  if (current->set_child_tid)
4264         put_user(task_pid_vnr(current), current->set_child_tid);

Here, the task_pid_vnr(current) is called within the block that has
enabled the user memory access. This can be made worse with KASAN
which makes task_pid_vnr() a rather large call with plenty of
opportunity to sleep.
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reported-by: syzbot+e74b94fe601ab9552d69@syzkaller.appspotmail.com
Suggested-by: Arnd Bergman <arnd@arndb.de>

--
Changes since v1:
- fixed formatting and updated the patch description with more info

Changes since v2:
- fixed commenting on __put_user() (schwab@linux-m68k.org)

Change since v3:
- fixed RFC in patch title. Should be ready to merge.
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

285a76bb

riscv: Drop const annotation for sp · 23c1075a

Kefeng Wang authored Mar 17, 2021

The const annotation should not be used for 'sp', or it will
become read only and lead to bad stack output.

Fixes: dec82277 ("riscv: stacktrace: Move register keyword to beginning of declaration")
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

23c1075a

Merge tag 'lto-v5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 1678e493

Linus Torvalds authored Apr 01, 2021

Pull LTO fix from Kees Cook:
 "It seems that there is a bug in ld.bfd when doing module section
  merging.

  As explicit merging is only needed for LTO, the work-around is to only
  do it under LTO, leaving the original section layout choices alone
  under normal builds:

   - Only perform explicit module section merges under LTO (Sean
     Christopherson)"

* tag 'lto-v5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  kbuild: lto: Merge module sections if and only if CONFIG_LTO_CLANG is enabled

1678e493

01 Apr, 2021 4 commits

kbuild: lto: Merge module sections if and only if CONFIG_LTO_CLANG is enabled · 6a3193cd

Sean Christopherson authored Mar 22, 2021

Merge module sections only when using Clang LTO. With ld.bfd, merging
sections does not appear to update the symbol tables for the module,
e.g. 'readelf -s' shows the value that a symbol would have had, if
sections were not merged. ld.lld does not show this problem.

The stale symbol table breaks gdb's function disassembler, and presumably
other things, e.g.

  gdb -batch -ex "file arch/x86/kvm/kvm.ko" -ex "disassemble kvm_init"

reads the wrong bytes and dumps garbage.

Fixes: dd277622 ("kbuild: lto: merge module sections")
Cc: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210322234438.502582-1-seanjc@google.com

6a3193cd

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 6905b1dc

Linus Torvalds authored Apr 01, 2021

Pull kvm fixes from Paolo Bonzini:
 "It's a bit larger than I (and probably you) would like by the time we
  get to -rc6, but perhaps not entirely unexpected since the changes in
  the last merge window were larger than usual.

  x86:
   - Fixes for missing TLB flushes with TDP MMU

   - Fixes for race conditions in nested SVM

   - Fixes for lockdep splat with Xen emulation

   - Fix for kvmclock underflow

   - Fix srcdir != builddir builds

   - Other small cleanups

  ARM:
   - Fix GICv3 MMIO compatibility probing

   - Prevent guests from using the ARMv8.4 self-hosted tracing
     extension"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  selftests: kvm: Check that TSC page value is small after KVM_SET_CLOCK(0)
  KVM: x86: Prevent 'hv_clock->system_time' from going negative in kvm_guest_time_update()
  KVM: x86: disable interrupts while pvclock_gtod_sync_lock is taken
  KVM: x86: reduce pvclock_gtod_sync_lock critical sections
  KVM: SVM: ensure that EFER.SVME is set when running nested guest or on nested vmexit
  KVM: SVM: load control fields from VMCB12 before checking them
  KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages
  KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping
  KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap
  KVM: make: Fix out-of-source module builds
  selftests: kvm: make hardware_disable_test less verbose
  KVM: x86/vPMU: Forbid writing to MSR_F15H_PERF MSRs when guest doesn't have X86_FEATURE_PERFCTR_CORE
  KVM: x86: remove unused declaration of kvm_write_tsc()
  KVM: clean up the unused argument
  tools/kvm_stat: Add restart delay
  KVM: arm64: Fix CPU interface MMIO compatibility detection
  KVM: arm64: Disable guest access to trace filter controls
  KVM: arm64: Hide system instruction access to Trace registers

6905b1dc

Merge tag 'drm-fixes-2021-04-02' of git://anongit.freedesktop.org/drm/drm · a80314c3

Linus Torvalds authored Apr 01, 2021

Pull drm fixes from Dave Airlie:
 "Things have settled down in time for Easter, a random smattering of
  small fixes across a few drivers.

  I'm guessing though there might be some i915 and misc fixes out there
  I haven't gotten yet, but since today is a public holiday here, I'm
  sending this early so I can have the day off, I'll see if more
  requests come in and decide what to do with them later.

  amdgpu:
   - Polaris idle power fix
   - VM fix
   - Vangogh S3 fix
   - Fixes for non-4K page sizes

  amdkfd:
   - dqm fence memory corruption fix

  tegra:
   - lockdep warning fix
   - runtine PM reference fix
   - display controller fix
   - PLL Fix

  imx:
   - memory leak in error path fix
   - LDB driver channel registration fix
   - oob array warning in LDB driver

  exynos
   - unused header file removal"

* tag 'drm-fixes-2021-04-02' of git://anongit.freedesktop.org/drm/drm:
  drm/amdgpu: check alignment on CPU page for bo map
  drm/amdgpu: Set a suitable dev_info.gart_page_size
  drm/amdgpu/vangogh: don't check for dpm in is_dpm_running when in suspend
  drm/amdkfd: dqm fence memory corruption
  drm/tegra: sor: Grab runtime PM reference across reset
  drm/tegra: dc: Restore coupling of display controllers
  gpu: host1x: Use different lock classes for each client
  drm/tegra: dc: Don't set PLL clock to 0Hz
  drm/amdgpu: fix offset calculation in amdgpu_vm_bo_clear_mappings()
  drm/amd/pm: no need to force MCLK to highest when no display connected
  drm/exynos/decon5433: Remove the unused include statements
  drm/imx: imx-ldb: fix out of bounds array access warning
  drm/imx: imx-ldb: Register LDB channel1 when it is the only channel to be used
  drm/imx: fix memory leak when fails to init

a80314c3

Merge tag 'imx-drm-fixes-2021-04-01' of git://git.pengutronix.de/git/pza/linux into drm-fixes · 6fdb8e5a

Dave Airlie authored Apr 02, 2021

drm/imx: imx-drm-core and imx-ldb fixes

Fix a memory leak in an error path during DRM device initialization,
fix the LDB driver to register channel 1 even if channel 0 is unused,
and fix an out of bounds array access warning in the LDB driver.
Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Philipp Zabel <p.zabel@pengutronix.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20210401092235.GA13586@pengutronix.de

6fdb8e5a