• Kees Cook's avatar
    stack: Optionally randomize kernel stack offset each syscall · 39218ff4
    Kees Cook authored
    This provides the ability for architectures to enable kernel stack base
    address offset randomization. This feature is controlled by the boot
    param "randomize_kstack_offset=on/off", with its default value set by
    CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
    
    This feature is based on the original idea from the last public release
    of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt
    All the credit for the original idea goes to the PaX team. Note that
    the design and implementation of this upstream randomize_kstack_offset
    feature differs greatly from the RANDKSTACK feature (see below).
    
    Reasoning for the feature:
    
    This feature aims to make harder the various stack-based attacks that
    rely on deterministic stack structure. We have had many such attacks in
    past (just to name few):
    
    https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
    https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
    https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
    
    As Linux kernel stack protections have been constantly improving
    (vmap-based stack allocation with guard pages, removal of thread_info,
    STACKLEAK), attackers have had to find new ways for their exploits
    to work. They have done so, continuing to rely on the kernel's stack
    determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT
    were not relevant. For example, the following recent attacks would have
    been hampered if the stack offset was non-deterministic between syscalls:
    
    https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf
    (page 70: targeting the pt_regs copy with linear stack overflow)
    
    https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
    (leaked stack address from one syscall as a target during next syscall)
    
    The main idea is that since the stack offset is randomized on each system
    call, it is harder for an attack to reliably land in any particular place
    on the thread stack, even with address exposures, as the stack base will
    change on the next syscall. Also, since randomization is performed after
    placing pt_regs, the ptrace-based approach[1] to discover the randomized
    offset during a long-running syscall should not be possible.
    
    Design description:
    
    During most of the kernel's execution, it runs on the "thread stack",
    which is pretty deterministic in its structure: it is fixed in size,
    and on every entry from userspace to kernel on a syscall the thread
    stack starts construction from an address fetched from the per-cpu
    cpu_current_top_of_stack variable. The first element to be pushed to the
    thread stack is the pt_regs struct that stores all required CPU registers
    and syscall parameters. Finally the specific syscall function is called,
    with the stack being used as the kernel executes the resulting request.
    
    The goal of randomize_kstack_offset feature is to add a random offset
    after the pt_regs has been pushed to the stack and before the rest of the
    thread stack is used during the syscall processing, and to change it every
    time a process issues a syscall. The source of randomness is currently
    architecture-defined (but x86 is using the low byte of rdtsc()). Future
    improvements for different entropy sources is possible, but out of scope
    for this patch. Further more, to add more unpredictability, new offsets
    are chosen at the end of syscalls (the timing of which should be less
    easy to measure from userspace than at syscall entry time), and stored
    in a per-CPU variable, so that the life of the value does not stay
    explicitly tied to a single task.
    
    As suggested by Andy Lutomirski, the offset is added using alloca()
    and an empty asm() statement with an output constraint, since it avoids
    changes to assembly syscall entry code, to the unwinder, and provides
    correct stack alignment as defined by the compiler.
    
    In order to make this available by default with zero performance impact
    for those that don't want it, it is boot-time selectable with static
    branches. This way, if the overhead is not wanted, it can just be
    left turned off with no performance impact.
    
    The generated assembly for x86_64 with GCC looks like this:
    
    ...
    ffffffff81003977: 65 8b 05 02 ea 00 7f  mov %gs:0x7f00ea02(%rip),%eax
    					    # 12380 <kstack_offset>
    ffffffff8100397e: 25 ff 03 00 00        and $0x3ff,%eax
    ffffffff81003983: 48 83 c0 0f           add $0xf,%rax
    ffffffff81003987: 25 f8 07 00 00        and $0x7f8,%eax
    ffffffff8100398c: 48 29 c4              sub %rax,%rsp
    ffffffff8100398f: 48 8d 44 24 0f        lea 0xf(%rsp),%rax
    ffffffff81003994: 48 83 e0 f0           and $0xfffffffffffffff0,%rax
    ...
    
    As a result of the above stack alignment, this patch introduces about
    5 bits of randomness after pt_regs is spilled to the thread stack on
    x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for
    stack alignment). The amount of entropy could be adjusted based on how
    much of the stack space we wish to trade for security.
    
    My measure of syscall performance overhead (on x86_64):
    
    lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null
        randomize_kstack_offset=y	Simple syscall: 0.7082 microseconds
        randomize_kstack_offset=n	Simple syscall: 0.7016 microseconds
    
    So, roughly 0.9% overhead growth for a no-op syscall, which is very
    manageable. And for people that don't want this, it's off by default.
    
    There are two gotchas with using the alloca() trick. First,
    compilers that have Stack Clash protection (-fstack-clash-protection)
    enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to
    any dynamic stack allocations. While the randomization offset is
    always less than a page, the resulting assembly would still contain
    (unreachable!) probing routines, bloating the resulting assembly. To
    avoid this, -fno-stack-clash-protection is unconditionally added to
    the kernel Makefile since this is the only dynamic stack allocation in
    the kernel (now that VLAs have been removed) and it is provably safe
    from Stack Clash style attacks.
    
    The second gotcha with alloca() is a negative interaction with
    -fstack-protector*, in that it sees the alloca() as an array allocation,
    which triggers the unconditional addition of the stack canary function
    pre/post-amble which slows down syscalls regardless of the static
    branch. In order to avoid adding this unneeded check and its associated
    performance impact, architectures need to carefully remove uses of
    -fstack-protector-strong (or -fstack-protector) in the compilation units
    that use the add_random_kstack() macro and to audit the resulting stack
    mitigation coverage (to make sure no desired coverage disappears). No
    change is visible for this on x86 because the stack protector is already
    unconditionally disabled for the compilation unit, but the change is
    required on arm64. There is, unfortunately, no attribute that can be
    used to disable stack protector for specific functions.
    
    Comparison to PaX RANDKSTACK feature:
    
    The RANDKSTACK feature randomizes the location of the stack start
    (cpu_current_top_of_stack), i.e. including the location of pt_regs
    structure itself on the stack. Initially this patch followed the same
    approach, but during the recent discussions[2], it has been determined
    to be of a little value since, if ptrace functionality is available for
    an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write
    different offsets in the pt_regs struct, observe the cache behavior of
    the pt_regs accesses, and figure out the random stack offset. Another
    difference is that the random offset is stored in a per-cpu variable,
    rather than having it be per-thread. As a result, these implementations
    differ a fair bit in their implementation details and results, though
    obviously the intent is similar.
    
    [1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/
    [2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/
    [3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.htmlCo-developed-by: default avatarElena Reshetova <elena.reshetova@intel.com>
    Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
    Signed-off-by: default avatarKees Cook <keescook@chromium.org>
    Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
    Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org
    39218ff4
main.c 38.2 KB