1. 11 Jun, 2021 1 commit
  2. 10 Jun, 2021 7 commits
    • Eric W. Biederman's avatar
      coredump: Limit what can interrupt coredumps · 06af8679
      Eric W. Biederman authored
      Olivier Langlois has been struggling with coredumps being incompletely written in
      processes using io_uring.
      
      Olivier Langlois <olivier@trillion01.com> writes:
      > io_uring is a big user of task_work and any event that io_uring made a
      > task waiting for that occurs during the core dump generation will
      > generate a TIF_NOTIFY_SIGNAL.
      >
      > Here are the detailed steps of the problem:
      > 1. io_uring calls vfs_poll() to install a task to a file wait queue
      >    with io_async_wake() as the wakeup function cb from io_arm_poll_handler()
      > 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
      > 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
      >    set_notify_signal()
      
      The coredump code deliberately supports being interrupted by SIGKILL,
      and depends upon prepare_signal to filter out all other signals.   Now
      that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack
      in dump_emitted by the coredump code no longer works.
      
      Make the coredump code more robust by explicitly testing for all of
      the wakeup conditions the coredump code supports.  This prevents
      new wakeup conditions from breaking the coredump code, as well
      as fixing the current issue.
      
      The filesystem code that the coredump code uses already limits
      itself to only aborting on fatal_signal_pending.  So it should
      not develop surprising wake-up reasons either.
      
      v2: Don't remove the now unnecessary code in prepare_signal.
      
      Cc: stable@vger.kernel.org
      Fixes: 12db8b69 ("entry: Add support for TIF_NOTIFY_SIGNAL")
      Reported-by: default avatarOlivier Langlois <olivier@trillion01.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06af8679
    • Linus Torvalds's avatar
      Merge branch 'for-5.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · f09eacca
      Linus Torvalds authored
      Pull cgroup fix from Tejun Heo:
       "This is a high priority but low risk fix for a cgroup1 bug where
        rename(2) can change a cgroup's name to something which can break
        parsing of /proc/PID/cgroup"
      
      * 'for-5.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup1: don't allow '\n' in renaming
      f09eacca
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 29a877d5
      Linus Torvalds authored
      Pull rdma fixes from Jason Gunthorpe:
       "A mixture of small bug fixes and a small security issue:
      
         - WARN_ON when IPoIB is automatically moved between namespaces
      
         - Long standing bug where mlx5 would use the wrong page for the
           doorbell recovery memory if fork is used
      
         - Security fix for mlx4 that disables the timestamp feature
      
         - Several crashers for mlx5
      
         - Plug a recent mlx5 memory leak for the sig_mr"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        IB/mlx5: Fix initializing CQ fragments buffer
        RDMA/mlx5: Delete right entry from MR signature database
        RDMA: Verify port when creating flow rule
        RDMA/mlx5: Block FDB rules when not in switchdev mode
        RDMA/mlx4: Do not map the core_clock page to user space unless enabled
        RDMA/mlx5: Use different doorbell memory for different processes
        RDMA/ipoib: Fix warning caused by destroying non-initial netns
      29a877d5
    • Alexander Kuznetsov's avatar
      cgroup1: don't allow '\n' in renaming · b7e24eb1
      Alexander Kuznetsov authored
      cgroup_mkdir() have restriction on newline usage in names:
      $ mkdir $'/sys/fs/cgroup/cpu/test\ntest2'
      mkdir: cannot create directory
      '/sys/fs/cgroup/cpu/test\ntest2': Invalid argument
      
      But in cgroup1_rename() such check is missed.
      This allows us to make /proc/<pid>/cgroup unparsable:
      $ mkdir /sys/fs/cgroup/cpu/test
      $ mv /sys/fs/cgroup/cpu/test $'/sys/fs/cgroup/cpu/test\ntest2'
      $ echo $$ > $'/sys/fs/cgroup/cpu/test\ntest2'
      $ cat /proc/self/cgroup
      11:pids:/
      10:freezer:/
      9:hugetlb:/
      8:cpuset:/
      7:blkio:/user.slice
      6:memory:/user.slice
      5:net_cls,net_prio:/
      4:perf_event:/
      3:devices:/user.slice
      2:cpu,cpuacct:/test
      test2
      1:name=systemd:/
      0::/
      Signed-off-by: default avatarAlexander Kuznetsov <wwfq@yandex-team.ru>
      Reported-by: default avatarAndrey Krasichkov <buglloc@yandex-team.ru>
      Acked-by: default avatarDmitry Yakunin <zeil@yandex-team.ru>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      b7e24eb1
    • Alaa Hleihel's avatar
      IB/mlx5: Fix initializing CQ fragments buffer · 2ba0aa2f
      Alaa Hleihel authored
      The function init_cq_frag_buf() can be called to initialize the current CQ
      fragments buffer cq->buf, or the temporary cq->resize_buf that is filled
      during CQ resize operation.
      
      However, the offending commit started to use function get_cqe() for
      getting the CQEs, the issue with this change is that get_cqe() always
      returns CQEs from cq->buf, which leads us to initialize the wrong buffer,
      and in case of enlarging the CQ we try to access elements beyond the size
      of the current cq->buf and eventually hit a kernel panic.
      
       [exception RIP: init_cq_frag_buf+103]
        [ffff9f799ddcbcd8] mlx5_ib_resize_cq at ffffffffc0835d60 [mlx5_ib]
        [ffff9f799ddcbdb0] ib_resize_cq at ffffffffc05270df [ib_core]
        [ffff9f799ddcbdc0] llt_rdma_setup_qp at ffffffffc0a6a712 [llt]
        [ffff9f799ddcbe10] llt_rdma_cc_event_action at ffffffffc0a6b411 [llt]
        [ffff9f799ddcbe98] llt_rdma_client_conn_thread at ffffffffc0a6bb75 [llt]
        [ffff9f799ddcbec8] kthread at ffffffffa66c5da1
        [ffff9f799ddcbf50] ret_from_fork_nospec_begin at ffffffffa6d95ddd
      
      Fix it by getting the needed CQE by calling mlx5_frag_buf_get_wqe() that
      takes the correct source buffer as a parameter.
      
      Fixes: 388ca8be ("IB/mlx5: Implement fragmented completion queue (CQ)")
      Link: https://lore.kernel.org/r/90a0e8c924093cfa50a482880ad7e7edb73dc19a.1623309971.git.leonro@nvidia.comSigned-off-by: default avatarAlaa Hleihel <alaa@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      2ba0aa2f
    • Aharon Landau's avatar
      RDMA/mlx5: Delete right entry from MR signature database · 6466f03f
      Aharon Landau authored
      The value mr->sig is stored in the entry upon mr allocation, however, ibmr
      is wrongly entered here as "old", therefore, xa_cmpxchg() does not replace
      the entry with NULL, which leads to the following trace:
      
       WARNING: CPU: 28 PID: 2078 at drivers/infiniband/hw/mlx5/main.c:3643 mlx5_ib_stage_init_cleanup+0x4d/0x60 [mlx5_ib]
       Modules linked in: nvme_rdma nvme_fabrics nvme_core 8021q garp mrp bonding bridge stp llc rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_tad
       CPU: 28 PID: 2078 Comm: reboot Tainted: G               X --------- ---  5.13.0-0.rc2.19.el9.x86_64 #1
       Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 2.9.1 12/07/2018
       RIP: 0010:mlx5_ib_stage_init_cleanup+0x4d/0x60 [mlx5_ib]
       Code: 8d bb 70 1f 00 00 be 00 01 00 00 e8 9d 94 ce da 48 3d 00 01 00 00 75 02 5b c3 0f 0b 5b c3 0f 0b 48 83 bb b0 20 00 00 00 74 d5 <0f> 0b eb d1 4
       RSP: 0018:ffffa8db06d33c90 EFLAGS: 00010282
       RAX: 0000000000000000 RBX: ffff97f890a44000 RCX: ffff97f900ec0160
       RDX: 0000000000000000 RSI: 0000000080080001 RDI: ffff97f890a44000
       RBP: ffffffffc0c189b8 R08: 0000000000000001 R09: 0000000000000000
       R10: 0000000000000001 R11: 0000000000000300 R12: ffff97f890a44000
       R13: ffffffffc0c36030 R14: 00000000fee1dead R15: 0000000000000000
       FS:  00007f0d5a8a3b40(0000) GS:ffff98077fb80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000555acbf4f450 CR3: 00000002a6f56002 CR4: 00000000001706e0
       Call Trace:
        mlx5r_remove+0x39/0x60 [mlx5_ib]
        auxiliary_bus_remove+0x1b/0x30
        __device_release_driver+0x17a/0x230
        device_release_driver+0x24/0x30
        bus_remove_device+0xdb/0x140
        device_del+0x18b/0x3e0
        mlx5_detach_device+0x59/0x90 [mlx5_core]
        mlx5_unload_one+0x22/0x60 [mlx5_core]
        shutdown+0x31/0x3a [mlx5_core]
        pci_device_shutdown+0x34/0x60
        device_shutdown+0x15b/0x1c0
        __do_sys_reboot.cold+0x2f/0x5b
        ? vfs_writev+0xc7/0x140
        ? handle_mm_fault+0xc5/0x290
        ? do_writev+0x6b/0x110
        do_syscall_64+0x40/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: e6fb246c ("RDMA/mlx5: Consolidate MR destruction to mlx5_ib_dereg_mr()")
      Link: https://lore.kernel.org/r/f3f585ea0db59c2a78f94f65eedeafc5a2374993.1623309971.git.leonro@nvidia.comSigned-off-by: default avatarAharon Landau <aharonl@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      6466f03f
    • Maor Gottlieb's avatar
      RDMA: Verify port when creating flow rule · 2adcb4c5
      Maor Gottlieb authored
      Validate port value provided by the user and with that remove no longer
      needed validation by the driver.  The missing check in the mlx5_ib driver
      could cause to the below oops.
      
      Call trace:
        _create_flow_rule+0x2d4/0xf28 [mlx5_ib]
        mlx5_ib_create_flow+0x2d0/0x5b0 [mlx5_ib]
        ib_uverbs_ex_create_flow+0x4cc/0x624 [ib_uverbs]
        ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xd4/0x150 [ib_uverbs]
        ib_uverbs_cmd_verbs.isra.7+0xb28/0xc50 [ib_uverbs]
        ib_uverbs_ioctl+0x158/0x1d0 [ib_uverbs]
        do_vfs_ioctl+0xd0/0xaf0
        ksys_ioctl+0x84/0xb4
        __arm64_sys_ioctl+0x28/0xc4
        el0_svc_common.constprop.3+0xa4/0x254
        el0_svc_handler+0x84/0xa0
        el0_svc+0x10/0x26c
       Code: b9401260 f9615681 51000400 8b001c20 (f9403c1a)
      
      Fixes: 436f2ad0 ("IB/core: Export ib_create/destroy_flow through uverbs")
      Link: https://lore.kernel.org/r/faad30dc5219a01727f47db3dc2f029d07c82c00.1623309971.git.leonro@nvidia.comReviewed-by: default avatarMark Bloch <markb@mellanox.com>
      Signed-off-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      2adcb4c5
  3. 09 Jun, 2021 6 commits
  4. 08 Jun, 2021 15 commits
    • Paolo Bonzini's avatar
      kvm: avoid speculation-based attacks from out-of-range memslot accesses · da27a83f
      Paolo Bonzini authored
      KVM's mechanism for accessing guest memory translates a guest physical
      address (gpa) to a host virtual address using the right-shifted gpa
      (also known as gfn) and a struct kvm_memory_slot.  The translation is
      performed in __gfn_to_hva_memslot using the following formula:
      
            hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE
      
      It is expected that gfn falls within the boundaries of the guest's
      physical memory.  However, a guest can access invalid physical addresses
      in such a way that the gfn is invalid.
      
      __gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first
      retrieves a memslot through __gfn_to_memslot.  While __gfn_to_memslot
      does check that the gfn falls within the boundaries of the guest's
      physical memory or not, a CPU can speculate the result of the check and
      continue execution speculatively using an illegal gfn. The speculation
      can result in calculating an out-of-bounds hva.  If the resulting host
      virtual address is used to load another guest physical address, this
      is effectively a Spectre gadget consisting of two consecutive reads,
      the second of which is data dependent on the first.
      
      Right now it's not clear if there are any cases in which this is
      exploitable.  One interesting case was reported by the original author
      of this patch, and involves visiting guest page tables on x86.  Right
      now these are not vulnerable because the hva read goes through get_user(),
      which contains an LFENCE speculation barrier.  However, there are
      patches in progress for x86 uaccess.h to mask kernel addresses instead of
      using LFENCE; once these land, a guest could use speculation to read
      from the VMM's ring 3 address space.  Other architectures such as ARM
      already use the address masking method, and would be susceptible to
      this same kind of data-dependent access gadgets.  Therefore, this patch
      proactively protects from these attacks by masking out-of-bounds gfns
      in __gfn_to_hva_memslot, which blocks speculation of invalid hvas.
      
      Sean Christopherson noted that this patch does not cover
      kvm_read_guest_offset_cached.  This however is limited to a few bytes
      past the end of the cache, and therefore it is unlikely to be useful in
      the context of building a chain of data dependent accesses.
      Reported-by: default avatarArtemiy Margaritov <artemiy.margaritov@gmail.com>
      Co-developed-by: default avatarArtemiy Margaritov <artemiy.margaritov@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      da27a83f
    • Lai Jiangshan's avatar
      KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync · b53e84ee
      Lai Jiangshan authored
      When using shadow paging, unload the guest MMU when emulating a guest TLB
      flush to ensure all roots are synchronized.  From the guest's perspective,
      flushing the TLB ensures any and all modifications to its PTEs will be
      recognized by the CPU.
      
      Note, unloading the MMU is overkill, but is done to mirror KVM's existing
      handling of INVPCID(all) and ensure the bug is squashed.  Future cleanup
      can be done to more precisely synchronize roots when servicing a guest
      TLB flush.
      
      If TDP is enabled, synchronizing the MMU is unnecessary even if nested
      TDP is in play, as a "legacy" TLB flush from L1 does not invalidate L1's
      TDP mappings.  For EPT, an explicit INVEPT is required to invalidate
      guest-physical mappings; for NPT, guest mappings are always tagged with
      an ASID and thus can only be invalidated via the VMCB's ASID control.
      
      This bug has existed since the introduction of KVM_VCPU_FLUSH_TLB.
      It was only recently exposed after Linux guests stopped flushing the
      local CPU's TLB prior to flushing remote TLBs (see commit 4ce94eab,
      "x86/mm/tlb: Flush remote and local TLBs concurrently"), but is also
      visible in Windows 10 guests.
      Tested-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Fixes: f38a7b75 ("KVM: X86: support paravirtualized help for TLB shootdowns")
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      [sean: massaged comment and changelog]
      Message-Id: <20210531172256.2908-1-jiangshanlai@gmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b53e84ee
    • Mark Bloch's avatar
      RDMA/mlx5: Block FDB rules when not in switchdev mode · edc0b0bc
      Mark Bloch authored
      Allow creating FDB steering rules only when in switchdev mode.
      
      The only software model where a userspace application can manipulate
      FDB entries is when it manages the eswitch. This is only possible in
      switchdev mode where we expose a single RDMA device with representors
      for all the vports that are connected to the eswitch.
      
      Fixes: 52438be4 ("RDMA/mlx5: Allow inserting a steering rule to the FDB")
      Link: https://lore.kernel.org/r/e928ae7c58d07f104716a2a8d730963d1bd01204.1623052923.git.leonro@nvidia.comReviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      edc0b0bc
    • Sean Christopherson's avatar
      KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message · f31500b0
      Sean Christopherson authored
      Use the __string() machinery provided by the tracing subystem to make a
      copy of the string literals consumed by the "nested VM-Enter failed"
      tracepoint.  A complete copy is necessary to ensure that the tracepoint
      can't outlive the data/memory it consumes and deference stale memory.
      
      Because the tracepoint itself is defined by kvm, if kvm-intel and/or
      kvm-amd are built as modules, the memory holding the string literals
      defined by the vendor modules will be freed when the module is unloaded,
      whereas the tracepoint and its data in the ring buffer will live until
      kvm is unloaded (or "indefinitely" if kvm is built-in).
      
      This bug has existed since the tracepoint was added, but was recently
      exposed by a new check in tracing to detect exactly this type of bug.
      
        fmt: '%s%s
        ' current_buffer: ' vmx_dirty_log_t-140127  [003] ....  kvm_nested_vmenter_failed: '
        WARNING: CPU: 3 PID: 140134 at kernel/trace/trace.c:3759 trace_check_vprintf+0x3be/0x3e0
        CPU: 3 PID: 140134 Comm: less Not tainted 5.13.0-rc1-ce2e73ce600a-req #184
        Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
        RIP: 0010:trace_check_vprintf+0x3be/0x3e0
        Code: <0f> 0b 44 8b 4c 24 1c e9 a9 fe ff ff c6 44 02 ff 00 49 8b 97 b0 20
        RSP: 0018:ffffa895cc37bcb0 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: ffffa895cc37bd08 RCX: 0000000000000027
        RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff9766cfad74f8
        RBP: ffffffffc0a041d4 R08: ffff9766cfad74f0 R09: ffffa895cc37bad8
        R10: 0000000000000001 R11: 0000000000000001 R12: ffffffffc0a041d4
        R13: ffffffffc0f4dba8 R14: 0000000000000000 R15: ffff976409f2c000
        FS:  00007f92fa200740(0000) GS:ffff9766cfac0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000559bd11b0000 CR3: 000000019fbaa002 CR4: 00000000001726e0
        Call Trace:
         trace_event_printf+0x5e/0x80
         trace_raw_output_kvm_nested_vmenter_failed+0x3a/0x60 [kvm]
         print_trace_line+0x1dd/0x4e0
         s_show+0x45/0x150
         seq_read_iter+0x2d5/0x4c0
         seq_read+0x106/0x150
         vfs_read+0x98/0x180
         ksys_read+0x5f/0xe0
         do_syscall_64+0x40/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Fixes: 380e0055 ("KVM: nVMX: trace nested VM-Enter failures detected by H/W")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Message-Id: <20210607175748.674002-1-seanjc@google.com>
      f31500b0
    • Linus Torvalds's avatar
      Merge tag 'for-linus-5.13b-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 368094df
      Linus Torvalds authored
      Pull xen fix from Juergen Gross:
       "A single patch fixing a Xen related security bug: a malicious guest
        might be able to trigger a 'use after free' issue in the xen-netback
        driver"
      
      * tag 'for-linus-5.13b-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen-netback: take a reference to the RX task thread
      368094df
    • Zhenzhong Duan's avatar
      selftests: kvm: Add support for customized slot0 memory size · f53b16ad
      Zhenzhong Duan authored
      Until commit 39fe2fc9 ("selftests: kvm: make allocation of extra
      memory take effect", 2021-05-27), parameter extra_mem_pages was used
      only to calculate the page table size for all the memory chunks,
      because real memory allocation happened with calls of
      vm_userspace_mem_region_add() after vm_create_default().
      
      Commit 39fe2fc9 however changed the meaning of extra_mem_pages to
      the size of memory slot 0.  This makes the memory allocation more
      flexible, but makes it harder to account for the number of
      pages needed for the page tables.  For example, memslot_perf_test
      has a small amount of memory in slot 0 but a lot in other slots,
      and adding that memory twice (both in slot 0 and with later
      calls to vm_userspace_mem_region_add()) causes an error that
      was fixed in commit 000ac429 ("selftests: kvm: fix overlapping
      addresses in memslot_perf_test", 2021-05-29)
      
      Since both uses are sensible, add a new parameter slot0_mem_pages
      to vm_create_with_vcpus() and some comments to clarify the meaning of
      slot0_mem_pages and extra_mem_pages.  With this change,
      memslot_perf_test can go back to passing the number of memory
      pages as extra_mem_pages.
      Signed-off-by: default avatarZhenzhong Duan <zhenzhong.duan@intel.com>
      Message-Id: <20210608233816.423958-4-zhenzhong.duan@intel.com>
      [Squashed in a single patch and rewrote the commit message. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f53b16ad
    • Linus Torvalds's avatar
      Merge tag 'orphans-v5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 374aeb91
      Linus Torvalds authored
      Pull orphan section fixes from Kees Cook:
       "These two corner case fixes have been in -next for about a week:
      
         - Avoid orphan section in ARM cpuidle (Arnd Bergmann)
      
         - Avoid orphan section with !SMP (Nathan Chancellor)"
      
      * tag 'orphans-v5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        vmlinux.lds.h: Avoid orphan section with !SMP
        ARM: cpuidle: Avoid orphan section warning
      374aeb91
    • Kees Cook's avatar
      proc: Track /proc/$pid/attr/ opener mm_struct · 591a22c1
      Kees Cook authored
      Commit bfb819ea ("proc: Check /proc/$pid/attr/ writes against file opener")
      tried to make sure that there could not be a confusion between the opener of
      a /proc/$pid/attr/ file and the writer. It used struct cred to make sure
      the privileges didn't change. However, there were existing cases where a more
      privileged thread was passing the opened fd to a differently privileged thread
      (during container setup). Instead, use mm_struct to track whether the opener
      and writer are still the same process. (This is what several other proc files
      already do, though for different reasons.)
      Reported-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reported-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Tested-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Fixes: bfb819ea ("proc: Check /proc/$pid/attr/ writes against file opener")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      591a22c1
    • Christian Borntraeger's avatar
      KVM: selftests: introduce P47V64 for s390x · 1bc603af
      Christian Borntraeger authored
      s390x can have up to 47bits of physical guest and 64bits of virtual
      address  bits. Add a new address mode to avoid errors of testcases
      going beyond 47bits.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Message-Id: <20210608123954.10991-1-borntraeger@de.ibm.com>
      Fixes: ef4c9f4f ("KVM: selftests: Fix 32-bit truncation of vm_get_max_gfn()")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1bc603af
    • Lai Jiangshan's avatar
      KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior · af3511ff
      Lai Jiangshan authored
      In record_steal_time(), st->preempted is read twice, and
      trace_kvm_pv_tlb_flush() might output result inconsistent if
      kvm_vcpu_flush_tlb_guest() see a different st->preempted later.
      
      It is a very trivial problem and hardly has actual harm and can be
      avoided by reseting and reading st->preempted in atomic way via xchg().
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      
      Message-Id: <20210531174628.10265-1-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      af3511ff
    • Linus Torvalds's avatar
      Merge tag 'spi-fix-v5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi · 4c8684fe
      Linus Torvalds authored
      Pull spi fixes from Mark Brown:
       "A small set of SPI fixes that have come up since the merge window, all
        fairly small fixes for rare cases"
      
      * tag 'spi-fix-v5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
        spi: stm32-qspi: Always wait BUSY bit to be cleared in stm32_qspi_wait_cmd()
        spi: spi-zynq-qspi: Fix some wrong goto jumps & missing error code
        spi: Cleanup on failure of initial setup
        spi: bcm2835: Fix out-of-bounds access with more than 4 slaves
      4c8684fe
    • Linus Torvalds's avatar
      Merge tag 'regulator-fix-v5.13-rc4' of... · 9b1111fa
      Linus Torvalds authored
      Merge tag 'regulator-fix-v5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
      
      Pull regulator fixes from Mark Brown:
       "A collection of fixes for the regulator API that have come up since
        the merge window, including a big batch of fixes from Axel Lin's usual
        careful and detailed review.
      
        The one stand out fix here is Dmitry Baryshkov's fix for an issue
        where we fail to power on the parents of always on regulators during
        system startup if they weren't already powered on"
      
      * tag 'regulator-fix-v5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (21 commits)
        regulator: rt4801: Fix NULL pointer dereference if priv->enable_gpios is NULL
        regulator: hi6421v600: Fix .vsel_mask setting
        regulator: bd718x7: Fix the BUCK7 voltage setting on BD71837
        regulator: atc260x: Fix n_voltages and min_sel for pickable linear ranges
        regulator: rtmv20: Fix to make regcache value first reading back from HW
        regulator: mt6315: Fix function prototype for mt6315_map_mode
        regulator: rtmv20: Add Richtek to Kconfig text
        regulator: rtmv20: Fix .set_current_limit/.get_current_limit callbacks
        regulator: hisilicon: use the correct HiSilicon copyright
        regulator: bd71828: Fix .n_voltages settings
        regulator: bd70528: Fix off-by-one for buck123 .n_voltages setting
        regulator: max77620: Silence deferred probe error
        regulator: max77620: Use device_set_of_node_from_dev()
        regulator: scmi: Fix off-by-one for linear regulators .n_voltages setting
        regulator: core: resolve supply for boot-on/always-on regulators
        regulator: fixed: Ensure enable_counter is correct if reg_domain_disable fails
        regulator: Check ramp_delay_table for regulator_set_ramp_delay_regmap
        regulator: fan53880: Fix missing n_voltages setting
        regulator: da9121: Return REGULATOR_MODE_INVALID for invalid mode
        regulator: fan53555: fix TCS4525 voltage calulation
        ...
      9b1111fa
    • Lai Jiangshan's avatar
      KVM: X86: MMU: Use the correct inherited permissions to get shadow page · b1bd5cba
      Lai Jiangshan authored
      When computing the access permissions of a shadow page, use the effective
      permissions of the walk up to that point, i.e. the logic AND of its parents'
      permissions.  Two guest PxE entries that point at the same table gfn need to
      be shadowed with different shadow pages if their parents' permissions are
      different.  KVM currently uses the effective permissions of the last
      non-leaf entry for all non-leaf entries.  Because all non-leaf SPTEs have
      full ("uwx") permissions, and the effective permissions are recorded only
      in role.access and merged into the leaves, this can lead to incorrect
      reuse of a shadow page and eventually to a missing guest protection page
      fault.
      
      For example, here is a shared pagetable:
      
         pgd[]   pud[]        pmd[]            virtual address pointers
                           /->pmd1(u--)->pte1(uw-)->page1 <- ptr1 (u--)
              /->pud1(uw-)--->pmd2(uw-)->pte2(uw-)->page2 <- ptr2 (uw-)
         pgd-|           (shared pmd[] as above)
              \->pud2(u--)--->pmd1(u--)->pte1(uw-)->page1 <- ptr3 (u--)
                           \->pmd2(uw-)->pte2(uw-)->page2 <- ptr4 (u--)
      
        pud1 and pud2 point to the same pmd table, so:
        - ptr1 and ptr3 points to the same page.
        - ptr2 and ptr4 points to the same page.
      
      (pud1 and pud2 here are pud entries, while pmd1 and pmd2 here are pmd entries)
      
      - First, the guest reads from ptr1 first and KVM prepares a shadow
        page table with role.access=u--, from ptr1's pud1 and ptr1's pmd1.
        "u--" comes from the effective permissions of pgd, pud1 and
        pmd1, which are stored in pt->access.  "u--" is used also to get
        the pagetable for pud1, instead of "uw-".
      
      - Then the guest writes to ptr2 and KVM reuses pud1 which is present.
        The hypervisor set up a shadow page for ptr2 with pt->access is "uw-"
        even though the pud1 pmd (because of the incorrect argument to
        kvm_mmu_get_page in the previous step) has role.access="u--".
      
      - Then the guest reads from ptr3.  The hypervisor reuses pud1's
        shadow pmd for pud2, because both use "u--" for their permissions.
        Thus, the shadow pmd already includes entries for both pmd1 and pmd2.
      
      - At last, the guest writes to ptr4.  This causes no vmexit or pagefault,
        because pud1's shadow page structures included an "uw-" page even though
        its role.access was "u--".
      
      Any kind of shared pagetable might have the similar problem when in
      virtual machine without TDP enabled if the permissions are different
      from different ancestors.
      
      In order to fix the problem, we change pt->access to be an array, and
      any access in it will not include permissions ANDed from child ptes.
      
      The test code is: https://lore.kernel.org/kvm/20210603050537.19605-1-jiangshanlai@gmail.com/
      Remember to test it with TDP disabled.
      
      The problem had existed long before the commit 41074d07 ("KVM: MMU:
      Fix inherited permissions for emulated guest pte updates"), and it
      is hard to find which is the culprit.  So there is no fixes tag here.
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20210603052455.21023-1-jiangshanlai@gmail.com>
      Cc: stable@vger.kernel.org
      Fixes: cea0f0e7 ("[PATCH] KVM: MMU: Shadow page table caching")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b1bd5cba
    • Wanpeng Li's avatar
      KVM: LAPIC: Write 0 to TMICT should also cancel vmx-preemption timer · e898da78
      Wanpeng Li authored
      According to the SDM 10.5.4.1:
      
        A write of 0 to the initial-count register effectively stops the local
        APIC timer, in both one-shot and periodic mode.
      
      However, the lapic timer oneshot/periodic mode which is emulated by vmx-preemption
      timer doesn't stop by writing 0 to TMICT since vmx->hv_deadline_tsc is still
      programmed and the guest will receive the spurious timer interrupt later. This
      patch fixes it by also cancelling the vmx-preemption timer when writing 0 to
      the initial-count register.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1623050385-100988-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e898da78
    • Ashish Kalra's avatar
      KVM: SVM: Fix SEV SEND_START session length & SEND_UPDATE_DATA query length... · 4f13d471
      Ashish Kalra authored
      KVM: SVM: Fix SEV SEND_START session length & SEND_UPDATE_DATA query length after commit 238eca82
      
      Commit 238eca82 ("KVM: SVM: Allocate SEV command structures on local stack")
      uses the local stack to allocate the structures used to communicate with the PSP,
      which were earlier being kzalloced. This breaks SEV live migration for
      computing the SEND_START session length and SEND_UPDATE_DATA query length as
      session_len and trans_len and hdr_len fields are not zeroed respectively for
      the above commands before issuing the SEV Firmware API call, hence the
      firmware returns incorrect session length and update data header or trans length.
      
      Also the SEV Firmware API returns SEV_RET_INVALID_LEN firmware error
      for these length query API calls, and the return value and the
      firmware error needs to be passed to the userspace as it is, so
      need to remove the return check in the KVM code.
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <20210607061532.27459-1-Ashish.Kalra@amd.com>
      Fixes: 238eca82 ("KVM: SVM: Allocate SEV command structures on local stack")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4f13d471
  5. 07 Jun, 2021 2 commits
  6. 06 Jun, 2021 9 commits
    • Linus Torvalds's avatar
      Linux 5.13-rc5 · 614124be
      Linus Torvalds authored
      614124be
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 90d56a3d
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Five small and fairly minor fixes, all in drivers"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: scsi_devinfo: Add blacklist entry for HPE OPEN-V
        scsi: ufs: ufs-mediatek: Fix HCI version in some platforms
        scsi: qedf: Do not put host in qedf_vport_create() unconditionally
        scsi: lpfc: Fix failure to transmit ABTS on FC link
        scsi: target: core: Fix warning on realtime kernels
      90d56a3d
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 20e41d9b
      Linus Torvalds authored
      Pull ext4 fixes from Ted Ts'o:
       "Miscellaneous ext4 bug fixes"
      
      * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: Only advertise encrypted_casefold when encryption and unicode are enabled
        ext4: fix no-key deletion for encrypt+casefold
        ext4: fix memory leak in ext4_fill_super
        ext4: fix fast commit alignment issues
        ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed
        ext4: fix accessing uninit percpu counter variable with fast_commit
        ext4: fix memory leak in ext4_mb_init_backend on error path.
      20e41d9b
    • Linus Torvalds's avatar
      Merge tag 'arm-soc-fixes-v5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · decad3e1
      Linus Torvalds authored
      Pull ARM SoC fixes from Olof Johansson:
       "A set of fixes that have been coming in over the last few weeks, the
        usual mix of fixes:
      
         - DT fixups for TI K3
      
         - SATA drive detection fix for TI DRA7
      
         - Power management fixes and a few build warning removals for OMAP
      
         - OP-TEE fix to use standard API for UUID exporting
      
         - DT fixes for a handful of i.MX boards
      
        And a few other smaller items"
      
      * tag 'arm-soc-fixes-v5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (29 commits)
        arm64: meson: select COMMON_CLK
        soc: amlogic: meson-clk-measure: remove redundant dev_err call in meson_msr_probe()
        ARM: OMAP1: ams-delta: remove unused function ams_delta_camera_power
        bus: ti-sysc: Fix flakey idling of uarts and stop using swsup_sidle_act
        ARM: dts: imx: emcon-avari: Fix nxp,pca8574 #gpio-cells
        ARM: dts: imx7d-pico: Fix the 'tuning-step' property
        ARM: dts: imx7d-meerkat96: Fix the 'tuning-step' property
        arm64: dts: freescale: sl28: var1: fix RGMII clock and voltage
        arm64: dts: freescale: sl28: var4: fix RGMII clock and voltage
        ARM: imx: pm-imx27: Include "common.h"
        arm64: dts: zii-ultra: fix 12V_MAIN voltage
        arm64: dts: zii-ultra: remove second GEN_3V3 regulator instance
        arm64: dts: ls1028a: fix memory node
        bus: ti-sysc: Fix am335x resume hang for usb otg module
        ARM: OMAP2+: Fix build warning when mmc_omap is not built
        ARM: OMAP1: isp1301-omap: Add missing gpiod_add_lookup_table function
        ARM: OMAP1: Fix use of possibly uninitialized irq variable
        optee: use export_uuid() to copy client UUID
        arm64: dts: ti: k3*: Introduce reg definition for interrupt routers
        arm64: dts: ti: k3-am65|j721e|am64: Map the dma / navigator subsystem via explicit ranges
        ...
      decad3e1
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.13-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · bd7b12aa
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "Fix our KVM reverse map real-mode handling since we enabled huge
        vmalloc (in some configurations).
      
        Revert a recent change to our IOMMU code which broke some devices.
      
        Fix KVM handling of FSCR on P7/P8, which could have possibly let a
        guest crash it's Qemu.
      
        Fix kprobes validation of prefixed instructions across page boundary.
      
        Thanks to Alexey Kardashevskiy, Christophe Leroy, Fabiano Rosas,
        Frederic Barrat, Naveen N. Rao, and Nicholas Piggin"
      
      * tag 'powerpc-5.13-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        Revert "powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE() to save TCEs"
        KVM: PPC: Book3S HV: Save host FSCR in the P7/8 path
        powerpc: Fix reverse map real-mode address lookup with huge vmalloc
        powerpc/kprobes: Fix validation of prefixed instructions across page boundary
      bd7b12aa
    • Linus Torvalds's avatar
      Merge tag 'x86_urgent_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 773ac53b
      Linus Torvalds authored
      Pull x86 fixes from Borislav Petkov:
       "A bunch of x86/urgent stuff accumulated for the last two weeks so
        lemme unload it to you.
      
        It should be all totally risk-free, of course. :-)
      
         - Fix out-of-spec hardware (1st gen Hygon) which does not implement
           MSR_AMD64_SEV even though the spec clearly states so, and check
           CPUID bits first.
      
         - Send only one signal to a task when it is a SEGV_PKUERR si_code
           type.
      
         - Do away with all the wankery of reserving X amount of memory in the
           first megabyte to prevent BIOS corrupting it and simply and
           unconditionally reserve the whole first megabyte.
      
         - Make alternatives NOP optimization work at an arbitrary position
           within the patched sequence because the compiler can put
           single-byte NOPs for alignment anywhere in the sequence (32-bit
           retpoline), vs our previous assumption that the NOPs are only
           appended.
      
         - Force-disable ENQCMD[S] instructions support and remove
           update_pasid() because of insufficient protection against FPU state
           modification in an interrupt context, among other xstate horrors
           which are being addressed at the moment. This one limits the
           fallout until proper enablement.
      
         - Use cpu_feature_enabled() in the idxd driver so that it can be
           build-time disabled through the defines in disabled-features.h.
      
         - Fix LVT thermal setup for SMI delivery mode by making sure the APIC
           LVT value is read before APIC initialization so that softlockups
           during boot do not happen at least on one machine.
      
         - Mark all legacy interrupts as legacy vectors when the IO-APIC is
           disabled and when all legacy interrupts are routed through the PIC"
      
      * tag 'x86_urgent_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/sev: Check SME/SEV support in CPUID first
        x86/fault: Don't send SIGSEGV twice on SEGV_PKUERR
        x86/setup: Always reserve the first 1M of RAM
        x86/alternative: Optimize single-byte NOPs at an arbitrary position
        x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()
        dmaengine: idxd: Use cpu_feature_enabled()
        x86/thermal: Fix LVT thermal setup for SMI delivery mode
        x86/apic: Mark _all_ legacy interrupts when IO/APIC is missing
      773ac53b
    • Daniel Rosenberg's avatar
      ext4: Only advertise encrypted_casefold when encryption and unicode are enabled · e71f99f2
      Daniel Rosenberg authored
      Encrypted casefolding is only supported when both encryption and
      casefolding are both enabled in the config.
      
      Fixes: 471fbbea ("ext4: handle casefolding with encryption")
      Cc: stable@vger.kernel.org # 5.13+
      Signed-off-by: default avatarDaniel Rosenberg <drosen@google.com>
      Link: https://lore.kernel.org/r/20210603094849.314342-1-drosen@google.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      e71f99f2
    • Daniel Rosenberg's avatar
      ext4: fix no-key deletion for encrypt+casefold · 63e7f128
      Daniel Rosenberg authored
      commit 471fbbea ("ext4: handle casefolding with encryption") is
      missing a few checks for the encryption key which are needed to
      support deleting enrypted casefolded files when the key is not
      present.
      
      This bug made it impossible to delete encrypted+casefolded directories
      without the encryption key, due to errors like:
      
          W         : EXT4-fs warning (device vdc): __ext4fs_dirhash:270: inode #49202: comm Binder:378_4: Siphash requires key
      
      Repro steps in kvm-xfstests test appliance:
            mkfs.ext4 -F -E encoding=utf8 -O encrypt /dev/vdc
            mount /vdc
            mkdir /vdc/dir
            chattr +F /vdc/dir
            keyid=$(head -c 64 /dev/zero | xfs_io -c add_enckey /vdc | awk '{print $NF}')
            xfs_io -c "set_encpolicy $keyid" /vdc/dir
            for i in `seq 1 100`; do
                mkdir /vdc/dir/$i
            done
            xfs_io -c "rm_enckey $keyid" /vdc
            rm -rf /vdc/dir # fails with the bug
      
      Fixes: 471fbbea ("ext4: handle casefolding with encryption")
      Signed-off-by: default avatarDaniel Rosenberg <drosen@google.com>
      Link: https://lore.kernel.org/r/20210522004132.2142563-1-drosen@google.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      63e7f128
    • Alexey Makhalov's avatar
      ext4: fix memory leak in ext4_fill_super · afd09b61
      Alexey Makhalov authored
      Buffer head references must be released before calling kill_bdev();
      otherwise the buffer head (and its page referenced by b_data) will not
      be freed by kill_bdev, and subsequently that bh will be leaked.
      
      If blocksizes differ, sb_set_blocksize() will kill current buffers and
      page cache by using kill_bdev(). And then super block will be reread
      again but using correct blocksize this time. sb_set_blocksize() didn't
      fully free superblock page and buffer head, and being busy, they were
      not freed and instead leaked.
      
      This can easily be reproduced by calling an infinite loop of:
      
        systemctl start <ext4_on_lvm>.mount, and
        systemctl stop <ext4_on_lvm>.mount
      
      ... since systemd creates a cgroup for each slice which it mounts, and
      the bh leak get amplified by a dying memory cgroup that also never
      gets freed, and memory consumption is much more easily noticed.
      
      Fixes: ce40733c ("ext4: Check for return value from sb_set_blocksize")
      Fixes: ac27a0ec ("ext4: initial copy of files from ext3")
      Link: https://lore.kernel.org/r/20210521075533.95732-1-amakhalov@vmware.comSigned-off-by: default avatarAlexey Makhalov <amakhalov@vmware.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      afd09b61