1. 19 Jan, 2023 19 commits
    • Peter Xu's avatar
      mm/hugetlb: make hugetlb_follow_page_mask() safe to pmd unshare · 7d049f3a
      Peter Xu authored
      Since hugetlb_follow_page_mask() walks the pgtable, it needs the vma lock
      to make sure the pgtable page will not be freed concurrently.
      
      Link: https://lkml.kernel.org/r/20221216155219.2043714-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d049f3a
    • Peter Xu's avatar
      mm/hugetlb: make userfaultfd_huge_must_wait() safe to pmd unshare · b8da2e46
      Peter Xu authored
      We can take the hugetlb walker lock, here taking vma lock directly.
      
      Link: https://lkml.kernel.org/r/20221216155217.2043700-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b8da2e46
    • Peter Xu's avatar
      mm/hugetlb: move swap entry handling into vma lock when faulted · fcd48540
      Peter Xu authored
      In hugetlb_fault(), there used to have a special path to handle swap entry
      at the entrance using huge_pte_offset().  That's unsafe because
      huge_pte_offset() for a pmd sharable range can access freed pgtables if
      without any lock to protect the pgtable from being freed after pmd
      unshare.
      
      Here the simplest solution to make it safe is to move the swap handling to
      be after the vma lock being held.  We may need to take the fault mutex on
      either migration or hwpoison entries now (also the vma lock, but that's
      really needed), however neither of them is hot path.
      
      Note that the vma lock cannot be released in hugetlb_fault() when the
      migration entry is detected, because in migration_entry_wait_huge() the
      pgtable page will be used again (by taking the pgtable lock), so that also
      need to be protected by the vma lock.  Modify migration_entry_wait_huge()
      so that it must be called with vma read lock held, and properly release
      the lock in __migration_entry_wait_huge().
      
      Link: https://lkml.kernel.org/r/20221216155100.2043537-5-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fcd48540
    • Peter Xu's avatar
      mm/hugetlb: document huge_pte_offset usage · fe7d4c6d
      Peter Xu authored
      huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
      hugetlb address.
      
      Normally, it's always safe to walk a generic pgtable as long as we're with
      the mmap lock held for either read or write, because that guarantees the
      pgtable pages will always be valid during the process.
      
      But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
      pgtable freed by pmd unsharing, it means that even with mmap lock held for
      current mm, the PMD pgtable page can still go away from under us if pmd
      unsharing is possible during the walk.
      
      So we have two ways to make it safe even for a shared mapping:
      
        (1) If we're with the hugetlb vma lock held for either read/write, it's
            okay because pmd unshare cannot happen at all.
      
        (2) If we're with the i_mmap_rwsem lock held for either read/write, it's
            okay because even if pmd unshare can happen, the pgtable page cannot
            be freed from under us.
      
      Document it.
      
      Link: https://lkml.kernel.org/r/20221216155100.2043537-4-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fe7d4c6d
    • Peter Xu's avatar
      mm/hugetlb: don't wait for migration entry during follow page · bb373dce
      Peter Xu authored
      That's what the code does with !hugetlb pages, so we should logically do
      the same for hugetlb, so migration entry will also be treated as no page.
      
      This is probably also the last piece in follow_page code that may sleep,
      the last one should be removed in cf994dd8af27 ("mm/gup: remove
      FOLL_MIGRATION", 2022-11-16).
      
      Link: https://lkml.kernel.org/r/20221216155100.2043537-3-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bb373dce
    • Peter Xu's avatar
      mm/hugetlb: let vma_offset_start() to return start · 243b1f2d
      Peter Xu authored
      Patch series "mm/hugetlb: Make huge_pte_offset() thread-safe for pmd
      unshare", v4.
      
      Problem
      =======
      
      huge_pte_offset() is a major helper used by hugetlb code paths to walk a
      hugetlb pgtable.  It's used mostly everywhere since that's needed even
      before taking the pgtable lock.
      
      huge_pte_offset() is always called with mmap lock held with either read or
      write.  It was assumed to be safe but it's actually not.  One race
      condition can easily trigger by: (1) firstly trigger pmd share on a memory
      range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
      another thread unshare the pmd range, and the pgtable page is prone to lost
      if the other shared process wants to free it completely (by either munmap
      or exit mm).
      
      The recent work from Mike on vma lock can resolve most of this already.
      It's achieved by forbidden pmd unsharing during the lock being taken, so no
      further risk of the pgtable page being freed.  It means if we can take the
      vma lock around all huge_pte_offset() callers it'll be safe.
      
      There're already a bunch of them that we did as per the latest mm-unstable,
      but also quite a few others that we didn't for various reasons especially
      on huge_pte_offset() usage.
      
      One more thing to mention is that besides the vma lock, i_mmap_rwsem can
      also be used to protect the pgtable page (along with its pgtable lock) from
      being freed from under us.  IOW, huge_pte_offset() callers need to either
      hold the vma lock or i_mmap_rwsem to safely walk the pgtables.
      
      A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
      very hard to trigger, one needs to apply another kernel delay patch too,
      see below):
      
      ======8<=======
        #define _GNU_SOURCE
        #include <stdio.h>
        #include <stdlib.h>
        #include <errno.h>
        #include <unistd.h>
        #include <sys/mman.h>
        #include <fcntl.h>
        #include <linux/memfd.h>
        #include <assert.h>
        #include <pthread.h>
      
        #define  MSIZE  (1UL << 30)     /* 1GB */
        #define  PSIZE  (2UL << 20)     /* 2MB */
      
        #define  HOLD_SEC  (1)
      
        int pipefd[2];
        void *buf;
      
        void *do_map(int fd)
        {
            unsigned char *tmpbuf, *p;
            int ret;
      
            ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE);
            if (ret) {
                perror("posix_memalign() failed");
                return NULL;
            }
      
            tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE,
                          MAP_SHARED | MAP_FIXED, fd, 0);
            if (tmpbuf == MAP_FAILED) {
                perror("mmap() failed");
                return NULL;
            }
            printf("mmap() -> %p\n", tmpbuf);
      
            for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) {
                *p = 1;
            }
      
            return tmpbuf;
        }
      
        void do_unmap(void *buf)
        {
            munmap(buf, MSIZE);
        }
      
        void proc2(int fd)
        {
            unsigned char c;
      
            buf = do_map(fd);
            if (!buf)
                return;
      
            read(pipefd[0], &c, 1);
            /*
             * This frees the shared pgtable page, causing use-after-free in
             * proc1_thread1 when soft walking hugetlb pgtable.
             */
            do_unmap(buf);
      
            printf("Proc2 quitting\n");
        }
      
        void *proc1_thread1(void *data)
        {
            /*
             * Trigger follow-page on 1st 2m page.  Kernel hack patch needed to
             * withhold this procedure for easier reproduce.
             */
            madvise(buf, PSIZE, MADV_POPULATE_WRITE);
            printf("Proc1-thread1 quitting\n");
            return NULL;
        }
      
        void *proc1_thread2(void *data)
        {
            unsigned char c;
      
            /* Wait a while until proc1_thread1() start to wait */
            sleep(0.5);
            /* Trigger pmd unshare */
            madvise(buf, PSIZE, MADV_DONTNEED);
            /* Kick off proc2 to release the pgtable */
            write(pipefd[1], &c, 1);
      
            printf("Proc1-thread2 quitting\n");
            return NULL;
        }
      
        void proc1(int fd)
        {
            pthread_t tid1, tid2;
            int ret;
      
            buf = do_map(fd);
            if (!buf)
                return;
      
            ret = pthread_create(&tid1, NULL, proc1_thread1, NULL);
            assert(ret == 0);
            ret = pthread_create(&tid2, NULL, proc1_thread2, NULL);
            assert(ret == 0);
      
            /* Kick the child to share the PUD entry */
            pthread_join(tid1, NULL);
            pthread_join(tid2, NULL);
      
            do_unmap(buf);
        }
      
        int main(void)
        {
            int fd, ret;
      
            fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB);
            if (fd < 0) {
                perror("open failed");
                return -1;
            }
      
            ret = ftruncate(fd, MSIZE);
            if (ret) {
                perror("ftruncate() failed");
                return -1;
            }
      
            ret = pipe(pipefd);
            if (ret) {
                perror("pipe() failed");
                return -1;
            }
      
            if (fork()) {
                proc1(fd);
            } else {
                proc2(fd);
            }
      
            close(pipefd[0]);
            close(pipefd[1]);
            close(fd);
      
            return 0;
        }
      ======8<=======
      
      The kernel patch needed to present such a race so it'll trigger 100%:
      
      ======8<=======
      : diff --git a/mm/hugetlb.c b/mm/hugetlb.c
      : index 9d97c9a2a15d..f8d99dad5004 100644
      : --- a/mm/hugetlb.c
      : +++ b/mm/hugetlb.c
      : @@ -38,6 +38,7 @@
      :  #include <asm/page.h>
      :  #include <asm/pgalloc.h>
      :  #include <asm/tlb.h>
      : +#include <asm/delay.h>
      : 
      :  #include <linux/io.h>
      :  #include <linux/hugetlb.h>
      : @@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
      :                 bool unshare = false;
      :                 int absent;
      :                 struct page *page;
      : +               unsigned long c = 0;
      : 
      :                 /*
      :                  * If we have a pending SIGKILL, don't keep faulting pages and
      : @@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
      :                  */
      :                 pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
      :                                       huge_page_size(h));
      : +
      : +               pr_info("%s: withhold 1 sec...\n", __func__);
      : +               for (c = 0; c < 100; c++) {
      : +                       udelay(10000);
      : +               }
      : +               pr_info("%s: withhold 1 sec...done\n", __func__);
      : +
      :                 if (pte)
      :                         ptl = huge_pte_lock(h, mm, pte);
      :                 absent = !pte || huge_pte_none(huge_ptep_get(pte));
      : ======8<=======
      
      It'll trigger use-after-free of the pgtable spinlock:
      
      ======8<=======
      [   16.959907] follow_hugetlb_page: withhold 1 sec...
      [   17.960315] follow_hugetlb_page: withhold 1 sec...done
      [   17.960550] ------------[ cut here ]------------
      [   17.960742] DEBUG_LOCKS_WARN_ON(1)
      [   17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0
      [   17.961264] Modules linked in:
      [   17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46
      [   17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      [   17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0
      [   17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9
      [   17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096
      [   17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000
      [   17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
      [   17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58
      [   17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000
      [   17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
      [   17.965123] FS:  00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
      [   17.965443] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0
      [   17.965956] PKRU: 55555554
      [   17.966068] Call Trace:
      [   17.966172]  <TASK>
      [   17.966268]  ? tick_nohz_tick_stopped+0x12/0x30
      [   17.966455]  lock_acquire+0xbf/0x2b0
      [   17.966603]  ? follow_hugetlb_page.cold+0x75/0x5c4
      [   17.966799]  ? _printk+0x48/0x4e
      [   17.966934]  _raw_spin_lock+0x2f/0x40
      [   17.967087]  ? follow_hugetlb_page.cold+0x75/0x5c4
      [   17.967285]  follow_hugetlb_page.cold+0x75/0x5c4
      [   17.967473]  __get_user_pages+0xbb/0x620
      [   17.967635]  faultin_vma_page_range+0x9a/0x100
      [   17.967817]  madvise_vma_behavior+0x3c0/0xbd0
      [   17.967998]  ? mas_prev+0x11/0x290
      [   17.968141]  ? find_vma_prev+0x5e/0xa0
      [   17.968304]  ? madvise_vma_anon_name+0x70/0x70
      [   17.968486]  madvise_walk_vmas+0xa9/0x120
      [   17.968650]  do_madvise.part.0+0xfa/0x270
      [   17.968813]  __x64_sys_madvise+0x5a/0x70
      [   17.968974]  do_syscall_64+0x37/0x90
      [   17.969123]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      [   17.969329] RIP: 0033:0x7f1840f0efdb
      [   17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68
      [   17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c
      [   17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb
      [   17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000
      [   17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f
      [   17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80
      [   17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000
      [   17.972083]  </TASK>
      [   17.972199] irq event stamp: 2353
      [   17.972372] hardirqs last  enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70
      [   17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70
      [   17.973365] softirqs last  enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
      [   17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
      [   17.974341] ---[ end trace 0000000000000000 ]---
      [   17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8
      [   17.975012] #PF: supervisor read access in kernel mode
      [   17.975314] #PF: error_code(0x0000) - not-present page
      [   17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0
      [   17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
      [   17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G        W          6.1.0-rc4-peterx+ #46
      [   17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      [   17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0
      [   17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f
      [   17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046
      [   17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000
      [   17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
      [   17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58
      [   17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000
      [   17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
      [   17.984815] FS:  00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
      [   17.985924] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0
      [   17.986674] PKRU: 55555554
      [   17.986832] Call Trace:
      [   17.987012]  <TASK>
      [   17.987266]  ? tick_nohz_tick_stopped+0x12/0x30
      [   17.987770]  lock_acquire+0xbf/0x2b0
      [   17.988118]  ? follow_hugetlb_page.cold+0x75/0x5c4
      [   17.988575]  ? _printk+0x48/0x4e
      [   17.988889]  _raw_spin_lock+0x2f/0x40
      [   17.989243]  ? follow_hugetlb_page.cold+0x75/0x5c4
      [   17.989687]  follow_hugetlb_page.cold+0x75/0x5c4
      [   17.990119]  __get_user_pages+0xbb/0x620
      [   17.990500]  faultin_vma_page_range+0x9a/0x100
      [   17.990928]  madvise_vma_behavior+0x3c0/0xbd0
      [   17.991354]  ? mas_prev+0x11/0x290
      [   17.991678]  ? find_vma_prev+0x5e/0xa0
      [   17.992024]  ? madvise_vma_anon_name+0x70/0x70
      [   17.992421]  madvise_walk_vmas+0xa9/0x120
      [   17.992793]  do_madvise.part.0+0xfa/0x270
      [   17.993166]  __x64_sys_madvise+0x5a/0x70
      [   17.993539]  do_syscall_64+0x37/0x90
      [   17.993879]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      ======8<=======
      
      Resolution
      ==========
      
      This patchset protects all the huge_pte_offset() callers to also take the
      vma lock properly.
      
      Patch Layout
      ============
      
      Patch 1-2:         cleanup, or dependency of the follow up patches
      Patch 3:           before fixing, document huge_pte_offset() on lock required
      Patch 4-8:         each patch resolves one possible race condition
      Patch 9:           introduce hugetlb_walk() to replace huge_pte_offset()
      
      Tests
      =====
      
      The series is verified with the above reproducer so the race cannot
      trigger anymore.  It also passes all hugetlb kselftests.
      
      
      This patch (of 9):
      
      Even though vma_offset_start() is named like that, it's not returning "the
      start address of the range" but rather the offset we should use to offset
      the vma->vm_start address.
      
      Make it return the real value of the start vaddr, and it also helps for
      all the callers because whenever the retval is used, it'll be ultimately
      added into the vma->vm_start anyway, so it's better.
      
      Link: https://lkml.kernel.org/r/20221216155100.2043537-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20221216155100.2043537-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      243b1f2d
    • Mike Kravetz's avatar
      hugetlb: update vma flag check for hugetlb vma lock · 379c2e60
      Mike Kravetz authored
      The check for whether a hugetlb vma lock exists partially depends on the
      vma's flags.  Currently, it checks for either VM_MAYSHARE or VM_SHARED. 
      The reason both flags are used is because VM_MAYSHARE was previously
      cleared in hugetlb vmas as they are tore down.  This is no longer the
      case, and only the VM_MAYSHARE check is required.
      
      Link: https://lkml.kernel.org/r/20221212235042.178355-2-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      379c2e60
    • Jeff Xu's avatar
      selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC · 11f75a01
      Jeff Xu authored
      Tests to verify MFD_NOEXEC, MFD_EXEC and vm.memfd_noexec sysctl.
      
      Link: https://lkml.kernel.org/r/20221215001205.51969-6-jeffxu@google.comSigned-off-by: default avatarJeff Xu <jeffxu@google.com>
      Co-developed-by: default avatarDaniel Verkamp <dverkamp@chromium.org>
      Signed-off-by: default avatarDaniel Verkamp <dverkamp@chromium.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      11f75a01
    • Jeff Xu's avatar
      mm/memfd: add write seals when apply SEAL_EXEC to executable memfd · c4f75bc8
      Jeff Xu authored
      In order to avoid WX mappings, add F_SEAL_WRITE when apply F_SEAL_EXEC to
      an executable memfd, so W^X from start.
      
      This implys application need to fill the content of the memfd first, after
      F_SEAL_EXEC is applied, application can no longer modify the content of
      the memfd.
      
      Typically, application seals the memfd right after writing to it.
      For example:
      1. memfd_create(MFD_EXEC).
      2. write() code to the memfd.
      3. fcntl(F_ADD_SEALS, F_SEAL_EXEC) to convert the memfd to W^X.
      4. call exec() on the memfd.
      
      Link: https://lkml.kernel.org/r/20221215001205.51969-5-jeffxu@google.comSigned-off-by: default avatarJeff Xu <jeffxu@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Daniel Verkamp <dverkamp@chromium.org>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c4f75bc8
    • Jeff Xu's avatar
      mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC · 105ff533
      Jeff Xu authored
      The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set
      executable bit at creation time (memfd_create).
      
      When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
      (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be
      executable (mode: 0777) after creation.
      
      when MFD_EXEC flag is set, memfd is created with executable bit
      (mode:0777), this is the same as the old behavior of memfd_create.
      
      The new pid namespaced sysctl vm.memfd_noexec has 3 values:
      0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
              MFD_EXEC was set.
      1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
              MFD_NOEXEC_SEAL was set.
      2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.
      
      The sysctl allows finer control of memfd_create for old-software that
      doesn't set the executable bit, for example, a container with
      vm.memfd_noexec=1 means the old-software will create non-executable memfd
      by default.  Also, the value of memfd_noexec is passed to child namespace
      at creation time.  For example, if the init namespace has
      vm.memfd_noexec=2, all its children namespaces will be created with 2.
      
      [akpm@linux-foundation.org: add stub functions to fix build]
      [akpm@linux-foundation.org: remove unneeded register_pid_ns_ctl_table_vm() stub, per Jeff]
      [akpm@linux-foundation.org: s/pr_warn_ratelimited/pr_warn_once/, per review]
      [akpm@linux-foundation.org: fix CONFIG_SYSCTL=n warning]
      Link: https://lkml.kernel.org/r/20221215001205.51969-4-jeffxu@google.comSigned-off-by: default avatarJeff Xu <jeffxu@google.com>
      Co-developed-by: default avatarDaniel Verkamp <dverkamp@chromium.org>
      Signed-off-by: default avatarDaniel Verkamp <dverkamp@chromium.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      105ff533
    • Daniel Verkamp's avatar
      selftests/memfd: add tests for F_SEAL_EXEC · 32d118ad
      Daniel Verkamp authored
      Basic tests to ensure that user/group/other execute bits cannot be changed
      after applying F_SEAL_EXEC to a memfd.
      
      Link: https://lkml.kernel.org/r/20221215001205.51969-3-jeffxu@google.comSigned-off-by: default avatarDaniel Verkamp <dverkamp@chromium.org>
      Co-developed-by: default avatarJeff Xu <jeffxu@google.com>
      Signed-off-by: default avatarJeff Xu <jeffxu@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      32d118ad
    • Daniel Verkamp's avatar
      mm/memfd: add F_SEAL_EXEC · 6fd73538
      Daniel Verkamp authored
      Patch series "mm/memfd: introduce MFD_NOEXEC_SEAL and MFD_EXEC", v8.
      
      Since Linux introduced the memfd feature, memfd have always had their
      execute bit set, and the memfd_create() syscall doesn't allow setting it
      differently.
      
      However, in a secure by default system, such as ChromeOS, (where all
      executables should come from the rootfs, which is protected by Verified
      boot), this executable nature of memfd opens a door for NoExec bypass and
      enables “confused deputy attack”.  E.g, in VRP bug [1]: cros_vm
      process created a memfd to share the content with an external process,
      however the memfd is overwritten and used for executing arbitrary code and
      root escalation.  [2] lists more VRP in this kind.
      
      On the other hand, executable memfd has its legit use, runc uses memfd’s
      seal and executable feature to copy the contents of the binary then
      execute them, for such system, we need a solution to differentiate runc's
      use of executable memfds and an attacker's [3].
      
      To address those above, this set of patches add following:
      1> Let memfd_create() set X bit at creation time.
      2> Let memfd to be sealed for modifying X bit.
      3> A new pid namespace sysctl: vm.memfd_noexec to control the behavior of
         X bit.For example, if a container has vm.memfd_noexec=2, then
         memfd_create() without MFD_NOEXEC_SEAL will be rejected.
      4> A new security hook in memfd_create(). This make it possible to a new
         LSM, which rejects or allows executable memfd based on its security policy.
      
      
      This patch (of 5):
      
      The new F_SEAL_EXEC flag will prevent modification of the exec bits:
      written as traditional octal mask, 0111, or as named flags, S_IXUSR |
      S_IXGRP | S_IXOTH.  Any chmod(2) or similar call that attempts to modify
      any of these bits after the seal is applied will fail with errno EPERM.
      
      This will preserve the execute bits as they are at the time of sealing, so
      the memfd will become either permanently executable or permanently
      un-executable.
      
      Link: https://lkml.kernel.org/r/20221215001205.51969-1-jeffxu@google.com
      Link: https://lkml.kernel.org/r/20221215001205.51969-2-jeffxu@google.comSigned-off-by: default avatarDaniel Verkamp <dverkamp@chromium.org>
      Co-developed-by: default avatarJeff Xu <jeffxu@google.com>
      Signed-off-by: default avatarJeff Xu <jeffxu@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: kernel test robot <lkp@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6fd73538
    • Peter Xu's avatar
      mm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp() · f1eb1bac
      Peter Xu authored
      This patch is a cleanup to always wr-protect pte/pmd in mkuffd_wp paths.
      
      The reasons I still think this patch is worthwhile, are:
      
        (1) It is a cleanup already; diffstat tells.
      
        (2) It just feels natural after I thought about this, if the pte is uffd
            protected, let's remove the write bit no matter what it was.
      
        (2) Since x86 is the only arch that supports uffd-wp, it also redefines
            pte|pmd_mkuffd_wp() in that it should always contain removals of
            write bits.  It means any future arch that want to implement uffd-wp
            should naturally follow this rule too.  It's good to make it a
            default, even if with vm_page_prot changes on VM_UFFD_WP.
      
        (3) It covers more than vm_page_prot.  So no chance of any potential
            future "accident" (like pte_mkdirty() sparc64 or loongarch, even
            though it just got its pte_mkdirty fixed <1 month ago).  It'll be
            fairly clear when reading the code too that we don't worry anything
            before a pte_mkuffd_wp() on uncertainty of the write bit.
      
      We may call pte_wrprotect() one more time in some paths (e.g.  thp split),
      but that should be fully local bitop instruction so the overhead should be
      negligible.
      
      Although this patch should logically also fix all the known issues on
      uffd-wp too recently on page migration (not for numa hint recovery - that
      may need another explcit pte_wrprotect), but this is not the plan for that
      fix.  So no fixes, and stable doesn't need this.
      
      Link: https://lkml.kernel.org/r/20221214201533.1774616-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ives van Hoorne <ives@codesandbox.io>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f1eb1bac
    • Sidhartha Kumar's avatar
      mm: move folio_set_compound_order() to mm/internal.h · 04a42e72
      Sidhartha Kumar authored
      folio_set_compound_order() is moved to an mm-internal location so external
      folio users cannot misuse this function.  Change the name of the function
      to folio_set_order() and use WARN_ON_ONCE() rather than BUG_ON.  Also,
      handle the case if a non-large folio is passed and add clarifying comments
      to the function.
      
      Link: https://lore.kernel.org/lkml/20221207223731.32784-1-sidhartha.kumar@oracle.com/T/
      Link: https://lkml.kernel.org/r/20221215061757.223440-1-sidhartha.kumar@oracle.com
      Fixes: 9fd33058 ("mm: add folio dtor and order setter functions")
      Signed-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Suggested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Suggested-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Suggested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      04a42e72
    • Andrew Morton's avatar
      Pull mm-hotfixes-stable dependencies into mm-stable. · 1301f931
      Andrew Morton authored
      Merge branch 'mm-hotfixes-stable' into mm-stable
      1301f931
    • Peter Xu's avatar
      mm: fix a few rare cases of using swapin error pte marker · 7e3ce3f8
      Peter Xu authored
      This patch should harden commit 15520a3f ("mm: use pte markers for
      swap errors") on using pte markers for swapin errors on a few corner
      cases.
      
      1. Propagate swapin errors across fork()s: if there're swapin errors in
         the parent mm, after fork()s the child should sigbus too when an error
         page is accessed.
      
      2. Fix a rare condition race in pte_marker_clear() where a uffd-wp pte
         marker can be quickly switched to a swapin error.
      
      3. Explicitly ignore swapin error pte markers in change_protection().
      
      I mostly don't worry on (2) or (3) at all, but we should still have them. 
      Case (1) is special because it can potentially cause silent data corrupt
      on child when parent has swapin error triggered with swapoff, but since
      swapin error is rare itself already it's probably not easy to trigger
      either.
      
      Currently there is a priority difference between the uffd-wp bit and the
      swapin error entry, in which the swapin error always has higher priority
      (e.g.  we don't need to wr-protect a swapin error pte marker).
      
      If there will be a 3rd bit introduced, we'll probably need to consider a
      more involved approach so we may need to start operate on the bits.  Let's
      leave that for later.
      
      This patch is tested with case (1) explicitly where we'll get corrupted
      data before in the child if there's existing swapin error pte markers, and
      after patch applied the child can be rightfully killed.
      
      We don't need to copy stable for this one since 15520a3f just landed
      as part of v6.2-rc1, only "Fixes" applied.
      
      Link: https://lkml.kernel.org/r/20221214200453.1772655-3-peterx@redhat.com
      Fixes: 15520a3f ("mm: use pte markers for swap errors")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Pengfei Xu <pengfei.xu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7e3ce3f8
    • Peter Xu's avatar
      mm/uffd: fix pte marker when fork() without fork event · 49d6d7fb
      Peter Xu authored
      Patch series "mm: Fixes on pte markers".
      
      Patch 1 resolves the syzkiller report from Pengfei.
      
      Patch 2 further harden pte markers when used with the recent swapin error
      markers.  The major case is we should persist a swapin error marker after
      fork(), so child shouldn't read a corrupted page.
      
      
      This patch (of 2):
      
      When fork(), dst_vma is not guaranteed to have VM_UFFD_WP even if src may
      have it and has pte marker installed.  The warning is improper along with
      the comment.  The right thing is to inherit the pte marker when needed, or
      keep the dst pte empty.
      
      A vague guess is this happened by an accident when there's the prior patch
      to introduce src/dst vma into this helper during the uffd-wp feature got
      developed and I probably messed up in the rebase, since if we replace
      dst_vma with src_vma the warning & comment it all makes sense too.
      
      Hugetlb did exactly the right here (copy_hugetlb_page_range()).  Fix the
      general path.
      
      Reproducer:
      
      https://github.com/xupengfe/syzkaller_logs/blob/main/221208_115556_copy_page_range/repro.c
      
      Bugzilla report: https://bugzilla.kernel.org/show_bug.cgi?id=216808
      
      Link: https://lkml.kernel.org/r/20221214200453.1772655-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20221214200453.1772655-2-peterx@redhat.com
      Fixes: c56d1b62 ("mm/shmem: handle uffd-wp during fork()")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reported-by: default avatarPengfei Xu <pengfei.xu@intel.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: <stable@vger.kernel.org> # 5.19+
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      49d6d7fb
    • Andrew Morton's avatar
      Sync with v6.2-rc4 · bd86d2ea
      Andrew Morton authored
      Merge branch 'master' into mm-hotfixes-stable
      bd86d2ea
    • Andrew Morton's avatar
      Sync with v6.2-rc4 · 0e18a6b4
      Andrew Morton authored
      Merge branch 'master' into mm-stable
      0e18a6b4
  2. 15 Jan, 2023 4 commits
  3. 14 Jan, 2023 7 commits
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 7c698440
      Linus Torvalds authored
      Pull iommu fixes from Joerg Roedel:
      
       - Core: Fix an iommu-group refcount leak
      
       - Fix overflow issue in IOVA alloc path
      
       - ARM-SMMU fixes from Will:
          - Fix VFIO regression on NXP SoCs by reporting IOMMU_CAP_CACHE_COHERENCY
          - Fix SMMU shutdown paths to avoid device unregistration race
      
       - Error handling fix for Mediatek IOMMU driver
      
      * tag 'iommu-fixes-v6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu/mediatek-v1: Fix an error handling path in mtk_iommu_v1_probe()
        iommu/iova: Fix alloc iova overflows issue
        iommu: Fix refcount leak in iommu_device_claim_dma_owner
        iommu/arm-smmu-v3: Don't unregister on shutdown
        iommu/arm-smmu: Don't unregister on shutdown
        iommu/arm-smmu: Report IOMMU_CAP_CACHE_COHERENCY even betterer
      7c698440
    • Linus Torvalds's avatar
      Merge tag 'fixes-2023-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock · 4f43ade4
      Linus Torvalds authored
      Pull memblock fix from Mike Rapoport:
       "memblock: always release pages to the buddy allocator in
        memblock_free_late()
      
        If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
        only releases pages to the buddy allocator if they are not in the
        deferred range. This is correct for free pages (as defined by
        for_each_free_mem_pfn_range_in_zone()) because free pages in the
        deferred range will be initialized and released as part of the
        deferred init process.
      
        memblock_free_pages() is called by memblock_free_late(), which is used
        to free reserved ranges after memblock_free_all() has run. All pages
        in reserved ranges have been initialized at that point, and
        accordingly, those pages are not touched by the deferred init process.
      
        This means that currently, if the pages that memblock_free_late()
        intends to release are in the deferred range, they will never be
        released to the buddy allocator. They will forever be reserved.
      
        In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
        which is also correct for free pages but is not correct for reserved
        pages. KMSAN metadata for reserved pages is initialized by
        kmsan_init_shadow(), which runs shortly before memblock_free_all().
      
        For both of these reasons, memblock_free_pages() should only be called
        for free pages, and memblock_free_late() should call
        __free_pages_core() directly instead.
      
        One case where this issue can occur in the wild is EFI boot on x86_64.
        The x86 EFI code reserves all EFI boot services memory ranges via
        memblock_reserve() and frees them later via memblock_free_late()
        (efi_reserve_boot_services() and efi_free_boot_services(),
        respectively).
      
        If any of those ranges happens to fall within the deferred init range,
        the pages will not be released and that memory will be unavailable.
      
        For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
      
          v6.2-rc2:
          Node 0, zone      DMA
                spanned  4095
                present  3999
                managed  3840
          Node 0, zone    DMA32
                spanned  246652
                present  245868
                managed  178867
      
          v6.2-rc2 + patch:
          Node 0, zone      DMA
                spanned  4095
                present  3999
                managed  3840
          Node 0, zone    DMA32
                spanned  246652
                present  245868
                managed  222816   # +43,949 pages"
      
      * tag 'fixes-2023-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
        mm: Always release pages to the buddy allocator in memblock_free_late().
      4f43ade4
    • Linus Torvalds's avatar
      Merge tag 'hardening-v6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 880ca43e
      Linus Torvalds authored
      Pull kernel hardening fixes from Kees Cook:
      
       - Fix CFI hash randomization with KASAN (Sami Tolvanen)
      
       - Check size of coreboot table entry and use flex-array
      
      * tag 'hardening-v6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        kbuild: Fix CFI hash randomization with KASAN
        firmware: coreboot: Check size of table entry and use flex-array
      880ca43e
    • Linus Torvalds's avatar
      Merge tag 'modules-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux · 8b7be52f
      Linus Torvalds authored
      Pull module fix from Luis Chamberlain:
       "Just one fix for modules by Nick"
      
      * tag 'modules-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
        kallsyms: Fix scheduling with interrupts disabled in self-test
      8b7be52f
    • Linus Torvalds's avatar
      Merge tag '6.2-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 · b35ad63e
      Linus Torvalds authored
      Pull cifs fixes from Steve French:
      
       - memory leak and double free fix
      
       - two symlink fixes
      
       - minor cleanup fix
      
       - two smb1 fixes
      
      * tag '6.2-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: Fix uninitialized memory read for smb311 posix symlink create
        cifs: fix potential memory leaks in session setup
        cifs: do not query ifaces on smb1 mounts
        cifs: fix double free on failed kerberos auth
        cifs: remove redundant assignment to the variable match
        cifs: fix file info setting in cifs_open_file()
        cifs: fix file info setting in cifs_query_path_info()
      b35ad63e
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 8e768130
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Two minor fixes in the hisi_sas driver which only impact enterprise
        style multi-expander and shared disk situations and no core changes"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: hisi_sas: Set a port invalid only if there are no devices attached when refreshing port id
        scsi: hisi_sas: Use abort task set to reset SAS disks when discovered
      8e768130
    • Linus Torvalds's avatar
      Merge tag 'ata-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata · 34cbf89a
      Linus Torvalds authored
      Pull ATA fix from Damien Le Moal:
       "A single fix to prevent building the pata_cs5535 driver with user mode
        linux as it uses msr operations that are not defined with UML"
      
      * tag 'ata-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
        ata: pata_cs5535: Don't build on UML
      34cbf89a
  4. 13 Jan, 2023 10 commits
    • Linus Torvalds's avatar
      Merge tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux · 97ec4d55
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Nothing major in here, just a collection of NVMe fixes and dropping a
        wrong might_sleep() that static checkers tripped over but which isn't
        valid"
      
      * tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux:
        MAINTAINERS: stop nvme matching for nvmem files
        nvme: don't allow unprivileged passthrough on partitions
        nvme: replace the "bool vec" arguments with flags in the ioctl path
        nvme: remove __nvme_ioctl
        nvme-pci: fix error handling in nvme_pci_enable()
        nvme-pci: add NVME_QUIRK_IDENTIFY_CNS quirk to Apple T2 controllers
        nvme-apple: add NVME_QUIRK_IDENTIFY_CNS quirk to fix regression
        block: Drop spurious might_sleep() from blk_put_queue()
      97ec4d55
    • Linus Torvalds's avatar
      Merge tag 'io_uring-6.2-2023-01-13' of git://git.kernel.dk/linux · 2ce7592d
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
       "A fix for a regression that happened last week, rest is fixes that
        will be headed to stable as well. In detail:
      
         - Fix for a regression added with the leak fix from last week (me)
      
         - In writing a test case for that leak, inadvertently discovered a
           case where we a poll request can race. So fix that up and mark it
           for stable, and also ensure that fdinfo covers both the poll tables
           that we have. The latter was an oversight when the split poll table
           were added (me)
      
         - Fix for a lockdep reported issue with IOPOLL (Pavel)"
      
      * tag 'io_uring-6.2-2023-01-13' of git://git.kernel.dk/linux:
        io_uring: lock overflowing for IOPOLL
        io_uring/poll: attempt request issue after racy poll wakeup
        io_uring/fdinfo: include locked hash table in fdinfo output
        io_uring/poll: add hash if ready poll request can't complete inline
        io_uring/io-wq: only free worker if it was allocated for creation
      2ce7592d
    • Linus Torvalds's avatar
      Merge tag 'pci-v6.2-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · 9e058c29
      Linus Torvalds authored
      Pull pci fixes from Bjorn Helgaas:
      
       - Work around apparent firmware issue that made Linux reject MMCONFIG
         space, which broke PCI extended config space (Bjorn Helgaas)
      
       - Fix CONFIG_PCIE_BT1 dependency due to mid-air collision between a
         PCI_MSI_IRQ_DOMAIN -> PCI_MSI change and addition of PCIE_BT1 (Lukas
         Bulwahn)
      
      * tag 'pci-v6.2-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
        x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space
        x86/pci: Simplify is_mmconf_reserved() messages
        PCI: dwc: Adjust to recent removal of PCI_MSI_IRQ_DOMAIN
      9e058c29
    • Sami Tolvanen's avatar
      kbuild: Fix CFI hash randomization with KASAN · 42633ed8
      Sami Tolvanen authored
      Clang emits a asan.module_ctor constructor to each object file
      when KASAN is enabled, and these functions are indirectly called
      in do_ctors. With CONFIG_CFI_CLANG, the compiler also emits a CFI
      type hash before each address-taken global function so they can
      pass indirect call checks.
      
      However, in commit 0c3e806e ("x86/cfi: Add boot time hash
      randomization"), x86 implemented boot time hash randomization,
      which relies on the .cfi_sites section generated by objtool. As
      objtool is run against vmlinux.o instead of individual object
      files with X86_KERNEL_IBT (enabled by default), CFI types in
      object files that are not part of vmlinux.o end up not being
      included in .cfi_sites, and thus won't get randomized and trip
      CFI when called.
      
      Only .vmlinux.export.o and init/version-timestamp.o are linked
      into vmlinux separately from vmlinux.o. As these files don't
      contain any functions, disable KASAN for both of them to avoid
      breaking hash randomization.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/1742
      Fixes: 0c3e806e ("x86/cfi: Add boot time hash randomization")
      Signed-off-by: default avatarSami Tolvanen <samitolvanen@google.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20230112224948.1479453-2-samitolvanen@google.com
      42633ed8
    • Kees Cook's avatar
      firmware: coreboot: Check size of table entry and use flex-array · 3b293487
      Kees Cook authored
      The memcpy() of the data following a coreboot_table_entry couldn't
      be evaluated by the compiler under CONFIG_FORTIFY_SOURCE. To make it
      easier to reason about, add an explicit flexible array member to struct
      coreboot_device so the entire entry can be copied at once. Additionally,
      validate the sizes before copying. Avoids this run-time false positive
      warning:
      
        memcpy: detected field-spanning write (size 168) of single field "&device->entry" at drivers/firmware/google/coreboot_table.c:103 (size 8)
      Reported-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Link: https://lore.kernel.org/all/03ae2704-8c30-f9f0-215b-7cdf4ad35a9a@molgen.mpg.de/
      Cc: Jack Rosenthal <jrosenth@chromium.org>
      Cc: Guenter Roeck <groeck@chromium.org>
      Cc: Julius Werner <jwerner@chromium.org>
      Cc: Brian Norris <briannorris@chromium.org>
      Cc: Stephen Boyd <swboyd@chromium.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarJulius Werner <jwerner@chromium.org>
      Reviewed-by: default avatarGuenter Roeck <groeck@chromium.org>
      Link: https://lore.kernel.org/r/20230107031406.gonna.761-kees@kernel.orgReviewed-by: default avatarStephen Boyd <swboyd@chromium.org>
      Reviewed-by: default avatarJack Rosenthal <jrosenth@chromium.org>
      Link: https://lore.kernel.org/r/20230112230312.give.446-kees@kernel.org
      3b293487
    • Nicholas Piggin's avatar
      kallsyms: Fix scheduling with interrupts disabled in self-test · da35048f
      Nicholas Piggin authored
      kallsyms_on_each* may schedule so must not be called with interrupts
      disabled. The iteration function could disable interrupts, but this
      also changes lookup_symbol() to match the change to the other timing
      code.
      Reported-by: default avatarErhard F. <erhard_f@mailbox.org>
      Link: https://lore.kernel.org/all/bug-216902-206035@https.bugzilla.kernel.org%2F/Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Link: https://lore.kernel.org/oe-lkp/202212251728.8d0872ff-oliver.sang@intel.com
      Fixes: 30f3bb09 ("kallsyms: Add self-test facility")
      Tested-by: default avatar"Erhard F." <erhard_f@mailbox.org>
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      da35048f
    • Peter Foley's avatar
      ata: pata_cs5535: Don't build on UML · 22eebaa6
      Peter Foley authored
      This driver uses MSR functions that aren't implemented under UML.
      Avoid building it to prevent tripping up allyesconfig.
      
      e.g.
      /usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x3a3): undefined reference to `__tracepoint_read_msr'
      /usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x3d2): undefined reference to `__tracepoint_write_msr'
      /usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x457): undefined reference to `__tracepoint_write_msr'
      /usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x481): undefined reference to `do_trace_write_msr'
      /usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x4d5): undefined reference to `do_trace_write_msr'
      /usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x4f5): undefined reference to `do_trace_read_msr'
      /usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: pata_cs5535.c:(.text+0x51c): undefined reference to `do_trace_write_msr'
      Signed-off-by: default avatarPeter Foley <pefoley2@pefoley.com>
      Reviewed-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      22eebaa6
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 92783a90
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "ARM:
      
         - Fix the PMCR_EL0 reset value after the PMU rework
      
         - Correctly handle S2 fault triggered by a S1 page table walk by not
           always classifying it as a write, as this breaks on R/O memslots
      
         - Document why we cannot exit with KVM_EXIT_MMIO when taking a write
           fault from a S1 PTW on a R/O memslot
      
         - Put the Apple M2 on the naughty list for not being able to
           correctly implement the vgic SEIS feature, just like the M1 before
           it
      
         - Reviewer updates: Alex is stepping down, replaced by Zenghui
      
        x86:
      
         - Fix various rare locking issues in Xen emulation and teach lockdep
           to detect them
      
         - Documentation improvements
      
         - Do not return host topology information from KVM_GET_SUPPORTED_CPUID"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86/xen: Avoid deadlock by adding kvm->arch.xen.xen_lock leaf node lock
        KVM: Ensure lockdep knows about kvm->lock vs. vcpu->mutex ordering rule
        KVM: x86/xen: Fix potential deadlock in kvm_xen_update_runstate_guest()
        KVM: x86/xen: Fix lockdep warning on "recursive" gpc locking
        Documentation: kvm: fix SRCU locking order docs
        KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID
        KVM: nSVM: clarify recalc_intercepts() wrt CR8
        MAINTAINERS: Remove myself as a KVM/arm64 reviewer
        MAINTAINERS: Add Zenghui Yu as a KVM/arm64 reviewer
        KVM: arm64: vgic: Add Apple M2 cpus to the list of broken SEIS implementations
        KVM: arm64: Convert FSC_* over to ESR_ELx_FSC_*
        KVM: arm64: Document the behaviour of S1PTW faults on RO memslots
        KVM: arm64: Fix S1PTW handling on RO memslots
        KVM: arm64: PMU: Fix PMCR_EL0 reset value
      92783a90
    • Mateusz Guzik's avatar
      lockref: stop doing cpu_relax in the cmpxchg loop · f5fe24ef
      Mateusz Guzik authored
      On the x86-64 architecture even a failing cmpxchg grants exclusive
      access to the cacheline, making it preferable to retry the failed op
      immediately instead of stalling with the pause instruction.
      
      To illustrate the impact, below are benchmark results obtained by
      running various will-it-scale tests on top of the 6.2-rc3 kernel and
      Cascade Lake (2 sockets * 24 cores * 2 threads) CPU.
      
      All results in ops/s.  Note there is some variance in re-runs, but the
      code is consistently faster when contention is present.
      
        open3 ("Same file open/close"):
        proc          stock       no-pause
           1         805603         814942       (+%1)
           2        1054980        1054781       (-0%)
           8        1544802        1822858      (+18%)
          24        1191064        2199665      (+84%)
          48         851582        1469860      (+72%)
          96         609481        1427170     (+134%)
      
        fstat2 ("Same file fstat"):
        proc          stock       no-pause
           1        3013872        3047636       (+1%)
           2        4284687        4400421       (+2%)
           8        3257721        5530156      (+69%)
          24        2239819        5466127     (+144%)
          48        1701072        5256609     (+209%)
          96        1269157        6649326     (+423%)
      
      Additionally, a kernel with a private patch to help access() scalability:
      access2 ("Same file access"):
      
        proc          stock        patched      patched
                                               +nopause
          24        2378041        2005501      5370335  (-15% / +125%)
      
      That is, fixing the problems in access itself *reduces* scalability
      after the cacheline ping-pong only happens in lockref with the pause
      instruction.
      
      Note that fstat and access benchmarks are not currently integrated into
      will-it-scale, but interested parties can find them in pull requests to
      said project.
      
      Code at hand has a rather tortured history.  First modification showed
      up in commit d472d9d9 ("lockref: Relax in cmpxchg loop"), written
      with Itanium in mind.  Later it got patched up to use an arch-dependent
      macro to stop doing it on s390 where it caused a significant regression.
      Said macro had undergone revisions and was ultimately eliminated later,
      going back to cpu_relax.
      
      While I intended to only remove cpu_relax for x86-64, I got the
      following comment from Linus:
      
          I would actually prefer just removing it entirely and see if
          somebody else hollers. You have the numbers to prove it hurts on
          real hardware, and I don't think we have any numbers to the
          contrary.
      
          So I think it's better to trust the numbers and remove it as a
          failure, than say "let's just remove it on x86-64 and leave
          everybody else with the potentially broken code"
      
      Additionally, Will Deacon (maintainer of the arm64 port, one of the
      architectures previously benchmarked):
      
          So, from the arm64 side of the fence, I'm perfectly happy just
          removing the cpu_relax() calls from lockref.
      
      As such, come back full circle in history and whack it altogether.
      Signed-off-by: default avatarMateusz Guzik <mjguzik@gmail.com>
      Link: https://lore.kernel.org/all/CAGudoHHx0Nqg6DE70zAVA75eV-HXfWyhVMWZ-aSeOofkA_=WdA@mail.gmail.com/
      Acked-by: Tony Luck <tony.luck@intel.com> # ia64
      Acked-by: Nicholas Piggin <npiggin@gmail.com> # powerpc
      Acked-by: Will Deacon <will@kernel.org> # arm64
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5fe24ef
    • Bjorn Helgaas's avatar
      x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space · fd3a8cff
      Bjorn Helgaas authored
      Normally we reject ECAM space unless it is reported as reserved in the E820
      table or via a PNP0C02 _CRS method (PCI Firmware, r3.3, sec 4.1.2).
      
      07eab090 ("efi/x86: Remove EfiMemoryMappedIO from E820 map"), removes
      E820 entries that correspond to EfiMemoryMappedIO regions because some
      other firmware uses EfiMemoryMappedIO for PCI host bridge windows, and the
      E820 entries prevent Linux from allocating BAR space for hot-added devices.
      
      Some firmware doesn't report ECAM space via PNP0C02 _CRS methods, but does
      mention it as an EfiMemoryMappedIO region via EFI GetMemoryMap(), which is
      normally converted to an E820 entry by a bootloader or EFI stub.  After
      07eab090, that E820 entry is removed, so we reject this ECAM space,
      which makes PCI extended config space (offsets 0x100-0xfff) inaccessible.
      
      The lack of extended config space breaks anything that relies on it,
      including perf, VSEC telemetry, EDAC, QAT, SR-IOV, etc.
      
      Allow use of ECAM for extended config space when the region is covered by
      an EfiMemoryMappedIO region, even if it's not included in E820 or PNP0C02
      _CRS.
      
      Link: https://lore.kernel.org/r/ac2693d8-8ba3-72e0-5b66-b3ae008d539d@linux.intel.com
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=216891
      Fixes: 07eab090 ("efi/x86: Remove EfiMemoryMappedIO from E820 map")
      Link: https://lore.kernel.org/r/20230110180243.1590045-3-helgaas@kernel.orgReported-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Reported-by: default avatarTony Luck <tony.luck@intel.com>
      Reported-by: default avatarGiovanni Cabiddu <giovanni.cabiddu@intel.com>
      Reported-by: default avatarYunying Sun <yunying.sun@intel.com>
      Reported-by: default avatarBaowen Zheng <baowen.zheng@corigine.com>
      Reported-by: default avatarZhenzhong Duan <zhenzhong.duan@intel.com>
      Reported-by: default avatarYang Lixiao <lixiao.yang@intel.com>
      Tested-by: default avatarTony Luck <tony.luck@intel.com>
      Tested-by: default avatarGiovanni Cabiddu <giovanni.cabiddu@intel.com>
      Tested-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Tested-by: default avatarYunying Sun <yunying.sun@intel.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarRafael J. Wysocki <rafael@kernel.org>
      fd3a8cff