1. 30 Sep, 2023 14 commits
    • Baoquan He's avatar
      Crash: add lock to serialize crash hotplug handling · e2a8f20d
      Baoquan He authored
      Eric reported that handling corresponding crash hotplug event can be
      failed easily when many memory hotplug event are notified in a short
      period.  They failed because failing to take __kexec_lock.
      
      =======
      [   78.714569] Fallback order for Node 0: 0
      [   78.714575] Built 1 zonelists, mobility grouping on.  Total pages: 1817886
      [   78.717133] Policy zone: Normal
      [   78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
      [   78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
      [   80.056643] PEFILE: Unsigned PE binary
      =======
      
      The memory hotplug events are notified very quickly and very many, while
      the handling of crash hotplug is much slower relatively.  So the atomic
      variable __kexec_lock and kexec_trylock() can't guarantee the
      serialization of crash hotplug handling.
      
      Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug
      handling specifically.  This doesn't impact the usage of __kexec_lock.
      
      Link: https://lkml.kernel.org/r/20230926120905.392903-1-bhe@redhat.com
      Fixes: 24726275 ("crash: add generic infrastructure for crash hotplug support")
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Tested-by: default avatarEric DeVolder <eric.devolder@oracle.com>
      Reviewed-by: default avatarEric DeVolder <eric.devolder@oracle.com>
      Reviewed-by: default avatarValentin Schneider <vschneid@redhat.com>
      Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e2a8f20d
    • Juntong Deng's avatar
      selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and... · bbe246f8
      Juntong Deng authored
      selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error
      
      According to the awk manual, the -e option does not need to be specified
      in front of 'program' (unless you need to mix program-file).
      
      The redundant -e option can cause error when users use awk tools other
      than gawk (for example, mawk does not support the -e option).
      
      Error Example:
      awk: not an option: -e
      
      Link: https://lkml.kernel.org/r/VI1P193MB075228810591AF2FDD7D42C599C3A@VI1P193MB0752.EURP193.PROD.OUTLOOK.COMSigned-off-by: default avatarJuntong Deng <juntong.deng@outlook.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bbe246f8
    • Yang Shi's avatar
      mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified · 24526268
      Yang Shi authored
      When calling mbind() with MPOL_MF_{MOVE|MOVEALL} | MPOL_MF_STRICT, kernel
      should attempt to migrate all existing pages, and return -EIO if there is
      misplaced or unmovable page.  Then commit 6f4576e3 ("mempolicy: apply
      page table walker on queue_pages_range()") messed up the return value and
      didn't break VMA scan early ianymore when MPOL_MF_STRICT alone.  The
      return value problem was fixed by commit a7f40cfe ("mm: mempolicy:
      make mbind() return -EIO when MPOL_MF_STRICT is specified"), but it broke
      the VMA walk early if unmovable page is met, it may cause some pages are
      not migrated as expected.
      
      The code should conceptually do:
      
       if (MPOL_MF_MOVE|MOVEALL)
           scan all vmas
           try to migrate the existing pages
           return success
       else if (MPOL_MF_MOVE* | MPOL_MF_STRICT)
           scan all vmas
           try to migrate the existing pages
           return -EIO if unmovable or migration failed
       else /* MPOL_MF_STRICT alone */
           break early if meets unmovable and don't call mbind_range() at all
       else /* none of those flags */
           check the ranges in test_walk, EFAULT without mbind_range() if discontig.
      
      Fixed the behavior.
      
      Link: https://lkml.kernel.org/r/20230920223242.3425775-1-yang@os.amperecomputing.com
      Fixes: a7f40cfe ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
      Signed-off-by: default avatarYang Shi <yang@os.amperecomputing.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[4.9+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      24526268
    • Jinjie Ruan's avatar
      mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions() · 45120b15
      Jinjie Ruan authored
      When CONFIG_DAMON_VADDR_KUNIT_TEST=y and making CONFIG_DEBUG_KMEMLEAK=y
      and CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y, the below memory leak is detected.
      
      Since commit 9f86d624 ("mm/damon/vaddr-test: remove unnecessary
      variables"), the damon_destroy_ctx() is removed, but still call
      damon_new_target() and damon_new_region(), the damon_region which is
      allocated by kmem_cache_alloc() in damon_new_region() and the damon_target
      which is allocated by kmalloc in damon_new_target() are not freed.  And
      the damon_region which is allocated in damon_new_region() in
      damon_set_regions() is also not freed.
      
      So use damon_destroy_target to free all the damon_regions and damon_target.
      
          unreferenced object 0xffff888107c9a940 (size 64):
            comm "kunit_try_catch", pid 1069, jiffies 4294670592 (age 732.761s)
            hex dump (first 32 bytes):
              00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b  ............kkkk
              60 c7 9c 07 81 88 ff ff f8 cb 9c 07 81 88 ff ff  `...............
            backtrace:
              [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
              [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
              [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
              [<ffffffff819c82be>] damon_test_apply_three_regions1+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff8881079cc740 (size 56):
            comm "kunit_try_catch", pid 1069, jiffies 4294670592 (age 732.761s)
            hex dump (first 32 bytes):
              05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00  ................
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0
              [<ffffffff819c82be>] damon_test_apply_three_regions1+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff888107c9ac40 (size 64):
            comm "kunit_try_catch", pid 1071, jiffies 4294670595 (age 732.843s)
            hex dump (first 32 bytes):
              00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b  ............kkkk
              a0 cc 9c 07 81 88 ff ff 78 a1 76 07 81 88 ff ff  ........x.v.....
            backtrace:
              [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
              [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
              [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
              [<ffffffff819c851e>] damon_test_apply_three_regions2+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff8881079ccc80 (size 56):
            comm "kunit_try_catch", pid 1071, jiffies 4294670595 (age 732.843s)
            hex dump (first 32 bytes):
              05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00  ................
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0
              [<ffffffff819c851e>] damon_test_apply_three_regions2+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff888107c9af40 (size 64):
            comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.011s)
            hex dump (first 32 bytes):
              00 00 00 00 00 00 00 00 06 00 00 00 6b 6b 6b 6b  ............kkkk
              20 a2 76 07 81 88 ff ff b8 a6 76 07 81 88 ff ff   .v.......v.....
            backtrace:
              [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
              [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
              [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
              [<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff88810776a200 (size 56):
            comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.011s)
            hex dump (first 32 bytes):
              05 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00  ................
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819c7d91>] damon_do_test_apply_three_regions.constprop.0+0xd1/0x3e0
              [<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff88810776a740 (size 56):
            comm "kunit_try_catch", pid 1073, jiffies 4294670597 (age 733.025s)
            hex dump (first 32 bytes):
              3d 00 00 00 00 00 00 00 3f 00 00 00 00 00 00 00  =.......?.......
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819bfcc2>] damon_set_regions+0x4c2/0x8e0
              [<ffffffff819c7dbb>] damon_do_test_apply_three_regions.constprop.0+0xfb/0x3e0
              [<ffffffff819c877e>] damon_test_apply_three_regions3+0x21e/0x260
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff888108038240 (size 64):
            comm "kunit_try_catch", pid 1075, jiffies 4294670600 (age 733.022s)
            hex dump (first 32 bytes):
              00 00 00 00 00 00 00 00 03 00 00 00 6b 6b 6b 6b  ............kkkk
              48 ad 76 07 81 88 ff ff 98 ae 76 07 81 88 ff ff  H.v.......v.....
            backtrace:
              [<ffffffff817e0167>] kmalloc_trace+0x27/0xa0
              [<ffffffff819c11cf>] damon_new_target+0x3f/0x1b0
              [<ffffffff819c7d55>] damon_do_test_apply_three_regions.constprop.0+0x95/0x3e0
              [<ffffffff819c898d>] damon_test_apply_three_regions4+0x1cd/0x210
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
          unreferenced object 0xffff88810776ad28 (size 56):
            comm "kunit_try_catch", pid 1075, jiffies 4294670600 (age 733.022s)
            hex dump (first 32 bytes):
              05 00 00 00 00 00 00 00 07 00 00 00 00 00 00 00  ................
              6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b  kkkkkkkk....kkkk
            backtrace:
              [<ffffffff819bc492>] damon_new_region+0x22/0x1c0
              [<ffffffff819bfcc2>] damon_set_regions+0x4c2/0x8e0
              [<ffffffff819c7dbb>] damon_do_test_apply_three_regions.constprop.0+0xfb/0x3e0
              [<ffffffff819c898d>] damon_test_apply_three_regions4+0x1cd/0x210
              [<ffffffff829fce6a>] kunit_generic_run_threadfn_adapter+0x4a/0x90
              [<ffffffff81237cf6>] kthread+0x2b6/0x380
              [<ffffffff81097add>] ret_from_fork+0x2d/0x70
              [<ffffffff81003791>] ret_from_fork_asm+0x11/0x20
      
      Link: https://lkml.kernel.org/r/20230925072100.3725620-1-ruanjinjie@huawei.com
      Fixes: 9f86d624 ("mm/damon/vaddr-test: remove unnecessary variables")
      Signed-off-by: default avatarJinjie Ruan <ruanjinjie@huawei.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      45120b15
    • Michal Hocko's avatar
      mm, memcg: reconsider kmem.limit_in_bytes deprecation · 4597648f
      Michal Hocko authored
      This reverts commits 86327e8e ("memcg: drop kmem.limit_in_bytes") and
      partially reverts 58056f77 ("memcg, kmem: further deprecate
      kmem.limit_in_bytes") which have incrementally removed support for the
      kernel memory accounting hard limit.  Unfortunately it has turned out that
      there is still userspace depending on the existence of
      memory.kmem.limit_in_bytes [1].  The underlying functionality is not
      really required but the non-existent file just confuses the userspace
      which fails in the result.  The patch to fix this on the userspace side
      has been submitted but it is hard to predict how it will propagate through
      the maze of 3rd party consumers of the software.
      
      Now, reverting alone 86327e8e is not an option because there is
      another set of userspace which cannot cope with ENOTSUPP returned when
      writing to the file.  Therefore we have to go and revisit 58056f77 as
      well.  There are two ways to go ahead.  Either we give up on the
      deprecation and fully revert 58056f77 as well or we can keep
      kmem.limit_in_bytes but make the write a noop and warn about the fact. 
      This should work for both known breaking workloads which depend on the
      existence but do not depend on the hard limit enforcement.
      
      Note to backporters to stable trees.  a8c49af3 ("memcg: add per-memcg
      total kernel memory stat") introduced in 4.18 has added memcg_account_kmem
      so the accounting is not done by obj_cgroup_charge_pages directly for v1
      anymore.  Prior kernels need to add it explicitly (thanks to Johannes for
      pointing this out).
      
      [akpm@linux-foundation.org: fix build - remove unused local]
      Link: http://lkml.kernel.org/r/20230920081101.GA12096@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net [1]
      Link: https://lkml.kernel.org/r/ZRE5VJozPZt9bRPy@dhcp22.suse.cz
      Fixes: 86327e8e ("memcg: drop kmem.limit_in_bytes")
      Fixes: 58056f77 ("memcg, kmem: further deprecate kmem.limit_in_bytes")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Tejun heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4597648f
    • Domenico Cerasuolo's avatar
      mm: zswap: fix potential memory corruption on duplicate store · ca56489c
      Domenico Cerasuolo authored
      While stress-testing zswap a memory corruption was happening when writing
      back pages.  __frontswap_store used to check for duplicate entries before
      attempting to store a page in zswap, this was because if the store fails
      the old entry isn't removed from the tree.  This change removes duplicate
      entries in zswap_store before the actual attempt.
      
      [cerasuolodomenico@gmail.com: add a warning and a comment, per Johannes]
        Link: https://lkml.kernel.org/r/20230925130002.1929369-1-cerasuolodomenico@gmail.com
      Link: https://lkml.kernel.org/r/20230922172211.1704917-1-cerasuolodomenico@gmail.com
      Fixes: 42c06a0e ("mm: kill frontswap")
      Signed-off-by: default avatarDomenico Cerasuolo <cerasuolodomenico@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ca56489c
    • Ryan Roberts's avatar
      arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries · 6f1bace9
      Ryan Roberts authored
      When called with a swap entry that does not embed a PFN (e.g. 
      PTE_MARKER_POISONED or PTE_MARKER_UFFD_WP), the previous implementation of
      set_huge_pte_at() would either cause a BUG() to fire (if CONFIG_DEBUG_VM
      is enabled) or cause a dereference of an invalid address and subsequent
      panic.
      
      arm64's huge pte implementation supports multiple huge page sizes, some of
      which are implemented in the page table with multiple contiguous entries. 
      So set_huge_pte_at() needs to work out how big the logical pte is, so that
      it can also work out how many physical ptes (or pmds) need to be written. 
      It previously did this by grabbing the folio out of the pte and querying
      its size.
      
      However, there are cases when the pte being set is actually a swap entry. 
      But this also used to work fine, because for huge ptes, we only ever saw
      migration entries and hwpoison entries.  And both of these types of swap
      entries have a PFN embedded, so the code would grab that and everything
      still worked out.
      
      But over time, more calls to set_huge_pte_at() have been added that set
      swap entry types that do not embed a PFN.  And this causes the code to go
      bang.  The triggering case is for the uffd poison test, commit
      99aa7721 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
      causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
      8a13897f ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
      added in v6.5-rc7.  Although review shows that there are other call sites
      that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
      on arm64 because arm64 doesn't support UFFD WP.
      
      Arguably, the root cause is really due to commit 18f39629 ("mm:
      hugetlb: kill set_huge_swap_pte_at()"), which aimed to simplify the
      interface to the core code by removing set_huge_swap_pte_at() (which took
      a page size parameter) and replacing it with calls to set_huge_pte_at()
      where the size was inferred from the folio, as descibed above.  While that
      commit didn't break anything at the time, it did break the interface
      because it couldn't handle swap entries without PFNs.  And since then new
      callers have come along which rely on this working.  But given the
      brokeness is only observable after commit 8a13897f ("mm: userfaultfd:
      support UFFDIO_POISON for hugetlbfs"), that one gets the Fixes tag.
      
      Now that we have modified the set_huge_pte_at() interface to pass the huge
      page size in the previous patch, we can trivially fix this issue.
      
      Link: https://lkml.kernel.org/r/20230922115804.2043771-3-ryan.roberts@arm.com
      Fixes: 8a13897f ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>	[6.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6f1bace9
    • Ryan Roberts's avatar
      mm: hugetlb: add huge page size param to set_huge_pte_at() · 935d4f0c
      Ryan Roberts authored
      Patch series "Fix set_huge_pte_at() panic on arm64", v2.
      
      This series fixes a bug in arm64's implementation of set_huge_pte_at(),
      which can result in an unprivileged user causing a kernel panic.  The
      problem was triggered when running the new uffd poison mm selftest for
      HUGETLB memory.  This test (and the uffd poison feature) was merged for
      v6.5-rc7.
      
      Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable
      (correctly this time) to get it backported to v6.5, where the issue first
      showed up.
      
      
      Description of Bug
      ==================
      
      arm64's huge pte implementation supports multiple huge page sizes, some of
      which are implemented in the page table with multiple contiguous entries. 
      So set_huge_pte_at() needs to work out how big the logical pte is, so that
      it can also work out how many physical ptes (or pmds) need to be written. 
      It previously did this by grabbing the folio out of the pte and querying
      its size.
      
      However, there are cases when the pte being set is actually a swap entry. 
      But this also used to work fine, because for huge ptes, we only ever saw
      migration entries and hwpoison entries.  And both of these types of swap
      entries have a PFN embedded, so the code would grab that and everything
      still worked out.
      
      But over time, more calls to set_huge_pte_at() have been added that set
      swap entry types that do not embed a PFN.  And this causes the code to go
      bang.  The triggering case is for the uffd poison test, commit
      99aa7721 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
      causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
      8a13897f ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
      added in v6.5-rc7.  Although review shows that there are other call sites
      that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
      on arm64 because arm64 doesn't support UFFD WP.
      
      If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise,
      it will dereference a bad pointer in page_folio():
      
          static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
          {
              VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
      
              return page_folio(pfn_to_page(swp_offset_pfn(entry)));
          }
      
      
      Fix
      ===
      
      The simplest fix would have been to revert the dodgy cleanup commit
      18f39629 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since
      things have moved on, this would have required an audit of all the new
      set_huge_pte_at() call sites to see if they should be converted to
      set_huge_swap_pte_at().  As per the original intent of the change, it
      would also leave us open to future bugs when people invariably get it
      wrong and call the wrong helper.
      
      So instead, I've added a huge page size parameter to set_huge_pte_at(). 
      This means that the arm64 code has the size in all cases.  It's a bigger
      change, due to needing to touch the arches that implement the function,
      but it is entirely mechanical, so in my view, low risk.
      
      I've compile-tested all touched arches; arm64, parisc, powerpc, riscv,
      s390, sparc (and additionally x86_64).  I've additionally booted and run
      mm selftests against arm64, where I observe the uffd poison test is fixed,
      and there are no other regressions.
      
      
      This patch (of 2):
      
      In order to fix a bug, arm64 needs to be told the size of the huge page
      for which the pte is being set in set_huge_pte_at().  Provide for this by
      adding an `unsigned long sz` parameter to the function.  This follows the
      same pattern as huge_pte_clear().
      
      This commit makes the required interface modifications to the core mm as
      well as all arches that implement this function (arm64, parisc, powerpc,
      riscv, s390, sparc).  The actual arm64 bug will be fixed in a separate
      commit.
      
      No behavioral changes intended.
      
      Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com
      Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com
      Fixes: 8a13897f ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>	[powerpc 8xx]
      Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>	[vmalloc change]
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>	[6.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      935d4f0c
    • Liam R. Howlett's avatar
      maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states · a8091f03
      Liam R. Howlett authored
      When updating the maple tree iterator to avoid rewalks, an issue was
      introduced when shifting beyond the limits.  This can be seen by trying to
      go to the previous address of 0, which would set the maple node to
      MAS_NONE and keep the range as the last entry.
      
      Subsequent calls to mas_find() would then search upwards from mas->last
      and skip the value at mas->index/mas->last.  This showed up as a bug in
      mprotect which skips the actual VMA at the current range after attempting
      to go to the previous VMA from 0.
      
      Since MAS_NONE may already be set when searching for a value that isn't
      contained within a node, changing the handling of MAS_NONE in mas_find()
      would make the code more complicated and error prone.  Furthermore, there
      was no way to tell which limit was hit, and thus which action to take
      (next or the entry at the current range).
      
      This solution is to add two states to track what happened with the
      previous iterator action.  This allows for the expected behaviour of the
      next command to return the correct item (either the item at the range
      requested, or the next/previous).
      
      Tests are also added and updated accordingly.
      
      Link: https://lkml.kernel.org/r/20230921181236.509072-3-Liam.Howlett@oracle.com
      Link: https://gist.github.com/heatd/85d2971fae1501b55b6ea401fbbe485b
      Link: https://lore.kernel.org/linux-mm/20230921181236.509072-1-Liam.Howlett@oracle.com/
      Fixes: 39193685 ("maple_tree: try harder to keep active node with mas_prev()")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reported-by: default avatarPedro Falcato <pedro.falcato@gmail.com>
      Closes: https://gist.github.com/heatd/85d2971fae1501b55b6ea401fbbe485b
      Closes: https://bugs.archlinux.org/task/79656
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a8091f03
    • Liam R. Howlett's avatar
      maple_tree: add mas_is_active() to detect in-tree walks · 5c590804
      Liam R. Howlett authored
      Patch series "maple_tree: Fix mas_prev() state regression".
      
      Pedro Falcato retported an mprotect regression [1] which was bisected back
      to the iterator changes for maple tree.  Root cause analysis showed the
      mas_prev() running off the end of the VMA space (previous from 0) followed
      by mas_find(), would skip the first value.
      
      This patchset introduces maple state underflow/overflow so the sequence of
      calls on the maple state will return what the user expects.
      
      Users who encounter this bug may see mprotect(), userfaultfd_register(),
      and mlock() fail on VMAs mapped with address 0.
      
      
      This patch (of 2):
      
      Instead of constantly checking each possibility of the maple state,
      create a fast path that will skip over checking unlikely states.
      
      Link: https://lkml.kernel.org/r/20230921181236.509072-1-Liam.Howlett@oracle.com
      Link: https://lkml.kernel.org/r/20230921181236.509072-2-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Pedro Falcato <pedro.falcato@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5c590804
    • Pan Bian's avatar
      nilfs2: fix potential use after free in nilfs_gccache_submit_read_data() · 7ee29fac
      Pan Bian authored
      In nilfs_gccache_submit_read_data(), brelse(bh) is called to drop the
      reference count of bh when the call to nilfs_dat_translate() fails.  If
      the reference count hits 0 and its owner page gets unlocked, bh may be
      freed.  However, bh->b_page is dereferenced to put the page after that,
      which may result in a use-after-free bug.  This patch moves the release
      operation after unlocking and putting the page.
      
      NOTE: The function in question is only called in GC, and in combination
      with current userland tools, address translation using DAT does not occur
      in that function, so the code path that causes this issue will not be
      executed.  However, it is possible to run that code path by intentionally
      modifying the userland GC library or by calling the GC ioctl directly.
      
      [konishi.ryusuke@gmail.com: NOTE added to the commit log]
      Link: https://lkml.kernel.org/r/1543201709-53191-1-git-send-email-bianpan2016@163.com
      Link: https://lkml.kernel.org/r/20230921141731.10073-1-konishi.ryusuke@gmail.com
      Fixes: a3d93f70 ("nilfs2: block cache for garbage collection")
      Signed-off-by: default avatarPan Bian <bianpan2016@163.com>
      Reported-by: default avatarFerry Meng <mengferry@linux.alibaba.com>
      Closes: https://lkml.kernel.org/r/20230818092022.111054-1-mengferry@linux.alibaba.comSigned-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7ee29fac
    • Matthew Wilcox (Oracle)'s avatar
      mm: abstract moving to the next PFN · ce60f27b
      Matthew Wilcox (Oracle) authored
      In order to fix the L1TF vulnerability, x86 can invert the PTE bits for
      PROT_NONE VMAs, which means we cannot move from one PTE to the next by
      adding 1 to the PFN field of the PTE.  This results in the BUG reported at
      [1].
      
      Abstract advancing the PTE to the next PFN through a pte_next_pfn()
      function/macro.
      
      Link: https://lkml.kernel.org/r/20230920040958.866520-1-willy@infradead.org
      Fixes: bcc6cc83 ("mm: add default definition of set_ptes()")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: syzbot+55cc72f8cc3a549119df@syzkaller.appspotmail.com
      Closes: https://lkml.kernel.org/r/000000000000d099fa0604f03351@google.com [1]
      Reviewed-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ce60f27b
    • Matthew Wilcox (Oracle)'s avatar
      mm: report success more often from filemap_map_folio_range() · a501a070
      Matthew Wilcox (Oracle) authored
      Even though we had successfully mapped the relevant page, we would rarely
      return success from filemap_map_folio_range().  That leads to falling back
      from the VMA lock path to the mmap_lock path, which is a speed &
      scalability issue.  Found by inspection.
      
      Link: https://lkml.kernel.org/r/20230920035336.854212-1-willy@infradead.org
      Fixes: 617c28ec ("filemap: batch PTE mappings")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a501a070
    • Greg Ungerer's avatar
      fs: binfmt_elf_efpic: fix personality for ELF-FDPIC · 7c315158
      Greg Ungerer authored
      The elf-fdpic loader hard sets the process personality to either
      PER_LINUX_FDPIC for true elf-fdpic binaries or to PER_LINUX for normal ELF
      binaries (in this case they would be constant displacement compiled with
      -pie for example).  The problem with that is that it will lose any other
      bits that may be in the ELF header personality (such as the "bug
      emulation" bits).
      
      On the ARM architecture the ADDR_LIMIT_32BIT flag is used to signify a
      normal 32bit binary - as opposed to a legacy 26bit address binary.  This
      matters since start_thread() will set the ARM CPSR register as required
      based on this flag.  If the elf-fdpic loader loses this bit the process
      will be mis-configured and crash out pretty quickly.
      
      Modify elf-fdpic loader personality setting so that it preserves the upper
      three bytes by using the SET_PERSONALITY macro to set it.  This macro in
      the generic case sets PER_LINUX and preserves the upper bytes. 
      Architectures can override this for their specific use case, and ARM does
      exactly this.
      
      The problem shows up quite easily running under qemu using the ARM
      architecture, but not necessarily on all types of real ARM hardware.  If
      the underlying ARM processor does not support the legacy 26-bit addressing
      mode then everything will work as expected.
      
      Link: https://lkml.kernel.org/r/20230907011808.2985083-1-gerg@kernel.org
      Fixes: 1bde925d ("fs/binfmt_elf_fdpic.c: provide NOMMU loader for regular ELF binaries")
      Signed-off-by: default avatarGreg Ungerer <gerg@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Greg Ungerer <gerg@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7c315158
  2. 19 Sep, 2023 13 commits
  3. 17 Sep, 2023 11 commits
  4. 16 Sep, 2023 2 commits