1. 16 Apr, 2023 12 commits
    • Andrew Morton's avatar
    • Peter Xu's avatar
      Revert "userfaultfd: don't fail on unrecognized features" · 2ff559f3
      Peter Xu authored
      This is a proposal to revert commit 914eedcb.
      
      I found this when writing a simple UFFDIO_API test to be the first unit
      test in this set.  Two things breaks with the commit:
      
        - UFFDIO_API check was lost and missing.  According to man page, the
        kernel should reject ioctl(UFFDIO_API) if uffdio_api.api != 0xaa.  This
        check is needed if the api version will be extended in the future, or
        user app won't be able to identify which is a new kernel.
      
        - Feature flags checks were removed, which means UFFDIO_API with a
        feature that does not exist will also succeed.  According to the man
        page, we should (and it makes sense) to reject ioctl(UFFDIO_API) if
        unknown features passed in.
      
      Link: https://lore.kernel.org/r/20220722201513.1624158-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20230412163922.327282-2-peterx@redhat.com
      Fixes: 914eedcb ("userfaultfd: don't fail on unrecognized features")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ff559f3
    • Baokun Li's avatar
      writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs · 1ba1199e
      Baokun Li authored
      KASAN report null-ptr-deref:
      ==================================================================
      BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
      Write of size 8 at addr 0000000000000000 by task sync/943
      CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
      Call Trace:
       <TASK>
       dump_stack_lvl+0x7f/0xc0
       print_report+0x2ba/0x340
       kasan_report+0xc4/0x120
       kasan_check_range+0x1b7/0x2e0
       __kasan_check_write+0x24/0x40
       bdi_split_work_to_wbs+0x5c5/0x7b0
       sync_inodes_sb+0x195/0x630
       sync_inodes_one_sb+0x3a/0x50
       iterate_supers+0x106/0x1b0
       ksys_sync+0x98/0x160
      [...]
      ==================================================================
      
      The race that causes the above issue is as follows:
      
                 cpu1                     cpu2
      -------------------------|-------------------------
      inode_switch_wbs
       INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
       queue_rcu_work(isw_wq, &isw->work)
       // queue_work async
        inode_switch_wbs_work_fn
         wb_put_many(old_wb, nr_switched)
          percpu_ref_put_many
           ref->data->release(ref)
           cgwb_release
            queue_work(cgwb_release_wq, &wb->release_work)
            // queue_work async
             &wb->release_work
             cgwb_release_workfn
                                  ksys_sync
                                   iterate_supers
                                    sync_inodes_one_sb
                                     sync_inodes_sb
                                      bdi_split_work_to_wbs
                                       kmalloc(sizeof(*work), GFP_ATOMIC)
                                       // alloc memory failed
              percpu_ref_exit
               ref->data = NULL
               kfree(data)
                                       wb_get(wb)
                                        percpu_ref_get(&wb->refcnt)
                                         percpu_ref_get_many(ref, 1)
                                          atomic_long_add(nr, &ref->data->count)
                                           atomic64_add(i, v)
                                           // trigger null-ptr-deref
      
      bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all
      wbs.  If the allocation of new work fails, the on-stack fallback will be
      used and the reference count of the current wb is increased afterwards. 
      If cgroup writeback membership switches occur before getting the reference
      count and the current wb is released as old_wd, then calling wb_get() or
      wb_put() will trigger the null pointer dereference above.
      
      This issue was introduced in v4.3-rc7 (see fix tag1).  Both
      sync_inodes_sb() and __writeback_inodes_sb_nr() calls to
      bdi_split_work_to_wbs() can trigger this issue.  For scenarios called via
      sync_inodes_sb(), originally commit 7fc5854f ("writeback: synchronize
      sync(2) against cgroup writeback membership switches") reduced the
      possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see
      fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from
      inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io,
      thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(),
      and the issue becomes easily reproducible again.
      
      To solve this problem, percpu_ref_exit() is called under RCU protection to
      avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). 
      Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
      and skip the current wb if wb_tryget() fails because the wb has already
      been shutdown.
      
      Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com
      Fixes: b817525a ("writeback: bdi_writeback iteration must not skip dying ones")
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Hou Tao <houtao1@huawei.com>
      Cc: yangerkun <yangerkun@huawei.com>
      Cc: Zhang Yi <yi.zhang@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1ba1199e
    • Peng Zhang's avatar
      maple_tree: fix a potential memory leak, OOB access, or other unpredictable bug · 1f5f12ec
      Peng Zhang authored
      In mas_alloc_nodes(), "node->node_count = 0" means to initialize the
      node_count field of the new node, but the node may not be a new node.  It
      may be a node that existed before and node_count has a value, setting it
      to 0 will cause a memory leak.  At this time, mas->alloc->total will be
      greater than the actual number of nodes in the linked list, which may
      cause many other errors.  For example, out-of-bounds access in
      mas_pop_node(), and mas_pop_node() may return addresses that should not be
      used.  Fix it by initializing node_count only for new nodes.
      
      Also, by the way, an if-else statement was removed to simplify the code.
      
      Link: https://lkml.kernel.org/r/20230411041005.26205-1-zhangpeng.00@bytedance.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f5f12ec
    • Steve Chou's avatar
      tools/mm/page_owner_sort.c: fix TGID output when cull=tg is used · 92357568
      Steve Chou authored
      When using cull option with 'tg' flag, the fprintf is using pid instead
      of tgid. It should use tgid instead.
      
      Link: https://lkml.kernel.org/r/20230411034929.2071501-1-steve_chou@pesi.com.tw
      Fixes: 9c8a0a8e ("tools/vm/page_owner_sort.c: support for user-defined culling rules")
      Signed-off-by: default avatarSteve Chou <steve_chou@pesi.com.tw>
      Cc: Jiajian Ye <yejiajian2018@email.szu.edu.cn>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      92357568
    • Jonathan Toppins's avatar
      mailmap: update jtoppins' entry to reference correct email · d2c115ba
      Jonathan Toppins authored
      Link: https://lkml.kernel.org/r/d79bc6eaf65e68bd1c2a1e1510ab6291ce5926a6.1681162487.git.jtoppins@redhat.comSigned-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Cc: Colin Ian King <colin.i.king@gmail.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Qais Yousef <qyousef@layalina.io>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d2c115ba
    • Liam R. Howlett's avatar
      mm/mempolicy: fix use-after-free of VMA iterator · f4e9e0e6
      Liam R. Howlett authored
      set_mempolicy_home_node() iterates over a list of VMAs and calls
      mbind_range() on each VMA, which also iterates over the singular list of
      the VMA passed in and potentially splits the VMA.  Since the VMA iterator
      is not passed through, set_mempolicy_home_node() may now point to a stale
      node in the VMA tree.  This can result in a UAF as reported by syzbot.
      
      Avoid the stale maple tree node by passing the VMA iterator through to the
      underlying call to split_vma().
      
      mbind_range() is also overly complicated, since there are two calling
      functions and one already handles iterating over the VMAs.  Simplify
      mbind_range() to only handle merging and splitting of the VMAs.
      
      Align the new loop in do_mbind() and existing loop in
      set_mempolicy_home_node() to use the reduced mbind_range() function.  This
      allows for a single location of the range calculation and avoids
      constantly looking up the previous VMA (since this is a loop over the
      VMAs).
      
      Link: https://lore.kernel.org/linux-mm/000000000000c93feb05f87e24ad@google.com/
      Fixes: 66850be5 ("mm/mempolicy: use vma iterator & maple state instead of vma linked list")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reported-by: syzbot+a7c1ec5b1d71ceaa5186@syzkaller.appspotmail.com
        Link: https://lkml.kernel.org/r/20230410152205.2294819-1-Liam.Howlett@oracle.com
      Tested-by: syzbot+a7c1ec5b1d71ceaa5186@syzkaller.appspotmail.com
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f4e9e0e6
    • Naoya Horiguchi's avatar
      mm/huge_memory.c: warn with pr_warn_ratelimited instead of VM_WARN_ON_ONCE_FOLIO · 4737edbb
      Naoya Horiguchi authored
      split_huge_page_to_list() WARNs when called for huge zero pages, which
      sounds to me too harsh because it does not imply a kernel bug, but just
      notifies the event to admins.  On the other hand, this is considered as
      critical by syzkaller and makes its testing less efficient, which seems to
      me harmful.
      
      So replace the VM_WARN_ON_ONCE_FOLIO with pr_warn_ratelimited.
      
      Link: https://lkml.kernel.org/r/20230406082004.2185420-1-naoya.horiguchi@linux.dev
      Fixes: 478d134e ("mm/huge_memory: do not overkill when splitting huge_zero_page")
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reported-by: syzbot+07a218429c8d19b1fb25@syzkaller.appspotmail.com
        Link: https://lore.kernel.org/lkml/000000000000a6f34a05e6efcd01@google.com/Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Xu Yu <xuyu@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4737edbb
    • Liam R. Howlett's avatar
      mm/mprotect: fix do_mprotect_pkey() return on error · 82f95134
      Liam R. Howlett authored
      When the loop over the VMA is terminated early due to an error, the return
      code could be overwritten with ENOMEM.  Fix the return code by only
      setting the error on early loop termination when the error is not set.
      
      User-visible effects include: attempts to run mprotect() against a
      special mapping or with a poorly-aligned hugetlb address should return
      -EINVAL, but they presently return -ENOMEM.  In other cases an -EACCESS
      should be returned.
      
      Link: https://lkml.kernel.org/r/20230406193050.1363476-1-Liam.Howlett@oracle.com
      Fixes: 2286a691 ("mm: change mprotect_fixup to vma iterator")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      82f95134
    • Peter Xu's avatar
      mm/khugepaged: check again on anon uffd-wp during isolation · dd47ac42
      Peter Xu authored
      Khugepaged collapse an anonymous thp in two rounds of scans.  The 2nd
      round done in __collapse_huge_page_isolate() after
      hpage_collapse_scan_pmd(), during which all the locks will be released
      temporarily.  It means the pgtable can change during this phase before 2nd
      round starts.
      
      It's logically possible some ptes got wr-protected during this phase, and
      we can errornously collapse a thp without noticing some ptes are
      wr-protected by userfault.  e1e267c7 wanted to avoid it but it only
      did that for the 1st phase, not the 2nd phase.
      
      Since __collapse_huge_page_isolate() happens after a round of small page
      swapins, we don't need to worry on any !present ptes - if it existed
      khugepaged will already bail out.  So we only need to check present ptes
      with uffd-wp bit set there.
      
      This is something I found only but never had a reproducer, I thought it
      was one caused a bug in Muhammad's recent pagemap new ioctl work, but it
      turns out it's not the cause of that but an userspace bug.  However this
      seems to still be a real bug even with a very small race window, still
      worth to have it fixed and copy stable.
      
      Link: https://lkml.kernel.org/r/20230405155120.3608140-1-peterx@redhat.com
      Fixes: e1e267c7 ("khugepaged: skip collapse if uffd-wp detected")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dd47ac42
    • David Hildenbrand's avatar
      mm/userfaultfd: fix uffd-wp handling for THP migration entries · 24bf08c4
      David Hildenbrand authored
      Looks like what we fixed for hugetlb in commit 44f86392 ("mm/hugetlb:
      fix uffd-wp handling for migration entries in
      hugetlb_change_protection()") similarly applies to THP.
      
      Setting/clearing uffd-wp on THP migration entries is not implemented
      properly.  Further, while removing migration PMDs considers the uffd-wp
      bit, inserting migration PMDs does not consider the uffd-wp bit.
      
      We have to set/clear independently of the migration entry type in
      change_huge_pmd() and properly copy the uffd-wp bit in
      set_pmd_migration_entry().
      
      Verified using a simple reproducer that triggers migration of a THP, that
      the set_pmd_migration_entry() no longer loses the uffd-wp bit.
      
      Link: https://lkml.kernel.org/r/20230405160236.587705-2-david@redhat.com
      Fixes: f45ec5ff ("userfaultfd: wp: support swap and page migration")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      24bf08c4
    • Qi Zheng's avatar
      mm: swap: fix performance regression on sparsetruncate-tiny · 998ad18b
      Qi Zheng authored
      The ->percpu_pvec_drained was originally introduced by commit d9ed0d08
      ("mm: only drain per-cpu pagevecs once per pagevec usage") to drain
      per-cpu pagevecs only once per pagevec usage.  But after converting the
      swap code to be more folio-based, the commit c2bc1681 ("mm/swap: add
      folio_batch_move_lru()") breaks this logic, which would cause
      ->percpu_pvec_drained to be reset to false, that means per-cpu pagevecs
      will be drained multiple times per pagevec usage.
      
      In theory, there should be no functional changes when converting code to
      be more folio-based.  We should call folio_batch_reinit() in
      folio_batch_move_lru() instead of folio_batch_init().  And to verify that
      we still need ->percpu_pvec_drained, I ran mmtests/sparsetruncate-tiny and
      got the following data:
      
                                   baseline                   with
                                  baseline/                 patch/
      Min       Time      326.00 (   0.00%)      328.00 (  -0.61%)
      1st-qrtle Time      334.00 (   0.00%)      336.00 (  -0.60%)
      2nd-qrtle Time      338.00 (   0.00%)      341.00 (  -0.89%)
      3rd-qrtle Time      343.00 (   0.00%)      347.00 (  -1.17%)
      Max-1     Time      326.00 (   0.00%)      328.00 (  -0.61%)
      Max-5     Time      327.00 (   0.00%)      330.00 (  -0.92%)
      Max-10    Time      328.00 (   0.00%)      331.00 (  -0.91%)
      Max-90    Time      350.00 (   0.00%)      357.00 (  -2.00%)
      Max-95    Time      395.00 (   0.00%)      390.00 (   1.27%)
      Max-99    Time      508.00 (   0.00%)      434.00 (  14.57%)
      Max       Time      547.00 (   0.00%)      476.00 (  12.98%)
      Amean     Time      344.61 (   0.00%)      345.56 *  -0.28%*
      Stddev    Time       30.34 (   0.00%)       19.51 (  35.69%)
      CoeffVar  Time        8.81 (   0.00%)        5.65 (  35.87%)
      BAmean-99 Time      342.38 (   0.00%)      344.27 (  -0.55%)
      BAmean-95 Time      338.58 (   0.00%)      341.87 (  -0.97%)
      BAmean-90 Time      336.89 (   0.00%)      340.26 (  -1.00%)
      BAmean-75 Time      335.18 (   0.00%)      338.40 (  -0.96%)
      BAmean-50 Time      332.54 (   0.00%)      335.42 (  -0.87%)
      BAmean-25 Time      329.30 (   0.00%)      332.00 (  -0.82%)
      
      From the above it can be seen that we get similar data to when
      ->percpu_pvec_drained was introduced, so we still need it.  Let's call
      folio_batch_reinit() in folio_batch_move_lru() to restore the original
      logic.
      
      Link: https://lkml.kernel.org/r/20230405161854.6931-1-zhengqi.arch@bytedance.com
      Fixes: c2bc1681 ("mm/swap: add folio_batch_move_lru()")
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      998ad18b
  2. 06 Apr, 2023 28 commits