1. 21 Apr, 2023 22 commits
    • Suren Baghdasaryan's avatar
      mm: do not increment pgfault stats when page fault handler retries · 53156443
      Suren Baghdasaryan authored
      If the page fault handler requests a retry, we will count the fault
      multiple times.  This is a relatively harmless problem as the retry paths
      are not often requested, and the only user-visible problem is that the
      fault counter will be slightly higher than it should be.  Nevertheless,
      userspace only took one fault, and should not see the fact that the kernel
      had to retry the fault multiple times.
      
      Move page fault accounting into mm_account_fault() and skip incomplete
      faults which will be accounted upon completion.
      
      Link: https://lkml.kernel.org/r/20230419175836.3857458-1-surenb@google.com
      Fixes: d065bd81 ("mm: retry page fault when blocking on disk transfer")
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53156443
    • Sergey Senozhatsky's avatar
      zsmalloc: allow only one active pool compaction context · d2658f20
      Sergey Senozhatsky authored
      zsmalloc pool can be compacted concurrently by many contexts,
      e.g.
      
       cc1 handle_mm_fault()
            do_anonymous_page()
             __alloc_pages_slowpath()
              try_to_free_pages()
               do_try_to_free_pages(
                lru_gen_shrink_node()
                 shrink_slab()
                  do_shrink_slab()
                   zs_shrinker_scan()
                    zs_compact()
      
      Pool compaction is currently (basically) single-threaded as
      it is performed under pool->lock. Having multiple compaction
      threads results in unnecessary contention, as each thread
      competes for pool->lock. This, in turn, affects all zsmalloc
      operations such as zs_malloc(), zs_map_object(), zs_free(), etc.
      
      Introduce the pool->compaction_in_progress atomic variable,
      which ensures that only one compaction context can run at a
      time. This reduces overall pool->lock contention in (corner)
      cases when many contexts attempt to shrink zspool simultaneously.
      
      Link: https://lkml.kernel.org/r/20230418074639.1903197-1-senozhatsky@chromium.org
      Fixes: c0547d0b ("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks")
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Reviewed-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d2658f20
    • Stefan Roesch's avatar
      selftests/mm: add new selftests for KSM · 07115fcc
      Stefan Roesch authored
      This adds three new tests to the selftests for KSM.  These tests use the
      new prctl API's to enable and disable KSM.
      
      1) add new prctl flags to prctl header file in tools dir
      
         This adds the new prctl flags to the include file prct.h in the
         tools directory.  This makes sure they are available for testing.
      
      2) add KSM prctl merge test to ksm_tests
      
         This adds the -t option to the ksm_tests program.  The -t flag
         allows to specify if it should use madvise or prctl ksm merging.
      
      3) add two functions for debugging merge outcome for ksm_tests
      
         This adds two functions to report the metrics in /proc/self/ksm_stat
         and /sys/kernel/debug/mm/ksm. The debug output is enabled with the
         -d option.
      
      4) add KSM prctl test to ksm_functional_tests
      
         This adds a test to the ksm_functional_test that verifies that the
         prctl system call to enable / disable KSM works.
      
      5) add KSM fork test to ksm_functional_test
      
         Add fork test to verify that the MMF_VM_MERGE_ANY flag is inherited
         by the child process.
      
      Link: https://lkml.kernel.org/r/20230418051342.1919757-4-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07115fcc
    • Stefan Roesch's avatar
      mm: add new KSM process and sysfs knobs · d21077fb
      Stefan Roesch authored
      This adds the general_profit KSM sysfs knob and the process profit metric
      knobs to ksm_stat.
      
      1) expose general_profit metric
      
         The documentation mentions a general profit metric, however this
         metric is not calculated.  In addition the formula depends on the size
         of internal structures, which makes it more difficult for an
         administrator to make the calculation.  Adding the metric for a better
         user experience.
      
      2) document general_profit sysfs knob
      
      3) calculate ksm process profit metric
      
         The ksm documentation mentions the process profit metric and how to
         calculate it.  This adds the calculation of the metric.
      
      4) mm: expose ksm process profit metric in ksm_stat
      
         This exposes the ksm process profit metric in /proc/<pid>/ksm_stat.
         The documentation mentions the formula for the ksm process profit
         metric, however it does not calculate it.  In addition the formula
         depends on the size of internal structures.  So it makes sense to
         expose it.
      
      5) document new procfs ksm knobs
      
      Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d21077fb
    • Stefan Roesch's avatar
      mm: add new api to enable ksm per process · d7597f59
      Stefan Roesch authored
      Patch series "mm: process/cgroup ksm support", v9.
      
      So far KSM can only be enabled by calling madvise for memory regions.  To
      be able to use KSM for more workloads, KSM needs to have the ability to be
      enabled / disabled at the process / cgroup level.
      
      Use case 1:
        The madvise call is not available in the programming language.  An
        example for this are programs with forked workloads using a garbage
        collected language without pointers.  In such a language madvise cannot
        be made available.
      
        In addition the addresses of objects get moved around as they are
        garbage collected.  KSM sharing needs to be enabled "from the outside"
        for these type of workloads.
      
      Use case 2:
        The same interpreter can also be used for workloads where KSM brings
        no benefit or even has overhead.  We'd like to be able to enable KSM on
        a workload by workload basis.
      
      Use case 3:
        With the madvise call sharing opportunities are only enabled for the
        current process: it is a workload-local decision.  A considerable number
        of sharing opportunities may exist across multiple workloads or jobs (if
        they are part of the same security domain).  Only a higler level entity
        like a job scheduler or container can know for certain if its running
        one or more instances of a job.  That job scheduler however doesn't have
        the necessary internal workload knowledge to make targeted madvise
        calls.
      
      Security concerns:
      
        In previous discussions security concerns have been brought up.  The
        problem is that an individual workload does not have the knowledge about
        what else is running on a machine.  Therefore it has to be very
        conservative in what memory areas can be shared or not.  However, if the
        system is dedicated to running multiple jobs within the same security
        domain, its the job scheduler that has the knowledge that sharing can be
        safely enabled and is even desirable.
      
      Performance:
      
        Experiments with using UKSM have shown a capacity increase of around 20%.
      
        Here are the metrics from an instagram workload (taken from a machine
        with 64GB main memory):
      
         full_scans: 445
         general_profit: 20158298048
         max_page_sharing: 256
         merge_across_nodes: 1
         pages_shared: 129547
         pages_sharing: 5119146
         pages_to_scan: 4000
         pages_unshared: 1760924
         pages_volatile: 10761341
         run: 1
         sleep_millisecs: 20
         stable_node_chains: 167
         stable_node_chains_prune_millisecs: 2000
         stable_node_dups: 2751
         use_zero_pages: 0
         zero_pages_sharing: 0
      
      After the service is running for 30 minutes to an hour, 4 to 5 million
      shared pages are common for this workload when using KSM.
      
      
      Detailed changes:
      
      1. New options for prctl system command
         This patch series adds two new options to the prctl system call. 
         The first one allows to enable KSM at the process level and the second
         one to query the setting.
      
      The setting will be inherited by child processes.
      
      With the above setting, KSM can be enabled for the seed process of a cgroup
      and all processes in the cgroup will inherit the setting.
      
      2. Changes to KSM processing
         When KSM is enabled at the process level, the KSM code will iterate
         over all the VMA's and enable KSM for the eligible VMA's.
      
         When forking a process that has KSM enabled, the setting will be
         inherited by the new child process.
      
      3. Add general_profit metric
         The general_profit metric of KSM is specified in the documentation,
         but not calculated.  This adds the general profit metric to
         /sys/kernel/debug/mm/ksm.
      
      4. Add more metrics to ksm_stat
         This adds the process profit metric to /proc/<pid>/ksm_stat.
      
      5. Add more tests to ksm_tests and ksm_functional_tests
         This adds an option to specify the merge type to the ksm_tests. 
         This allows to test madvise and prctl KSM.
      
         It also adds a two new tests to ksm_functional_tests: one to test
         the new prctl options and the other one is a fork test to verify that
         the KSM process setting is inherited by client processes.
      
      
      This patch (of 3):
      
      So far KSM can only be enabled by calling madvise for memory regions.  To
      be able to use KSM for more workloads, KSM needs to have the ability to be
      enabled / disabled at the process / cgroup level.
      
      1. New options for prctl system command
      
         This patch series adds two new options to the prctl system call.
         The first one allows to enable KSM at the process level and the second
         one to query the setting.
      
         The setting will be inherited by child processes.
      
         With the above setting, KSM can be enabled for the seed process of a
         cgroup and all processes in the cgroup will inherit the setting.
      
      2. Changes to KSM processing
      
         When KSM is enabled at the process level, the KSM code will iterate
         over all the VMA's and enable KSM for the eligible VMA's.
      
         When forking a process that has KSM enabled, the setting will be
         inherited by the new child process.
      
        1) Introduce new MMF_VM_MERGE_ANY flag
      
           This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
           is set, kernel samepage merging (ksm) gets enabled for all vma's of a
           process.
      
        2) Setting VM_MERGEABLE on VMA creation
      
           When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
           VM_MERGEABLE flag will be set for this VMA.
      
        3) support disabling of ksm for a process
      
           This adds the ability to disable ksm for a process if ksm has been
           enabled for the process with prctl.
      
        4) add new prctl option to get and set ksm for a process
      
           This adds two new options to the prctl system call
           - enable ksm for all vmas of a process (if the vmas support it).
           - query if ksm has been enabled for a process.
      
      3. Disabling MMF_VM_MERGE_ANY for storage keys in s390
      
         In the s390 architecture when storage keys are used, the
         MMF_VM_MERGE_ANY will be disabled.
      
      Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
      Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d7597f59
    • John Keeping's avatar
      mm: shrinkers: fix debugfs file permissions · 2124f79d
      John Keeping authored
      The permissions for the files here are swapped as "count" is read-only and
      "scan" is write-only.  While this doesn't really matter as these
      permissions don't stop the files being opened for reading/writing as
      appropriate, they are shown by "ls -l" and are confusing.
      
      Link: https://lkml.kernel.org/r/20230418101906.3131303-1-john@metanate.comSigned-off-by: default avatarJohn Keeping <john@metanate.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2124f79d
    • David Hildenbrand's avatar
      mm: don't check VMA write permissions if the PTE/PMD indicates write permissions · f3ebdf04
      David Hildenbrand authored
      Staring at the comment "Recheck VMA as permissions can change since
      migration started" in remove_migration_pte() can result in confusion,
      because if the source PTE/PMD indicates write permissions, then there
      should be no need to check VMA write permissions when restoring migration
      entries or PTE-mapping a PMD.
      
      Commit d3cb8bf6 ("mm: migrate: Close race between migration completion
      and mprotect") introduced the maybe_mkwrite() handling in
      remove_migration_pte() in 2014, stating that a race between mprotect() and
      migration finishing would be possible, and that we could end up with a
      writable PTE that should be readable.
      
      However, mprotect() code first updates vma->vm_flags / vma->vm_page_prot
      and then walks the page tables to (a) set all present writable PTEs to
      read-only and (b) convert all writable migration entries to readable
      migration entries.  While walking the page tables and modifying the
      entries, migration code has to grab the PT locks to synchronize against
      concurrent page table modifications.
      
      Assuming migration would find a writable migration entry (while holding
      the PT lock) and replace it with a writable present PTE, surely mprotect()
      code didn't stumble over the writable migration entry yet (converting it
      into a readable migration entry) and would instead wait for the PT lock to
      convert the now present writable PTE into a read-only PTE.  As mprotect()
      didn't finish yet, the behavior is just like migration didn't happen: a
      writable PTE will be converted to a read-only PTE.
      
      So it's fine to rely on the writability information in the source PTE/PMD
      and not recheck against the VMA as long as we're holding the PT lock to
      synchronize with anyone who concurrently wants to downgrade write
      permissions (like mprotect()) by first adjusting vma->vm_flags /
      vma->vm_page_prot to then walk over the page tables to adjust the page
      table entries.
      
      Running test cases that should reveal such races -- mprotect(PROT_READ)
      racing with page migration or THP splitting -- for multiple hours did not
      reveal an issue with this cleanup.
      
      Link: https://lkml.kernel.org/r/20230418142113.439494-1-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f3ebdf04
    • Huang Ying's avatar
      migrate_pages_batch: fix statistics for longterm pin retry · 851ae642
      Huang Ying authored
      In commit fd4a7ac3 ("mm: migrate: try again if THP split is failed due
      to page refcnt"), if the THP splitting fails due to page reference count,
      we will retry to improve migration successful rate.  But the failed
      splitting is counted as migration failure and migration retry, which will
      cause duplicated failure counting.  So, in this patch, this is fixed via
      undoing the failure counting if we decide to retry.  The patch is tested
      via failure injection.
      
      Link: https://lkml.kernel.org/r/20230416235929.1040194-1-ying.huang@intel.com
      Fixes: fd4a7ac3 ("mm: migrate: try again if THP split is failed due to page refcnt")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      851ae642
    • ZhangPeng's avatar
      userfaultfd: use helper function range_in_vma() · 686ea6e6
      ZhangPeng authored
      We can use range_in_vma() to check if dst_start, dst_start + len are
      within the dst_vma range.  Minor readability improvement.
      
      Link: https://lkml.kernel.org/r/20230417003919.930515-1-zhangpeng362@huawei.comSigned-off-by: default avatarZhangPeng <zhangpeng362@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      686ea6e6
    • Yajun Deng's avatar
      lib/show_mem.c: use for_each_populated_zone() simplify code · 13215e8a
      Yajun Deng authored
      __show_mem() needs to iterate over all zones that have memory, we can
      simplify the code by using for_each_populated_zone().
      
      Link: https://lkml.kernel.org/r/20230417035226.4013584-1-yajun.deng@linux.devSigned-off-by: default avatarYajun Deng <yajun.deng@linux.dev>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      13215e8a
    • Kefeng Wang's avatar
      mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() · 4bf4f155
      Kefeng Wang authored
      Both of them change the arg from page_list to folio_list when convert them
      to use a folio, but not the declaration, let's correct it, also move the
      reclaim_pages() from swap.h to internal.h as it only used in mm.
      
      Link: https://lkml.kernel.org/r/20230417114807.186786-1-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviwed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4bf4f155
    • Pankaj Raghav's avatar
      fs/buffer: convert create_page_buffers to folio_create_buffers · c6c8c3e7
      Pankaj Raghav authored
      fs/buffer do not support large folios as there are many assumptions on the
      folio size to be the host page size.  This conversion is one step towards
      removing that assumption.  Also this conversion will reduce calls to
      compound_head() if folio_create_buffers() calls
      folio_create_empty_buffers().
      
      Link: https://lkml.kernel.org/r/20230417123618.22094-5-p.raghav@samsung.comSigned-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6c8c3e7
    • Pankaj Raghav's avatar
      fs/buffer: add folio_create_empty_buffers helper · 8e2e1756
      Pankaj Raghav authored
      Folio version of create_empty_buffers().  This is required to convert
      create_page_buffers() to folio_create_buffers() later in the series.
      
      It removes several calls to compound_head() as it works directly on folio
      compared to create_empty_buffers().  Hence, create_empty_buffers() has
      been modified to call folio_create_empty_buffers().
      
      Link: https://lkml.kernel.org/r/20230417123618.22094-4-p.raghav@samsung.comSigned-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e2e1756
    • Pankaj Raghav's avatar
      buffer: add folio_alloc_buffers() helper · c71124a8
      Pankaj Raghav authored
      Folio version of alloc_page_buffers() helper.  This is required to convert
      create_page_buffers() to folio_create_buffers() later in the series.
      
      alloc_page_buffers() has been modified to call folio_alloc_buffers() which
      adds one call to compound_head() but folio_alloc_buffers() removes one
      call to compound_head() compared to the existing alloc_page_buffers()
      implementation.
      
      Link: https://lkml.kernel.org/r/20230417123618.22094-3-p.raghav@samsung.comSigned-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c71124a8
    • Pankaj Raghav's avatar
      fs/buffer: add folio_set_bh helper · 465e5e6a
      Pankaj Raghav authored
      Patch series "convert create_page_buffers to folio_create_buffers".
      
      One of the first kernel panic we hit when we try to increase the block
      size > 4k is inside create_page_buffers()[1].  Even though buffer.c
      function do not support large folios (folios > PAGE_SIZE) at the moment,
      these changes are required when we want to remove that constraint.
      
      
      This patch (of 4):
      
      The folio version of set_bh_page().  This is required to convert
      create_page_buffers() to folio_create_buffers() later in the series.
      
      Link: https://lkml.kernel.org/r/20230417123618.22094-1-p.raghav@samsung.com
      Link: https://lkml.kernel.org/r/20230417123618.22094-2-p.raghav@samsung.comSigned-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      465e5e6a
    • Peter Xu's avatar
      selftests/mm: add tests for RO pinning vs fork() · 760aee0b
      Peter Xu authored
      Add a test suite (with 10 more sub-tests) to cover RO pinning against
      fork() over uffd-wp.  It covers both:
      
        (1) Early CoW test in fork() when page pinned,
        (2) page unshare due to RO longterm pin.
      
      They are:
      
        Testing wp-fork-pin on anon... done
        Testing wp-fork-pin on shmem... done
        Testing wp-fork-pin on shmem-private... done
        Testing wp-fork-pin on hugetlb... done
        Testing wp-fork-pin on hugetlb-private... done
        Testing wp-fork-pin-with-event on anon... done
        Testing wp-fork-pin-with-event on shmem... done
        Testing wp-fork-pin-with-event on shmem-private... done
        Testing wp-fork-pin-with-event on hugetlb... done
        Testing wp-fork-pin-with-event on hugetlb-private... done
      
      CONFIG_GUP_TEST needed or they'll be skipped.
      
        Testing wp-fork-pin on anon... skipped [reason: Possibly CONFIG_GUP_TEST missing or unprivileged]
      
      Note that the major test goal is on private memory, but no hurt to also run
      all of them over shared because shared memory should work the same.
      
      Link: https://lkml.kernel.org/r/20230417195317.898696-7-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mika Penttilä <mpenttil@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      760aee0b
    • Peter Xu's avatar
      selftests/mm: rename COW_EXTRA_LIBS to IOURING_EXTRA_LIBS · 71fc41eb
      Peter Xu authored
      The macro and facility can be reused in other tests too.  Make it general.
      
      Link: https://lkml.kernel.org/r/20230417195317.898696-6-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mika Penttilä <mpenttil@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      71fc41eb
    • Peter Xu's avatar
      selftests/mm: extend and rename uffd pagemap test · cff29458
      Peter Xu authored
      Extend it to all types of mem, meanwhile add one parallel test when
      EVENT_FORK is enabled, where uffd-wp bits should be persisted rather than
      dropped.
      
      Since at it, rename the test to "wp-fork" to better show what it means. 
      Making the new test called "wp-fork-with-event".
      
      Before:
      
              Testing pagemap on anon... done
      
      After:
      
              Testing wp-fork on anon... done
              Testing wp-fork on shmem... done
              Testing wp-fork on shmem-private... done
              Testing wp-fork on hugetlb... done
              Testing wp-fork on hugetlb-private... done
              Testing wp-fork-with-event on anon... done
              Testing wp-fork-with-event on shmem... done
              Testing wp-fork-with-event on shmem-private... done
              Testing wp-fork-with-event on hugetlb... done
              Testing wp-fork-with-event on hugetlb-private... done
      
      Link: https://lkml.kernel.org/r/20230417195317.898696-5-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mika Penttilä <mpenttil@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cff29458
    • Peter Xu's avatar
      selftests/mm: add a few options for uffd-unit-test · 21337f2a
      Peter Xu authored
      Namely:
      
        "-f": add a wildcard filter for tests to run
        "-l": list tests rather than running any
        "-h": help msg
      
      Link: https://lkml.kernel.org/r/20230417195317.898696-4-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mika Penttilä <mpenttil@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      21337f2a
    • Peter Xu's avatar
      mm/hugetlb: fix uffd-wp bit lost when unsharing happens · 0f230bc2
      Peter Xu authored
      When we try to unshare a pinned page for a private hugetlb, uffd-wp bit
      can get lost during unsharing.
      
      When above condition met, one can lose uffd-wp bit on the privately mapped
      hugetlb page.  It allows the page to be writable even if it should still be
      wr-protected.  I assume it can mean data loss.
      
      This should be very rare, only if an unsharing happened on a private
      hugetlb page with uffd-wp protected (e.g.  in a child which shares the
      same page with parent with UFFD_FEATURE_EVENT_FORK enabled).
      
      When I wrote the reproducer (provided in the last patch) I needed to
      use the newest gup_test cmd introduced by David to trigger it because I
      don't even know another way to do a proper RO longerm pin.
      
      Besides that, it needs a bunch of other conditions all met:
      
              (1) hugetlb being mapped privately,
              (2) userfaultfd registered with WP and EVENT_FORK,
              (3) the user app fork()s, then,
              (4) RO longterm pin onto a wr-protected anonymous page.
      
      If it's not impossible to hit in production I'd say extremely rare.
      
      Link: https://lkml.kernel.org/r/20230417195317.898696-3-peterx@redhat.com
      Fixes: 166f3ecc ("mm/hugetlb: hook page faults for uffd write protection")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reported-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mika Penttilä <mpenttil@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0f230bc2
    • Peter Xu's avatar
      mm/hugetlb: fix uffd-wp during fork() · 5a2f8d22
      Peter Xu authored
      Patch series "mm/hugetlb: More fixes around uffd-wp vs fork() / RO pins",
      v2.
      
      
      This patch (of 6):
      
      There're a bunch of things that were wrong:
      
        - Reading uffd-wp bit from a swap entry should use pte_swp_uffd_wp()
          rather than huge_pte_uffd_wp().
      
        - When copying over a pte, we should drop uffd-wp bit when
          !EVENT_FORK (aka, when !userfaultfd_wp(dst_vma)).
      
        - When doing early CoW for private hugetlb (e.g. when the parent page was
          pinned), uffd-wp bit should be properly carried over if necessary.
      
      No bug reported probably because most people do not even care about these
      corner cases, but they are still bugs and can be exposed by the recent unit
      tests introduced, so fix all of them in one shot.
      
      Link: https://lkml.kernel.org/r/20230417195317.898696-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20230417195317.898696-2-peterx@redhat.com
      Fixes: bc70fbf2 ("mm/hugetlb: handle uffd-wp during fork()")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mika Penttilä <mpenttil@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a2f8d22
    • Zqiang's avatar
      kasan: fix lockdep report invalid wait context · be41d814
      Zqiang authored
      For kernels built with the following options and booting
      
      CONFIG_SLUB=y
      CONFIG_DEBUG_LOCKDEP=y
      CONFIG_PROVE_LOCKING=y
      CONFIG_PROVE_RAW_LOCK_NESTING=y
      
      [    0.523115] [ BUG: Invalid wait context ]
      [    0.523315] 6.3.0-rc1-yocto-standard+ #739 Not tainted
      [    0.523649] -----------------------------
      [    0.523663] swapper/0/0 is trying to lock:
      [    0.523663] ffff888035611360 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x2e/0x1e0
      [    0.523663] other info that might help us debug this:
      [    0.523663] context-{2:2}
      [    0.523663] no locks held by swapper/0/0.
      [    0.523663] stack backtrace:
      [    0.523663] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.3.0-rc1-yocto-standard+ #739
      [    0.523663] Call Trace:
      [    0.523663]  <IRQ>
      [    0.523663]  dump_stack_lvl+0x64/0xb0
      [    0.523663]  dump_stack+0x10/0x20
      [    0.523663]  __lock_acquire+0x6c4/0x3c10
      [    0.523663]  lock_acquire+0x188/0x460
      [    0.523663]  put_cpu_partial+0x5a/0x1e0
      [    0.523663]  __slab_free+0x39a/0x520
      [    0.523663]  ___cache_free+0xa9/0xc0
      [    0.523663]  qlist_free_all+0x7a/0x160
      [    0.523663]  per_cpu_remove_cache+0x5c/0x70
      [    0.523663]  __flush_smp_call_function_queue+0xfc/0x330
      [    0.523663]  generic_smp_call_function_single_interrupt+0x13/0x20
      [    0.523663]  __sysvec_call_function+0x86/0x2e0
      [    0.523663]  sysvec_call_function+0x73/0x90
      [    0.523663]  </IRQ>
      [    0.523663]  <TASK>
      [    0.523663]  asm_sysvec_call_function+0x1b/0x20
      [    0.523663] RIP: 0010:default_idle+0x13/0x20
      [    0.523663] RSP: 0000:ffffffff83e07dc0 EFLAGS: 00000246
      [    0.523663] RAX: 0000000000000000 RBX: ffffffff83e1e200 RCX: ffffffff82a83293
      [    0.523663] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8119a6b1
      [    0.523663] RBP: ffffffff83e07dc8 R08: 0000000000000001 R09: ffffed1006ac0d66
      [    0.523663] R10: ffff888035606b2b R11: ffffed1006ac0d65 R12: 0000000000000000
      [    0.523663] R13: ffffffff83e1e200 R14: ffffffff84a7d980 R15: 0000000000000000
      [    0.523663]  default_idle_call+0x6c/0xa0
      [    0.523663]  do_idle+0x2e1/0x330
      [    0.523663]  cpu_startup_entry+0x20/0x30
      [    0.523663]  rest_init+0x152/0x240
      [    0.523663]  arch_call_rest_init+0x13/0x40
      [    0.523663]  start_kernel+0x331/0x470
      [    0.523663]  x86_64_start_reservations+0x18/0x40
      [    0.523663]  x86_64_start_kernel+0xbb/0x120
      [    0.523663]  secondary_startup_64_no_verify+0xe0/0xeb
      [    0.523663]  </TASK>
      
      The local_lock_irqsave() is invoked in put_cpu_partial() and happens in
      IPI context, due to the CONFIG_PROVE_RAW_LOCK_NESTING=y (the
      LD_WAIT_CONFIG not equal to LD_WAIT_SPIN), so acquire local_lock in IPI
      context will trigger above calltrace.
      
      This commit therefore moves qlist_free_all() from hard-irq context to task
      context.
      
      Link: https://lkml.kernel.org/r/20230327120019.1027640-1-qiang1.zhang@intel.comSigned-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      be41d814
  2. 18 Apr, 2023 18 commits