1. 21 Apr, 2023 19 commits
    • Stefan Roesch's avatar
      mm: add new KSM process and sysfs knobs · d21077fb
      Stefan Roesch authored
      This adds the general_profit KSM sysfs knob and the process profit metric
      knobs to ksm_stat.
      
      1) expose general_profit metric
      
         The documentation mentions a general profit metric, however this
         metric is not calculated.  In addition the formula depends on the size
         of internal structures, which makes it more difficult for an
         administrator to make the calculation.  Adding the metric for a better
         user experience.
      
      2) document general_profit sysfs knob
      
      3) calculate ksm process profit metric
      
         The ksm documentation mentions the process profit metric and how to
         calculate it.  This adds the calculation of the metric.
      
      4) mm: expose ksm process profit metric in ksm_stat
      
         This exposes the ksm process profit metric in /proc/<pid>/ksm_stat.
         The documentation mentions the formula for the ksm process profit
         metric, however it does not calculate it.  In addition the formula
         depends on the size of internal structures.  So it makes sense to
         expose it.
      
      5) document new procfs ksm knobs
      
      Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d21077fb
    • Stefan Roesch's avatar
      mm: add new api to enable ksm per process · d7597f59
      Stefan Roesch authored
      Patch series "mm: process/cgroup ksm support", v9.
      
      So far KSM can only be enabled by calling madvise for memory regions.  To
      be able to use KSM for more workloads, KSM needs to have the ability to be
      enabled / disabled at the process / cgroup level.
      
      Use case 1:
        The madvise call is not available in the programming language.  An
        example for this are programs with forked workloads using a garbage
        collected language without pointers.  In such a language madvise cannot
        be made available.
      
        In addition the addresses of objects get moved around as they are
        garbage collected.  KSM sharing needs to be enabled "from the outside"
        for these type of workloads.
      
      Use case 2:
        The same interpreter can also be used for workloads where KSM brings
        no benefit or even has overhead.  We'd like to be able to enable KSM on
        a workload by workload basis.
      
      Use case 3:
        With the madvise call sharing opportunities are only enabled for the
        current process: it is a workload-local decision.  A considerable number
        of sharing opportunities may exist across multiple workloads or jobs (if
        they are part of the same security domain).  Only a higler level entity
        like a job scheduler or container can know for certain if its running
        one or more instances of a job.  That job scheduler however doesn't have
        the necessary internal workload knowledge to make targeted madvise
        calls.
      
      Security concerns:
      
        In previous discussions security concerns have been brought up.  The
        problem is that an individual workload does not have the knowledge about
        what else is running on a machine.  Therefore it has to be very
        conservative in what memory areas can be shared or not.  However, if the
        system is dedicated to running multiple jobs within the same security
        domain, its the job scheduler that has the knowledge that sharing can be
        safely enabled and is even desirable.
      
      Performance:
      
        Experiments with using UKSM have shown a capacity increase of around 20%.
      
        Here are the metrics from an instagram workload (taken from a machine
        with 64GB main memory):
      
         full_scans: 445
         general_profit: 20158298048
         max_page_sharing: 256
         merge_across_nodes: 1
         pages_shared: 129547
         pages_sharing: 5119146
         pages_to_scan: 4000
         pages_unshared: 1760924
         pages_volatile: 10761341
         run: 1
         sleep_millisecs: 20
         stable_node_chains: 167
         stable_node_chains_prune_millisecs: 2000
         stable_node_dups: 2751
         use_zero_pages: 0
         zero_pages_sharing: 0
      
      After the service is running for 30 minutes to an hour, 4 to 5 million
      shared pages are common for this workload when using KSM.
      
      
      Detailed changes:
      
      1. New options for prctl system command
         This patch series adds two new options to the prctl system call. 
         The first one allows to enable KSM at the process level and the second
         one to query the setting.
      
      The setting will be inherited by child processes.
      
      With the above setting, KSM can be enabled for the seed process of a cgroup
      and all processes in the cgroup will inherit the setting.
      
      2. Changes to KSM processing
         When KSM is enabled at the process level, the KSM code will iterate
         over all the VMA's and enable KSM for the eligible VMA's.
      
         When forking a process that has KSM enabled, the setting will be
         inherited by the new child process.
      
      3. Add general_profit metric
         The general_profit metric of KSM is specified in the documentation,
         but not calculated.  This adds the general profit metric to
         /sys/kernel/debug/mm/ksm.
      
      4. Add more metrics to ksm_stat
         This adds the process profit metric to /proc/<pid>/ksm_stat.
      
      5. Add more tests to ksm_tests and ksm_functional_tests
         This adds an option to specify the merge type to the ksm_tests. 
         This allows to test madvise and prctl KSM.
      
         It also adds a two new tests to ksm_functional_tests: one to test
         the new prctl options and the other one is a fork test to verify that
         the KSM process setting is inherited by client processes.
      
      
      This patch (of 3):
      
      So far KSM can only be enabled by calling madvise for memory regions.  To
      be able to use KSM for more workloads, KSM needs to have the ability to be
      enabled / disabled at the process / cgroup level.
      
      1. New options for prctl system command
      
         This patch series adds two new options to the prctl system call.
         The first one allows to enable KSM at the process level and the second
         one to query the setting.
      
         The setting will be inherited by child processes.
      
         With the above setting, KSM can be enabled for the seed process of a
         cgroup and all processes in the cgroup will inherit the setting.
      
      2. Changes to KSM processing
      
         When KSM is enabled at the process level, the KSM code will iterate
         over all the VMA's and enable KSM for the eligible VMA's.
      
         When forking a process that has KSM enabled, the setting will be
         inherited by the new child process.
      
        1) Introduce new MMF_VM_MERGE_ANY flag
      
           This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
           is set, kernel samepage merging (ksm) gets enabled for all vma's of a
           process.
      
        2) Setting VM_MERGEABLE on VMA creation
      
           When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
           VM_MERGEABLE flag will be set for this VMA.
      
        3) support disabling of ksm for a process
      
           This adds the ability to disable ksm for a process if ksm has been
           enabled for the process with prctl.
      
        4) add new prctl option to get and set ksm for a process
      
           This adds two new options to the prctl system call
           - enable ksm for all vmas of a process (if the vmas support it).
           - query if ksm has been enabled for a process.
      
      3. Disabling MMF_VM_MERGE_ANY for storage keys in s390
      
         In the s390 architecture when storage keys are used, the
         MMF_VM_MERGE_ANY will be disabled.
      
      Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
      Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d7597f59
    • John Keeping's avatar
      mm: shrinkers: fix debugfs file permissions · 2124f79d
      John Keeping authored
      The permissions for the files here are swapped as "count" is read-only and
      "scan" is write-only.  While this doesn't really matter as these
      permissions don't stop the files being opened for reading/writing as
      appropriate, they are shown by "ls -l" and are confusing.
      
      Link: https://lkml.kernel.org/r/20230418101906.3131303-1-john@metanate.comSigned-off-by: default avatarJohn Keeping <john@metanate.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2124f79d
    • David Hildenbrand's avatar
      mm: don't check VMA write permissions if the PTE/PMD indicates write permissions · f3ebdf04
      David Hildenbrand authored
      Staring at the comment "Recheck VMA as permissions can change since
      migration started" in remove_migration_pte() can result in confusion,
      because if the source PTE/PMD indicates write permissions, then there
      should be no need to check VMA write permissions when restoring migration
      entries or PTE-mapping a PMD.
      
      Commit d3cb8bf6 ("mm: migrate: Close race between migration completion
      and mprotect") introduced the maybe_mkwrite() handling in
      remove_migration_pte() in 2014, stating that a race between mprotect() and
      migration finishing would be possible, and that we could end up with a
      writable PTE that should be readable.
      
      However, mprotect() code first updates vma->vm_flags / vma->vm_page_prot
      and then walks the page tables to (a) set all present writable PTEs to
      read-only and (b) convert all writable migration entries to readable
      migration entries.  While walking the page tables and modifying the
      entries, migration code has to grab the PT locks to synchronize against
      concurrent page table modifications.
      
      Assuming migration would find a writable migration entry (while holding
      the PT lock) and replace it with a writable present PTE, surely mprotect()
      code didn't stumble over the writable migration entry yet (converting it
      into a readable migration entry) and would instead wait for the PT lock to
      convert the now present writable PTE into a read-only PTE.  As mprotect()
      didn't finish yet, the behavior is just like migration didn't happen: a
      writable PTE will be converted to a read-only PTE.
      
      So it's fine to rely on the writability information in the source PTE/PMD
      and not recheck against the VMA as long as we're holding the PT lock to
      synchronize with anyone who concurrently wants to downgrade write
      permissions (like mprotect()) by first adjusting vma->vm_flags /
      vma->vm_page_prot to then walk over the page tables to adjust the page
      table entries.
      
      Running test cases that should reveal such races -- mprotect(PROT_READ)
      racing with page migration or THP splitting -- for multiple hours did not
      reveal an issue with this cleanup.
      
      Link: https://lkml.kernel.org/r/20230418142113.439494-1-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f3ebdf04
    • Huang Ying's avatar
      migrate_pages_batch: fix statistics for longterm pin retry · 851ae642
      Huang Ying authored
      In commit fd4a7ac3 ("mm: migrate: try again if THP split is failed due
      to page refcnt"), if the THP splitting fails due to page reference count,
      we will retry to improve migration successful rate.  But the failed
      splitting is counted as migration failure and migration retry, which will
      cause duplicated failure counting.  So, in this patch, this is fixed via
      undoing the failure counting if we decide to retry.  The patch is tested
      via failure injection.
      
      Link: https://lkml.kernel.org/r/20230416235929.1040194-1-ying.huang@intel.com
      Fixes: fd4a7ac3 ("mm: migrate: try again if THP split is failed due to page refcnt")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      851ae642
    • ZhangPeng's avatar
      userfaultfd: use helper function range_in_vma() · 686ea6e6
      ZhangPeng authored
      We can use range_in_vma() to check if dst_start, dst_start + len are
      within the dst_vma range.  Minor readability improvement.
      
      Link: https://lkml.kernel.org/r/20230417003919.930515-1-zhangpeng362@huawei.comSigned-off-by: default avatarZhangPeng <zhangpeng362@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      686ea6e6
    • Yajun Deng's avatar
      lib/show_mem.c: use for_each_populated_zone() simplify code · 13215e8a
      Yajun Deng authored
      __show_mem() needs to iterate over all zones that have memory, we can
      simplify the code by using for_each_populated_zone().
      
      Link: https://lkml.kernel.org/r/20230417035226.4013584-1-yajun.deng@linux.devSigned-off-by: default avatarYajun Deng <yajun.deng@linux.dev>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      13215e8a
    • Kefeng Wang's avatar
      mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() · 4bf4f155
      Kefeng Wang authored
      Both of them change the arg from page_list to folio_list when convert them
      to use a folio, but not the declaration, let's correct it, also move the
      reclaim_pages() from swap.h to internal.h as it only used in mm.
      
      Link: https://lkml.kernel.org/r/20230417114807.186786-1-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviwed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4bf4f155
    • Pankaj Raghav's avatar
      fs/buffer: convert create_page_buffers to folio_create_buffers · c6c8c3e7
      Pankaj Raghav authored
      fs/buffer do not support large folios as there are many assumptions on the
      folio size to be the host page size.  This conversion is one step towards
      removing that assumption.  Also this conversion will reduce calls to
      compound_head() if folio_create_buffers() calls
      folio_create_empty_buffers().
      
      Link: https://lkml.kernel.org/r/20230417123618.22094-5-p.raghav@samsung.comSigned-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6c8c3e7
    • Pankaj Raghav's avatar
      fs/buffer: add folio_create_empty_buffers helper · 8e2e1756
      Pankaj Raghav authored
      Folio version of create_empty_buffers().  This is required to convert
      create_page_buffers() to folio_create_buffers() later in the series.
      
      It removes several calls to compound_head() as it works directly on folio
      compared to create_empty_buffers().  Hence, create_empty_buffers() has
      been modified to call folio_create_empty_buffers().
      
      Link: https://lkml.kernel.org/r/20230417123618.22094-4-p.raghav@samsung.comSigned-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e2e1756
    • Pankaj Raghav's avatar
      buffer: add folio_alloc_buffers() helper · c71124a8
      Pankaj Raghav authored
      Folio version of alloc_page_buffers() helper.  This is required to convert
      create_page_buffers() to folio_create_buffers() later in the series.
      
      alloc_page_buffers() has been modified to call folio_alloc_buffers() which
      adds one call to compound_head() but folio_alloc_buffers() removes one
      call to compound_head() compared to the existing alloc_page_buffers()
      implementation.
      
      Link: https://lkml.kernel.org/r/20230417123618.22094-3-p.raghav@samsung.comSigned-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c71124a8
    • Pankaj Raghav's avatar
      fs/buffer: add folio_set_bh helper · 465e5e6a
      Pankaj Raghav authored
      Patch series "convert create_page_buffers to folio_create_buffers".
      
      One of the first kernel panic we hit when we try to increase the block
      size > 4k is inside create_page_buffers()[1].  Even though buffer.c
      function do not support large folios (folios > PAGE_SIZE) at the moment,
      these changes are required when we want to remove that constraint.
      
      
      This patch (of 4):
      
      The folio version of set_bh_page().  This is required to convert
      create_page_buffers() to folio_create_buffers() later in the series.
      
      Link: https://lkml.kernel.org/r/20230417123618.22094-1-p.raghav@samsung.com
      Link: https://lkml.kernel.org/r/20230417123618.22094-2-p.raghav@samsung.comSigned-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      465e5e6a
    • Peter Xu's avatar
      selftests/mm: add tests for RO pinning vs fork() · 760aee0b
      Peter Xu authored
      Add a test suite (with 10 more sub-tests) to cover RO pinning against
      fork() over uffd-wp.  It covers both:
      
        (1) Early CoW test in fork() when page pinned,
        (2) page unshare due to RO longterm pin.
      
      They are:
      
        Testing wp-fork-pin on anon... done
        Testing wp-fork-pin on shmem... done
        Testing wp-fork-pin on shmem-private... done
        Testing wp-fork-pin on hugetlb... done
        Testing wp-fork-pin on hugetlb-private... done
        Testing wp-fork-pin-with-event on anon... done
        Testing wp-fork-pin-with-event on shmem... done
        Testing wp-fork-pin-with-event on shmem-private... done
        Testing wp-fork-pin-with-event on hugetlb... done
        Testing wp-fork-pin-with-event on hugetlb-private... done
      
      CONFIG_GUP_TEST needed or they'll be skipped.
      
        Testing wp-fork-pin on anon... skipped [reason: Possibly CONFIG_GUP_TEST missing or unprivileged]
      
      Note that the major test goal is on private memory, but no hurt to also run
      all of them over shared because shared memory should work the same.
      
      Link: https://lkml.kernel.org/r/20230417195317.898696-7-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mika Penttilä <mpenttil@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      760aee0b
    • Peter Xu's avatar
      selftests/mm: rename COW_EXTRA_LIBS to IOURING_EXTRA_LIBS · 71fc41eb
      Peter Xu authored
      The macro and facility can be reused in other tests too.  Make it general.
      
      Link: https://lkml.kernel.org/r/20230417195317.898696-6-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mika Penttilä <mpenttil@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      71fc41eb
    • Peter Xu's avatar
      selftests/mm: extend and rename uffd pagemap test · cff29458
      Peter Xu authored
      Extend it to all types of mem, meanwhile add one parallel test when
      EVENT_FORK is enabled, where uffd-wp bits should be persisted rather than
      dropped.
      
      Since at it, rename the test to "wp-fork" to better show what it means. 
      Making the new test called "wp-fork-with-event".
      
      Before:
      
              Testing pagemap on anon... done
      
      After:
      
              Testing wp-fork on anon... done
              Testing wp-fork on shmem... done
              Testing wp-fork on shmem-private... done
              Testing wp-fork on hugetlb... done
              Testing wp-fork on hugetlb-private... done
              Testing wp-fork-with-event on anon... done
              Testing wp-fork-with-event on shmem... done
              Testing wp-fork-with-event on shmem-private... done
              Testing wp-fork-with-event on hugetlb... done
              Testing wp-fork-with-event on hugetlb-private... done
      
      Link: https://lkml.kernel.org/r/20230417195317.898696-5-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mika Penttilä <mpenttil@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cff29458
    • Peter Xu's avatar
      selftests/mm: add a few options for uffd-unit-test · 21337f2a
      Peter Xu authored
      Namely:
      
        "-f": add a wildcard filter for tests to run
        "-l": list tests rather than running any
        "-h": help msg
      
      Link: https://lkml.kernel.org/r/20230417195317.898696-4-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mika Penttilä <mpenttil@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      21337f2a
    • Peter Xu's avatar
      mm/hugetlb: fix uffd-wp bit lost when unsharing happens · 0f230bc2
      Peter Xu authored
      When we try to unshare a pinned page for a private hugetlb, uffd-wp bit
      can get lost during unsharing.
      
      When above condition met, one can lose uffd-wp bit on the privately mapped
      hugetlb page.  It allows the page to be writable even if it should still be
      wr-protected.  I assume it can mean data loss.
      
      This should be very rare, only if an unsharing happened on a private
      hugetlb page with uffd-wp protected (e.g.  in a child which shares the
      same page with parent with UFFD_FEATURE_EVENT_FORK enabled).
      
      When I wrote the reproducer (provided in the last patch) I needed to
      use the newest gup_test cmd introduced by David to trigger it because I
      don't even know another way to do a proper RO longerm pin.
      
      Besides that, it needs a bunch of other conditions all met:
      
              (1) hugetlb being mapped privately,
              (2) userfaultfd registered with WP and EVENT_FORK,
              (3) the user app fork()s, then,
              (4) RO longterm pin onto a wr-protected anonymous page.
      
      If it's not impossible to hit in production I'd say extremely rare.
      
      Link: https://lkml.kernel.org/r/20230417195317.898696-3-peterx@redhat.com
      Fixes: 166f3ecc ("mm/hugetlb: hook page faults for uffd write protection")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reported-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mika Penttilä <mpenttil@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0f230bc2
    • Peter Xu's avatar
      mm/hugetlb: fix uffd-wp during fork() · 5a2f8d22
      Peter Xu authored
      Patch series "mm/hugetlb: More fixes around uffd-wp vs fork() / RO pins",
      v2.
      
      
      This patch (of 6):
      
      There're a bunch of things that were wrong:
      
        - Reading uffd-wp bit from a swap entry should use pte_swp_uffd_wp()
          rather than huge_pte_uffd_wp().
      
        - When copying over a pte, we should drop uffd-wp bit when
          !EVENT_FORK (aka, when !userfaultfd_wp(dst_vma)).
      
        - When doing early CoW for private hugetlb (e.g. when the parent page was
          pinned), uffd-wp bit should be properly carried over if necessary.
      
      No bug reported probably because most people do not even care about these
      corner cases, but they are still bugs and can be exposed by the recent unit
      tests introduced, so fix all of them in one shot.
      
      Link: https://lkml.kernel.org/r/20230417195317.898696-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20230417195317.898696-2-peterx@redhat.com
      Fixes: bc70fbf2 ("mm/hugetlb: handle uffd-wp during fork()")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mika Penttilä <mpenttil@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a2f8d22
    • Zqiang's avatar
      kasan: fix lockdep report invalid wait context · be41d814
      Zqiang authored
      For kernels built with the following options and booting
      
      CONFIG_SLUB=y
      CONFIG_DEBUG_LOCKDEP=y
      CONFIG_PROVE_LOCKING=y
      CONFIG_PROVE_RAW_LOCK_NESTING=y
      
      [    0.523115] [ BUG: Invalid wait context ]
      [    0.523315] 6.3.0-rc1-yocto-standard+ #739 Not tainted
      [    0.523649] -----------------------------
      [    0.523663] swapper/0/0 is trying to lock:
      [    0.523663] ffff888035611360 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x2e/0x1e0
      [    0.523663] other info that might help us debug this:
      [    0.523663] context-{2:2}
      [    0.523663] no locks held by swapper/0/0.
      [    0.523663] stack backtrace:
      [    0.523663] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.3.0-rc1-yocto-standard+ #739
      [    0.523663] Call Trace:
      [    0.523663]  <IRQ>
      [    0.523663]  dump_stack_lvl+0x64/0xb0
      [    0.523663]  dump_stack+0x10/0x20
      [    0.523663]  __lock_acquire+0x6c4/0x3c10
      [    0.523663]  lock_acquire+0x188/0x460
      [    0.523663]  put_cpu_partial+0x5a/0x1e0
      [    0.523663]  __slab_free+0x39a/0x520
      [    0.523663]  ___cache_free+0xa9/0xc0
      [    0.523663]  qlist_free_all+0x7a/0x160
      [    0.523663]  per_cpu_remove_cache+0x5c/0x70
      [    0.523663]  __flush_smp_call_function_queue+0xfc/0x330
      [    0.523663]  generic_smp_call_function_single_interrupt+0x13/0x20
      [    0.523663]  __sysvec_call_function+0x86/0x2e0
      [    0.523663]  sysvec_call_function+0x73/0x90
      [    0.523663]  </IRQ>
      [    0.523663]  <TASK>
      [    0.523663]  asm_sysvec_call_function+0x1b/0x20
      [    0.523663] RIP: 0010:default_idle+0x13/0x20
      [    0.523663] RSP: 0000:ffffffff83e07dc0 EFLAGS: 00000246
      [    0.523663] RAX: 0000000000000000 RBX: ffffffff83e1e200 RCX: ffffffff82a83293
      [    0.523663] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8119a6b1
      [    0.523663] RBP: ffffffff83e07dc8 R08: 0000000000000001 R09: ffffed1006ac0d66
      [    0.523663] R10: ffff888035606b2b R11: ffffed1006ac0d65 R12: 0000000000000000
      [    0.523663] R13: ffffffff83e1e200 R14: ffffffff84a7d980 R15: 0000000000000000
      [    0.523663]  default_idle_call+0x6c/0xa0
      [    0.523663]  do_idle+0x2e1/0x330
      [    0.523663]  cpu_startup_entry+0x20/0x30
      [    0.523663]  rest_init+0x152/0x240
      [    0.523663]  arch_call_rest_init+0x13/0x40
      [    0.523663]  start_kernel+0x331/0x470
      [    0.523663]  x86_64_start_reservations+0x18/0x40
      [    0.523663]  x86_64_start_kernel+0xbb/0x120
      [    0.523663]  secondary_startup_64_no_verify+0xe0/0xeb
      [    0.523663]  </TASK>
      
      The local_lock_irqsave() is invoked in put_cpu_partial() and happens in
      IPI context, due to the CONFIG_PROVE_RAW_LOCK_NESTING=y (the
      LD_WAIT_CONFIG not equal to LD_WAIT_SPIN), so acquire local_lock in IPI
      context will trigger above calltrace.
      
      This commit therefore moves qlist_free_all() from hard-irq context to task
      context.
      
      Link: https://lkml.kernel.org/r/20230327120019.1027640-1-qiang1.zhang@intel.comSigned-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      be41d814
  2. 18 Apr, 2023 21 commits
    • Longlong Xia's avatar
      mm: ksm: support hwpoison for ksm page · 4248d008
      Longlong Xia authored
      hwpoison_user_mappings() is updated to support ksm pages, and add
      collect_procs_ksm() to collect processes when the error hit an ksm page. 
      The difference from collect_procs_anon() is that it also needs to traverse
      the rmap-item list on the stable node of the ksm page.  At the same time,
      add_to_kill_ksm() is added to handle ksm pages.  And
      task_in_to_kill_list() is added to avoid duplicate addition of tsk to the
      to_kill list.  This is because when scanning the list, if the pages that
      make up the ksm page all come from the same process, they may be added
      repeatedly.
      
      Link: https://lkml.kernel.org/r/20230414021741.2597273-3-xialonglong1@huawei.comSigned-off-by: default avatarLonglong Xia <xialonglong1@huawei.com>
      Tested-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4248d008
    • Longlong Xia's avatar
      mm: memory-failure: refactor add_to_kill() · 4f775086
      Longlong Xia authored
      Patch series "mm: ksm: support hwpoison for ksm page", v2.
      
      Currently, ksm does not support hwpoison.  As ksm is being used more
      widely for deduplication at the system level, container level, and process
      level, supporting hwpoison for ksm has become increasingly important. 
      However, ksm pages were not processed by hwpoison in 2009 [1].
      
      The main method of implementation:
      
      1. Refactor add_to_kill() and add new add_to_kill_*() to better
         accommodate the handling of different types of pages.
      
      2.  Add collect_procs_ksm() to collect processes when the error hit an
         ksm page.
      
      3. Add task_in_to_kill_list() to avoid duplicate addition of tsk to
         the to_kill list.  
      
      4. Try_to_unmap ksm page (already supported).
      
      5. Handle related processes such as sending SIGBUS.
      
      Tested with poisoning to ksm page from
      1) different process
      2) one process
      
      and with/without memory_failure_early_kill set, the processes are killed
      as expected with the patchset.  
      
      [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
      commit/?h=01e00f88
      
      
      This patch (of 2):
      
      The page_address_in_vma() is used to find the user virtual address of page
      in add_to_kill(), but it doesn't support ksm due to the ksm page->index
      unusable, add an ksm_addr as parameter to add_to_kill(), let's the caller
      to pass it, also rename the function to __add_to_kill(), and adding
      add_to_kill_anon_file() for handling anonymous pages and file pages,
      adding add_to_kill_fsdax() for handling fsdax pages.
      
      Link: https://lkml.kernel.org/r/20230414021741.2597273-1-xialonglong1@huawei.com
      Link: https://lkml.kernel.org/r/20230414021741.2597273-2-xialonglong1@huawei.comSigned-off-by: default avatarLonglong Xia <xialonglong1@huawei.com>
      Tested-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4f775086
    • Jeff Xu's avatar
      selftests/memfd: fix test_sysctl · 3cc0c373
      Jeff Xu authored
      sysctl memfd_noexec is pid-namespaced, non-reservable, and inherent to the
      child process.
      
      Move the inherence test from init ns to child ns, so init ns can keep the
      default value.
      
      Link: https://lkml.kernel.org/r/20230414022801.2545257-1-jeffxu@google.comSigned-off-by: default avatarJeff Xu <jeffxu@google.com>
      Reported-by: default avatarkernel test robot <yujie.liu@intel.com>
        Link: https://lore.kernel.org/oe-lkp/202303312259.441e35db-yujie.liu@intel.comTested-by: default avatarYujie Liu <yujie.liu@intel.com>
      Cc: Daniel Verkamp <dverkamp@chromium.org>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3cc0c373
    • Chaitanya S Prakash's avatar
      selftests/mm: run hugetlb testcases of va switch · c025da0f
      Chaitanya S Prakash authored
      The va_high_addr_switch selftest is used to test mmap across 128TB
      boundary.  It divides the selftest cases into two main categories on the
      basis of size.  One set is used to create mappings that are multiples of
      PAGE_SIZE while the other creates mappings that are multiples of
      HUGETLB_SIZE.
      
      In order to run the hugetlb testcases the binary must be appended with
      "--run-hugetlb" but the file that used to run the test only invokes the
      binary, thereby completely skipping the hugetlb testcases.  Hence, the
      required statement has been added.
      
      Link: https://lkml.kernel.org/r/20230323105243.2807166-6-chaitanyas.prakash@arm.comSigned-off-by: default avatarChaitanya S Prakash <chaitanyas.prakash@arm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c025da0f
    • Chaitanya S Prakash's avatar
      selftests/mm: configure nr_hugepages for arm64 · 2f489e2e
      Chaitanya S Prakash authored
      Arm64 has a default hugepage size of 512MB when CONFIG_ARM64_64K_PAGES=y
      is enabled.  While testing on arm64 platforms having up to 4PB of virtual
      address space, a minimum of 6 hugepages were required for all test cases
      to pass.  Support for this requirement has been added.
      
      Link: https://lkml.kernel.org/r/20230323105243.2807166-5-chaitanyas.prakash@arm.comSigned-off-by: default avatarChaitanya S Prakash <chaitanyas.prakash@arm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2f489e2e
    • Chaitanya S Prakash's avatar
      selftests/mm: add platform independent in code comments · c2af2a41
      Chaitanya S Prakash authored
      The in code comments for the selftest were made on the basis of 128TB
      switch, an architecture feature specific to PowerPc and x86 platforms. 
      Keeping in mind the support added for arm64 platforms which implements a
      256TB switch, a more generic explanation has been provided.
      
      Link: https://lkml.kernel.org/r/20230323105243.2807166-4-chaitanyas.prakash@arm.comSigned-off-by: default avatarChaitanya S Prakash <chaitanyas.prakash@arm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c2af2a41
    • Chaitanya S Prakash's avatar
      selftests/mm: rename va_128TBswitch to va_high_addr_switch · bbe16872
      Chaitanya S Prakash authored
      As the initial selftest only took into consideration PowperPC and x86
      architectures, on adding support for arm64, a platform independent naming
      convention is chosen.
      
      Link: https://lkml.kernel.org/r/20230323105243.2807166-3-chaitanyas.prakash@arm.comSigned-off-by: default avatarChaitanya S Prakash <chaitanyas.prakash@arm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bbe16872
    • Chaitanya S Prakash's avatar
      selftests/mm: add support for arm64 platform on va switch · cd834afa
      Chaitanya S Prakash authored
      Patch series "selftests/mm: Implement support for arm64 on va".
      
      The va_128TBswitch selftest is designed and implemented for PowerPC and
      x86 architectures which support a 128TB switch, up to 256TB of virtual
      address space and hugepage sizes of 16MB and 2MB respectively.  Arm64
      platforms on the other hand support a 256Tb switch, up to 4PB of virtual
      address space and a default hugepage size of 512MB when 64k pagesize is
      enabled.
      
      These architectural differences require introducing support for arm64
      platforms, after which a more generic naming convention is suggested.  The
      in code comments are amended to provide a more platform independent
      explanation of the working of the code and nr_hugepages are configured as
      required.  Finally, the file running the testcase is modified in order to
      prevent skipping of hugetlb testcases of va_high_addr_switch.
      
      
      This patch (of 5):
      
      Arm64 platforms have the ability to support 64kb pagesize, 512MB default
      hugepage size and up to 4PB of virtual address space.  The address switch
      occurs at 256TB as opposed to 128TB.  Hence, the necessary support has
      been added.
      
      Link: https://lkml.kernel.org/r/20230323105243.2807166-1-chaitanyas.prakash@arm.com
      Link: https://lkml.kernel.org/r/20230323105243.2807166-2-chaitanyas.prakash@arm.comSigned-off-by: default avatarChaitanya S Prakash <chaitanyas.prakash@arm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cd834afa
    • Luca Vizzarro's avatar
      memfd: pass argument of memfd_fcntl as int · f7b8f70b
      Luca Vizzarro authored
      The interface for fcntl expects the argument passed for the command
      F_ADD_SEALS to be of type int.  The current code wrongly treats it as a
      long.  In order to avoid access to undefined bits, we should explicitly
      cast the argument to int.
      
      This commit changes the signature of all the related and helper functions
      so that they treat the argument as int instead of long.
      
      Link: https://lkml.kernel.org/r/20230414152459.816046-5-Luca.Vizzarro@arm.comSigned-off-by: default avatarLuca Vizzarro <Luca.Vizzarro@arm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
      Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com>
      Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: David Laight <David.Laight@ACULAB.com>
      Cc: Mark Rutland <Mark.Rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f7b8f70b
    • Kalesh Singh's avatar
      mm: Multi-gen LRU: remove wait_event_killable() · 7f63cf2d
      Kalesh Singh authored
      Android 14 and later default to MGLRU [1] and field telemetry showed
      occasional long tail latency (>100ms) in the reclaim path.
      
      Tracing revealed priority inversion in the reclaim path.  In
      try_to_inc_max_seq(), when high priority tasks were blocked on
      wait_event_killable(), the preemption of the low priority task to call
      wake_up_all() caused those high priority tasks to wait longer than
      necessary.  In general, this problem is not different from others of its
      kind, e.g., one caused by mutex_lock().  However, it is specific to MGLRU
      because it introduced the new wait queue lruvec->mm_state.wait.
      
      The purpose of this new wait queue is to avoid the thundering herd
      problem.  If many direct reclaimers rush into try_to_inc_max_seq(), only
      one can succeed, i.e., the one to wake up the rest, and the rest who
      failed might cause premature OOM kills if they do not wait.  So far there
      is no evidence supporting this scenario, based on how often the wait has
      been hit.  And this begs the question how useful the wait queue is in
      practice.
      
      Based on Minchan's recommendation, which is in line with his commit
      6d4675e6 ("mm: don't be stuck to rmap lock on reclaim path") and the
      rest of the MGLRU code which also uses trylock when possible, remove the
      wait queue.
      
      [1] https://android-review.googlesource.com/q/I7ed7fbfd6ef9ce10053347528125dd98c39e50bf
      
      Link: https://lkml.kernel.org/r/20230413214326.2147568-1-kaleshsingh@google.com
      Fixes: bd74fdae ("mm: multi-gen LRU: support page table walks")
      Signed-off-by: default avatarKalesh Singh <kaleshsingh@google.com>
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarWei Wang <wvw@google.com>
      Acked-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
      Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7f63cf2d
    • Yang Yang's avatar
      mm: workingset: update description of the source file · ed8f3f99
      Yang Yang authored
      The calculation of workingset size is the core logic of handling refault,
      it had been updated several times[1][2] after workingset.c was created[3].
      But the description hadn't been updated accordingly, this mismatch may
      confuse the readers.  So we update the description to make it consistent
      to the code.
      
      [1] commit 34e58cac ("mm: workingset: let cache workingset challenge anon")
      [2] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU")
      [3] commit a528910e ("mm: thrash detection-based file cache sizing")
      
      Link: https://lkml.kernel.org/r/202304131634494948454@zte.com.cnSigned-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed8f3f99
    • Pavankumar Kondeti's avatar
      printk: export console trace point for kcsan/kasan/kfence/kmsan · 1f6ab566
      Pavankumar Kondeti authored
      The console tracepoint is used by kcsan/kasan/kfence/kmsan test modules. 
      Since this tracepoint is not exported, these modules iterate over all
      available tracepoints to find the console trace point.  Export the trace
      point so that it can be directly used.
      
      Link: https://lkml.kernel.org/r/20230413100859.1492323-1-quic_pkondeti@quicinc.comSigned-off-by: default avatarPavankumar Kondeti <quic_pkondeti@quicinc.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: John Ogness <john.ogness@linutronix.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f6ab566
    • Yosry Ahmed's avatar
      mm: vmscan: refactor updating current->reclaim_state · c7b23b68
      Yosry Ahmed authored
      During reclaim, we keep track of pages reclaimed from other means than
      LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
      which we stash a pointer to in current task_struct.
      
      However, we keep track of more than just reclaimed slab pages through
      this.  We also use it for clean file pages dropped through pruned inodes,
      and xfs buffer pages freed.  Rename reclaimed_slab to reclaimed, and add a
      helper function that wraps updating it through current, so that future
      changes to this logic are contained within include/linux/swap.h.
      
      Link: https://lkml.kernel.org/r/20230413104034.1086717-4-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c7b23b68
    • Yosry Ahmed's avatar
      mm: vmscan: move set_task_reclaim_state() near flush_reclaim_state() · ef05e689
      Yosry Ahmed authored
      Move set_task_reclaim_state() near flush_reclaim_state() so that all
      helpers manipulating reclaim_state are in close proximity.
      
      Link: https://lkml.kernel.org/r/20230413104034.1086717-3-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ef05e689
    • Yosry Ahmed's avatar
      mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim · 583c27a1
      Yosry Ahmed authored
      Patch series "Ignore non-LRU-based reclaim in memcg reclaim", v6.
      
      Upon running some proactive reclaim tests using memory.reclaim, we noticed
      some tests flaking where writing to memory.reclaim would be successful
      even though we did not reclaim the requested amount fully Looking further
      into it, I discovered that *sometimes* we overestimate the number of
      reclaimed pages in memcg reclaim.
      
      Reclaimed pages through other means than LRU-based reclaim are tracked
      through reclaim_state in struct scan_control, which is stashed in current
      task_struct.  These pages are added to the number of reclaimed pages
      through LRUs.  For memcg reclaim, these pages generally cannot be linked
      to the memcg under reclaim and can cause an overestimated count of
      reclaimed pages.  This short series tries to address that.
      
      Patch 1 ignores pages reclaimed outside of LRU reclaim in memcg reclaim. 
      The pages are uncharged anyway, so even if we end up under-reporting
      reclaimed pages we will still succeed in making progress during charging.
      
      Patches 2-3 are just refactoring.  Patch 2 moves set_reclaim_state()
      helper next to flush_reclaim_state().  Patch 3 adds a helper that wraps
      updating current->reclaim_state, and renames reclaim_state->reclaimed_slab
      to reclaim_state->reclaimed.
      
      
      This patch (of 3):
      
      We keep track of different types of reclaimed pages through
      reclaim_state->reclaimed_slab, and we add them to the reported number of
      reclaimed pages.  For non-memcg reclaim, this makes sense.  For memcg
      reclaim, we have no clue if those pages are charged to the memcg under
      reclaim.
      
      Slab pages are shared by different memcgs, so a freed slab page may have
      only been partially charged to the memcg under reclaim.  The same goes for
      clean file pages from pruned inodes (on highmem systems) or xfs buffer
      pages, there is no simple way to currently link them to the memcg under
      reclaim.
      
      Stop reporting those freed pages as reclaimed pages during memcg reclaim. 
      This should make the return value of writing to memory.reclaim, and may
      help reduce unnecessary reclaim retries during memcg charging.  Writing to
      memory.reclaim on the root memcg is considered as cgroup_reclaim(), but
      for this case we want to include any freed pages, so use the
      global_reclaim() check instead of !cgroup_reclaim().
      
      Generally, this should make the return value of
      try_to_free_mem_cgroup_pages() more accurate.  In some limited cases (e.g.
      freed a slab page that was mostly charged to the memcg under reclaim),
      the return value of try_to_free_mem_cgroup_pages() can be underestimated,
      but this should be fine.  The freed pages will be uncharged anyway, and we
      can charge the memcg the next time around as we usually do memcg reclaim
      in a retry loop.
      
      Link: https://lkml.kernel.org/r/20230413104034.1086717-1-yosryahmed@google.com
      Link: https://lkml.kernel.org/r/20230413104034.1086717-2-yosryahmed@google.com
      Fixes: f2fe7b09 ("mm: memcg/slab: charge individual slab objects
      instead of pages")
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      583c27a1
    • Alexander Potapenko's avatar
      mm: apply __must_check to vmap_pages_range_noflush() · d905ae2b
      Alexander Potapenko authored
      To prevent errors when vmap_pages_range_noflush() or
      __vmap_pages_range_noflush() silently fail (see the link below for an
      example), annotate them with __must_check so that the callers do not
      unconditionally assume the mapping succeeded.
      
      Link: https://lkml.kernel.org/r/20230413131223.4135168-4-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reported-by: default avatarDipanjan Das <mail.dipanjan.das@gmail.com>
        Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d905ae2b
    • Alexander Potapenko's avatar
      mm: kmsan: apply __must_check to non-void functions · bb1508c2
      Alexander Potapenko authored
      Non-void KMSAN hooks may return error codes that indicate that KMSAN
      failed to reflect the changed memory state in the metadata (e.g.  it could
      not create the necessary memory mappings).  In such cases the callers
      should handle the errors to prevent the tool from using the inconsistent
      metadata in the future.
      
      We mark non-void hooks with __must_check so that error handling is not
      skipped.
      
      Link: https://lkml.kernel.org/r/20230413131223.4135168-3-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dipanjan Das <mail.dipanjan.das@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bb1508c2
    • Liu Shixin's avatar
      mm: hwpoison: support recovery from HugePage copy-on-write faults · 1cb9dc4b
      Liu Shixin authored
      copy-on-write of hugetlb user pages with uncorrectable errors will result
      in a kernel crash.  This is because the copy is performed in kernel mode
      and in general we can not handle accessing memory with such errors while
      in kernel mode.  Commit a873dfe1 ("mm, hwpoison: try to recover from
      copy-on write faults") introduced the routine copy_user_highpage_mc() to
      gracefully handle copying of user pages with uncorrectable errors. 
      However, the separate hugetlb copy-on-write code paths were not modified
      as part of commit a873dfe1.
      
      Modify hugetlb copy-on-write code paths to use copy_mc_user_highpage() so
      that they can also gracefully handle uncorrectable errors in user pages. 
      This involves changing the hugetlb specific routine
      copy_user_large_folio() from type void to int so that it can return an
      error.  Modify the hugetlb userfaultfd code in the same way so that it can
      return -EHWPOISON if it encounters an uncorrectable error.
      
      Link: https://lkml.kernel.org/r/20230413131349.2524210-1-liushixin2@huawei.comSigned-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1cb9dc4b
    • Yosry Ahmed's avatar
      memcg: page_cgroup_ino() get memcg from the page's folio · ec342603
      Yosry Ahmed authored
      In a kernel with added WARN_ON_ONCE(PageTail) in page_memcg_check(), we
      observed a warning from page_cgroup_ino() when reading /proc/kpagecgroup. 
      This warning was added to catch fragile reads of a page memcg.  Make
      page_cgroup_ino() get memcg from the page's folio using
      folio_memcg_check(): that gives it the correct memcg for each page of a
      folio, so is the right fix.
      
      Note that page_folio() is racy, the page's folio can change from under us,
      but the entire function is racy and documented as such.
      
      I dithered between the right fix and the safer "fix": it's unlikely but
      conceivable that some userspace has learnt that /proc/kpagecgroup gives no
      memcg on tail pages, and compensates for that in some (racy) way: so
      continuing to give no memcg on tails, without warning, might be safer.
      
      But hwpoison_filter_task(), the only other user of page_cgroup_ino(),
      persuaded me.  It looks as if it currently leaves out tail pages of the
      selected memcg, by mistake: whereas hwpoison_inject() uses compound_head()
      and expects the tails to be included.  So hwpoison testing coverage has
      probably been restricted by the wrong output from page_cgroup_ino() (if
      that memcg filter is used at all): in the short term, it might be safer
      not to enable wider coverage there, but long term we would regret that.
      
      This is based on a patch originally written by Hugh Dickins and retains
      most of the original commit log [1]
      
      The patch was changed to use folio_memcg_check(page_folio(page)) instead
      of page_memcg_check(compound_head(page)) based on discussions with Matthew
      Wilcox; where he stated that callers of page_memcg_check() should stop
      using it due to the ambiguity around tail pages -- instead they should use
      folio_memcg_check() and handle tail pages themselves.
      
      Link: https://lkml.kernel.org/r/20230412003451.4018887-1-yosryahmed@google.com
      Link: https://lore.kernel.org/linux-mm/20230313083452.1319968-1-yosryahmed@google.com/ [1]
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ec342603
    • Aneesh Kumar K.V's avatar
      mm/hugetlb_vmemmap: rename ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP · 0b376f1e
      Aneesh Kumar K.V authored
      Now we use ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP config option to
      indicate devdax and hugetlb vmemmap optimization support.  Hence rename
      that to a generic ARCH_WANT_OPTIMIZE_VMEMMAP
      
      Link: https://lkml.kernel.org/r/20230412050025.84346-2-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Tarun Sahu <tsahu@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b376f1e
    • Aneesh Kumar K.V's avatar
      mm/vmemmap/devdax: fix kernel crash when probing devdax devices · 87a7ae75
      Aneesh Kumar K.V authored
      commit 4917f55b ("mm/sparse-vmemmap: improve memory savings for
      compound devmaps") added support for using optimized vmmemap for devdax
      devices.  But how vmemmap mappings are created are architecture specific. 
      For example, powerpc with hash translation doesn't have vmemmap mappings
      in init_mm page table instead they are bolted table entries in the
      hardware page table
      
      vmemmap_populate_compound_pages() used by vmemmap optimization code is not
      aware of these architecture-specific mapping.  Hence allow architecture to
      opt for this feature.  I selected architectures supporting
      HUGETLB_PAGE_OPTIMIZE_VMEMMAP option as also supporting this feature.
      
      This patch fixes the below crash on ppc64.
      
      BUG: Unable to handle kernel data access on write at 0xc00c000100400038
      Faulting instruction address: 0xc000000001269d90
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
      Modules linked in:
      CPU: 7 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc5-150500.34-default+ #2 5c90a668b6bbd142599890245c2fb5de19d7d28a
      Hardware name: IBM,9009-42G POWER9 (raw) 0x4e0202 0xf000005 of:IBM,FW950.40 (VL950_099) hv:phyp pSeries
      NIP:  c000000001269d90 LR: c0000000004c57d4 CTR: 0000000000000000
      REGS: c000000003632c30 TRAP: 0300   Not tainted  (6.3.0-rc5-150500.34-default+)
      MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24842228  XER: 00000000
      CFAR: c0000000004c57d0 DAR: c00c000100400038 DSISR: 42000000 IRQMASK: 0
      ....
      NIP [c000000001269d90] __init_single_page.isra.74+0x14/0x4c
      LR [c0000000004c57d4] __init_zone_device_page+0x44/0xd0
      Call Trace:
      [c000000003632ed0] [c000000003632f60] 0xc000000003632f60 (unreliable)
      [c000000003632f10] [c0000000004c5ca0] memmap_init_zone_device+0x170/0x250
      [c000000003632fe0] [c0000000005575f8] memremap_pages+0x2c8/0x7f0
      [c0000000036330c0] [c000000000557b5c] devm_memremap_pages+0x3c/0xa0
      [c000000003633100] [c000000000d458a8] dev_dax_probe+0x108/0x3e0
      [c0000000036331a0] [c000000000d41430] dax_bus_probe+0xb0/0x140
      [c0000000036331d0] [c000000000cef27c] really_probe+0x19c/0x520
      [c000000003633260] [c000000000cef6b4] __driver_probe_device+0xb4/0x230
      [c0000000036332e0] [c000000000cef888] driver_probe_device+0x58/0x120
      [c000000003633320] [c000000000cefa6c] __device_attach_driver+0x11c/0x1e0
      [c0000000036333a0] [c000000000cebc58] bus_for_each_drv+0xa8/0x130
      [c000000003633400] [c000000000ceefcc] __device_attach+0x15c/0x250
      [c0000000036334a0] [c000000000ced458] bus_probe_device+0x108/0x110
      [c0000000036334f0] [c000000000ce92dc] device_add+0x7fc/0xa10
      [c0000000036335b0] [c000000000d447c8] devm_create_dev_dax+0x1d8/0x530
      [c000000003633640] [c000000000d46b60] __dax_pmem_probe+0x200/0x270
      [c0000000036337b0] [c000000000d46bf0] dax_pmem_probe+0x20/0x70
      [c0000000036337d0] [c000000000d2279c] nvdimm_bus_probe+0xac/0x2b0
      [c000000003633860] [c000000000cef27c] really_probe+0x19c/0x520
      [c0000000036338f0] [c000000000cef6b4] __driver_probe_device+0xb4/0x230
      [c000000003633970] [c000000000cef888] driver_probe_device+0x58/0x120
      [c0000000036339b0] [c000000000cefd08] __driver_attach+0x1d8/0x240
      [c000000003633a30] [c000000000cebb04] bus_for_each_dev+0xb4/0x130
      [c000000003633a90] [c000000000cee564] driver_attach+0x34/0x50
      [c000000003633ab0] [c000000000ced878] bus_add_driver+0x218/0x300
      [c000000003633b40] [c000000000cf1144] driver_register+0xa4/0x1b0
      [c000000003633bb0] [c000000000d21a0c] __nd_driver_register+0x5c/0x100
      [c000000003633c10] [c00000000206a2e8] dax_pmem_init+0x34/0x48
      [c000000003633c30] [c0000000000132d0] do_one_initcall+0x60/0x320
      [c000000003633d00] [c0000000020051b0] kernel_init_freeable+0x360/0x400
      [c000000003633de0] [c000000000013764] kernel_init+0x34/0x1d0
      [c000000003633e50] [c00000000000de14] ret_from_kernel_thread+0x5c/0x64
      
      Link: https://lkml.kernel.org/r/20230411142214.64464-1-aneesh.kumar@linux.ibm.com
      Fixes: 4917f55b ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reported-by: default avatarTarun Sahu <tsahu@linux.ibm.com>
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      87a7ae75