- 21 Apr, 2023 25 commits
-
-
Linus Torvalds authored
Instead of having callers care about the mmap_min_addr logic for the lowest valid mapping address (and some of them getting it wrong), just move the logic into vm_unmapped_area() itself. One less thing for various architecture cases (and generic helpers) to worry about. We should really try to make much more of this be common code, but baby steps.. Without this, vm_unmapped_area() could return an address below mmap_min_addr (because some caller forgot about that). That then causes the mmap machinery to think it has found a workable address, but then later security_mmap_addr(addr) is unhappy about it and the mmap() returns with a nonsensical error (EPERM). The proper action is to either return ENOMEM (if the virtual address space is exhausted), or try to find another address (ie do a bottom-up search for free addresses after the top-down one failed). See commit 2afc745f ("mm: ensure get_unmapped_area() returns higher address than mmap_min_addr"), which fixed this for one call site (the generic arch_get_unmapped_area_topdown() fallback) but left other cases alone. Link: https://lkml.kernel.org/r/20230418214009.1142926-1-Liam.Howlett@oracle.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Hugh Dickins authored
Some architectures can have their hugetlb pages down at the lowest PTE level: their huge_pte_alloc() using pte_alloc_map(), but without any following pte_unmap(). Since none of these arches uses CONFIG_HIGHPTE, this is not seen as a problem at present; but would become a problem if forthcoming changes were to add an rcu_read_lock() into pte_offset_map(), with the rcu_read_unlock() expected in pte_unmap(). Similarly in their huge_pte_offset(): pte_offset_kernel() is good enough for that, but it's probably less confusing if we define pte_offset_huge() along with pte_alloc_huge(). Only define them without CONFIG_HIGHPTE: so there would be a build error to signal if ever more work is needed. For ease of development, define these now for 6.4-rc1, ahead of any use: then architectures can integrate patches using them, independent from mm. Link: https://lkml.kernel.org/r/ae9e7d98-8a3a-cfd9-4762-bcddffdf96cf@google.comSigned-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peng Zhang authored
In the case of reverse allocation, mas->index and mas->last do not point to the correct allocation range, which will cause users to get incorrect allocation results, so fix it. If the user does not use it in a specific way, this bug will not be triggered. This is a bug, but only VMA uses it now, the way VMA is used now will not trigger it. There is a possibility that a user will trigger it in the future. Also re-check whether the size is still satisfied after the lower bound was increased, which is a corner case and is incorrect in previous versions. Link: https://lkml.kernel.org/r/20230419093625.99201-1-zhangpeng.00@bytedance.com Fixes: 54a611b6 ("Maple Tree: add new data structure") Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Suren Baghdasaryan authored
If the page fault handler requests a retry, we will count the fault multiple times. This is a relatively harmless problem as the retry paths are not often requested, and the only user-visible problem is that the fault counter will be slightly higher than it should be. Nevertheless, userspace only took one fault, and should not see the fact that the kernel had to retry the fault multiple times. Move page fault accounting into mm_account_fault() and skip incomplete faults which will be accounted upon completion. Link: https://lkml.kernel.org/r/20230419175836.3857458-1-surenb@google.com Fixes: d065bd81 ("mm: retry page fault when blocking on disk transfer") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Peter Xu <peterx@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Minchan Kim <minchan@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Sergey Senozhatsky authored
zsmalloc pool can be compacted concurrently by many contexts, e.g. cc1 handle_mm_fault() do_anonymous_page() __alloc_pages_slowpath() try_to_free_pages() do_try_to_free_pages( lru_gen_shrink_node() shrink_slab() do_shrink_slab() zs_shrinker_scan() zs_compact() Pool compaction is currently (basically) single-threaded as it is performed under pool->lock. Having multiple compaction threads results in unnecessary contention, as each thread competes for pool->lock. This, in turn, affects all zsmalloc operations such as zs_malloc(), zs_map_object(), zs_free(), etc. Introduce the pool->compaction_in_progress atomic variable, which ensures that only one compaction context can run at a time. This reduces overall pool->lock contention in (corner) cases when many contexts attempt to shrink zspool simultaneously. Link: https://lkml.kernel.org/r/20230418074639.1903197-1-senozhatsky@chromium.org Fixes: c0547d0b ("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Stefan Roesch authored
This adds three new tests to the selftests for KSM. These tests use the new prctl API's to enable and disable KSM. 1) add new prctl flags to prctl header file in tools dir This adds the new prctl flags to the include file prct.h in the tools directory. This makes sure they are available for testing. 2) add KSM prctl merge test to ksm_tests This adds the -t option to the ksm_tests program. The -t flag allows to specify if it should use madvise or prctl ksm merging. 3) add two functions for debugging merge outcome for ksm_tests This adds two functions to report the metrics in /proc/self/ksm_stat and /sys/kernel/debug/mm/ksm. The debug output is enabled with the -d option. 4) add KSM prctl test to ksm_functional_tests This adds a test to the ksm_functional_test that verifies that the prctl system call to enable / disable KSM works. 5) add KSM fork test to ksm_functional_test Add fork test to verify that the MMF_VM_MERGE_ANY flag is inherited by the child process. Link: https://lkml.kernel.org/r/20230418051342.1919757-4-shr@devkernel.ioSigned-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: David Hildenbrand <david@redhat.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Stefan Roesch authored
This adds the general_profit KSM sysfs knob and the process profit metric knobs to ksm_stat. 1) expose general_profit metric The documentation mentions a general profit metric, however this metric is not calculated. In addition the formula depends on the size of internal structures, which makes it more difficult for an administrator to make the calculation. Adding the metric for a better user experience. 2) document general_profit sysfs knob 3) calculate ksm process profit metric The ksm documentation mentions the process profit metric and how to calculate it. This adds the calculation of the metric. 4) mm: expose ksm process profit metric in ksm_stat This exposes the ksm process profit metric in /proc/<pid>/ksm_stat. The documentation mentions the formula for the ksm process profit metric, however it does not calculate it. In addition the formula depends on the size of internal structures. So it makes sense to expose it. 5) document new procfs ksm knobs Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.ioSigned-off-by: Stefan Roesch <shr@devkernel.io> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Stefan Roesch authored
Patch series "mm: process/cgroup ksm support", v9. So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. Use case 1: The madvise call is not available in the programming language. An example for this are programs with forked workloads using a garbage collected language without pointers. In such a language madvise cannot be made available. In addition the addresses of objects get moved around as they are garbage collected. KSM sharing needs to be enabled "from the outside" for these type of workloads. Use case 2: The same interpreter can also be used for workloads where KSM brings no benefit or even has overhead. We'd like to be able to enable KSM on a workload by workload basis. Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. Security concerns: In previous discussions security concerns have been brought up. The problem is that an individual workload does not have the knowledge about what else is running on a machine. Therefore it has to be very conservative in what memory areas can be shared or not. However, if the system is dedicated to running multiple jobs within the same security domain, its the job scheduler that has the knowledge that sharing can be safely enabled and is even desirable. Performance: Experiments with using UKSM have shown a capacity increase of around 20%. Here are the metrics from an instagram workload (taken from a machine with 64GB main memory): full_scans: 445 general_profit: 20158298048 max_page_sharing: 256 merge_across_nodes: 1 pages_shared: 129547 pages_sharing: 5119146 pages_to_scan: 4000 pages_unshared: 1760924 pages_volatile: 10761341 run: 1 sleep_millisecs: 20 stable_node_chains: 167 stable_node_chains_prune_millisecs: 2000 stable_node_dups: 2751 use_zero_pages: 0 zero_pages_sharing: 0 After the service is running for 30 minutes to an hour, 4 to 5 million shared pages are common for this workload when using KSM. Detailed changes: 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 3. Add general_profit metric The general_profit metric of KSM is specified in the documentation, but not calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm. 4. Add more metrics to ksm_stat This adds the process profit metric to /proc/<pid>/ksm_stat. 5. Add more tests to ksm_tests and ksm_functional_tests This adds an option to specify the merge type to the ksm_tests. This allows to test madvise and prctl KSM. It also adds a two new tests to ksm_functional_tests: one to test the new prctl options and the other one is a fork test to verify that the KSM process setting is inherited by client processes. This patch (of 3): So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 1) Introduce new MMF_VM_MERGE_ANY flag This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is set, kernel samepage merging (ksm) gets enabled for all vma's of a process. 2) Setting VM_MERGEABLE on VMA creation When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the VM_MERGEABLE flag will be set for this VMA. 3) support disabling of ksm for a process This adds the ability to disable ksm for a process if ksm has been enabled for the process with prctl. 4) add new prctl option to get and set ksm for a process This adds two new options to the prctl system call - enable ksm for all vmas of a process (if the vmas support it). - query if ksm has been enabled for a process. 3. Disabling MMF_VM_MERGE_ANY for storage keys in s390 In the s390 architecture when storage keys are used, the MMF_VM_MERGE_ANY will be disabled. Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.ioSigned-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: David Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
John Keeping authored
The permissions for the files here are swapped as "count" is read-only and "scan" is write-only. While this doesn't really matter as these permissions don't stop the files being opened for reading/writing as appropriate, they are shown by "ls -l" and are confusing. Link: https://lkml.kernel.org/r/20230418101906.3131303-1-john@metanate.comSigned-off-by: John Keeping <john@metanate.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
Staring at the comment "Recheck VMA as permissions can change since migration started" in remove_migration_pte() can result in confusion, because if the source PTE/PMD indicates write permissions, then there should be no need to check VMA write permissions when restoring migration entries or PTE-mapping a PMD. Commit d3cb8bf6 ("mm: migrate: Close race between migration completion and mprotect") introduced the maybe_mkwrite() handling in remove_migration_pte() in 2014, stating that a race between mprotect() and migration finishing would be possible, and that we could end up with a writable PTE that should be readable. However, mprotect() code first updates vma->vm_flags / vma->vm_page_prot and then walks the page tables to (a) set all present writable PTEs to read-only and (b) convert all writable migration entries to readable migration entries. While walking the page tables and modifying the entries, migration code has to grab the PT locks to synchronize against concurrent page table modifications. Assuming migration would find a writable migration entry (while holding the PT lock) and replace it with a writable present PTE, surely mprotect() code didn't stumble over the writable migration entry yet (converting it into a readable migration entry) and would instead wait for the PT lock to convert the now present writable PTE into a read-only PTE. As mprotect() didn't finish yet, the behavior is just like migration didn't happen: a writable PTE will be converted to a read-only PTE. So it's fine to rely on the writability information in the source PTE/PMD and not recheck against the VMA as long as we're holding the PT lock to synchronize with anyone who concurrently wants to downgrade write permissions (like mprotect()) by first adjusting vma->vm_flags / vma->vm_page_prot to then walk over the page tables to adjust the page table entries. Running test cases that should reveal such races -- mprotect(PROT_READ) racing with page migration or THP splitting -- for multiple hours did not reveal an issue with this cleanup. Link: https://lkml.kernel.org/r/20230418142113.439494-1-david@redhat.comSigned-off-by: David Hildenbrand <david@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Huang Ying authored
In commit fd4a7ac3 ("mm: migrate: try again if THP split is failed due to page refcnt"), if the THP splitting fails due to page reference count, we will retry to improve migration successful rate. But the failed splitting is counted as migration failure and migration retry, which will cause duplicated failure counting. So, in this patch, this is fixed via undoing the failure counting if we decide to retry. The patch is tested via failure injection. Link: https://lkml.kernel.org/r/20230416235929.1040194-1-ying.huang@intel.com Fixes: fd4a7ac3 ("mm: migrate: try again if THP split is failed due to page refcnt") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
ZhangPeng authored
We can use range_in_vma() to check if dst_start, dst_start + len are within the dst_vma range. Minor readability improvement. Link: https://lkml.kernel.org/r/20230417003919.930515-1-zhangpeng362@huawei.comSigned-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yajun Deng authored
__show_mem() needs to iterate over all zones that have memory, we can simplify the code by using for_each_populated_zone(). Link: https://lkml.kernel.org/r/20230417035226.4013584-1-yajun.deng@linux.devSigned-off-by: Yajun Deng <yajun.deng@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kefeng Wang authored
Both of them change the arg from page_list to folio_list when convert them to use a folio, but not the declaration, let's correct it, also move the reclaim_pages() from swap.h to internal.h as it only used in mm. Link: https://lkml.kernel.org/r/20230417114807.186786-1-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviwed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Pankaj Raghav authored
fs/buffer do not support large folios as there are many assumptions on the folio size to be the host page size. This conversion is one step towards removing that assumption. Also this conversion will reduce calls to compound_head() if folio_create_buffers() calls folio_create_empty_buffers(). Link: https://lkml.kernel.org/r/20230417123618.22094-5-p.raghav@samsung.comSigned-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Pankaj Raghav authored
Folio version of create_empty_buffers(). This is required to convert create_page_buffers() to folio_create_buffers() later in the series. It removes several calls to compound_head() as it works directly on folio compared to create_empty_buffers(). Hence, create_empty_buffers() has been modified to call folio_create_empty_buffers(). Link: https://lkml.kernel.org/r/20230417123618.22094-4-p.raghav@samsung.comSigned-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Pankaj Raghav authored
Folio version of alloc_page_buffers() helper. This is required to convert create_page_buffers() to folio_create_buffers() later in the series. alloc_page_buffers() has been modified to call folio_alloc_buffers() which adds one call to compound_head() but folio_alloc_buffers() removes one call to compound_head() compared to the existing alloc_page_buffers() implementation. Link: https://lkml.kernel.org/r/20230417123618.22094-3-p.raghav@samsung.comSigned-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Pankaj Raghav authored
Patch series "convert create_page_buffers to folio_create_buffers". One of the first kernel panic we hit when we try to increase the block size > 4k is inside create_page_buffers()[1]. Even though buffer.c function do not support large folios (folios > PAGE_SIZE) at the moment, these changes are required when we want to remove that constraint. This patch (of 4): The folio version of set_bh_page(). This is required to convert create_page_buffers() to folio_create_buffers() later in the series. Link: https://lkml.kernel.org/r/20230417123618.22094-1-p.raghav@samsung.com Link: https://lkml.kernel.org/r/20230417123618.22094-2-p.raghav@samsung.comSigned-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Add a test suite (with 10 more sub-tests) to cover RO pinning against fork() over uffd-wp. It covers both: (1) Early CoW test in fork() when page pinned, (2) page unshare due to RO longterm pin. They are: Testing wp-fork-pin on anon... done Testing wp-fork-pin on shmem... done Testing wp-fork-pin on shmem-private... done Testing wp-fork-pin on hugetlb... done Testing wp-fork-pin on hugetlb-private... done Testing wp-fork-pin-with-event on anon... done Testing wp-fork-pin-with-event on shmem... done Testing wp-fork-pin-with-event on shmem-private... done Testing wp-fork-pin-with-event on hugetlb... done Testing wp-fork-pin-with-event on hugetlb-private... done CONFIG_GUP_TEST needed or they'll be skipped. Testing wp-fork-pin on anon... skipped [reason: Possibly CONFIG_GUP_TEST missing or unprivileged] Note that the major test goal is on private memory, but no hurt to also run all of them over shared because shared memory should work the same. Link: https://lkml.kernel.org/r/20230417195317.898696-7-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mika Penttilä <mpenttil@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
The macro and facility can be reused in other tests too. Make it general. Link: https://lkml.kernel.org/r/20230417195317.898696-6-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mika Penttilä <mpenttil@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Extend it to all types of mem, meanwhile add one parallel test when EVENT_FORK is enabled, where uffd-wp bits should be persisted rather than dropped. Since at it, rename the test to "wp-fork" to better show what it means. Making the new test called "wp-fork-with-event". Before: Testing pagemap on anon... done After: Testing wp-fork on anon... done Testing wp-fork on shmem... done Testing wp-fork on shmem-private... done Testing wp-fork on hugetlb... done Testing wp-fork on hugetlb-private... done Testing wp-fork-with-event on anon... done Testing wp-fork-with-event on shmem... done Testing wp-fork-with-event on shmem-private... done Testing wp-fork-with-event on hugetlb... done Testing wp-fork-with-event on hugetlb-private... done Link: https://lkml.kernel.org/r/20230417195317.898696-5-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mika Penttilä <mpenttil@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Namely: "-f": add a wildcard filter for tests to run "-l": list tests rather than running any "-h": help msg Link: https://lkml.kernel.org/r/20230417195317.898696-4-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mika Penttilä <mpenttil@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
When we try to unshare a pinned page for a private hugetlb, uffd-wp bit can get lost during unsharing. When above condition met, one can lose uffd-wp bit on the privately mapped hugetlb page. It allows the page to be writable even if it should still be wr-protected. I assume it can mean data loss. This should be very rare, only if an unsharing happened on a private hugetlb page with uffd-wp protected (e.g. in a child which shares the same page with parent with UFFD_FEATURE_EVENT_FORK enabled). When I wrote the reproducer (provided in the last patch) I needed to use the newest gup_test cmd introduced by David to trigger it because I don't even know another way to do a proper RO longerm pin. Besides that, it needs a bunch of other conditions all met: (1) hugetlb being mapped privately, (2) userfaultfd registered with WP and EVENT_FORK, (3) the user app fork()s, then, (4) RO longterm pin onto a wr-protected anonymous page. If it's not impossible to hit in production I'd say extremely rare. Link: https://lkml.kernel.org/r/20230417195317.898696-3-peterx@redhat.com Fixes: 166f3ecc ("mm/hugetlb: hook page faults for uffd write protection") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mika Penttilä <mpenttil@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Patch series "mm/hugetlb: More fixes around uffd-wp vs fork() / RO pins", v2. This patch (of 6): There're a bunch of things that were wrong: - Reading uffd-wp bit from a swap entry should use pte_swp_uffd_wp() rather than huge_pte_uffd_wp(). - When copying over a pte, we should drop uffd-wp bit when !EVENT_FORK (aka, when !userfaultfd_wp(dst_vma)). - When doing early CoW for private hugetlb (e.g. when the parent page was pinned), uffd-wp bit should be properly carried over if necessary. No bug reported probably because most people do not even care about these corner cases, but they are still bugs and can be exposed by the recent unit tests introduced, so fix all of them in one shot. Link: https://lkml.kernel.org/r/20230417195317.898696-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20230417195317.898696-2-peterx@redhat.com Fixes: bc70fbf2 ("mm/hugetlb: handle uffd-wp during fork()") Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mika Penttilä <mpenttil@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Zqiang authored
For kernels built with the following options and booting CONFIG_SLUB=y CONFIG_DEBUG_LOCKDEP=y CONFIG_PROVE_LOCKING=y CONFIG_PROVE_RAW_LOCK_NESTING=y [ 0.523115] [ BUG: Invalid wait context ] [ 0.523315] 6.3.0-rc1-yocto-standard+ #739 Not tainted [ 0.523649] ----------------------------- [ 0.523663] swapper/0/0 is trying to lock: [ 0.523663] ffff888035611360 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x2e/0x1e0 [ 0.523663] other info that might help us debug this: [ 0.523663] context-{2:2} [ 0.523663] no locks held by swapper/0/0. [ 0.523663] stack backtrace: [ 0.523663] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.3.0-rc1-yocto-standard+ #739 [ 0.523663] Call Trace: [ 0.523663] <IRQ> [ 0.523663] dump_stack_lvl+0x64/0xb0 [ 0.523663] dump_stack+0x10/0x20 [ 0.523663] __lock_acquire+0x6c4/0x3c10 [ 0.523663] lock_acquire+0x188/0x460 [ 0.523663] put_cpu_partial+0x5a/0x1e0 [ 0.523663] __slab_free+0x39a/0x520 [ 0.523663] ___cache_free+0xa9/0xc0 [ 0.523663] qlist_free_all+0x7a/0x160 [ 0.523663] per_cpu_remove_cache+0x5c/0x70 [ 0.523663] __flush_smp_call_function_queue+0xfc/0x330 [ 0.523663] generic_smp_call_function_single_interrupt+0x13/0x20 [ 0.523663] __sysvec_call_function+0x86/0x2e0 [ 0.523663] sysvec_call_function+0x73/0x90 [ 0.523663] </IRQ> [ 0.523663] <TASK> [ 0.523663] asm_sysvec_call_function+0x1b/0x20 [ 0.523663] RIP: 0010:default_idle+0x13/0x20 [ 0.523663] RSP: 0000:ffffffff83e07dc0 EFLAGS: 00000246 [ 0.523663] RAX: 0000000000000000 RBX: ffffffff83e1e200 RCX: ffffffff82a83293 [ 0.523663] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8119a6b1 [ 0.523663] RBP: ffffffff83e07dc8 R08: 0000000000000001 R09: ffffed1006ac0d66 [ 0.523663] R10: ffff888035606b2b R11: ffffed1006ac0d65 R12: 0000000000000000 [ 0.523663] R13: ffffffff83e1e200 R14: ffffffff84a7d980 R15: 0000000000000000 [ 0.523663] default_idle_call+0x6c/0xa0 [ 0.523663] do_idle+0x2e1/0x330 [ 0.523663] cpu_startup_entry+0x20/0x30 [ 0.523663] rest_init+0x152/0x240 [ 0.523663] arch_call_rest_init+0x13/0x40 [ 0.523663] start_kernel+0x331/0x470 [ 0.523663] x86_64_start_reservations+0x18/0x40 [ 0.523663] x86_64_start_kernel+0xbb/0x120 [ 0.523663] secondary_startup_64_no_verify+0xe0/0xeb [ 0.523663] </TASK> The local_lock_irqsave() is invoked in put_cpu_partial() and happens in IPI context, due to the CONFIG_PROVE_RAW_LOCK_NESTING=y (the LD_WAIT_CONFIG not equal to LD_WAIT_SPIN), so acquire local_lock in IPI context will trigger above calltrace. This commit therefore moves qlist_free_all() from hard-irq context to task context. Link: https://lkml.kernel.org/r/20230327120019.1027640-1-qiang1.zhang@intel.comSigned-off-by: Zqiang <qiang1.zhang@intel.com> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 18 Apr, 2023 15 commits
-
-
Longlong Xia authored
hwpoison_user_mappings() is updated to support ksm pages, and add collect_procs_ksm() to collect processes when the error hit an ksm page. The difference from collect_procs_anon() is that it also needs to traverse the rmap-item list on the stable node of the ksm page. At the same time, add_to_kill_ksm() is added to handle ksm pages. And task_in_to_kill_list() is added to avoid duplicate addition of tsk to the to_kill list. This is because when scanning the list, if the pages that make up the ksm page all come from the same process, they may be added repeatedly. Link: https://lkml.kernel.org/r/20230414021741.2597273-3-xialonglong1@huawei.comSigned-off-by: Longlong Xia <xialonglong1@huawei.com> Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Longlong Xia authored
Patch series "mm: ksm: support hwpoison for ksm page", v2. Currently, ksm does not support hwpoison. As ksm is being used more widely for deduplication at the system level, container level, and process level, supporting hwpoison for ksm has become increasingly important. However, ksm pages were not processed by hwpoison in 2009 [1]. The main method of implementation: 1. Refactor add_to_kill() and add new add_to_kill_*() to better accommodate the handling of different types of pages. 2. Add collect_procs_ksm() to collect processes when the error hit an ksm page. 3. Add task_in_to_kill_list() to avoid duplicate addition of tsk to the to_kill list. 4. Try_to_unmap ksm page (already supported). 5. Handle related processes such as sending SIGBUS. Tested with poisoning to ksm page from 1) different process 2) one process and with/without memory_failure_early_kill set, the processes are killed as expected with the patchset. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ commit/?h=01e00f88 This patch (of 2): The page_address_in_vma() is used to find the user virtual address of page in add_to_kill(), but it doesn't support ksm due to the ksm page->index unusable, add an ksm_addr as parameter to add_to_kill(), let's the caller to pass it, also rename the function to __add_to_kill(), and adding add_to_kill_anon_file() for handling anonymous pages and file pages, adding add_to_kill_fsdax() for handling fsdax pages. Link: https://lkml.kernel.org/r/20230414021741.2597273-1-xialonglong1@huawei.com Link: https://lkml.kernel.org/r/20230414021741.2597273-2-xialonglong1@huawei.comSigned-off-by: Longlong Xia <xialonglong1@huawei.com> Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jeff Xu authored
sysctl memfd_noexec is pid-namespaced, non-reservable, and inherent to the child process. Move the inherence test from init ns to child ns, so init ns can keep the default value. Link: https://lkml.kernel.org/r/20230414022801.2545257-1-jeffxu@google.comSigned-off-by: Jeff Xu <jeffxu@google.com> Reported-by: kernel test robot <yujie.liu@intel.com> Link: https://lore.kernel.org/oe-lkp/202303312259.441e35db-yujie.liu@intel.comTested-by: Yujie Liu <yujie.liu@intel.com> Cc: Daniel Verkamp <dverkamp@chromium.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Chaitanya S Prakash authored
The va_high_addr_switch selftest is used to test mmap across 128TB boundary. It divides the selftest cases into two main categories on the basis of size. One set is used to create mappings that are multiples of PAGE_SIZE while the other creates mappings that are multiples of HUGETLB_SIZE. In order to run the hugetlb testcases the binary must be appended with "--run-hugetlb" but the file that used to run the test only invokes the binary, thereby completely skipping the hugetlb testcases. Hence, the required statement has been added. Link: https://lkml.kernel.org/r/20230323105243.2807166-6-chaitanyas.prakash@arm.comSigned-off-by: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Chaitanya S Prakash authored
Arm64 has a default hugepage size of 512MB when CONFIG_ARM64_64K_PAGES=y is enabled. While testing on arm64 platforms having up to 4PB of virtual address space, a minimum of 6 hugepages were required for all test cases to pass. Support for this requirement has been added. Link: https://lkml.kernel.org/r/20230323105243.2807166-5-chaitanyas.prakash@arm.comSigned-off-by: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Chaitanya S Prakash authored
The in code comments for the selftest were made on the basis of 128TB switch, an architecture feature specific to PowerPc and x86 platforms. Keeping in mind the support added for arm64 platforms which implements a 256TB switch, a more generic explanation has been provided. Link: https://lkml.kernel.org/r/20230323105243.2807166-4-chaitanyas.prakash@arm.comSigned-off-by: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Chaitanya S Prakash authored
As the initial selftest only took into consideration PowperPC and x86 architectures, on adding support for arm64, a platform independent naming convention is chosen. Link: https://lkml.kernel.org/r/20230323105243.2807166-3-chaitanyas.prakash@arm.comSigned-off-by: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Chaitanya S Prakash authored
Patch series "selftests/mm: Implement support for arm64 on va". The va_128TBswitch selftest is designed and implemented for PowerPC and x86 architectures which support a 128TB switch, up to 256TB of virtual address space and hugepage sizes of 16MB and 2MB respectively. Arm64 platforms on the other hand support a 256Tb switch, up to 4PB of virtual address space and a default hugepage size of 512MB when 64k pagesize is enabled. These architectural differences require introducing support for arm64 platforms, after which a more generic naming convention is suggested. The in code comments are amended to provide a more platform independent explanation of the working of the code and nr_hugepages are configured as required. Finally, the file running the testcase is modified in order to prevent skipping of hugetlb testcases of va_high_addr_switch. This patch (of 5): Arm64 platforms have the ability to support 64kb pagesize, 512MB default hugepage size and up to 4PB of virtual address space. The address switch occurs at 256TB as opposed to 128TB. Hence, the necessary support has been added. Link: https://lkml.kernel.org/r/20230323105243.2807166-1-chaitanyas.prakash@arm.com Link: https://lkml.kernel.org/r/20230323105243.2807166-2-chaitanyas.prakash@arm.comSigned-off-by: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Luca Vizzarro authored
The interface for fcntl expects the argument passed for the command F_ADD_SEALS to be of type int. The current code wrongly treats it as a long. In order to avoid access to undefined bits, we should explicitly cast the argument to int. This commit changes the signature of all the related and helper functions so that they treat the argument as int instead of long. Link: https://lkml.kernel.org/r/20230414152459.816046-5-Luca.Vizzarro@arm.comSigned-off-by: Luca Vizzarro <Luca.Vizzarro@arm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Kevin Brodsky <Kevin.Brodsky@arm.com> Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: David Laight <David.Laight@ACULAB.com> Cc: Mark Rutland <Mark.Rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kalesh Singh authored
Android 14 and later default to MGLRU [1] and field telemetry showed occasional long tail latency (>100ms) in the reclaim path. Tracing revealed priority inversion in the reclaim path. In try_to_inc_max_seq(), when high priority tasks were blocked on wait_event_killable(), the preemption of the low priority task to call wake_up_all() caused those high priority tasks to wait longer than necessary. In general, this problem is not different from others of its kind, e.g., one caused by mutex_lock(). However, it is specific to MGLRU because it introduced the new wait queue lruvec->mm_state.wait. The purpose of this new wait queue is to avoid the thundering herd problem. If many direct reclaimers rush into try_to_inc_max_seq(), only one can succeed, i.e., the one to wake up the rest, and the rest who failed might cause premature OOM kills if they do not wait. So far there is no evidence supporting this scenario, based on how often the wait has been hit. And this begs the question how useful the wait queue is in practice. Based on Minchan's recommendation, which is in line with his commit 6d4675e6 ("mm: don't be stuck to rmap lock on reclaim path") and the rest of the MGLRU code which also uses trylock when possible, remove the wait queue. [1] https://android-review.googlesource.com/q/I7ed7fbfd6ef9ce10053347528125dd98c39e50bf Link: https://lkml.kernel.org/r/20230413214326.2147568-1-kaleshsingh@google.com Fixes: bd74fdae ("mm: multi-gen LRU: support page table walks") Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Suggested-by: Minchan Kim <minchan@kernel.org> Reported-by: Wei Wang <wvw@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yang Yang authored
The calculation of workingset size is the core logic of handling refault, it had been updated several times[1][2] after workingset.c was created[3]. But the description hadn't been updated accordingly, this mismatch may confuse the readers. So we update the description to make it consistent to the code. [1] commit 34e58cac ("mm: workingset: let cache workingset challenge anon") [2] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU") [3] commit a528910e ("mm: thrash detection-based file cache sizing") Link: https://lkml.kernel.org/r/202304131634494948454@zte.com.cnSigned-off-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Pavankumar Kondeti authored
The console tracepoint is used by kcsan/kasan/kfence/kmsan test modules. Since this tracepoint is not exported, these modules iterate over all available tracepoints to find the console trace point. Export the trace point so that it can be directly used. Link: https://lkml.kernel.org/r/20230413100859.1492323-1-quic_pkondeti@quicinc.comSigned-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Marco Elver <elver@google.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yosry Ahmed authored
During reclaim, we keep track of pages reclaimed from other means than LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab, which we stash a pointer to in current task_struct. However, we keep track of more than just reclaimed slab pages through this. We also use it for clean file pages dropped through pruned inodes, and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add a helper function that wraps updating it through current, so that future changes to this logic are contained within include/linux/swap.h. Link: https://lkml.kernel.org/r/20230413104034.1086717-4-yosryahmed@google.comSigned-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yosry Ahmed authored
Move set_task_reclaim_state() near flush_reclaim_state() so that all helpers manipulating reclaim_state are in close proximity. Link: https://lkml.kernel.org/r/20230413104034.1086717-3-yosryahmed@google.comSigned-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yosry Ahmed authored
Patch series "Ignore non-LRU-based reclaim in memcg reclaim", v6. Upon running some proactive reclaim tests using memory.reclaim, we noticed some tests flaking where writing to memory.reclaim would be successful even though we did not reclaim the requested amount fully Looking further into it, I discovered that *sometimes* we overestimate the number of reclaimed pages in memcg reclaim. Reclaimed pages through other means than LRU-based reclaim are tracked through reclaim_state in struct scan_control, which is stashed in current task_struct. These pages are added to the number of reclaimed pages through LRUs. For memcg reclaim, these pages generally cannot be linked to the memcg under reclaim and can cause an overestimated count of reclaimed pages. This short series tries to address that. Patch 1 ignores pages reclaimed outside of LRU reclaim in memcg reclaim. The pages are uncharged anyway, so even if we end up under-reporting reclaimed pages we will still succeed in making progress during charging. Patches 2-3 are just refactoring. Patch 2 moves set_reclaim_state() helper next to flush_reclaim_state(). Patch 3 adds a helper that wraps updating current->reclaim_state, and renames reclaim_state->reclaimed_slab to reclaim_state->reclaimed. This patch (of 3): We keep track of different types of reclaimed pages through reclaim_state->reclaimed_slab, and we add them to the reported number of reclaimed pages. For non-memcg reclaim, this makes sense. For memcg reclaim, we have no clue if those pages are charged to the memcg under reclaim. Slab pages are shared by different memcgs, so a freed slab page may have only been partially charged to the memcg under reclaim. The same goes for clean file pages from pruned inodes (on highmem systems) or xfs buffer pages, there is no simple way to currently link them to the memcg under reclaim. Stop reporting those freed pages as reclaimed pages during memcg reclaim. This should make the return value of writing to memory.reclaim, and may help reduce unnecessary reclaim retries during memcg charging. Writing to memory.reclaim on the root memcg is considered as cgroup_reclaim(), but for this case we want to include any freed pages, so use the global_reclaim() check instead of !cgroup_reclaim(). Generally, this should make the return value of try_to_free_mem_cgroup_pages() more accurate. In some limited cases (e.g. freed a slab page that was mostly charged to the memcg under reclaim), the return value of try_to_free_mem_cgroup_pages() can be underestimated, but this should be fine. The freed pages will be uncharged anyway, and we can charge the memcg the next time around as we usually do memcg reclaim in a retry loop. Link: https://lkml.kernel.org/r/20230413104034.1086717-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20230413104034.1086717-2-yosryahmed@google.com Fixes: f2fe7b09 ("mm: memcg/slab: charge individual slab objects instead of pages") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-