- 27 Sep, 2022 16 commits
-
-
Huang Ying authored
Patch series "migrate_pages(): fix several bugs in error path", v3. During review the code of migrate_pages() and build a test program for it. Several bugs in error path are identified and fixed in this series. Most patches are tested via - Apply error-inject.patch in Linux kernel - Compile test-migrate.c (with -lnuma) - Test with test-migrate.sh error-inject.patch, test-migrate.c, and test-migrate.sh are as below. It turns out that error injection is an important tool to fix bugs in error path. This patch (of 8): The return value of move_pages() syscall is incorrect when counting the remaining pages to be migrated. For example, for the following test program, " #define _GNU_SOURCE #include <stdbool.h> #include <stdio.h> #include <string.h> #include <stdlib.h> #include <errno.h> #include <fcntl.h> #include <sys/uio.h> #include <sys/mman.h> #include <sys/types.h> #include <unistd.h> #include <numaif.h> #include <numa.h> #ifndef MADV_FREE #define MADV_FREE 8 /* free pages only if memory pressure */ #endif #define ONE_MB (1024 * 1024) #define MAP_SIZE (16 * ONE_MB) #define THP_SIZE (2 * ONE_MB) #define THP_MASK (THP_SIZE - 1) #define ERR_EXIT_ON(cond, msg) \ do { \ int __cond_in_macro = (cond); \ if (__cond_in_macro) \ error_exit(__cond_in_macro, (msg)); \ } while (0) void error_msg(int ret, int nr, int *status, const char *msg) { int i; fprintf(stderr, "Error: %s, ret : %d, error: %s\n", msg, ret, strerror(errno)); if (!nr) return; fprintf(stderr, "status: "); for (i = 0; i < nr; i++) fprintf(stderr, "%d ", status[i]); fprintf(stderr, "\n"); } void error_exit(int ret, const char *msg) { error_msg(ret, 0, NULL, msg); exit(1); } int page_size; bool do_vmsplice; bool do_thp; static int pipe_fds[2]; void *addr; char *pn; char *pn1; void *pages[2]; int status[2]; void prepare() { int ret; struct iovec iov; if (addr) { munmap(addr, MAP_SIZE); close(pipe_fds[0]); close(pipe_fds[1]); } ret = pipe(pipe_fds); ERR_EXIT_ON(ret, "pipe"); addr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); ERR_EXIT_ON(addr == MAP_FAILED, "mmap"); if (do_thp) { ret = madvise(addr, MAP_SIZE, MADV_HUGEPAGE); ERR_EXIT_ON(ret, "advise hugepage"); } pn = (char *)(((unsigned long)addr + THP_SIZE) & ~THP_MASK); pn1 = pn + THP_SIZE; pages[0] = pn; pages[1] = pn1; *pn = 1; if (do_vmsplice) { iov.iov_base = pn; iov.iov_len = page_size; ret = vmsplice(pipe_fds[1], &iov, 1, 0); ERR_EXIT_ON(ret < 0, "vmsplice"); } status[0] = status[1] = 1024; } void test_migrate() { int ret; int nodes[2] = { 1, 1 }; pid_t pid = getpid(); prepare(); ret = move_pages(pid, 1, pages, nodes, status, MPOL_MF_MOVE_ALL); error_msg(ret, 1, status, "move 1 page"); prepare(); ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL); error_msg(ret, 2, status, "move 2 pages, page 1 not mapped"); prepare(); *pn1 = 1; ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL); error_msg(ret, 2, status, "move 2 pages"); prepare(); *pn1 = 1; nodes[1] = 0; ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL); error_msg(ret, 2, status, "move 2 pages, page 1 to node 0"); } int main(int argc, char *argv[]) { numa_run_on_node(0); page_size = getpagesize(); test_migrate(); fprintf(stderr, "\nMake page 0 cannot be migrated:\n"); do_vmsplice = true; test_migrate(); fprintf(stderr, "\nTest THP:\n"); do_thp = true; do_vmsplice = false; test_migrate(); fprintf(stderr, "\nTHP: make page 0 cannot be migrated:\n"); do_vmsplice = true; test_migrate(); return 0; } " The output of the current kernel is, " Error: move 1 page, ret : 0, error: Success status: 1 Error: move 2 pages, page 1 not mapped, ret : 0, error: Success status: 1 -14 Error: move 2 pages, ret : 0, error: Success status: 1 1 Error: move 2 pages, page 1 to node 0, ret : 0, error: Success status: 1 0 Make page 0 cannot be migrated: Error: move 1 page, ret : 0, error: Success status: 1024 Error: move 2 pages, page 1 not mapped, ret : 1, error: Success status: 1024 -14 Error: move 2 pages, ret : 0, error: Success status: 1024 1024 Error: move 2 pages, page 1 to node 0, ret : 1, error: Success status: 1024 1024 " While the expected output is, " Error: move 1 page, ret : 0, error: Success status: 1 Error: move 2 pages, page 1 not mapped, ret : 0, error: Success status: 1 -14 Error: move 2 pages, ret : 0, error: Success status: 1 1 Error: move 2 pages, page 1 to node 0, ret : 0, error: Success status: 1 0 Make page 0 cannot be migrated: Error: move 1 page, ret : 1, error: Success status: 1024 Error: move 2 pages, page 1 not mapped, ret : 1, error: Success status: 1024 -14 Error: move 2 pages, ret : 1, error: Success status: 1024 1024 Error: move 2 pages, page 1 to node 0, ret : 2, error: Success status: 1024 1024 " Fix this via correcting the remaining pages counting. With the fix, the output for the test program as above is expected. Link: https://lkml.kernel.org/r/20220817081408.513338-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20220817081408.513338-2-ying.huang@intel.com Fixes: 5984fabb ("mm: move_pages: report the number of non-attempted pages") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yang Yang authored
Once upon a time, we only support accounting thrashing of page cache. Then Joonsoo introduced workingset detection for anonymous pages and we gained the ability to account thrashing of them[1]. So let delayacct account both the thrashing of page cache and anonymous pages, this could make the codes more consistent and simpler. [1] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU") Link: https://lkml.kernel.org/r/20220805033838.1714674-1-yang.yang29@zte.com.cnSigned-off-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: CGEL ZTE <cgel.zte@gmail.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Introduce a variable swap_migration_ad_supported to cache whether the arch supports swap migration A/D bits. Here one thing to mention is that SWP_MIG_TOTAL_BITS will internally reference the other macro MAX_PHYSMEM_BITS, which is a function call on x86 (constant on all the rest of archs). It's safe to reference it in swapfile_init() because when reaching here we're already during initcalls level 4 so we must have initialized 5-level pgtable for x86_64 (right after early_identify_cpu() finishes). - start_kernel - setup_arch - early_cpu_init - get_cpu_cap --> fetch from CPUID (including X86_FEATURE_LA57) - early_identify_cpu --> clear X86_FEATURE_LA57 (if early lvl5 not enabled (USE_EARLY_PGTABLE_L5)) - arch_call_rest_init - rest_init - kernel_init - kernel_init_freeable - do_basic_setup - do_initcalls --> calls swapfile_init() (initcall level 4) This should slightly speed up the migration swap entry handlings. Link: https://lkml.kernel.org/r/20220811161331.37055-8-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
We used to have swapfile_maximum_size() fetching a maximum value of swapfile size per-arch. As the caller of max_swapfile_size() grows, this patch introduce a variable "swapfile_maximum_size" and cache the value of old max_swapfile_size(), so that we don't need to calculate the value every time. Caching the value in swapfile_init() is safe because when reaching the phase we should have initialized all the relevant information. Here the major arch to take care of is x86, which defines the max swapfile size based on L1TF mitigation. Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly when reaching swapfile_init(). As a reference, the code path looks like this for x86: - start_kernel - setup_arch - early_cpu_init - early_identify_cpu --> setup X86_BUG_L1TF - parse_early_param - l1tf_cmdline --> set l1tf_mitigation - check_bugs - l1tf_select_mitigation --> set l1tf_mitigation - arch_call_rest_init - rest_init - kernel_init - kernel_init_freeable - do_basic_setup - do_initcalls --> calls swapfile_init() (initcall level 4) The swapfile size only depends on swp pte format on non-x86 archs, so caching it is safe too. Since at it, rename max_swapfile_size() to arch_max_swapfile_size() because arch can define its own function, so it's more straightforward to have "arch_" as its prefix. At the meantime, export swapfile_maximum_size to replace the old usages of max_swapfile_size(). [peterx@redhat.com: declare arch_max_swapfile_size) in swapfile.h] Link: https://lkml.kernel.org/r/YxTh1GuC6ro5fKL5@xz-m1.local Link: https://lkml.kernel.org/r/20220811161331.37055-7-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
When page migration happens, we always ignore the young/dirty bit settings in the old pgtable, and marking the page as old in the new page table using either pte_mkold() or pmd_mkold(), and keeping the pte clean. That's fine from functional-wise, but that's not friendly to page reclaim because the moving page can be actively accessed within the procedure. Not to mention hardware setting the young bit can bring quite some overhead on some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit. The same slowdown problem to dirty bits when the memory is first written after page migration happened. Actually we can easily remember the A/D bit configuration and recover the information after the page is migrated. To achieve it, define a new set of bits in the migration swap offset field to cache the A/D bits for old pte. Then when removing/recovering the migration entry, we can recover the A/D bits even if the page changed. One thing to mention is that here we used max_swapfile_size() to detect how many swp offset bits we have, and we'll only enable this feature if we know the swp offset is big enough to store both the PFN value and the A/D bits. Otherwise the A/D bits are dropped like before. Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Carry over the dirty bit from pmd to pte when a huge pmd splits. It shouldn't be a correctness issue since when pmd_dirty() we'll have the page marked dirty anyway, however having dirty bit carried over helps the next initial writes of split ptes on some archs like x86. Link: https://lkml.kernel.org/r/20220811161331.37055-5-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Huang Ying <ying.huang@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
We've got a bunch of special swap entries that stores PFN inside the swap offset fields. To fetch the PFN, normally the user just calls swp_offset() assuming that'll be the PFN. Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the max possible length of a PFN on the host, meanwhile doing proper check with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry(). One reason to do so is we never tried to sanitize whether swap offset can really fit for storing PFN. At the meantime, this patch also prepares us with the future possibility to store more information inside the swp offset field, so assuming "swp_offset(entry)" to be the PFN will not stand any more very soon. Replace many of the swp_offset() callers to use swp_offset_pfn() where proper. Note that many of the existing users are not candidates for the replacement, e.g.: (1) When the swap entry is not a pfn swap entry at all, or, (2) when we wanna keep the whole swp_offset but only change the swp type. For the latter, it can happen when fork() triggered on a write-migration swap entry pte, we may want to only change the migration type from write->read but keep the rest, so it's not "fetching PFN" but "changing swap type only". They're left aside so that when there're more information within the swp offset they'll be carried over naturally in those cases. Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what the new swp_offset_pfn() is about. Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
swapops.h contains quite a few layers of ifdef, some of the "else" and "endif" doesn't get proper comment on the macro so it's hard to follow on what are they referring to. Add the comments. Link: https://lkml.kernel.org/r/20220811161331.37055-3-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Nadav Amit <nadav.amit@gmail.com> Reviewed-by: Huang Ying <ying.huang@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Patch series "mm: Remember a/d bits for migration entries", v4. Problem ======= When migrating a page, right now we always mark the migrated page as old & clean. However that could lead to at least two problems: (1) We lost the real hot/cold information while we could have persisted. That information shouldn't change even if the backing page is changed after the migration, (2) There can be always extra overhead on the immediate next access to any migrated page, because hardware MMU needs cycles to set the young bit again for reads, and dirty bits for write, as long as the hardware MMU supports these bits. Many of the recent upstream works showed that (2) is not something trivial and actually very measurable. In my test case, reading 1G chunk of memory - jumping in page size intervals - could take 99ms just because of the extra setting on the young bit on a generic x86_64 system, comparing to 4ms if young set. This issue is originally reported by Andrea Arcangeli. Solution ======== To solve this problem, this patchset tries to remember the young/dirty bits in the migration entries and carry them over when recovering the ptes. We have the chance to do so because in many systems the swap offset is not really fully used. Migration entries use swp offset to store PFN only, while the PFN is normally not as large as swp offset and normally smaller. It means we do have some free bits in swp offset that we can use to store things like A/D bits, and that's how this series tried to approach this problem. max_swapfile_size() is used here to detect per-arch offset length in swp entries. We'll automatically remember the A/D bits when we find that we have enough swp offset field to keep both the PFN and the extra bits. Since max_swapfile_size() can be slow, the last two patches cache the results for it and also swap_migration_ad_supported as a whole. Known Issues / TODOs ==================== We still haven't taught madvise() to recognize the new A/D bits in migration entries, namely MADV_COLD/MADV_FREE. E.g. when MADV_COLD upon a migration entry. It's not clear yet on whether we should clear the A bit, or we should just drop the entry directly. We didn't teach idle page tracking on the new migration entries, because it'll need larger rework on the tree on rmap pgtable walk. However it should make it already better because before this patchset page will be old page after migration, so the series will fix potential false negative of idle page tracking when pages were migrated before observing. The other thing is migration A/D bits will not start to working for private device swap entries. The code is there for completeness but since private device swap entries do not yet have fields to store A/D bits, even if we'll persistent A/D across present pte switching to migration entry, we'll lose it again when the migration entry converted to private device swap entry. Tests ===== After the patchset applied, the immediate read access test [1] of above 1G chunk after migration can shrink from 99ms to 4ms. The test is done by moving 1G pages from node 0->1->0 then read it in page size jumps. The test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. Similar effect can also be measured when writting the memory the 1st time after migration. After applying the patchset, both initial immediate read/write after page migrated will perform similarly like before migration happened. Patch Layout ============ Patch 1-2: Cleanups from either previous versions or on swapops.h macros. Patch 3-4: Prepare for the introduction of migration A/D bits Patch 5: The core patch to remember young/dirty bit in swap offsets. Patch 6-7: Cache relevant fields to make migration_entry_supports_ad() fast. [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c This patch (of 7): Replace all the magic "5" with the macro. Link: https://lkml.kernel.org/r/20220811161331.37055-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220811161331.37055-2-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
1.Remove meaningless comment in kill_proc(). That doesn't tell anything. 2.Fix the wrong function name get_hwpoison_unless_zero(). It should be get_page_unless_zero(). 3.The gate keeper for free hwpoison page has moved to check_new_page(). Update the corresponding comment. Link: https://lkml.kernel.org/r/20220830123604.25763-7-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
PageTable can't be handled by memory_failure(). Filter it out explicitly in hwpoison_user_mappings(). This will also make code more consistent with the relevant check in unpoison_memory(). Link: https://lkml.kernel.org/r/20220830123604.25763-6-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
If vma->vm_mm != t->mm, there's no need to call page_mapped_in_vma() as add_to_kill() won't be called in this case. Move up the mm check to avoid possible unneeded calling to page_mapped_in_vma(). Link: https://lkml.kernel.org/r/20220830123604.25763-5-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
Use num_poisoned_pages_sub() to combine multiple atomic ops into one. Also num_poisoned_pages_dec() can be killed as there's no caller now. Link: https://lkml.kernel.org/r/20220830123604.25763-4-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
It's more recommended to use __PageMovable() to detect non-lru movable pages. We can avoid bumping page refcnt via isolate_movable_page() for the isolated lru pages. Also if pages become PageLRU just after they're checked but before trying to isolate them, isolate_lru_page() will be called to do the right work. [linmiaohe@huawei.com: fixes per Naoya Horiguchi] Link: https://lkml.kernel.org/r/1f7ee86e-7d28-0d8c-e0de-b7a5a94519e8@huawei.com Link: https://lkml.kernel.org/r/20220830123604.25763-3-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
Patch series "A few cleanup patches for memory-failure". his series contains a few cleanup patches to use __PageMovable() to detect non-lru movable pages, use num_poisoned_pages_sub() to reduce multiple atomic ops overheads and so on. More details can be found in the respective changelogs. This patch (of 6): Use ClearPageHWPoison() instead of TestClearPageHWPoison() to clear page hwpoison flags to avoid unneeded full memory barrier overhead. Link: https://lkml.kernel.org/r/20220830123604.25763-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220830123604.25763-2-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yang Shi authored
The syzbot reported the below problem: BUG: Bad page map in process syz-executor198 pte:8000000071c00227 pmd:74b30067 addr:0000000020563000 vm_flags:08100077 anon_vma:ffff8880547d2200 mapping:0000000000000000 index:20563 file:(null) fault:0x0 mmap:0x0 read_folio:0x0 CPU: 1 PID: 3614 Comm: syz-executor198 Not tainted 6.0.0-rc3-next-20220901-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_bad_pte.cold+0x2a7/0x2d0 mm/memory.c:565 vm_normal_page+0x10c/0x2a0 mm/memory.c:636 hpage_collapse_scan_pmd+0x729/0x1da0 mm/khugepaged.c:1199 madvise_collapse+0x481/0x910 mm/khugepaged.c:2433 madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1062 madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1236 do_madvise.part.0+0x24a/0x340 mm/madvise.c:1415 do_madvise mm/madvise.c:1428 [inline] __do_sys_madvise mm/madvise.c:1428 [inline] __se_sys_madvise mm/madvise.c:1426 [inline] __x64_sys_madvise+0x113/0x150 mm/madvise.c:1426 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f770ba87929 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f770ba18308 EFLAGS: 00000246 ORIG_RAX: 000000000000001c RAX: ffffffffffffffda RBX: 00007f770bb0f3f8 RCX: 00007f770ba87929 RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000 RBP: 00007f770bb0f3f0 R08: 00007f770ba18700 R09: 0000000000000000 R10: 00007f770ba18700 R11: 0000000000000246 R12: 00007f770bb0f3fc R13: 00007ffc2d8b62ef R14: 00007f770ba18400 R15: 0000000000022000 Basically the test program does the below conceptually: 1. mmap 0x2000000 - 0x21000000 as anonymous region 2. mmap io_uring SQ stuff at 0x20563000 with MAP_FIXED, io_uring_mmap() actually remaps the pages with special PTEs 3. call MADV_COLLAPSE for 0x20000000 - 0x21000000 It actually triggered the below race: CPU A CPU B mmap 0x20000000 - 0x21000000 as anon madvise_collapse is called on this area Retrieve start and end address from the vma (NEVER updated later!) Collapsed the first 2M area and dropped mmap_lock Acquire mmap_lock mmap io_uring file at 0x20563000 Release mmap_lock Reacquire mmap_lock revalidate vma pass since 0x20200000 + 0x200000 > 0x20563000 scan the next 2M (0x20200000 - 0x20400000), but due to whatever reason it didn't release mmap_lock scan the 3rd 2M area (start from 0x20400000) get into the vma created by io_uring The hend should be updated after MADV_COLLAPSE reacquire mmap_lock since the vma may be shrunk. We don't have to worry about shink from the other direction since it could be caught by hugepage_vma_revalidate(). Either no valid vma is found or the vma doesn't fit anymore. Link: https://lkml.kernel.org/r/20220914162220.787703-1-shy828301@gmail.com Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse") Reported-by: syzbot+915f3e317adb0e85835f@syzkaller.appspotmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 26 Sep, 2022 12 commits
-
-
Andrew Morton authored
-
Kees Cook authored
The check_object_size() helper under CONFIG_HARDENED_USERCOPY is designed to skip any checks where the length is known at compile time as a reasonable heuristic to avoid "likely known-good" cases. However, it can only do this when the copy_*_user() helpers are, themselves, inline too. Using find_vmap_area() requires taking a spinlock. The check_object_size() helper can call find_vmap_area() when the destination is in vmap memory. If show_regs() is called in interrupt context, it will attempt a call to copy_from_user_nmi(), which may call check_object_size() and then find_vmap_area(). If something in normal context happens to be in the middle of calling find_vmap_area() (with the spinlock held), the interrupt handler will hang forever. The copy_from_user_nmi() call is actually being called with a fixed-size length, so check_object_size() should never have been called in the first place. Given the narrow constraints, just replace the __copy_from_user_inatomic() call with an open-coded version that calls only into the sanitizers and not check_object_size(), followed by a call to raw_copy_from_user(). [akpm@linux-foundation.org: no instrument_copy_from_user() in my tree...] Link: https://lkml.kernel.org/r/20220919201648.2250764-1-keescook@chromium.org Link: https://lore.kernel.org/all/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@mail.gmail.com Fixes: 0aef499f ("mm/usercopy: Detect vmalloc overruns") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Yu Zhao <yuzhao@google.com> Reported-by: Florian Lehner <dev@der-flo.net> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Florian Lehner <dev@der-flo.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Zi Yan authored
set_migratetype_isolate() does not allow isolating MIGRATE_CMA pageblocks unless it is used for CMA allocation. isolate_single_pageblock() did not have the same behavior when it is used together with set_migratetype_isolate() in start_isolate_page_range(). This allows alloc_contig_range() with migratetype other than MIGRATE_CMA, like MIGRATE_MOVABLE (used by alloc_contig_pages()), to isolate first and last pageblock but fail the rest. The failure leads to changing migratetype of the first and last pageblock to MIGRATE_MOVABLE from MIGRATE_CMA, corrupting the CMA region. This can happen during gigantic page allocations. Like Doug said here: https://lore.kernel.org/linux-mm/a3363a52-883b-dcd1-b77f-f2bb378d6f2d@gmail.com/T/#u, for gigantic page allocations, the user would notice no difference, since the allocation on CMA region will fail as well as it did before. But it might hurt the performance of device drivers that use CMA, since CMA region size decreases. Fix it by passing migratetype into isolate_single_pageblock(), so that set_migratetype_isolate() used by isolate_single_pageblock() will prevent the isolation happening. Link: https://lkml.kernel.org/r/20220914023913.1855924-1-zi.yan@sent.com Fixes: b2c9e2fb ("mm: make alloc_contig_range work at pageblock granularity") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Doug Berger <opendmb@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Doug Berger <opendmb@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Shuai Xue authored
The GHES code calls memory_failure_queue() from IRQ context to queue work into workqueue and schedule it on the current CPU. Then the work is processed in memory_failure_work_func() by kworker and calls memory_failure(). When a page is already poisoned, commit a3f5d80e ("mm,hwpoison: send SIGBUS with error virutal address") make memory_failure() call kill_accessing_process() that: - holds mmap locking of current->mm - does pagetable walk to find the error virtual address - and sends SIGBUS to the current process with error info. However, the mm of kworker is not valid, resulting in a null-pointer dereference. So check mm when killing the accessing process. [akpm@linux-foundation.org: remove unrelated whitespace alteration] Link: https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com Fixes: a3f5d80e ("mm,hwpoison: send SIGBUS with error virutal address") Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Bixuan Cui <cuibixuan@linux.alibaba.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Doug Berger authored
With gigantic pages it may not be true that struct page structures are contiguous across the entire gigantic page. The nth_page macro is used here in place of direct pointer arithmetic to correct for this. Mike said: : This error could cause addressing exceptions. However, this is only : possible in configurations where CONFIG_SPARSEMEM && : !CONFIG_SPARSEMEM_VMEMMAP. Such a configuration option is rare and : unknown to be the default anywhere. Link: https://lkml.kernel.org/r/20220914190917.3517663-1-opendmb@gmail.com Fixes: 8531fc6f ("hugetlb: add hugetlb demote page support") Signed-off-by: Doug Berger <opendmb@gmail.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Maurizio Lombardi authored
A number of drivers call page_frag_alloc() with a fragment's size > PAGE_SIZE. In low memory conditions, __page_frag_cache_refill() may fail the order 3 cache allocation and fall back to order 0; In this case, the cache will be smaller than the fragment, causing memory corruptions. Prevent this from happening by checking if the newly allocated cache is large enough for the fragment; if not, the allocation will fail and page_frag_alloc() will return NULL. Link: https://lkml.kernel.org/r/20220715125013.247085-1-mlombard@redhat.com Fixes: b63ae8ca ("mm/net: Rename and move page fragment handling from net/ to mm/") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Cc: Chen Lin <chen45464546@163.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Sergei Antonov authored
Running this test program on ARMv4 a few times (sometimes just once) reproduces the bug. int main() { unsigned i; char paragon[SIZE]; void* ptr; memset(paragon, 0xAA, SIZE); ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_SHARED, -1, 0); if (ptr == MAP_FAILED) return 1; printf("ptr = %p\n", ptr); for (i=0;i<10000;i++){ memset(ptr, 0xAA, SIZE); if (memcmp(ptr, paragon, SIZE)) { printf("Unexpected bytes on iteration %u!!!\n", i); break; } } munmap(ptr, SIZE); } In the "ptr" buffer there appear runs of zero bytes which are aligned by 16 and their lengths are multiple of 16. Linux v5.11 does not have the bug, "git bisect" finds the first bad commit: f9ce0be7 ("mm: Cleanup faultaround and finish_fault() codepaths") Before the commit update_mmu_cache() was called during a call to filemap_map_pages() as well as finish_fault(). After the commit finish_fault() lacks it. Bring back update_mmu_cache() to finish_fault() to fix the bug. Also call update_mmu_tlb() only when returning VM_FAULT_NOPAGE to more closely reproduce the code of alloc_set_pte() function that existed before the commit. On many platforms update_mmu_cache() is nop: x86, see arch/x86/include/asm/pgtable ARMv6+, see arch/arm/include/asm/tlbflush.h So, it seems, few users ran into this bug. Link: https://lkml.kernel.org/r/20220908204809.2012451-1-saproj@gmail.com Fixes: f9ce0be7 ("mm: Cleanup faultaround and finish_fault() codepaths") Signed-off-by: Sergei Antonov <saproj@gmail.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
If no frontswap module (i.e. zswap) was registered, frontswap_ops will be NULL. In such situation, swapon crashes with the following stack trace: Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000 Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=00000020a4fab000 [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 96000004 [#1] SMP Modules linked in: zram fsl_dpaa2_eth pcs_lynx phylink ahci_qoriq crct10dif_ce ghash_ce sbsa_gwdt fsl_mc_dpio nvme lm90 nvme_core at803x xhci_plat_hcd rtc_fsl_ftm_alarm xgmac_mdio ahci_platform i2c_imx ip6_tables ip_tables fuse Unloaded tainted modules: cppc_cpufreq():1 CPU: 10 PID: 761 Comm: swapon Not tainted 6.0.0-rc2-00454-g22100432cf14 #1 Hardware name: SolidRun Ltd. SolidRun CEX7 Platform, BIOS EDK II Jun 21 2022 pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : frontswap_init+0x38/0x60 lr : __do_sys_swapon+0x8a8/0x9f4 sp : ffff80000969bcf0 x29: ffff80000969bcf0 x28: ffff37bee0d8fc00 x27: ffff80000a7f5000 x26: fffffcdefb971e80 x25: ffffaba797453b90 x24: 0000000000000064 x23: ffff37c1f209d1a8 x22: ffff37bee880e000 x21: ffffaba797748560 x20: ffff37bee0d8fce4 x19: ffffaba797748488 x18: 0000000000000014 x17: 0000000030ec029a x16: ffffaba795a479b0 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000030 x12: 0000000000000001 x11: ffff37c63c0aba18 x10: 0000000000000000 x9 : ffffaba7956b8c88 x8 : ffff80000969bcd0 x7 : 0000000000000000 x6 : 0000000000000000 x5 : 0000000000000001 x4 : 0000000000000000 x3 : ffffaba79730f000 x2 : ffff37bee0d8fc00 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: frontswap_init+0x38/0x60 __do_sys_swapon+0x8a8/0x9f4 __arm64_sys_swapon+0x28/0x3c invoke_syscall+0x78/0x100 el0_svc_common.constprop.0+0xd4/0xf4 do_el0_svc+0x38/0x4c el0_svc+0x34/0x10c el0t_64_sync_handler+0x11c/0x150 el0t_64_sync+0x190/0x194 Code: d000e283 910003fd f9006c41 f946d461 (f9400021) ---[ end trace 0000000000000000 ]--- Link: https://lkml.kernel.org/r/20220909130829.3262926-1-hch@lst.de Fixes: 1da0d94a ("frontswap: remove support for multiple ops") Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Liu Shixin <liushixin2@huawei.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Naoya Horiguchi authored
NULL pointer dereference is triggered when calling thp split via debugfs on the system with offlined memory blocks. With debug option enabled, the following kernel messages are printed out: page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: unmovable page page:000000007d7ab72e is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:1248! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 ... RIP: 0010:split_huge_pages_write+0xcf4/0xe30 This shows that page_to_nid() in page_zone() is unexpectedly called for an offlined memmap. Use pfn_to_online_page() to get struct page in PFN walker. Link: https://lkml.kernel.org/r/20220908041150.3430269-1-naoya.horiguchi@linux.dev Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e8] Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> [5.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Minchan Kim authored
MADV_PAGEOUT tries to isolate non-LRU pages and gets a warning from isolate_lru_page below. Fix it by checking PageLRU in advance. ------------[ cut here ]------------ trying to isolate tail page WARNING: CPU: 0 PID: 6175 at mm/folio-compat.c:158 isolate_lru_page+0x130/0x140 Modules linked in: CPU: 0 PID: 6175 Comm: syz-executor.0 Not tainted 5.18.12 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:isolate_lru_page+0x130/0x140 Link: https://lore.kernel.org/linux-mm/485f8c33.2471b.182d5726afb.Coremail.hantianshuo@iie.ac.cn/ Link: https://lkml.kernel.org/r/20220908151204.762596-1-minchan@kernel.org Fixes: 1a4e58cc ("mm: introduce MADV_PAGEOUT") Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: 韩天ç` <hantianshuo@iie.ac.cn> Suggested-by: Yang Shi <shy828301@gmail.com> Acked-by: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yang Shi authored
The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will move to use RCU instead of disabling local interrupts in fast-GUP. Using an IPI is the old-styled way of serializing against fast-GUP although it still works as expected now. And fast-GUP now fixed the potential race with THP collapse by checking whether PMD is changed or not. So IPI broadcast in radix pmd collapse flush is not necessary anymore. But it is still needed for hash TLB. Link: https://lkml.kernel.org/r/20220907180144.555485-2-shy828301@gmail.comSuggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yang Shi authored
Since general RCU GUP fast was introduced in commit 2667f50e ("mm: introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer sufficient to handle concurrent GUP-fast in all cases, it only handles traditional IPI-based GUP-fast correctly. On architectures that send an IPI broadcast on TLB flush, it works as expected. But on the architectures that do not use IPI to broadcast TLB flush, it may have the below race: CPU A CPU B THP collapse fast GUP gup_pmd_range() <-- see valid pmd gup_pte_range() <-- work on pte pmdp_collapse_flush() <-- clear pmd and flush __collapse_huge_page_isolate() check page pinned <-- before GUP bump refcount pin the page check PTE <-- no change __collapse_huge_page_copy() copy data to huge page ptep_clear() install huge pmd for the huge page return the stale page discard the stale page The race can be fixed by checking whether PMD is changed or not after taking the page pin in fast GUP, just like what it does for PTE. If the PMD is changed it means there may be parallel THP collapse, so GUP should back off. Also update the stale comment about serializing against fast GUP in khugepaged. Link: https://lkml.kernel.org/r/20220907180144.555485-1-shy828301@gmail.com Fixes: 2667f50e ("mm: introduce a general RCU get_user_pages_fast()") Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 12 Sep, 2022 12 commits
-
-
David Hildenbrand authored
commit 6c287605 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") made sure that when PageAnonExclusive() has to be cleared during temporary unmapping of a page, that the PTE is cleared/invalidated and that the TLB is flushed. What we want to achieve in all cases is that we cannot end up with a pin on an anonymous page that may be shared, because such pins would be unreliable and could result in memory corruptions when the mapped page and the pin go out of sync due to a write fault. That TLB flush handling was inspired by an outdated comment in mm/ksm.c:write_protect_page(), which similarly required the TLB flush in the past to synchronize with GUP-fast. However, ever since general RCU GUP fast was introduced in commit 2667f50e ("mm: introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer sufficient to handle concurrent GUP-fast in all cases -- it only handles traditional IPI-based GUP-fast correctly. Peter Xu (thankfully) questioned whether that TLB flush is really required. On architectures that send an IPI broadcast on TLB flush, it works as expected. To synchronize with RCU GUP-fast properly, we're conceptually fine, however, we have to enforce a certain memory order and are missing memory barriers. Let's document that, avoid the TLB flush where possible and use proper explicit memory barriers where required. We shouldn't really care about the additional memory barriers here, as we're not on extremely hot paths -- and we're getting rid of some TLB flushes. We use a smp_mb() pair for handling concurrent pinning and a smp_rmb()/smp_wmb() pair for handling the corner case of only temporary PTE changes but permanent PageAnonExclusive changes. One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to convert an exclusive anonymous page to a KSM page, and that page is already mapped write-protected (-> no PTE change) would be: Thread 0 (KSM) Thread 1 (GUP-fast) (B1) Read the PTE # (B2) skipped without FOLL_WRITE (A1) Clear PTE smp_mb() (A2) Check pinned (B3) Pin the mapped page smp_mb() (A3) Clear PageAnonExclusive smp_wmb() (A4) Restore PTE (B4) Check if the PTE changed smp_rmb() (B5) Check PageAnonExclusive Thread 1 will properly detect that PageAnonExclusive was cleared and back off. Note that we don't need a memory barrier between checking if the page is pinned and clearing PageAnonExclusive, because stores are not speculated. The possible issues due to reordering are of theoretical nature so far and attempts to reproduce the race failed. Especially the "no PTE change" case isn't the common case, because we'd need an exclusive anonymous page that's mapped R/O and the PTE is clean in KSM code -- and using KSM with page pinning isn't extremely common. Further, the clear+TLB flush we used for now implies a memory barrier. So the problematic missing part should be the missing memory barrier after pinning but before checking if the PTE changed. Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com Fixes: 6c287605 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Nadav Amit <namit@vmware.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Christoph von Recklinghausen <crecklin@redhat.com> Cc: Don Dutile <ddutile@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christophe JAILLET authored
Use 'percpu_ref_tryget_live_rcu()' instead of 'percpu_ref_tryget_live()' to save a few cycles when it is known that the rcu lock is already taken/released. Link: https://lkml.kernel.org/r/9ef1562a1975371360f3e263856e9f1c5749b656.1662136782.git.christophe.jaillet@wanadoo.frSigned-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kefeng Wang authored
Drop unneed comment and blank, adjust the variable, and the most important is to delete BUG_ON(). The page passed is always buddy page into __isolate_free_page() from compaction, page_isolation and page_reporting, and the caller also check the return, BUG_ON() is a too drastic measure, remove it. Link: https://lkml.kernel.org/r/20220901015043.189276-1-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Liu Shixin authored
No caller cares about the return value of create_object(), so make it return void. Link: https://lkml.kernel.org/r/20220901023007.3471887-1-liushixin2@huawei.comSigned-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Tarun Sahu authored
Commit d2d6cba5d6623 ("selftest: vm: remove orphaned references to local_config.{h,mk}") took care of removing orphaned references. This commit removes local_config from .gitignore. Parent patch commit 69007f156ba ("Kselftests: remove support of libhugetlbfs from kselftests") Link: https://lkml.kernel.org/r/20220901092315.33619-1-tsahu@linux.ibm.comSigned-off-by: Tarun Sahu <tsahu@linux.ibm.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
If CONFIG_SYSFS and CONFIG_SYSCTL are both undefined, hugetlb doesn't work now as there's no way to set max huge pages. Make sure at least one of the above configs is defined to make hugetlb works as expected. Link: https://lkml.kernel.org/r/20220901120030.63318-11-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
When code reaches here, invalid page would have been accessed if huge pte is none. So this BUG_ON(huge_pte_none()) is meaningless. Remove it. Link: https://lkml.kernel.org/r/20220901120030.63318-10-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
The SetHPageVmemmapOptimized() called here seems unnecessary as it's assumed to be set when calling this function. But it's indeed cleared by above set_page_private(page, 0). Add a comment to avoid possible future confusion. Link: https://lkml.kernel.org/r/20220901120030.63318-9-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
Fold hugetlbfs_pagecache_page() into its sole caller to remove some duplicated code. No functional change intended. Link: https://lkml.kernel.org/r/20220901120030.63318-8-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
We can pass NULL to kobj_to_hstate() directly when nid is unused to simplify the code. No functional change intended. Link: https://lkml.kernel.org/r/20220901120030.63318-7-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
Use helper huge_pte_lock and pmd_lock to simplify the code. No functional change intended. Link: https://lkml.kernel.org/r/20220901120030.63318-6-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Miaohe Lin authored
It's better to use sizeof() to get the array size instead of manual calculation. Minor readability improvement. Link: https://lkml.kernel.org/r/20220901120030.63318-5-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-