- 16 Oct, 2023 5 commits
-
-
Kefeng Wang authored
In preparation for large folio numa balancing, make mpol_misplaced() to take a folio, no functional change intended. Link: https://lkml.kernel.org/r/20230921074417.24004-6-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kefeng Wang authored
In preparation for large folio numa balancing, make numa_migrate_prep() to take a folio, no functional change intended. Link: https://lkml.kernel.org/r/20230921074417.24004-5-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kefeng Wang authored
Numa balancing only try to migrate non-compound page in do_numa_page(), use a folio in it to save several compound_head calls, note we use folio_estimated_sharers(), it is enough to check the folio sharers since only normal page is handled, if large folio numa balancing is supported, a precise folio sharers check would be used, no functional change intended. Link: https://lkml.kernel.org/r/20230921074417.24004-4-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kefeng Wang authored
Use a folio in do_huge_pmd_numa_page(), reduce three page_folio() calls to one, no functional change intended. Link: https://lkml.kernel.org/r/20230921074417.24004-3-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Kefeng Wang authored
Patch series "mm: convert numa balancing functions to use a folio", v2. do_numa_pages() only handles non-compound pages, and only PMD-mapped THPs are handled in do_huge_pmd_numa_page(). But a large, PTE-mapped folio will be supported so let's convert more numa balancing functions to use/take a folio in preparation for that, no functional change intended for now. This patch (of 6): The new vm_normal_folio_pmd() wrapper is similar to vm_normal_folio(), which allow them to completely replace the struct page variables with struct folio variables. Link: https://lkml.kernel.org/r/20230921074417.24004-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20230921074417.24004-2-wangkefeng.wang@huawei.comSigned-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 06 Oct, 2023 13 commits
-
-
Minjie Du authored
Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using the existing helper folio_next_index() in filemap_map_pages(). Link: https://lkml.kernel.org/r/20230921081535.3398-1-duminjie@vivo.comSigned-off-by: Minjie Du <duminjie@vivo.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Florent Revest authored
Add some tests to cover the new PR_MDWE_NO_INHERIT flag of the PR_SET_MDWE prctl. Check that: - it can't be set without PR_SET_MDWE - MDWE flags can't be unset - when set, PR_SET_MDWE doesn't propagate to children Link: https://lkml.kernel.org/r/20230828150858.393570-7-revest@chromium.orgSigned-off-by: Florent Revest <revest@chromium.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ayush Jain <ayush.jain3@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: Topi Miettinen <toiwoton@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Florent Revest authored
This extends the current PR_SET_MDWE prctl arg with a bit to indicate that the process doesn't want MDWE protection to propagate to children. To implement this no-inherit mode, the tag in current->mm->flags must be absent from MMF_INIT_MASK. This means that the encoding for "MDWE but without inherit" is different in the prctl than in the mm flags. This leads to a bit of bit-mangling in the prctl implementation. Link: https://lkml.kernel.org/r/20230828150858.393570-6-revest@chromium.orgSigned-off-by: Florent Revest <revest@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ayush Jain <ayush.jain3@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: Topi Miettinen <toiwoton@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Florent Revest authored
Defining a prctl flag as an int is a footgun because on a 64 bit machine and with a variadic implementation of prctl (like in musl and glibc), when used directly as a prctl argument, it can get casted to long with garbage upper bits which would result in unexpected behaviors. This patch changes the constant to an unsigned long to eliminate that possibilities. This does not break UAPI. I think that a stable backport would be "nice to have": to reduce the chances that users build binaries that could end up with garbage bits in their MDWE prctl arguments. We are not aware of anyone having yet encountered this corner case with MDWE prctls but a backport would reduce the likelihood it happens, since this sort of issues has happened with other prctls. But If this is perceived as a backporting burden, I suppose we could also live without a stable backport. Link: https://lkml.kernel.org/r/20230828150858.393570-5-revest@chromium.org Fixes: b507808e ("mm: implement memory-deny-write-execute as a prctl") Signed-off-by: Florent Revest <revest@chromium.org> Suggested-by: Alexey Izbyshev <izbyshev@ispras.ru> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ayush Jain <ayush.jain3@amd.com> Cc: Greg Thelen <gthelen@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: Topi Miettinen <toiwoton@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Florent Revest authored
Invalid prctls return a negative code and set errno. It's good practice to check that errno is set as expected. Link: https://lkml.kernel.org/r/20230828150858.393570-4-revest@chromium.orgSigned-off-by: Florent Revest <revest@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ayush Jain <ayush.jain3@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: Topi Miettinen <toiwoton@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Florent Revest authored
I checked with the original author, the mmap_FIXED test case wasn't properly tested and fails. Currently, it maps two consecutive (non overlapping) pages and expects the second mapping to be denied by MDWE but these two pages have nothing to do with each other so MDWE is actually out of the picture here. What the test actually intended to do was to remap a virtual address using MAP_FIXED. However, this operation unmaps the existing mapping and creates a new one so the va is backed by a new page and MDWE is again out of the picture, all remappings should succeed. This patch keeps the test case to make it clear that this situation is expected to work: MDWE shouldn't block a MAP_FIXED replacement. Link: https://lkml.kernel.org/r/20230828150858.393570-3-revest@chromium.org Fixes: 4cf1fe34 ("kselftest: vm: add tests for memory-deny-write-execute") Signed-off-by: Florent Revest <revest@chromium.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Ayush Jain <ayush.jain3@amd.com> Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Greg Thelen <gthelen@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: Topi Miettinen <toiwoton@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Florent Revest authored
Patch series "MDWE without inheritance", v4. Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags current with a flag that prevents pages that were previously not executable from becoming executable. This tag always gets inherited by children tasks. (it's in MMF_INIT_MASK) At Google, we've been using a somewhat similar downstream patch for a few years now. To make the adoption of this feature easier, we've had it support a mode in which the W^X flag does not propagate to children. For example, this is handy if a C process which wants W^X protection suspects it could start children processes that would use a JIT. I'd like to align our features with the upstream prctl. This series proposes a new NO_INHERIT flag to the MDWE prctl to make this kind of adoption easier. It sets a different flag in current that is not in MMF_INIT_MASK and which does not propagate. As part of looking into MDWE, I also fixed a couple of things in the MDWE test. The background for this was discussed in these threads: v1: https://lore.kernel.org/all/66900d0ad42797a55259061f757beece@ispras.ru/ v2: https://lore.kernel.org/all/d7e3749c-a718-df94-92af-1cb0fecab772@redhat.com/ This patch (of 6): Fix tabs/spaces inconsistency in the mdwe test. Link: https://lkml.kernel.org/r/20230828150858.393570-1-revest@chromium.org Link: https://lkml.kernel.org/r/20230828150858.393570-2-revest@chromium.orgSigned-off-by: Florent Revest <revest@chromium.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ayush Jain <ayush.jain3@amd.com> Cc: Greg Thelen <gthelen@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: Topi Miettinen <toiwoton@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
liwenyu authored
The current memory reclaim delay statistics only count the direct memory reclaim of the task in do_try_to_free_pages(). In systems with NUMA open, some tasks occasionally experience slower response times, but the total count of reclaim does not increase, using ftrace can show that node_reclaim has occurred. The memory reclaim occurring in get_page_from_freelist() is also due to heavy memory load. To get the impact of tasks in memory reclaim, this patch adds the statistics of the memory reclaim delay statistics for __node_reclaim(). Link: https://lkml.kernel.org/r/181C946095F0252B+7cc60eca-1abf-4502-aad3-ffd8ef89d910@ex.bilibili.comSigned-off-by: Wen Yu Li <wenyuli@ex.bilibili.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: <wangyun@bilibili.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jann Horn authored
Document what mmu_notifier_invalidate_range_start_nonblock() is for. Also add a __must_check annotation to signal that callers must bail out if a notifier vetoes the operation. Link: https://lkml.kernel.org/r/20230918201832.265108-1-jannh@google.comSigned-off-by: Jann Horn <jannh@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Liu Shixin authored
Since commit b25806dc("mm: memcontrol: deprecate swapaccounting=0 mode") do_memsw_account() is synonymous with !cgroup_subsys_on_dfl(memory_cgrp_subsys), It always equals true in memcg1_stat_format(). Remove the unused code. Link: https://lkml.kernel.org/r/20230915105845.3199656-3-liushixin2@huawei.comSigned-off-by: Liu Shixin <liushixin2@huawei.com> Suggested-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Tejun heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Liu Shixin authored
Patch series "Expose swapcache stat for memcg v1", v2. Since commit b6038942 ("mm: memcg: add swapcache stat for memcg v2") adds swapcache stat for the cgroup v2, it seems there is no reason to hide it in memcg v1. Conversely, with swapcached it is more accurate to evaluate the available memory for memcg. Link: https://lkml.kernel.org/r/20230915105845.3199656-1-liushixin2@huawei.com Link: https://lkml.kernel.org/r/20230915105845.3199656-2-liushixin2@huawei.comSigned-off-by: Liu Shixin <liushixin2@huawei.com> Suggested-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Baolin Wang authored
Recently, we found that cross-die access to pagetable pages on ARM64 machines can cause performance fluctuations in our business. Currently, there are no PMU events available to track this situation on our ARM64 machines, so accurate pagetable accounting can help to analyze this issue, but now the PUD level pagetable accounting is missed. So introduce pagetable_pud_ctor/dtor() to help to get accurate PUD pagetable accounting, as well as converting the architectures which use generic PUD pagetable allocation to add corresponding PUD pagetable accounting. Moreover this patch will mark the PUD level pagetable with PG_table flag, which will help to do sanity validation in unpoison_memory(). On my testing machine, I can see more pagetables statistics after the patch with page-types tool: Before patch: flags page-count MB symbolic-flags long-symbolic-flags 0x0000000004000000 27326 106 __________________________g_________________ pgtable After patch: 0x0000000004000000 27541 107 __________________________g_________________ pgtable Link: https://lkml.kernel.org/r/876c71c03a7e69c17722a690e3225a4f7b172fb2.1695017383.git.baolin.wang@linux.alibaba.comSigned-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Li Zhijian authored
First found this typo as reviewing memory tier code. Fix it by sed like: $ sed -i 's/sibiling/sibling/g' $(git grep -l sibiling) so the acpi one will be corrected as well. Link: https://lkml.kernel.org/r/20230802092856.819328-1-lizhijian@cn.fujitsu.comSigned-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 04 Oct, 2023 22 commits
-
-
Yin Fengwei authored
Current kernel only lock base size folio during mlock syscall. Add large folio support with following rules: - Only mlock large folio when it's in VM_LOCKED VMA range and fully mapped to page table. fully mapped folio is required as if folio is not fully mapped to a VM_LOCKED VMA, if system is in memory pressure, page reclaim is allowed to pick up this folio, split it and reclaim the pages which are not in VM_LOCKED VMA. - munlock will apply to the large folio which is in VMA range or cross the VMA boundary. This is required to handle the case that the large folio is mlocked, later the VMA is split in the middle of large folio. Link: https://lkml.kernel.org/r/20230918073318.1181104-4-fengwei.yin@intel.comSigned-off-by: Yin Fengwei <fengwei.yin@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yin Fengwei authored
If large folio is in the range of VM_LOCKED VMA, it should be mlocked to avoid being picked by page reclaim. Which may split the large folio and then mlock each pages again. Mlock this kind of large folio to prevent them being picked by page reclaim. For the large folio which cross the boundary of VM_LOCKED VMA or not fully mapped to VM_LOCKED VMA, we'd better not to mlock it. So if the system is under memory pressure, this kind of large folio will be split and the pages ouf of VM_LOCKED VMA can be reclaimed. Ideally, for large folio, we should mlock it when the large folio is fully mapped to VMA and munlock it if any page are unmampped from VMA. But it's not easy to detect whether the large folio is fully mapped to VMA in some cases (like add/remove rmap). So we update mlock_vma_folio() and munlock_vma_folio() to mlock/munlock the folio according to vma->vm_flags. Let caller to decide whether they should call these two functions. For add rmap, only mlock normal 4K folio and postpone large folio handling to page reclaim phase. It is possible to reuse page table iterator to detect whether folio is fully mapped or not during page reclaim phase. For remove rmap, invoke munlock_vma_folio() to munlock folio unconditionly because rmap makes folio not fully mapped to VMA. Link: https://lkml.kernel.org/r/20230918073318.1181104-3-fengwei.yin@intel.comSigned-off-by: Yin Fengwei <fengwei.yin@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yin Fengwei authored
Patch series "support large folio for mlock", v3. Yu mentioned at [1] about the mlock() can't be applied to large folio. I leant the related code and here is my understanding: - For RLIMIT_MEMLOCK related, there is no problem. Because the RLIMIT_MEMLOCK statistics is not related underneath page. That means underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK statistics collection which is always correct. - For keeping the page in RAM, there is no problem either. At least, during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit set in vm_flags, the folio will be kept whatever the folio is mlocked or not. So the function of mlock for large folio works. But it's not optimized because the page reclaim needs scan these large folio and may split them. This series identified the large folio for mlock to four types: - The large folio is in VM_LOCKED range and fully mapped to the range - The large folio is in the VM_LOCKED range but not fully mapped to the range - The large folio cross VM_LOCKED VMA boundary - The large folio cross last level page table boundary For the first type, we mlock large folio so page reclaim will skip it. For the second/third type, we don't mlock large folio. As the pages not mapped to VM_LOACKED range are mapped to none VM_LOCKED range, if system is in memory pressure situation, the large folio can be picked by page reclaim and split. Then the pages not mapped to VM_LOCKED range can be reclaimed. For the fourth type, we don't mlock large folio because locking one page table lock can't prevent the part in another last level page table being unmapped. Thanks to Ryan for pointing this out. To check whether the folio is fully mapped to the range, PTEs needs be checked to see whether the page of folio is associated. Which needs take page table lock and is heavy operation. So far, the only place needs this check is madvise and page reclaim. These functions already have their own PTE iterator. patch1 introduce API to check whether large folio is in VMA range. patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support large folio mlock/munlock. patch3 make mlock/munlock syscall support large folio. Yu also mentioned a race which can make folio unevictable after munlock during RFC v2 discussion [3]: We decided that race issue didn't block this series based on: - That race issue was not introduced by this series - We had a looks-ok fix for that race issue. Need to wait for mlock_count fixing patch as Yosry Ahmed suggested [4] [1] https://lore.kernel.org/linux-mm/CAOUHufbtNPkdktjt_5qM45GegVO-rCFOMkSh0HQminQ12zsV8Q@mail.gmail.com/ [2] https://lore.kernel.org/linux-mm/20230809061105.3369958-1-fengwei.yin@intel.com/ [3] https://lore.kernel.org/linux-mm/CAOUHufZ6=9P_=CAOQyw0xw-3q707q-1FVV09dBNDC-hpcpj2Pg@mail.gmail.com/ This patch (of 3): folio_in_range() will be used to check whether the folio is mapped to specific VMA and whether the mapping address of folio is in the range. Also a helper function folio_within_vma() to check whether folio is in the range of vma based on folio_in_range(). Link: https://lkml.kernel.org/r/20230918073318.1181104-1-fengwei.yin@intel.com Link: https://lkml.kernel.org/r/20230918073318.1181104-2-fengwei.yin@intel.comSigned-off-by: Yin Fengwei <fengwei.yin@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jinjie Ruan authored
When CONFIG_DAMON_KUNIT_TEST=y and making CONFIG_DEBUG_KMEMLEAK=y and CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y, the below memory leak is detected. The damon_ctx which is allocated by kzalloc() in damon_new_ctx() in damon_test_ops_registration() and damon_test_set_attrs() are not freed. So use damon_destroy_ctx() to free it. After applying this patch, the following memory leak is never detected unreferenced object 0xffff2b49c6968800 (size 512): comm "kunit_try_catch", pid 350, jiffies 4294895294 (age 557.028s) hex dump (first 32 bytes): 88 13 00 00 00 00 00 00 a0 86 01 00 00 00 00 00 ................ 00 87 93 03 00 00 00 00 0a 00 00 00 00 00 00 00 ................ backtrace: [<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368 [<0000000073acab3b>] __kmem_cache_alloc_node+0x174/0x290 [<00000000b5f89cef>] kmalloc_trace+0x40/0x164 [<00000000eb19e83f>] damon_new_ctx+0x28/0xb4 [<00000000daf6227b>] damon_test_ops_registration+0x34/0x328 [<00000000559c4801>] kunit_try_run_case+0x50/0xac [<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c [<000000003c3e9211>] kthread+0x124/0x130 [<0000000028f85bdd>] ret_from_fork+0x10/0x20 unreferenced object 0xffff2b49c1a9cc00 (size 512): comm "kunit_try_catch", pid 356, jiffies 4294895306 (age 557.000s) hex dump (first 32 bytes): 88 13 00 00 00 00 00 00 a0 86 01 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 0a 00 00 00 00 00 00 00 ................ backtrace: [<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368 [<0000000073acab3b>] __kmem_cache_alloc_node+0x174/0x290 [<00000000b5f89cef>] kmalloc_trace+0x40/0x164 [<00000000eb19e83f>] damon_new_ctx+0x28/0xb4 [<00000000058495c4>] damon_test_set_attrs+0x30/0x1a8 [<00000000559c4801>] kunit_try_run_case+0x50/0xac [<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c [<000000003c3e9211>] kthread+0x124/0x130 [<0000000028f85bdd>] ret_from_fork+0x10/0x20 Link: https://lkml.kernel.org/r/20230918120951.2230468-3-ruanjinjie@huawei.com Fixes: d1836a3b ("mm/damon/core-test: initialise context before test in damon_test_set_attrs()") Fixes: 4f540f5a ("mm/damon/core-test: add a kunit test case for ops registration") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jinjie Ruan authored
Patch series "mm/damon/core-test: Fix memory leaks in core-test", v3. There are a few memory leaks in core-test which are detected by kmemleak. This patchset fixes the issues. This patch (of 2): When CONFIG_DAMON_KUNIT_TEST=y and making CONFIG_DEBUG_KMEMLEAK=y and CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y, the below memory leak is detected. The damon_region which is allocated by kmem_cache_alloc() in damon_new_region() in damon_test_regions() and damon_test_update_monitoring_result() are not freed. So for damon_test_regions(), replace damon_del_region() call with damon_destroy_region() so that it calls both damon_del_region() and damon_free_region(), the latter will free the damon_region. For damon_test_update_monitoring_result(), call damon_free_region() to free it. After applying this patch, the following memory leak is never detected. unreferenced object 0xffff2b49c3edc000 (size 56): comm "kunit_try_catch", pid 338, jiffies 4294895280 (age 557.084s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 49 2b ff ff ............I+.. backtrace: [<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368 [<00000000b528f67c>] kmem_cache_alloc+0x168/0x284 [<000000008603f022>] damon_new_region+0x28/0x54 [<00000000a3b8c64e>] damon_test_regions+0x38/0x270 [<00000000559c4801>] kunit_try_run_case+0x50/0xac [<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c [<000000003c3e9211>] kthread+0x124/0x130 [<0000000028f85bdd>] ret_from_fork+0x10/0x20 unreferenced object 0xffff2b49c5b20000 (size 56): comm "kunit_try_catch", pid 354, jiffies 4294895304 (age 556.988s) hex dump (first 32 bytes): 03 00 00 00 00 00 00 00 07 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 96 00 00 00 49 2b ff ff ............I+.. backtrace: [<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368 [<00000000b528f67c>] kmem_cache_alloc+0x168/0x284 [<000000008603f022>] damon_new_region+0x28/0x54 [<00000000ca019f80>] damon_test_update_monitoring_result+0x18/0x34 [<00000000559c4801>] kunit_try_run_case+0x50/0xac [<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c [<000000003c3e9211>] kthread+0x124/0x130 [<0000000028f85bdd>] ret_from_fork+0x10/0x20 Link: https://lkml.kernel.org/r/20230918120951.2230468-1-ruanjinjie@huawei.com Link: https://lkml.kernel.org/r/20230918120951.2230468-2-ruanjinjie@huawei.com Fixes: 17ccae8b ("mm/damon: add kunit tests") Fixes: f4c978b6 ("mm/damon/core-test: add a test for damon_update_monitoring_results()") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Feng Tang <feng.tang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jianguo Bao authored
Change to use new address space operation dirty_folio(). Link: https://lkml.kernel.org/r/20230917-trycontrib1-v1-1-db22630b8839@gmail.com Fixes: 6f31a5a2 ("fs: Add aops->dirty_folio") Signed-off-by: Jianguo Bau <roidinev@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Update DAMON ABI document for the newly added DAMON sysfs file for DAMOS apply intervals (apply_interval_us file). Link: https://lkml.kernel.org/r/20230916020945.47296-10-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Update DAMON usage document's DAMON sysfs interface section for the newly added DAMOS apply intervals support (apply_interval_us file). Link: https://lkml.kernel.org/r/20230916020945.47296-9-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Update DAMON selftests to test existence of the file for reading/writing DAMOS apply interval under each scheme directory. Link: https://lkml.kernel.org/r/20230916020945.47296-8-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Update DAMON sysfs interface to support DAMOS apply intervals by adding a new file, 'apply_interval_us' in each scheme directory. Users can set and get the interval for each scheme in microseconds by writing to and reading from the file. Link: https://lkml.kernel.org/r/20230916020945.47296-7-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Update DAMON design doc to explain about DAMOS apply intervals. Link: https://lkml.kernel.org/r/20230916020945.47296-6-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
DAMON-based operation schemes are applied for every aggregation interval. That was mainly because schemes were using nr_accesses, which be complete to be used for every aggregation interval. However, the schemes are now using nr_accesses_bp, which is updated for each sampling interval in a way that reasonable to be used. Therefore, there is no reason to apply schemes for each aggregation interval. The unnecessary alignment with aggregation interval was also making some use cases of DAMOS tricky. Quotas setting under long aggregation interval is one such example. Suppose the aggregation interval is ten seconds, and there is a scheme having CPU quota 100ms per 1s. The scheme will actually uses 100ms per ten seconds, since it cannobe be applied before next aggregation interval. The feature is working as intended, but the results might not that intuitive for some users. This could be fixed by updating the quota to 1s per 10s. But, in the case, the CPU usage of DAMOS could look like spikes, and would actually make a bad effect to other CPU-sensitive workloads. Implement a dedicated timing interval for each DAMON-based operation scheme, namely apply_interval. The interval will be sampling interval aligned, and each scheme will be applied for its apply_interval. The interval is set to 0 by default, and it means the scheme should use the aggregation interval instead. This avoids old users getting any behavioral difference. Link: https://lkml.kernel.org/r/20230916020945.47296-5-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
damos_before_apply tracepoint is exposing access rate of DAMON regions using nr_accesses field of regions, which was actually used by DAMOS in the past. However, it has changed to use nr_accesses_bp instead. Update the tracepoint to expose the value that DAMOS is really using. Note that it doesn't expose the value as is in the basis point, but after converting it to the natural number by dividing it by 10,000. Therefore this change doesn't make user-visible behavioral differences. Link: https://lkml.kernel.org/r/20230916020945.47296-4-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
DAMON sysfs interface exposes access rate of each region via DAMOS tried regions directory. For this, the nr_accesses field of the region is used. DAMOS was actually using nr_accesses in the past, but it uses nr_accesses_bp now. Use the value that it is really using as the source. Note that this doesn't expose nr_accesses_bp as is (in basis point), but after converting it to the natural number by dividing the value by 10,000. Hence there is no behavioral change from users' perspective. Link: https://lkml.kernel.org/r/20230916020945.47296-3-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
Patch series "mm/damon: implement DAMOS apply intervals". DAMON-based operation schemes are applied for every aggregation interval. That is mainly because schemes are using nr_accesses, which be complete to be used for every aggregation interval. This makes some DAMOS use cases be tricky. Quota setting under long aggregation interval is one such example. Suppose the aggregation interval is ten seconds, and there is a scheme having CPU quota 100ms per 1s. The scheme will actually uses 100ms per ten seconds, since it cannobe be applied before next aggregation interval. The feature is working as intended, but the results might not that intuitive for some users. This could be fixed by updating the quota to 1s per 10s. But, in the case, the CPU usage of DAMOS could look like spikes, and actually make a bad effect to other CPU-sensitive workloads. Also, with such huge aggregation interval, users may want schemes to be applied more frequently. DAMON provides nr_accesses_bp, which is updated for each sampling interval in a way that reasonable to be used. By using that instead of nr_accesses, DAMOS can have its own time interval and mitigate abovely mentioned issues. This patchset makes DAMOS schemes to use nr_accesses_bp instead of nr_accesses, and have their own timing intervals. Also update DAMOS tried regions sysfs files and DAMOS before_apply tracepoint to use the new data as their source. Note that the interval is zero by default, and it is interpreted to use the aggregation interval instead. This avoids making user-visible behavioral changes. Patches Seuqeunce ----------------- The first patch (patch 1/9) makes DAMOS uses nr_accesses_bp instead of nr_accesses, and following two patches (patches 2/9 and 3/9) updates DAMON sysfs interface for DAMOS tried regions and the DAMOS before_apply tracespoint to use nr_accesses_bp instead of nr_accesses, respectively. The following two patches (patches 4/9 and 5/9) implements the scheme-specific apply interval for DAMON kernel API users and update the design document for the new feature. Finally, the following four patches (patches 6/9, 7/9, 8/9 and 9/9) add support of the feature in DAMON sysfs interface, add a simple selftest test case, and document the new file on the usage and the ABI documents, repsectively. This patch (of 9): DAMON provides nr_accesses_bp, which becomes same to nr_accesses * 10000 for every aggregation interval, but updated every sampling interval with a reasonable accuracy. Since DAMON-based operation schemes are applied in every aggregation interval using nr_accesses, using nr_accesses_bp instead will make no difference to users. Meanwhile, it allows DAMOS to apply the schemes in a time interval that less than the aggregation interval. It could be useful and more flexible for some cases. Do it. Link: https://lkml.kernel.org/r/20230916020945.47296-1-sj@kernel.org Link: https://lkml.kernel.org/r/20230916020945.47296-2-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Convert the callers to expect a folio and remove the unnecesary conversion back to a struct page. Link: https://lkml.kernel.org/r/20230824141325.2704553-4-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Anything found on a linked list threaded through ->lru is guaranteed to be a folio as the compound_head found in a tail page overlaps the ->lru member of struct page. So we can pull folios directly off these lists no matter whether pages or folios were added to the list. Link: https://lkml.kernel.org/r/20230824141325.2704553-3-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Patch series "Small hugetlb cleanups", v2. Some trivial folio conversions This patch (of 3): update_and_free_hugetlb_folio puts the memory on hpage_freelist as a folio so we can take it off the list as a folio. Link: https://lkml.kernel.org/r/20230824141325.2704553-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230824141325.2704553-2-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Usama Arif authored
The new boot flow when it comes to initialization of gigantic pages is as follows: - At boot time, for a gigantic page during __alloc_bootmem_hugepage, the region after the first struct page is marked as noinit. - This results in only the first struct page to be initialized in reserve_bootmem_region. As the tail struct pages are not initialized at this point, there can be a significant saving in boot time if HVO succeeds later on. - Later on in the boot, the head page is prepped and the first HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages are initialized. - HVO is attempted. If it is not successful, then the rest of the tail struct pages are initialized. If it is successful, no more tail struct pages need to be initialized saving significant boot time. The WARN_ON for increased ref count in gather_bootmem_prealloc was changed to a VM_BUG_ON. This is OK as there should be no speculative references this early in boot process. The VM_BUG_ON's are there just in case such code is introduced. [akpm@linux-foundation.org: make it nicer for 80 cols] Link: https://lkml.kernel.org/r/20230913105401.519709-5-usama.arif@bytedance.comSigned-off-by: Usama Arif <usama.arif@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Usama Arif authored
For reserved memory regions marked with this flag, reserve_bootmem_region is not called during memmap_init_reserved_pages. This can be used to avoid struct page initialization for regions which won't need them, for e.g. hugepages with Hugepage Vmemmap Optimization enabled. Link: https://lkml.kernel.org/r/20230913105401.519709-4-usama.arif@bytedance.comSigned-off-by: Usama Arif <usama.arif@bytedance.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Usama Arif authored
This allows setting flags to both memblock types and is in preparation for setting flags (for e.g. to not initialize struct pages) on reserved memory region. [usama.arif@bytedance.com: add missing argument definition] Link: https://lkml.kernel.org/r/20230918090657.220463-1-usama.arif@bytedance.com Link: https://lkml.kernel.org/r/20230913105401.519709-3-usama.arif@bytedance.comSigned-off-by: Usama Arif <usama.arif@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Usama Arif authored
Patch series "mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO", v5. This series moves the boot time initialization of tail struct pages of a gigantic page to later on in the boot. Only the HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages are initialized at the start. If HVO is successful, then no more tail struct pages need to be initialized. For a 1G hugepage, this series avoid initialization of 262144 - 63 = 262081 struct pages per hugepage. When tested on a 512G system (allocating 500 1G hugepages), the kexec-boot times with DEFERRED_STRUCT_PAGE_INIT enabled are: - with patches, HVO enabled: 1.32 seconds - with patches, HVO disabled: 2.15 seconds - without patches, HVO enabled: 3.90 seconds - without patches, HVO disabled: 3.58 seconds This represents an approximately 70% reduction in boot time and will significantly reduce server downtime when using a large number of gigantic pages. This patch (of 4): If tail page prep and initialization is skipped, then the "start" page will not contain the correct nid. Use the nid from first vmemap page. Link: https://lkml.kernel.org/r/20230913105401.519709-1-usama.arif@bytedance.com Link: https://lkml.kernel.org/r/20230913105401.519709-2-usama.arif@bytedance.comSigned-off-by: Usama Arif <usama.arif@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-