- 11 Dec, 2023 40 commits
-
-
Muchun Song authored
The 8782fb61 ("mm: pagewalk: Fix race between unmap and page walker") introduces an assertion to walk_page_range_novma() to make all the users of page table walker is safe. However, the race only exists for walking the user page tables. And it is ridiculous to hold a particular user mmap write lock against the changes of the kernel page tables. So only assert at least mmap read lock when walking the kernel page tables. And some users matching this case could downgrade to a mmap read lock to relief the contention of mmap lock of init_mm, it will be nicer in hugetlb (only holding mmap read lock) in the next patch. Link: https://lkml.kernel.org/r/20231127084645.27017-2-songmuchun@bytedance.comSigned-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Fabio M. De Francesco authored
kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page() in swapfile.c. kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. In mm/swapfile.c, the blocks of code between the mappings and un-mappings do not depend on the above-mentioned side effects of kmap_atomic(), so that the mere replacements of the old API with the new one is all that is required (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/20231127155452.586387-1-fabio.maria.de.francesco@linux.intel.comSigned-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Fabio M. De Francesco authored
kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page() in zswap.c. kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. In mm/zswap.c, the blocks of code between the mappings and un-mappings do not depend on the above-mentioned side effects of kmap_atomic(), so that the mere replacements of the old API with the new one is all that is required (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/20231127160058.586446-1-fabio.maria.de.francesco@linux.intel.comSigned-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> (Google) Cc: Ira Weiny <ira.weiny@intel.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Yong Wang authored
When the system is under oom, it prints out the RSS information of each process. However, we don't know the size of rss_anon, rss_file, and rss_shmem. To distinguish the memory occupied by anonymous or file mappings or shmem, could help us identify the root cause of the oom. So this patch adds RSS details, which refers to the /proc/<pid>/status[1]. It can help us know more about process memory usage. Example of oom including the new rss_* fields: [ 1630.902466] Tasks state (memory values in pages): [ 1630.902870] [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name [ 1630.903619] [ 149] 0 149 486 288 0 288 0 36864 0 0 ash [ 1630.904210] [ 156] 0 156 153531 153345 153345 0 0 1269760 0 0 mm_test [1] commit 8cee852e ("mm, procfs: breakdown RSS for anon, shmem and file in /proc/pid/status"). Link: https://lkml.kernel.org/r/202311231840181856667@zte.com.cnSigned-off-by: Yong Wang <wang.yong12@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andy Shevchenko authored
kernel-doc is not happy about documentation in list_lru.h: list_lru.h:90: warning: Function parameter or member 'lru' not described in 'list_lru_add' list_lru.h:90: warning: Excess function parameter 'list_lru' description in 'list_lru_add' list_lru.h:90: warning: No description found for return value of 'list_lru_add' list_lru.h:103: warning: Function parameter or member 'lru' not described in 'list_lru_del' list_lru.h:103: warning: Excess function parameter 'list_lru' description in 'list_lru_del' list_lru.h:103: warning: No description found for return value of 'list_lru_del' list_lru.h:116: warning: No description found for return value of 'list_lru_count_one' list_lru.h:168: warning: No description found for return value of 'list_lru_walk_one' list_lru.h:185: warning: No description found for return value of 'list_lru_walk_one_irq' Fix the documentation accordingly. While at it, fix the references to the parameters in functions inside the long descriptions, on which the above script is not complaining (yet?). Link: https://lkml.kernel.org/r/20231123172320.2434780-1-andriy.shevchenko@linux.intel.comSigned-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
This is a bug found not by any report but only by code observations. When GUP sees a devpmd/devpud and if page==NULL is returned, it means a fault is probably required. Here falling through when page==NULL can cause unexpected behavior. Fix both cases by catching the page==NULL cases with no_page_table(). Link: https://lkml.kernel.org/r/20231123180222.1048297-1-peterx@redhat.com Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings") Fixes: 080dbb61 ("mm/follow_page_mask: split follow_page_mask to smaller functions.") Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Charan Teja Kalla authored
__alloc_pages_direct_reclaim() is called from slowpath allocation where high atomic reserves can be unreserved after there is a progress in reclaim and yet no suitable page is found. Later should_reclaim_retry() gets called from slow path allocation to decide if the reclaim needs to be retried before OOM kill path is taken. should_reclaim_retry() checks the available(reclaimable + free pages) memory against the min wmark levels of a zone and returns: a) true, if it is above the min wmark so that slow path allocation will do the reclaim retries. b) false, thus slowpath allocation takes oom kill path. should_reclaim_retry() can also unreserves the high atomic reserves **but only after all the reclaim retries are exhausted.** In a case where there are almost none reclaimable memory and free pages contains mostly the high atomic reserves but allocation context can't use these high atomic reserves, makes the available memory below min wmark levels hence false is returned from should_reclaim_retry() leading the allocation request to take OOM kill path. This can turn into a early oom kill if high atomic reserves are holding lot of free memory and unreserving of them is not attempted. (early)OOM is encountered on a VM with the below state: [ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB local_pcp:492kB free_cma:0kB [ 295.998656] lowmem_reserve[]: 0 32 [ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH) 33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7752kB Per above log, the free memory of ~7MB exist in the high atomic reserves is not freed up before falling back to oom kill path. Fix it by trying to unreserve the high atomic reserves in should_reclaim_retry() before __alloc_pages_direct_reclaim() can fallback to oom kill path. Link: https://lkml.kernel.org/r/1700823445-27531-1-git-send-email-quic_charante@quicinc.com Fixes: 0aaa29a5 ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand") Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Reported-by: Chris Goldsworthy <quic_cgoldswo@quicinc.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Chris Goldsworthy <quic_cgoldswo@quicinc.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Charan Teja Kalla authored
Highatomic reserves are set to roughly 1% of zone for maximum and a pageblock size for minimum. Encountered a system with the below configuration: Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB managed:49224kB On such systems, even a single pageblock makes highatomic reserves are set to ~8% of the zone memory. This high value can easily exert pressure on the zone. Per discussion with Michal and Mel, it is not much useful to reserve the memory for highatomic allocations on such small systems[1]. Since the minimum size for high atomic reserves is always going to be a pageblock size and if 1% of zone managed pages is going to be below pageblock size, don't reserve memory for high atomic allocations. Thanks Michal for this suggestion[2]. Since no memory is being reserved for high atomic allocations and if respective allocation failures are seen, this patch can be reverted. [1] https://lore.kernel.org/linux-mm/20231117161956.d3yjdxhhm4rhl7h2@techsingularity.net/ [2] https://lore.kernel.org/linux-mm/ZVYRJMUitykepLRy@tiehlicka/ Link: https://lkml.kernel.org/r/c3a2a48e2cfe08176a80eaf01c110deb9e918055.1700821416.git.quic_charante@quicinc.comSigned-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Acked-by: David Rientjes <rientjes@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Charan Teja Kalla authored
Patch series "mm: page_alloc: fixes for high atomic reserve caluculations", v3. The state of the system where the issue exposed shown in oom kill logs: [ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kBlocal_pcp:492kB free_cma:0kB [ 295.998656] lowmem_reserve[]: 0 32 [ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH) 33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7752kB From the above, it is seen that ~16MB of memory reserved for high atomic reserves against the expectation of 1% reserves which is fixed in the 1st patch. Don't reserve the high atomic page blocks if 1% of zone memory size is below a pageblock size. This patch (of 2): reserve_highatomic_pageblock() aims to reserve the 1% of the managed pages of a zone, which is used for the high order atomic allocations. It uses the below calculation to reserve: static void reserve_highatomic_pageblock(struct page *page, ....) { ....... max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; if (zone->nr_reserved_highatomic >= max_managed) goto out; zone->nr_reserved_highatomic += pageblock_nr_pages; set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL); out: .... } Since we are always appending the 1% of zone managed pages count to pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the nr_reserved_highatomic is incremented/decremented in pageblock sizes. Encountered a system(actually a VM running on the Linux kernel) with the below zone configuration: Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB managed:49224kB The existing calculations making it to reserve the 8MB(with pageblock size of 4MB) i.e. 16% of the zone managed memory. Reserving such high amount of memory can easily exert memory pressure in the system thus may lead into unnecessary reclaims till unreserving of high atomic reserves. Since high atomic reserves are managed in pageblock size granules, as MIGRATE_HIGHATOMIC is set for such pageblock, fix the calculations for high atomic reserves as, minimum is pageblock size , maximum is approximately 1% of the zone managed pages. Link: https://lkml.kernel.org/r/cover.1700821416.git.quic_charante@quicinc.com Link: https://lkml.kernel.org/r/1660034138397b82a0a8b6ae51cbe96bd583d89e.1700821416.git.quic_charante@quicinc.comSigned-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Peter Xu authored
Hugetlb vmemmap default option (HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON) is a sub-option to hugetlbfs, but it shows in the same level as hugetlbfs itself, under "Pesudo filesystems". Make the vmemmap option a sub-option to hugetlbfs, by changing hugetlbfs into a menuconfig. When moving it, fix a typo 'v' spot by Randy. Link: https://lkml.kernel.org/r/20231124151902.1075697-1-peterx@redhat.comSigned-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Serge Semin authored
Based on the init_unavailable_range() method and it's callee semantics no multi-line info messages are intended to be printed to the console. Thus append the '\n' symbol to the respective info string. Link: https://lkml.kernel.org/r/20231122182419.30633-7-fancer.lancer@gmail.comSigned-off-by: Serge Semin <fancer.lancer@gmail.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Serge Semin authored
Besides of the already described reasons the pages backended memory holes might be persistent due to having memory mapped IO spaces behind those ranges in the framework of flatmem kernel config. Add such note to the init_unavailable_range() method kdoc in order to point out to one more reason of having the function executed for such regions. [fancer.lancer@gmail.com: update per Mike] Link: https://lkml.kernel.org/r/20231202111855.18392-1-fancer.lancer@gmail.com Link: https://lkml.kernel.org/r/20231122182419.30633-6-fancer.lancer@gmail.comSigned-off-by: Serge Semin <fancer.lancer@gmail.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Alexander Gordeev authored
Rename ptdesc _refcount field to __page_refcount similar to the other unused page fields. Link: https://lkml.kernel.org/r/982bdc652ba79a606c3d01c905766e7e076b3315.1700594815.git.agordeev@linux.ibm.comSigned-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Suggested-by: Vishal Moola <vishal.moola@gmail.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Alexander Gordeev authored
Patch series "minor ptdesc updates", v3. This patch (of 2): Since commit d08d4e7c ("s390/mm: use full 4KB page for 2KB PTE") there is no fragmented page tracking on s390. Fix the corresponding comments. Link: https://lkml.kernel.org/r/cover.1700594815.git.agordeev@linux.ibm.com Link: https://lkml.kernel.org/r/2eead241f3a45bed26c7911cf66bded1e35670b8.1700594815.git.agordeev@linux.ibm.comSigned-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Suggested-by: Heiko Carstens <hca@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
SeongJae Park authored
damon_split_region_at() should set access rate related fields of the resulting regions same. It may forgotten, and actually there was the mistake before. Test it with the unit test case for the function. Link: https://lkml.kernel.org/r/20231119171529.66863-2-sj@kernel.orgSigned-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Juntong Deng authored
Currently free meta can only be stored in object if the object is not smaller than free meta. After the improvement, when the object is smaller than free meta and SLUB DEBUG is not enabled, it is possible to store part of the free meta in the object, reducing the increased size of the red zone. Example: free meta size: 16 bytes alloc meta size: 16 bytes object size: 8 bytes optimal redzone size (object_size <= 64): 16 bytes Before improvement: actual redzone size = alloc meta size + free meta size = 32 bytes After improvement: actual redzone size = alloc meta size + (free meta size - object size) = 24 bytes [juntong.deng@outlook.com: make kasan_metadata_size() adapt to the improved free meta storage] Link: https://lkml.kernel.org/r/VI1P193MB0752675D6E0A2D16CE656F8299BAA@VI1P193MB0752.EURP193.PROD.OUTLOOK.COM Link: https://lkml.kernel.org/r/VI1P193MB0752DE2CCD9046B5FED0AA8E99B5A@VI1P193MB0752.EURP193.PROD.OUTLOOK.COMSigned-off-by: Juntong Deng <juntong.deng@outlook.com> Suggested-by: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Fabio M. De Francesco authored
kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page(). kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. The code blocks between the mappings and un-mappings do not rely on the above-mentioned side effects of kmap_atomic(), so that mere replacements of the old API with the new one is all that they require (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/20231120142836.7219-1-fabio.maria.de.francesco@linux.intel.comSigned-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Fabio M. De Francesco authored
kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page(). kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. The code blocks between the mappings and un-mappings don't rely on the above-mentioned side effects of kmap_atomic(), so that mere replacements of the old API with the new one is all that they require (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/20231120142640.7077-1-fabio.maria.de.francesco@linux.intel.comSigned-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Fabio M. De Francesco authored
kmap_atomic() has been deprecated in favor of kmap_local_{folio,page}. Therefore, replace kmap_atomic() with kmap_local_page in __wp_page_copy_user(). kmap_atomic() disables preemption in !PREEMPT_RT kernels and unconditionally disables also page-faults. My limited knowledge of the implementation of __wp_page_copy_user() makes me think that the latter side effect is still needed here, but kmap_local_page() is implemented not to disable page-faults. So, in addition to the conversion to local mapping, add explicit pagefault_disable() / pagefault_enable() between mapping and un-mapping. Link: https://lkml.kernel.org/r/20231120142418.6977-1-fmdefrancesco@gmail.comSigned-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Fabio M. De Francesco authored
kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page() in calc_checksum(). kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. In calc_checksum(), the block of code between the mapping and un-mapping does not depend on the above-mentioned side effects of kmap_aatomic(), so that a mere replacements of the old API with the new one is all that is required (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/20231120141855.6761-1-fmdefrancesco@gmail.comSigned-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Fabio De Francesco authored
kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page() in memcmp_pages(). kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. In memcmp_pages(), the block of code between the mapping and un-mapping does not depend on the above-mentioned side effects of kmap_aatomic(), so that mere replacements of the old API with the new one is all that is required (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/20231120141554.6612-1-fmdefrancesco@gmail.comSigned-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Sumanth Korikkar authored
vmem_altmap_free() and vmem_altmap_offset() could be utlized without CONFIG_ZONE_DEVICE enabled. For example, mm/memory_hotplug.c:__add_pages() relies on that. The altmap is no longer restricted to ZONE_DEVICE handling, but instead depends on CONFIG_SPARSEMEM_VMEMMAP. When CONFIG_SPARSEMEM_VMEMMAP is disabled, these functions are defined as inline stubs, ensuring compatibility with configurations that do not use sparsemem vmemmap. Without it, lkp reported the following: ld: arch/x86/mm/init_64.o: in function `remove_pagetable': init_64.c:(.meminit.text+0xfc7): undefined reference to `vmem_altmap_free' Link: https://lkml.kernel.org/r/20231120145354.308999-4-sumanthk@linux.ibm.comSigned-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202311180545.VeyRXEDq-lkp@intel.com/Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
KMSAN is frequently used in fuzzing scenarios and thus saves a lot of stack traces. As KMSAN does not support evicting stack traces from the stack depot, the stack depot capacity might be reached quickly with large stack records. Adjust the maximum number of stack depot pools for this case. The average size of a stack trace saved into the stack depot is ~16 frames. Thus, adjust the maximum pools number accordingly to keep the maximum number of stack traces that can be saved into the stack depot similar to the one that was allowed before the stack trace eviction changes. Link: https://lkml.kernel.org/r/301a115cf7ce8ddb42ef6de9151c2bb76ba728fc.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Evict alloc/free stack traces from the stack depot for Generic KASAN once they are evicted from the quaratine. For auxiliary stack traces, evict the oldest stack trace once a new one is saved (KASAN only keeps references to the last two). Also evict all saved stack traces on krealloc. To avoid double-evicting and mis-evicting stack traces (in case KASAN's metadata was corrupted), reset KASAN's per-object metadata that stores stack depot handles when the object is initialized and when it's evicted from the quarantine. Note that stack_depot_put is no-op if the handle is 0. Link: https://lkml.kernel.org/r/5cef104d9b842899489b4054fe8d1339a71acee0.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
When both KASAN and slub_debug are enabled, when a free object is being prepared in setup_object, slub_debug poisons the object data before KASAN initializes its per-object metadata. Right now, in setup_object, KASAN only initializes the alloc metadata, which is always stored outside of the object. slub_debug is aware of this and it skips poisoning and checking that memory area. However, with the following patch in this series, KASAN also starts initializing its free medata in setup_object. As this metadata might be stored within the object, this initialization might overwrite the slub_debug poisoning. This leads to slub_debug reports. Thus, skip checking slub_debug poisoning of the object data area that overlaps with the in-object KASAN free metadata. Also make slub_debug poisoning of tail kmalloc redzones more precise when KASAN is enabled: slub_debug can still poison and check the tail kmalloc allocation area that comes after the KASAN free metadata. Link: https://lkml.kernel.org/r/20231122231202.121277-1-andrey.konovalov@linux.devSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Make tag-based KASAN modes evict stack traces from the stack depot once they are evicted from the stack ring. Internally, pass STACK_DEPOT_FLAG_GET to stack_depot_save_flags (via kasan_save_stack) to increment the refcount when saving a new entry to stack ring and call stack_depot_put when removing an entry from stack ring. Link: https://lkml.kernel.org/r/b4773e5c1b0b9df6826ec0b65c1923feadfa78e5.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Check the object size when looking up entries in the stack ring. If the size of the object for which a report is being printed does not match the size of the object for which a stack trace has been saved in the stack ring, the saved stack trace is irrelevant. Link: https://lkml.kernel.org/r/68c6948175aadd7e7e7deea61725103d64a4528f.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Remove the atomic accesses to entry fields in save_stack_info and kasan_complete_mode_report_info for tag-based KASAN modes. These atomics are not required, as the read/write lock prevents the entries from being read (in kasan_complete_mode_report_info) while being written (in save_stack_info) and the try_cmpxchg prevents the same entry from being rewritten (in save_stack_info) in the unlikely case of wrapping during writing. Link: https://lkml.kernel.org/r/29f59126d9845c5257b6c29cd7ad113b16f19f47.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Add stack_depot_put, a function that decrements the reference counter on a stack record and removes it from the stack depot once the counter reaches 0. Internally, when removing a stack record, the function unlinks it from the hash table bucket and returns to the freelist. With this change, the users of stack depot can call stack_depot_put when keeping a stack trace in the stack depot is not needed anymore. This allows avoiding polluting the stack depot with irrelevant stack traces and thus have more space to store the relevant ones before the stack depot reaches its capacity. Link: https://lkml.kernel.org/r/1d1ad5692ee43d4fc2b3fd9d221331d30b36123f.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Add a reference counter for how many times a stack records has been added to stack depot. Add a new STACK_DEPOT_FLAG_GET flag to stack_depot_save_flags that instructs the stack depot to increment the refcount. Do not yet decrement the refcount; this is implemented in one of the following patches. Do not yet enable any users to use the flag to avoid overflowing the refcount. This is preparatory patch for implementing the eviction of stack records from the stack depot. Link: https://lkml.kernel.org/r/a3fc14a2359d019d2a008d4ff8b46a665371ffee.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Change the bool can_alloc argument of __stack_depot_save to a u32 argument that accepts a set of flags. The following patch will add another flag to stack_depot_save_flags besides the existing STACK_DEPOT_FLAG_CAN_ALLOC. Also rename the function to stack_depot_save_flags, as __stack_depot_save is a cryptic name, Link: https://lkml.kernel.org/r/645fa15239621eebbd3a10331e5864b718839512.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Make KMSAN use stack_depot_save instead of __stack_depot_save, as it always passes true to __stack_depot_save as the last argument. Link: https://lkml.kernel.org/r/18092240699efdc6acd78b51e41ea782953e6c8d.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Switch stack_record to use list_head for links in the hash table and in the freelist. This will allow removing entries from the hash table buckets. This is preparatory patch for implementing the eviction of stack records from the stack depot. Link: https://lkml.kernel.org/r/4787d9a584cd33433d9ee1846b17fa3d3e1987ad.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Currently, stack depot uses the following locking scheme: 1. Lock-free accesses when looking up a stack record, which allows to have multiple users to look up records in parallel; 2. Spinlock for protecting the stack depot pools and the hash table when adding a new record. For implementing the eviction of stack traces from stack depot, the lock-free approach is not going to work anymore, as we will need to be able to also remove records from the hash table. Convert the spinlock into a read/write lock, and drop the atomic accesses, as they are no longer required. Looking up stack traces is now protected by the read lock and adding new records - by the write lock. One of the following patches will add a new function for evicting stack records, which will be protected by the write lock as well. With this change, multiple users can still look up records in parallel. This is preparatory patch for implementing the eviction of stack records from the stack depot. Link: https://lkml.kernel.org/r/9f81ffcc4bb422ebb6326a65a770bf1918634cbb.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Instead of using the global pool_offset variable to find a free slot when storing a new stack record, mainlain a freelist of free slots within the allocated stack pools. A global next_stack variable is used as the head of the freelist, and the next field in the stack_record struct is reused as freelist link (when the record is not in the freelist, this field is used as a link in the hash table). This is preparatory patch for implementing the eviction of stack records from the stack depot. Link: https://lkml.kernel.org/r/b9e4c79955c2121b69301778643b203d3fb09ccc.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Instead of using the last pointer in stack_pools for storing the pointer to a new pool (which does not yet store any stack records), use a new new_pool variable. This a purely code readability change: it seems more logical to store the pointer to a pool with a special meaning in a dedicated variable. Link: https://lkml.kernel.org/r/448bc18296c16bef95cb3167697be6583dcc8ce3.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Rename next_pool_required to new_pool_required. This a purely code readability change: the following patch will change stack depot to store the pointer to the new pool in a separate variable, and "new" seems like a more logical name. Link: https://lkml.kernel.org/r/fd7cd6c6eb250c13ec5d2009d75bb4ddd1470db9.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Split code in depot_alloc_stack and depot_init_pool into 3 functions: 1. depot_keep_next_pool that keeps preallocated memory for the next pool if required. 2. depot_update_pools that moves on to the next pool if there's no space left in the current pool, uses preallocated memory for the new current pool if required, and calls depot_keep_next_pool otherwise. 3. depot_alloc_stack that calls depot_update_pools and then allocates a stack record as before. This makes it somewhat easier to follow the logic of depot_alloc_stack and also serves as a preparation for implementing the eviction of stack records from the stack depot. Link: https://lkml.kernel.org/r/71fb144d42b701fcb46708d7f4be6801a4a8270e.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Drop smp_load_acquire from next_pool_required in depot_init_pool, as both depot_init_pool and the all smp_store_release's to this variable are executed under the stack depot lock. Also simplify and clean up comments accompanying the use of atomic accesses in the stack depot code. Link: https://lkml.kernel.org/r/c118ef044d8db80248d9e1f14592c72e8429e9d9.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
Instead of storing stack records in stack depot pools one right after another, use fixed-sized slots. Add a new Kconfig option STACKDEPOT_MAX_FRAMES that allows to select the size of the slot in frames. Use 64 as the default value, which is the maximum stack trace size both KASAN and KMSAN use right now. Also add descriptions for other stack depot Kconfig options. This is preparatory patch for implementing the eviction of stack records from the stack depot. Link: https://lkml.kernel.org/r/dce7d030a99ff61022509665187fac45b0827298.1700502145.git.andreyknvl@google.comSigned-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-