- 24 Feb, 2021 40 commits
-
-
Matthew Wilcox (Oracle) authored
Use AOP_TRUNCATED_PAGE to indicate that no error occurred, but the page we looked up is no longer valid. In this case, the reference to the page will have been removed; if we hit any other error, the caller will release the reference. Link: https://lkml.kernel.org/r/20210122160140.223228-12-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
By moving the iocb flag checks to the caller, we can pass the file and the page index instead of the iocb. It never needed the iter. By passing the pagevec, we can return an errno (or AOP_TRUNCATED_PAGE) instead of an ERR_PTR. Link: https://lkml.kernel.org/r/20210122160140.223228-11-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Make this function more generic by passing the file instead of the iocb. Check in the callers whether we should call readpage or not. Also make it return an errno / 0 / AOP_TRUNCATED_PAGE, and make calling put_page() the caller's responsibility. Link: https://lkml.kernel.org/r/20210122160140.223228-10-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
The readpage operation can block in many (most?) filesystems, so we should punt to a work queue instead of calling it. This was the last caller of lock_page_for_iocb(), so remove it. Link: https://lkml.kernel.org/r/20210122160140.223228-9-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
The previous patch removed wait_on_page_locked_async(), so inline __wait_on_page_locked_async into __lock_page_async(). Link: https://lkml.kernel.org/r/20210122160140.223228-8-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
For page splitting to succeed, the thread asking to split the page has to be the only one with a reference to the page. Calling wait_on_page_locked() while holding a reference to the page will effectively prevent this from happening with sufficient threads waiting on the same page. Use put_and_wait_on_page_locked() to sleep without holding a reference to the page, then retry the page lookup after the page is unlocked. Since we now get the page lock a little earlier in filemap_update_page(), we can eliminate a number of duplicate checks. The original intent (commit ebded027 ("avoid unnecessary calls to lock_page when waiting for IO to complete during a read")) behind getting the page lock later was to avoid re-locking the page after it has been brought uptodate by another thread. We still avoid that because we go through the normal lookup path again after the winning thread has brought the page uptodate. Link: https://lkml.kernel.org/r/20210122160140.223228-7-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
This is prep work for the next patch, but I think at least one of the current callers would prefer a killable sleep to an uninterruptible one. Link: https://lkml.kernel.org/r/20210122160140.223228-6-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Add filemap_get_read_batch() which returns the head pages which represent a contiguous array of bytes in the file. It also stops when encountering a page marked as Readahead or !Uptodate (but does return that page) so it can be handled appropriately by filemap_get_pages(). That lets us remove the loop in filemap_get_pages() and check only the last page. Link: https://lkml.kernel.org/r/20210122160140.223228-5-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Using a pagevec lets us keep the pages and the number of pages together which simplifies a lot of the calling conventions. Link: https://lkml.kernel.org/r/20210122160140.223228-4-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Increasing the batch size runs into diminishing returns. It's probably better to make, eg, three calls to filemap_get_pages() than it is to call into kmalloc(). Link: https://lkml.kernel.org/r/20210122160140.223228-3-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Patch series "Refactor generic_file_buffered_read", v5. This is a combination of Christoph's work to refactor generic_file_buffered_read() and some of my large-page support which was disrupted by Kent's refactoring of generic_file_buffered_read. This patch (of 18): The recent split of generic_file_buffered_read() created some very long function names which are hard to distinguish from each other. Rename as follows: generic_file_buffered_read_readpage -> filemap_read_page generic_file_buffered_read_pagenotuptodate -> filemap_update_page generic_file_buffered_read_no_cached_page -> filemap_create_page generic_file_buffered_read_get_pages -> filemap_get_pages Link: https://lkml.kernel.org/r/20210122160140.223228-1-willy@infradead.org Link: https://lkml.kernel.org/r/20210122160140.223228-2-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Pavel Begunkov authored
Currently, if I/O is enqueued for async execution direct paths of generic_file_{read,write}_iter() will always revert the iter. There are no users expecting that, and that is also costly. Leave iterators as is on -EIOCBQUEUED. Link: https://lkml.kernel.org/r/f5247b60e7abbd2ff850cd108491f53a2e0c501a.1610207781.git.asml.silence@gmail.comSigned-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Baolin Wang authored
Since commit 74d60958 ("page cache: Add and replace pages using the XArray") was merged, the replace_page_cache_page() can not fail and always return 0, we can remove the redundant return value and void it. Moreover remove the unused gfp_mask. Link: https://lkml.kernel.org/r/609c30e5274ba15d8b90c872fd0d8ac437a9b2bb.1610071401.git.baolin.wang@linux.alibaba.comSigned-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miaohe Lin authored
Commit 108bcc96 ("mm: add & use zone_end_pfn() and zone_spans_pfn()") introduced the helper zone_end_pfn() to calculate the zone end pfn. But pagetypeinfo_showmixedcount_print forgot to use it. And the initialization of local variable pfn is duplicated, remove one. Link: https://lkml.kernel.org/r/20210123070538.5861-1-linmiaohe@huawei.comSigned-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Anshuman Khandual authored
Currently the basic tests just validate various page table transformations after starting with vm_get_page_prot(VM_READ|VM_WRITE|VM_EXEC) protection. Instead scan over the entire protection_map[] for better coverage. It also makes sure that all these basic page table tranformations checks hold true irrespective of the starting protection value for the page table entry. There is also a slight change in the debug print format for basic tests to capture the protection value it is being tested with. The modified output looks something like [pte_basic_tests ]: Validating PTE basic () [pte_basic_tests ]: Validating PTE basic (read) [pte_basic_tests ]: Validating PTE basic (write) [pte_basic_tests ]: Validating PTE basic (read|write) [pte_basic_tests ]: Validating PTE basic (exec) [pte_basic_tests ]: Validating PTE basic (read|exec) [pte_basic_tests ]: Validating PTE basic (write|exec) [pte_basic_tests ]: Validating PTE basic (read|write|exec) [pte_basic_tests ]: Validating PTE basic (shared) [pte_basic_tests ]: Validating PTE basic (read|shared) [pte_basic_tests ]: Validating PTE basic (write|shared) [pte_basic_tests ]: Validating PTE basic (read|write|shared) [pte_basic_tests ]: Validating PTE basic (exec|shared) [pte_basic_tests ]: Validating PTE basic (read|exec|shared) [pte_basic_tests ]: Validating PTE basic (write|exec|shared) [pte_basic_tests ]: Validating PTE basic (read|write|exec|shared) This adds a missing argument 'struct mm_struct *' in pud_basic_tests() test . This never got exposed before as PUD based THP is available only on X86 platform where mm_pmd_folded(mm) call gets macro replaced without requiring the mm_struct i.e __is_defined(__PAGETABLE_PMD_FOLDED). Link: https://lkml.kernel.org/r/1611137241-26220-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390] Reviewed-by: Steven Price <steven.price@arm.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Anshuman Khandual authored
Patch series "mm/debug_vm_pgtable: Some minor updates", v3. This series contains some cleanups and new test suggestions from Catalin from an earlier discussion. https://lore.kernel.org/linux-mm/20201123142237.GF17833@gaia/ This patch (of 2): This adds validation tests for dirtiness after write protect conversion for each page table level. There are two new separate test types involved here. The first test ensures that a given page table entry does not become dirty after pxx_wrprotect(). This is important for platforms like arm64 which transfers and drops the hardware dirty bit (!PTE_RDONLY) to the software dirty bit while making it an write protected one. This test ensures that no fresh page table entry could be created with hardware dirty bit set. The second test ensures that a given page table entry always preserve the dirty information across pxx_wrprotect(). This adds two previously missing PUD level basic tests and while here fixes pxx_wrprotect() related typos in the documentation file. Link: https://lkml.kernel.org/r/1611137241-26220-1-git-send-email-anshuman.khandual@arm.com Link: https://lkml.kernel.org/r/1611137241-26220-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390] Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
The memcg_data is only valid on the head page, not the tail pages. Change the format and location of the printout within the dump to match the other parts of struct page better. Link: https://lkml.kernel.org/r/20210114190200.1894484-1-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Zhiyuan Dai authored
Add whitespace to fix coding style issues, improve code reading. Link: https://lkml.kernel.org/r/1612847403-5594-1-git-send-email-daizhiyuan@phytium.com.cnSigned-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vlastimil Babka authored
The boot param and config determine the value of memcg_sysfs_enabled, which is unused since commit 10befea9 ("mm: memcg/slab: use a single set of kmem_caches for all allocations") as there are no per-memcg kmem caches anymore. Link: https://lkml.kernel.org/r/20210127124745.7928-1-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vlastimil Babka authored
In deactivate_slab() we currently move all but one objects on the cpu freelist to the page freelist one by one using the costly cmpxchg_double() operation. Then we unfreeze the page while moving the last object on page freelist, with a final cmpxchg_double(). This can be optimized to avoid the cmpxchg_double() per object. Just count the objects on cpu freelist (to adjust page->inuse properly) and also remember the last object in the chain. Then splice page->freelist to the last object and effectively add the whole cpu freelist to page->freelist while unfreezing the page, with a single cmpxchg_double(). Link: https://lkml.kernel.org/r/20210115183543.15097-1-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jann Horn <jannh@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vlastimil Babka authored
SLAB has been using get/put_online_cpus() around creating, destroying and shrinking kmem caches since 95402b38 ("cpu-hotplug: replace per-subsystem mutexes with get_online_cpus()") in 2008, which is supposed to be replacing a private mutex (cache_chain_mutex, called slab_mutex today) with system-wide mechanism, but in case of SLAB it's in fact used in addition to the existing mutex, without explanation why. SLUB appears to have avoided the cpu hotplug lock initially, but gained it due to common code unification, such as 20cea968 ("mm, sl[aou]b: Move kmem_cache_create mutex handling to common code"). Regardless of the history, checking if the hotplug lock is actually needed today suggests that it's not, and therefore it's better to avoid this system-wide lock and the ordering this imposes wrt other locks (such as slab_mutex). Specifically, in SLAB we have for_each_online_cpu() in do_tune_cpucache() protected by slab_mutex, and cpu hotplug callbacks that also take the slab_mutex, which is also taken by the common slab function that currently also take the hotplug lock. Thus the slab_mutex protection should be sufficient. Also per-cpu array caches are allocated for each possible cpu, so not affected by their online/offline state. In SLUB we have for_each_online_cpu() in functions that show statistics and are already unprotected today, as racing with hotplug is not harmful. Otherwise SLUB relies on percpu allocator. The slub_cpu_dead() hotplug callback takes the slab_mutex. To sum up, this patch removes get/put_online_cpus() calls from slab as it should be safe without further adjustments. Link: https://lkml.kernel.org/r/20210113131634.3671-4-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Qian Cai <cai@redhat.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vlastimil Babka authored
Since commit 03afc0e2 ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for SLAB and SLUB when creating, destroying or shrinking a cache. It is quite a heavy lock and it's best to avoid it if possible, as we had several issues with lockdep complaining about ordering in the past, see e.g. e4f8e513 ("mm/slub: fix a deadlock in show_slab_objects()"). The problem scenario in 03afc0e2 (solved by the memory hotplug lock) can be summarized as follows: while there's slab_mutex synchronizing new kmem cache creation and SLUB's MEM_GOING_ONLINE callback slab_mem_going_online_callback(), we may miss creation of kmem_cache_node for the hotplugged node in the new kmem cache, because the hotplug callback doesn't yet see the new cache, and cache creation in init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the N_NORMAL_MEMORY nodemask, which however may not yet include the new node, as that happens only later after the MEM_GOING_ONLINE callback. Instead of using get/put_online_mems(), the problem can be solved by SLUB maintaining its own nodemask of nodes for which it has allocated the per-node kmem_cache_node structures. This nodemask would generally mirror the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's control in its memory hotplug callbacks under the slab_mutex. This patch adds such nodemask and its handling. Commit 03afc0e2 mentiones "issues like [the one above]", but there don't appear to be further issues. All the paths (shared for SLAB and SLUB) taking the memory hotplug locks are also taking the slab_mutex, except kmem_cache_shrink() where 03afc0e2 replaced slab_mutex with get/put_online_mems(). We however cannot simply restore slab_mutex in kmem_cache_shrink(), as SLUB can enters the function from a write to sysfs 'shrink' file, thus holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested within slab_mutex. But on closer inspection we don't actually need to protect kmem_cache_shrink() from hotplug callbacks: While SLUB's __kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node added in parallel hotplug is not fatal, and parallel hotremove does not free kmem_cache_node's anymore after the previous patch, so use-after free cannot happen. The per-node shrinking itself is protected by n->list_lock. Same is true for SLAB, and SLOB is no-op. SLAB also doesn't need the memory hotplug locking, which it only gained by 03afc0e2 through the shared paths in slab_common.c. Its memory hotplug callbacks are also protected by slab_mutex against races with these paths. The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new node is already set there during the MEM_GOING_ONLINE callback, so no special care is needed for SLAB. As such, this patch removes all get/put_online_mems() usage by the slab subsystem. Link: https://lkml.kernel.org/r/20210113131634.3671-3-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Qian Cai <cai@redhat.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vlastimil Babka authored
Patch series "mm, slab, slub: remove cpu and memory hotplug locks". Some related work caused me to look at how we use get/put_mems_online() and get/put_online_cpus() during kmem cache creation/descruction/shrinking, and realize that it should be actually safe to remove all of that with rather small effort (as e.g. Michal Hocko suspected in some of the past discussions already). This has the benefit to avoid rather heavy locks that have caused locking order issues already in the past. So this is the result, Patches 2 and 3 remove memory hotplug and cpu hotplug locking, respectively. Patch 1 is due to realization that in fact some races exist despite the locks (even if not removed), but the most sane solution is not to introduce more of them, but rather accept some wasted memory in scenarios that should be rare anyway (full memory hot remove), as we do the same in other contexts already. This patch (of 3): Commit e4f8e513 ("mm/slub: fix a deadlock in show_slab_objects()") has fixed a problematic locking order by removing the memory hotplug lock get/put_online_mems() from show_slab_objects(). During the discussion, it was argued [1] that this is OK, because existing slabs on the node would prevent a hotremove to proceed. That's true, but per-node kmem_cache_node structures are not necessarily allocated on the same node and may exist even without actual slab pages on the same node. Any path that uses get_node() directly or via for_each_kmem_cache_node() (such as show_slab_objects()) can race with freeing of kmem_cache_node even with the !NULL check, resulting in use-after-free. To that end, commit e4f8e513 argues in a comment that: * We don't really need mem_hotplug_lock (to hold off * slab_mem_going_offline_callback) here because slab's memory hot * unplug code doesn't destroy the kmem_cache->node[] data. While it's true that slab_mem_going_offline_callback() doesn't free the kmem_cache_node, the later callback slab_mem_offline_callback() actually does, so the race and use-after-free exists. Not just for show_slab_objects() after commit e4f8e513, but also many other places that are not under slab_mutex. And adding slab_mutex locking or other synchronization to SLUB paths such as get_any_partial() would be bad for performance and error-prone. The easiest solution is therefore to make the abovementioned comment true and stop freeing the kmem_cache_node structures, accepting some wasted memory in the full memory node removal scenario. Analogically we also don't free hotremoved pgdat as mentioned in [1], nor the similar per-node structures in SLAB. Importantly this approach will not block the hotremove, as generally such nodes should be movable in order to succeed hotremove in the first place, and thus the GFP_KERNEL allocated kmem_cache_node will come from elsewhere. [1] https://lore.kernel.org/linux-mm/20190924151147.GB23050@dhcp22.suse.cz/ Link: https://lkml.kernel.org/r/20210113131634.3671-1-vbabka@suse.cz Link: https://lkml.kernel.org/r/20210113131634.3671-2-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Qian Cai <cai@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Berg authored
If kmemleak is enabled, it uses a kmem cache for its own objects. These objects are used to hold information kmemleak uses, including a stack trace. If slub_debug is also turned on, each of them has *another* stack trace, so the overhead adds up, and on my tests (on ARCH=um, admittedly) 2/3rds of the allocations end up being doing the stack tracing. Turn off SLAB_STORE_USER if SLAB_NOLEAKTRACE was given, to avoid storing the essentially same data twice. Link: https://lkml.kernel.org/r/20210113215114.d94efa13ba30.I117b6764e725b3192318bbcf4269b13b709539ae@changeidSigned-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Zhiyuan Dai authored
Fix some coding style issues, improve code reading. Adds whitespace to clearly separate the parameters. Link: https://lkml.kernel.org/r/1612841499-32166-1-git-send-email-daizhiyuan@phytium.com.cnSigned-off-by: Zhiyuan Dai <daizhiyuan@phytium.com.cn> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nikolay Borisov authored
This argument hasn't been used since e153362a ("slub: Remove objsize check in kmem_cache_flags()") so simply remove it. Link: https://lkml.kernel.org/r/20210126095733.974665-1-nborisov@suse.comSigned-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jacob Wen authored
Currently, a trace record generated by the RCU core is as below. ... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=00000000f3b49a66 It doesn't tell us what the RCU core has freed. This patch adds the slab name to trace_kmem_cache_free(). The new format is as follows. ... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=0000000037f79c8d name=dentry ... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=00000000f78cb7b5 name=sock_inode_cache ... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=0000000018768985 name=pool_workqueue ... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=000000006a6cb484 name=radix_tree_node We can use it to understand what the RCU core is going to free. For example, some users maybe interested in when the RCU core starts freeing reclaimable slabs like dentry to reduce memory pressure. Link: https://lkml.kernel.org/r/20201216072804.8838-1-jian.w.wen@oracle.comSigned-off-by: Jacob Wen <jian.w.wen@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Alexey Dobriyan authored
[akpm@linux-foundation.org: update inode_operations.tmpfile] Link: http://lkml.kernel.org/r/20190206073349.GA15311@avx2Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Randy Dunlap authored
Delete duplicate words in fs/*.c. The doubled words that are being dropped are: that, be, the, in, and, for Link: https://lkml.kernel.org/r/20201224052810.25315-1-rdunlap@infradead.orgSigned-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jiapeng Chong authored
Fix the following coccicheck warnings: fs/ocfs2/refcounttree.c:981:16-18: WARNING !A || A && B is equivalent to !A || B. Link: https://lkml.kernel.org/r/1612235424-80367-1-git-send-email-jiapeng.chong@linux.alibaba.comSigned-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Dan Carpenter authored
The error handling in this function frees "reg" but it is still on the "o2hb_all_regions" list so it will lead to a use after freew. Joseph Qi points out that we need to clear the bit in the "o2hb_region_bitmap" as well Link: https://lkml.kernel.org/r/YBk4M6HUG8jB/jc7@mwanda Fixes: 1cf257f5 ("ocfs2: fix memory leak") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
guozh authored
There are some definitions which is not used anymore in OCFS2 module, so as to be removed. Link: https://lkml.kernel.org/r/2021011916182284700534@chinatelecom.cnSigned-off-by: Guozhonghua <guozh88@chinatelecom.cn> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Yi Li authored
iput handles NULL pointers gracefully, so there's no need to check the pointer before the call. Link: https://lkml.kernel.org/r/20201231040535.4091761-1-yili@winhong.comSigned-off-by: Yi Li <yili@winhong.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Rustam Kovhaev authored
Mounting a corrupted filesystem with NTFS resulted in a kernel crash. We should check for valid STANDARD_INFORMATION attribute offset and length before trying to access it Link: https://lkml.kernel.org/r/20210217155930.1506815-1-rkovhaev@gmail.com Link: https://syzkaller.appspot.com/bug?extid=c584225dabdea2f71969Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com> Reported-by: syzbot+c584225dabdea2f71969@syzkaller.appspotmail.com Tested-by: syzbot+c584225dabdea2f71969@syzkaller.appspotmail.com Acked-by: Anton Altaparmakov <anton@tuxera.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Randy Dunlap authored
Drop the repeated words "the" and "in" in comments. Link: https://lkml.kernel.org/r/20210125194937.24627-1-rdunlap@infradead.orgSigned-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Colin Ian King authored
Here are some of the more common spelling mistakes and typos that I've found while fixing up spelling mistakes in the kernel since September 2020 Link: https://lkml.kernel.org/r/20210210124318.55082-1-colin.king@canonical.comSigned-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
dingsenjie authored
Increase "allocted" and "exeeds" spelling error check. Link: https://lkml.kernel.org/r/20210127081919.1928-1-dingsenjie@163.comSigned-off-by: dingsenjie <dingsenjie@yulong.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
zuoqilin authored
Increase exeeds spelling error check. Link: https://lkml.kernel.org/r/20210127060049.915-1-zuoqilin1@163.comSigned-off-by: zuoqilin <zuoqilin@yulong.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
tangchunyou authored
Increase maping spelling error check. Link: https://lkml.kernel.org/r/20210121092125.2663-1-tangchunyou@163.comSigned-off-by: WenZhang <zhangwen@yulong.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Randy Dunlap authored
Since CONFIG_EXPERIMENTAL was removed in 2013, go ahead and drop it from any defconfig files. Link: https://lkml.kernel.org/r/20210115010011.29483-1-rdunlap@infradead.org Fixes: 3d374d09 ("final removal of CONFIG_EXPERIMENTAL") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Brian Cain <bcain@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-