- 10 Feb, 2023 9 commits
-
-
Liam R. Howlett authored
Add wrappers for the maple tree to the vma iterator. This will provide type safety at compile time. Link: https://lkml.kernel.org/r/20230120162650.984577-8-Liam.Howlett@oracle.comSigned-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Liam R. Howlett authored
When mas_prev() does not find anything, set the state to MAS_NONE. Handle the MAS_NONE in mas_find() like a MAS_START. Link: https://lkml.kernel.org/r/20230120162650.984577-7-Liam.Howlett@oracle.comSigned-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: <syzbot+502859d610c661e56545@syzkaller.appspotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Liam R. Howlett authored
If an invalidated maple state is encountered during write, reset the maple state to MAS_START. This will result in a re-walk of the tree to the correct location for the write. Link: https://lore.kernel.org/all/20230107020126.1627-1-sj@kernel.org/ Link: https://lkml.kernel.org/r/20230120162650.984577-6-Liam.Howlett@oracle.comSigned-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Liam R. Howlett authored
Add a testcase to ensure the iterator detects bad states on modifications and does what the user expects Link: https://lkml.kernel.org/r/20230120162650.984577-5-Liam.Howlett@oracle.comSigned-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Liam R. Howlett authored
When iterating, a user may operate on the tree and cause the maple state to be altered and left in an unintuitive state. Detect this scenario and correct it by setting to the limit and invalidating the state. Link: https://lkml.kernel.org/r/20230120162650.984577-4-Liam.Howlett@oracle.comSigned-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Liam R. Howlett authored
Ensure the node isn't dead after reading the node end. Link: https://lkml.kernel.org/r/20230120162650.984577-3-Liam.Howlett@oracle.comSigned-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Liam R. Howlett authored
Patch series "VMA tree type safety and remove __vma_adjust()", v4. This patchset does two things: 1. Clean up, including removal of __vma_adjust() and 2. Extends the VMA iterator API to provide type safety to the VMA operations using the maple tree, as requested by Linus [1]. It also addresses another issue of usability brought up by Linus about needing to modify the maple state within the loops. The maple state has been replaced by the VMA iterator and the iterator is now modified within the MM code so the caller should not need to worry about doing the work themselves when tree modifications occur. This brought up a potential inconsistency of the iterator state and what the user expects, so the inconsistency is addressed to keep the VMA iterator safe for use after the looping over a VMA range. This is addressed in patch 3 ("maple_tree: Reduce user error potential") and 4 ("test_maple_tree: Test modifications while iterating"). While cleaning up the state, the duplicate locking code in mm/mmap.c introduced by the maple tree has been address by abstracting it to two functions: vma_prepare() and vma_complete(). These abstractions allowed for a much simpler __vma_adjust(), which eventually leads to the removal of the __vma_adjust() function by placing the logic into the vma_merge() function itself. 1. https://lore.kernel.org/linux-mm/CAHk-=wg9WQXBGkNdKD2bqocnN73rDswuWsavBB7T-tekykEn_A@mail.gmail.com/ This patch (of 49): Add a function that will zero out the maple state struct and set some basic defaults. Link: https://lkml.kernel.org/r/20230120162650.984577-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20230120162650.984577-2-Liam.Howlett@oracle.comSigned-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
If we have a HIGHMEM system with a large folio, 'offset' may be larger than PAGE_SIZE, and so min_t will cap at 'len' instead of the intended end-of-page. That can overflow into the next page which is likely to be unmapped and fault, but could theoretically copy the wrong data. Link: https://lkml.kernel.org/r/Y919vmSrtAgsf6K3@casper.infradead.org Fixes: 00cdf760 ("mm: add memcpy_from_file_folio()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
David Hildenbrand authored
We're masking with the number of type bits instead of the type mask, which is obviously wrong. Link: https://lkml.kernel.org/r/39fd91e3-c93b-23c6-afc6-cbe473bb0ca9@redhat.com Fixes: 20aae9ef ("arm/mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Mark Brown <broonie@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 03 Feb, 2023 31 commits
-
-
Matthew Wilcox (Oracle) authored
This is just a conversion to the folio API. While there are some nods towards supporting multi-page folios in here, the blocks array is still sized for one page's worth of blocks, and there are other assumptions such as the blocks_per_page variable. [willy@infradead.org: fix accidentally-triggering WARN_ON_ONCE] Link: https://lkml.kernel.org/r/Y9kuaBgXf9lKJ8b0@casper.infradead.org Link: https://lkml.kernel.org/r/20230126201255.1681189-3-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
Patch series "Convert writepage_t to use a folio". More folioisation. I split out the mpage work from everything else because it completely dominated the patch, but some implementations I just converted outright. This patch (of 2): We always write back an entire folio, but that's currently passed as the head page. Convert all filesystems that use write_cache_pages() to expect a folio instead of a page. Link: https://lkml.kernel.org/r/20230126201255.1681189-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230126201255.1681189-2-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Matthew Wilcox (Oracle) authored
This is the equivalent of memcpy_from_page(). It differs in that it takes the position in a file instead of offset in a folio, it accepts the total number of bytes to be copied (instead of the number of bytes to be copied from this folio) and it returns how many bytes were copied from the folio, rather than making the caller calculate that and then checking if the caller got it right. [akpm@linux-foundation.org: fix typo in comment] Link: https://lkml.kernel.org/r/20230126201552.1681588-1-willy@infradead.orgSigned-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
The ->rw_page method is a special purpose bypass of the usual bio handling path that is limited to single-page reads and writes and synchronous which causes a lot of extra code in the drivers, callers and the block layer. The only remaining user is the MM swap code. Switch that swap code to simply submit a single-vec on-stack bio an synchronously wait on it based on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers that currently implement ->rw_page instead. While this touches one extra cache line and executes extra code, it simplifies the block layer and drivers and ensures that all feastures are properly supported by all drivers, e.g. right now ->rw_page bypassed cgroup writeback entirely. [akpm@linux-foundation.org: fix comment typo, per Dan] Link: https://lkml.kernel.org/r/20230125133436.447864-8-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
Split the block device case from swap_readpage into a separate helper, following the abstraction for file based swap. Link: https://lkml.kernel.org/r/20230125133436.447864-7-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
__swap_writepage always returns 0. Link: https://lkml.kernel.org/r/20230125133436.447864-6-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
Optimize the synchronous swap in case by using an on-stack bio instead of allocating one using bio_alloc. Link: https://lkml.kernel.org/r/20230125133436.447864-5-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
Split the block device case from swap_readpage into a separate helper, following the abstraction for file based swap and frontswap. Link: https://lkml.kernel.org/r/20230125133436.447864-4-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
swap_readpage always returns 0, and no caller checks the return value. [akpm@linux-foundation.org: fix void-returning swap_readpage() stub, per Keith] Link: https://lkml.kernel.org/r/20230125133436.447864-3-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
Patch series "remove ->rw_page". This series removes the ->rw_page block_device_operation, which is an old and clumsy attempt at a simple read/write fast path for the block layer. It isn't actually used by the fastest block layer operations that we support (polled I/O through io_uring), but only used by the mpage buffered I/O helpers which are some of the slowest I/O we have and do not make any difference there at all, and zram which is a block device abused to duplicate the zram functionality. Given that zram is heavily used we need to make sure there is a good replacement for synchronous I/O, so this series adds a new flag for drivers that complete I/O synchronously and uses that flag to use on-stack bios and synchronous submission for them in the swap code. This patch (of 7): These are micro-optimizations for synchronous I/O, which do not matter compared to all the other inefficiencies in the legacy buffer_head based mpage code. Link: https://lkml.kernel.org/r/20230125133436.447864-1-hch@lst.de Link: https://lkml.kernel.org/r/20230125133436.447864-2-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
Move the VM_FLUSH_RESET_PERMS to the caller and rename the function to better describe what it is doing. Link: https://lkml.kernel.org/r/20230121071051.1143058-11-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
vunmap only needs to find and free the vmap_area and vm_strut, so open code that there and merge the rest of the code into vfree. Link: https://lkml.kernel.org/r/20230121071051.1143058-10-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
All these checks apply to the free_vm_area interface as well, so move them to the common routine. Link: https://lkml.kernel.org/r/20230121071051.1143058-9-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
Use the common helper to find and remove a vmap_area instead of open coding it. Link: https://lkml.kernel.org/r/20230121071051.1143058-8-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
__remove_vm_area is the only part of va_remove_mappings that requires a vmap_area. Move the call out to the caller and only pass the vm_struct to va_remove_mappings. Link: https://lkml.kernel.org/r/20230121071051.1143058-7-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
This adds an extra, never taken, in_interrupt() branch, but will allow to cut down the maze of vfree helpers. Link: https://lkml.kernel.org/r/20230121071051.1143058-6-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
Move these two functions around a bit to avoid forward declarations. Link: https://lkml.kernel.org/r/20230121071051.1143058-5-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
Fold __vfree_deferred into vfree_atomic, and call vfree_atomic early on from vfree if called from interrupt context so that the extra low-level helper can be avoided. Link: https://lkml.kernel.org/r/20230121071051.1143058-4-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
__vfree is a subset of vfree that just skips a few checks, and which is only used by vfree and an error cleanup path. Fold __vfree into vfree and switch the only other caller to call vfree() instead. Link: https://lkml.kernel.org/r/20230121071051.1143058-3-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Christoph Hellwig authored
Patch series "cleanup vfree and vunmap". This little series untangles the vfree and vunmap code path a bit. This patch (of 10): VM_FLUSH_RESET_PERMS is just for use with vmalloc as it is tied to freeing the underlying pages. Link: https://lkml.kernel.org/r/20230121071051.1143058-1-hch@lst.de Link: https://lkml.kernel.org/r/20230121071051.1143058-2-hch@lst.deSigned-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Mel Gorman authored
Commit 7efc3b72 ("mm/compaction: fix set skip in fast_find_migrateblock") address an issue where a pageblock selected by fast_find_migrateblock() was ignored. Unfortunately, the same fix resulted in numerous reports of khugepaged or kcompactd stalling for long periods of time or consuming 100% of CPU. Tracing showed that there was a lot of rescanning between a small subset of pageblocks because the conditions for marking the block skip are not met. The scan is not reaching the end of the pageblock because enough pages were isolated but none were migrated successfully. Eventually it circles back to the same block. Pageblock skip tracking tries to minimise both latency and excessive scanning but tracking exactly when a block is fully scanned requires an excessive amount of state. This patch forcibly rescans a pageblock when all isolated pages fail to migrate even though it could be for transient reasons such as page writeback or page dirty. This will sometimes migrate too many pages but pageblocks will be marked skip and forward progress will be made. "Usemen" from the mmtests configuration workload-usemem-stress-numa-compact was used to stress compaction. The compaction trace events were recorded using a 6.2-rc5 kernel that includes commit 7efc3b72 and count of unique ranges were measured. The top 5 ranges were 3076 range=(0x10ca00-0x10cc00) 3076 range=(0x110a00-0x110c00) 3098 range=(0x13b600-0x13b800) 3104 range=(0x141c00-0x141e00) 11424 range=(0x11b600-0x11b800) While this workload is very different than what the bugs reported, the pattern of the same subset of blocks being repeatedly scanned is observed. At one point, *only* the range range=(0x11b600 ~ 0x11b800) was scanned for 2 seconds. 14 seconds passed between the first migration-related event and the last. With the series applied including this patch, the top 5 ranges were 1 range=(0x11607e-0x116200) 1 range=(0x116200-0x116278) 1 range=(0x116278-0x116400) 1 range=(0x116400-0x116424) 1 range=(0x116424-0x116600) Only unique ranges were scanned and the time between the first migration-related event was 0.11 milliseconds. Link: https://lkml.kernel.org/r/20230125134434.18017-5-mgorman@techsingularity.net Fixes: 7efc3b72 ("mm/compaction: fix set skip in fast_find_migrateblock") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Chuyi Zhou <zhouchuyi@bytedance.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Mel Gorman authored
cc->finish_pageblock is set when the current pageblock should be rescanned but fast_find_migrateblock can select an alternative block. Disable fast_find_migrateblock when the current pageblock scan should be completed. Link: https://lkml.kernel.org/r/20230125134434.18017-4-mgorman@techsingularity.netSigned-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Chuyi Zhou <zhouchuyi@bytedance.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Mel Gorman authored
If a page has been captured then draining is unnecssary so check first for a captured page. Link: https://lkml.kernel.org/r/20230125134434.18017-3-mgorman@techsingularity.netSigned-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Chuyi Zhou <zhouchuyi@bytedance.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Mel Gorman authored
Patch series "Fix excessive CPU usage during compaction". Commit 7efc3b72 ("mm/compaction: fix set skip in fast_find_migrateblock") fixed a problem where pageblocks found by fast_find_migrateblock() were ignored. Unfortunately there were numerous bug reports complaining about high CPU usage and massive stalls once 6.1 was released. Due to the severity, the patch was reverted by Vlastimil as a short-term fix[1] to -stable. The underlying problem for each of the bugs is suspected to be the repeated scanning of the same pageblocks. This series should guarantee forward progress even with commit 7efc3b72. More information is in the changelog for patch 4. [1] http://lore.kernel.org/r/20230113173345.9692-1-vbabka@suse.cz This patch (of 4): The rescan field was not well named albeit accurate at the time. Rename the field to finish_pageblock to indicate that the remainder of the pageblock should be scanned regardless of COMPACT_CLUSTER_MAX. The intent is that pageblocks with transient failures get marked for skipping to avoid revisiting the same pageblock. Link: https://lkml.kernel.org/r/20230125134434.18017-2-mgorman@techsingularity.netSigned-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Chuyi Zhou <zhouchuyi@bytedance.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jongwoo Han authored
Link: https://lkml.kernel.org/r/20230125180847.4542-1-jongwooo.han@gmail.comSigned-off-by: Jongwoo Han <jongwooo.han@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrey Konovalov authored
The implementation of page_alloc poisoning sampling assumed that tag_clear_highpage resets page tags for __GFP_ZEROTAGS allocations. However, this is no longer the case since commit 70c248ac ("mm: kasan: Skip unpoisoning of user pages"). This leads to kernel crashes when MTE-enabled userspace mappings are used with Hardware Tag-Based KASAN enabled. Reset page tags for __GFP_ZEROTAGS allocations in post_alloc_hook(). Also clarify and fix related comments. [andreyknvl@google.com: update comment] Link: https://lkml.kernel.org/r/5dbd866714b4839069e2d8469ac45b60953db290.1674592780.git.andreyknvl@google.com Link: https://lkml.kernel.org/r/24ea20c1b19c2b4b56cf9f5b354915f8dbccfc77.1674592496.git.andreyknvl@google.com Fixes: 44383cef ("kasan: allow sampling page_alloc allocations for HW_TAGS") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Peter Collingbourne <pcc@google.com> Tested-by: Peter Collingbourne <pcc@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Mike Rapoport authored
W=1 build with clangs complains: mm/sparse.c:347:27: warning: unused function 'pgdat_to_phys' [-Wunused-function] static inline phys_addr_t pgdat_to_phys(struct pglist_data *pgdat) ^ 1 warning generated. pgdat_to_phys() is only used by functions defined when CONFIG_MEMORY_HOTREMOVE=y. Move pgdat_to_phys() under #ifdef CONFIG_MEMORY_HOTREMOVE to make clang happy. Link: https://lkml.kernel.org/r/20230121101151.1703292-1-rppt@kernel.orgSigned-off-by: Mike Rapoport <rppt@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/all/202301210155.1E5zABb5-lkp@intel.com Cc: Miles Chen <miles.chen@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Hyeonggon Yoo authored
When allocating a high-order page, separate allocation timestamp is recorded for each sub-page resulting in different timestamp values between them. This behavior is not consistent with the behavior when recording free timestamp and caused confusion when analyzing memory dumps. Record single timestamp for the entire allocation, aligning with the behavior for free timestamps. Link: https://lkml.kernel.org/r/20230121165054.520507-1-42.hyeyoo@gmail.comSigned-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jiaqi Yan authored
Add documentation for memory_failure's per NUMA node sysfs entries Link: https://lkml.kernel.org/r/20230120034622.2698268-4-jiaqiyan@google.comSigned-off-by: Jiaqi Yan <jiaqiyan@google.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: David Rientjes <rientjes@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jiaqi Yan authored
Right before memory_failure finishes its handling, accumulate poisoned page's resolution counters to pglist_data's memory_failure_stats, so as to update the corresponding sysfs entries. Tested: 1) Start an application to allocate memory buffer chunks 2) Convert random memory buffer addresses to physical addresses 3) Inject memory errors using EINJ at chosen physical addresses 4) Access poisoned memory buffer and recover from SIGBUS 5) Check counter values under /sys/devices/system/node/node*/memory_failure/* Link: https://lkml.kernel.org/r/20230120034622.2698268-3-jiaqiyan@google.comSigned-off-by: Jiaqi Yan <jiaqiyan@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Jiaqi Yan authored
Patch series "Introduce per NUMA node memory error statistics", v2. Background ========== In the RFC for Kernel Support of Memory Error Detection [1], one advantage of software-based scanning over hardware patrol scrubber is the ability to make statistics visible to system administrators. The statistics include 2 categories: * Memory error statistics, for example, how many memory error are encountered, how many of them are recovered by the kernel. Note these memory errors are non-fatal to kernel: during the machine check exception (MCE) handling kernel already classified MCE's severity to be unnecessary to panic (but either action required or optional). * Scanner statistics, for example how many times the scanner have fully scanned a NUMA node, how many errors are first detected by the scanner. The memory error statistics are useful to userspace and actually not specific to scanner detected memory errors, and are the focus of this patchset. Motivation ========== Memory error stats are important to userspace but insufficient in kernel today. Datacenter administrators can better monitor a machine's memory health with the visible stats. For example, while memory errors are inevitable on servers with 10+ TB memory, starting server maintenance when there are only 1~2 recovered memory errors could be overreacting; in cloud production environment maintenance usually means live migrate all the workload running on the server and this usually causes nontrivial disruption to the customer. Providing insight into the scope of memory errors on a system helps to determine the appropriate follow-up action. In addition, the kernel's existing memory error stats need to be standardized so that userspace can reliably count on their usefulness. Today kernel provides following memory error info to userspace, but they are not sufficient or have disadvantages: * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, not per NUMA node stats though * ras:memory_failure_event: only available after explicitly enabled * /dev/mcelog provides many useful info about the MCEs, but doesn't capture how memory_failure recovered memory MCEs * kernel logs: userspace needs to process log text Exposing memory error stats is also a good start for the in-kernel memory error detector. Today the data source of memory error stats are either direct memory error consumption, or hardware patrol scrubber detection (either signaled as UCNA or SRAO). Once in-kernel memory scanner is implemented, it will be the main source as it is usually configured to scan memory DIMMs constantly and faster than hardware patrol scrubber. How Implemented =============== As Naoya pointed out [2], exposing memory error statistics to userspace is useful independent of software or hardware scanner. Therefore we implement the memory error statistics independent of the in-kernel memory error detector. It exposes the following per NUMA node memory error counters: /sys/devices/system/node/node${X}/memory_failure/total /sys/devices/system/node/node${X}/memory_failure/recovered /sys/devices/system/node/node${X}/memory_failure/ignored /sys/devices/system/node/node${X}/memory_failure/failed /sys/devices/system/node/node${X}/memory_failure/delayed These counters describe how many raw pages are poisoned and after the attempted recoveries by the kernel, their resolutions: how many are recovered, ignored, failed, or delayed respectively. This approach can be easier to extend for future use cases than /proc/meminfo, trace event, and log. The following math holds for the statistics: * total = recovered + ignored + failed + delayed These memory error stats are reset during machine boot. The 1st commit introduces these sysfs entries. The 2nd commit populates memory error stats every time memory_failure attempts memory error recovery. The 3rd commit adds documentations for introduced stats. [1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#mc22959244f5388891c523882e61163c6e4d703af [2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6 This patch (of 3): Today kernel provides following memory error info to userspace, but each has its own disadvantage * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, not per NUMA node stats though * ras:memory_failure_event: only available after explicitly enabled * /dev/mcelog provides many useful info about the MCEs, but doesn't capture how memory_failure recovered memory MCEs * kernel logs: userspace needs to process log text Exposes per NUMA node memory error stats as sysfs entries: /sys/devices/system/node/node${X}/memory_failure/total /sys/devices/system/node/node${X}/memory_failure/recovered /sys/devices/system/node/node${X}/memory_failure/ignored /sys/devices/system/node/node${X}/memory_failure/failed /sys/devices/system/node/node${X}/memory_failure/delayed These counters describe how many raw pages are poisoned and after the attempted recoveries by the kernel, their resolutions: how many are recovered, ignored, failed, or delayed respectively. The following math holds for the statistics: * total = recovered + ignored + failed + delayed Link: https://lkml.kernel.org/r/20230120034622.2698268-1-jiaqiyan@google.com Link: https://lkml.kernel.org/r/20230120034622.2698268-2-jiaqiyan@google.comSigned-off-by: Jiaqi Yan <jiaqiyan@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-