1. 12 Dec, 2022 24 commits
    • Feiyang Chen's avatar
      MIPS&LoongArch&NIOS2: adjust prototypes of p?d_init() · 22c4e804
      Feiyang Chen authored
      Patch series "mm/sparse-vmemmap: Generalise helpers and enable for
      LoongArch", v14.
      
      This series is in order to enable sparse-vmemmap for LoongArch.  But
      LoongArch cannot use generic helpers directly because MIPS&LoongArch need
      to call pgd_init()/pud_init()/pmd_init() when populating page tables.  So
      we adjust the prototypes of p?d_init() to make generic helpers can call
      them, then enable sparse-vmemmap with generic helpers, and to be further,
      generalise vmemmap_populate_hugepages() for ARM64, X86 and LoongArch.
      
      
      This patch (of 4):
      
      We are preparing to add sparse vmemmap support to LoongArch.  MIPS and
      LoongArch need to call pgd_init()/pud_init()/pmd_init() when populating
      page tables, so adjust their prototypes to make generic helpers can call
      them.
      
      NIOS2 declares pmd_init() but doesn't use, just remove it to avoid build
      errors.
      
      Link: https://lkml.kernel.org/r/20221027125253.3458989-1-chenhuacai@loongson.cn
      Link: https://lkml.kernel.org/r/20221027125253.3458989-2-chenhuacai@loongson.cnSigned-off-by: default avatarFeiyang Chen <chenfeiyang@loongson.cn>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      Reviewed-by: default avatarJiaxun Yang <jiaxun.yang@flygoat.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Xuefeng Li <lixuefeng@loongson.cn>
      Cc: Xuerui Wang <kernel@xen0n.name>
      Cc: Min Zhou <zhoumin@loongson.cn>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      22c4e804
    • Alexander Potapenko's avatar
      kmsan: allow using __msan_instrument_asm_store() inside runtime · 85716a80
      Alexander Potapenko authored
      In certain cases (e.g.  when handling a softirq)
      __msan_instrument_asm_store(&var, sizeof(var)) may be called with from
      within KMSAN runtime, but later the value of @var is used with
      !kmsan_in_runtime(), leading to false positives.
      
      Because kmsan_internal_unpoison_memory() doesn't take locks, it should be
      fine to call it without kmsan_in_runtime() checks, which fixes the
      mentioned false positives.
      
      Link: https://lkml.kernel.org/r/20221128094541.2645890-2-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85716a80
    • Alexander Potapenko's avatar
      lockdep: allow instrumenting lockdep.c with KMSAN · 1e8e4a7c
      Alexander Potapenko authored
      Lockdep and KMSAN used to play badly together, causing deadlocks when
      KMSAN instrumentation of lockdep.c called lockdep functions recursively.
      
      Looks like this is no more the case, and a kernel can run (yet slower)
      with both KMSAN and lockdep enabled.  This patch should fix false
      positives on wq_head->lock->dep_map, which KMSAN used to consider
      uninitialized because of lockdep.c not being instrumented.
      
      Link: https://lore.kernel.org/lkml/Y3b9AAEKp2Vr3e6O@sol.localdomain/
      Link: https://lkml.kernel.org/r/20221128094541.2645890-1-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reported-by: default avatarEric Biggers <ebiggers@kernel.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1e8e4a7c
    • zhang songyi's avatar
      include/linux/pgtable.h: : remove redundant pte variable · d3a89233
      zhang songyi authored
      Return value from ptep_get_and_clear_full() directly instead of taking
      this in another redundant variable.
      
      Link: https://lkml.kernel.org/r/202211282107437343474@zte.com.cnSigned-off-by: default avatarzhang songyi <zhang.songyi@zte.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d3a89233
    • Brian Foster's avatar
      mm/fadvise: use LLONG_MAX instead of -1 for eof · 3cd629e5
      Brian Foster authored
      generic_fadvise() sets endbyte = -1 to specify end of file (i.e.  if
      length == 0 is passed from userspace).  Most other callers to
      filemap_fdatawrite_range() use LLONG_MAX for this purpose, particularly if
      they also call fdatawait_range() (which requires end >= start).  For
      example, sync_file_range(), vfs_fsync() (where the range is passed down
      through per-fs ->fsync() callbacks), filemap_flush(), etc. 
      generic_fadvise() does not currently wait on writeback, but fix the call
      up to be consistent with other callers.
      
      Link: https://lkml.kernel.org/r/20221128155632.3950447-3-bfoster@redhat.comSigned-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3cd629e5
    • Brian Foster's avatar
      filemap: skip write and wait if end offset precedes start · feeb9b26
      Brian Foster authored
      Patch series "filemap: skip write and wait if end offset precedes start",
      v2.
      
      A fix for the odd write and wait behavior described in the patch 1 commit
      log.  Technically patch 1 could simply remove the check rather than lift
      it into the callers, but this seemed a bit more user friendly to me. 
      Patch 2 is appended after observation that fadvise() interacted poorly
      with the v1 patch.  This is no longer a problem with v2, making patch 2
      purely a cleanup.
      
      This series survived both fstests and ltp regression runs without
      observable problems.  I had (end < start) warning checks in each relevant
      function, with fadvise() being the only caller that triggered them.  That
      said, I dropped the warnings after testing because there seemed to much
      potential for noise from the various other callers.
      
      
      This patch (of 2):
      
      A call to file[map]_write_and_wait_range() with an end offset that
      precedes the start offset but happens to land in the same page can trigger
      writeback submission but fails to wait on the submitted page.  Writeback
      submission occurs because __filemap_fdatawrite_range() passes both offsets
      down into write_cache_pages(), which rounds down to page indexes before it
      starts processing writeback.  However, __filemap_fdatawait_range()
      immediately returns if the byte-granular end offset precedes the start
      offset.
      
      This behavior was observed in the form of unpredictable latency from a
      frequent write and wait call with incorrect parameters.  The behavior gave
      the impression that the fdatawait path might occasionally fail to wait on
      writeback, but further investigation showed the latency was from
      write_cache_pages() waiting on writeback state to clear for a page already
      under writeback.  Therefore, this indicated that fdatawait actually never
      waits on writeback in this particular situation.
      
      The byte granular check in __filemap_fdatawait_range() goes all the way
      back to the old wait_on_page_writeback() helper.  It originally used page
      offsets and so would have waited in this problematic case.  That changed
      to byte granularity file offsets in commit 94004ed7 ("kill
      wait_on_page_writeback_range"), which subtly changed this behavior.  The
      check itself has become somewhat redundant since the error checking code
      that used to follow the wait loop (at the time of the aforementioned
      commit) has now been removed and lifted into the higher level callers.
      
      Therefore, we can restore historical fdatawait behavior by simply removing
      the check.  Since the current fdatawait behavior has been in place for
      quite some time and is consistent with other interfaces that use file
      offsets, instead lift the check into the file[map]_write_and_wait_range()
      helpers to provide consistent behavior between the write and wait.
      
      Link: https://lkml.kernel.org/r/20221128155632.3950447-1-bfoster@redhat.com
      Link: https://lkml.kernel.org/r/20221128155632.3950447-2-bfoster@redhat.comSigned-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      feeb9b26
    • Nhat Pham's avatar
      zsmalloc: implement writeback mechanism for zsmalloc · 9997bc01
      Nhat Pham authored
      This commit adds the writeback mechanism for zsmalloc, analogous to the
      zbud allocator.  Zsmalloc will attempt to determine the coldest zspage
      (i.e least recently used) in the pool, and attempt to write back all the
      stored compressed objects via the pool's evict handler.
      
      Link: https://lkml.kernel.org/r/20221128191616.1261026-7-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9997bc01
    • Nhat Pham's avatar
      zsmalloc: add zpool_ops field to zs_pool to store evict handlers · bd0fded2
      Nhat Pham authored
      This adds a new field to zs_pool to store evict handlers for writeback,
      analogous to the zbud allocator.
      
      Link: https://lkml.kernel.org/r/20221128191616.1261026-6-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bd0fded2
    • Nhat Pham's avatar
      zsmalloc: add a LRU to zs_pool to keep track of zspages in LRU order · 64f768c6
      Nhat Pham authored
      This helps determines the coldest zspages as candidates for writeback.
      
      Link: https://lkml.kernel.org/r/20221128191616.1261026-5-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      64f768c6
    • Nhat Pham's avatar
      zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks · c0547d0b
      Nhat Pham authored
      Currently, zsmalloc has a hierarchy of locks, which includes a pool-level
      migrate_lock, and a lock for each size class.  We have to obtain both
      locks in the hotpath in most cases anyway, except for zs_malloc.  This
      exception will no longer exist when we introduce a LRU into the zs_pool
      for the new writeback functionality - we will need to obtain a pool-level
      lock to synchronize LRU handling even in zs_malloc.
      
      In preparation for zsmalloc writeback, consolidate these locks into a
      single pool-level lock, which drastically reduces the complexity of
      synchronization in zsmalloc.
      
      We have also benchmarked the lock consolidation to see the performance
      effect of this change on zram.
      
      First, we ran a synthetic FS workload on a server machine with 36 cores
      (same machine for all runs), using
      
      fs_mark  -d  ../zram1mnt  -s  100000  -n  2500  -t  32  -k
      
      before and after for btrfs and ext4 on zram (FS usage is 80%).
      
      Here is the result (unit is file/second):
      
      With lock consolidation (btrfs):
      Average: 13520.2, Median: 13531.0, Stddev: 137.5961482019028
      
      Without lock consolidation (btrfs):
      Average: 13487.2, Median: 13575.0, Stddev: 309.08283679298665
      
      With lock consolidation (ext4):
      Average: 16824.4, Median: 16839.0, Stddev: 89.97388510006668
      
      Without lock consolidation (ext4)
      Average: 16958.0, Median: 16986.0, Stddev: 194.7370021336469
      
      As you can see, we observe a 0.3% regression for btrfs, and a 0.9%
      regression for ext4. This is a small, barely measurable difference in my
      opinion.
      
      For a more realistic scenario, we also tries building the kernel on zram.
      Here is the time it takes (in seconds):
      
      With lock consolidation (btrfs):
      real
      Average: 319.6, Median: 320.0, Stddev: 0.8944271909999159
      user
      Average: 6894.2, Median: 6895.0, Stddev: 25.528415540334656
      sys
      Average: 521.4, Median: 522.0, Stddev: 1.51657508881031
      
      Without lock consolidation (btrfs):
      real
      Average: 319.8, Median: 320.0, Stddev: 0.8366600265340756
      user
      Average: 6896.6, Median: 6899.0, Stddev: 16.04057355583023
      sys
      Average: 520.6, Median: 521.0, Stddev: 1.140175425099138
      
      With lock consolidation (ext4):
      real
      Average: 320.0, Median: 319.0, Stddev: 1.4142135623730951
      user
      Average: 6896.8, Median: 6878.0, Stddev: 28.621670111997307
      sys
      Average: 521.2, Median: 521.0, Stddev: 1.7888543819998317
      
      Without lock consolidation (ext4)
      real
      Average: 319.6, Median: 319.0, Stddev: 0.8944271909999159
      user
      Average: 6886.2, Median: 6887.0, Stddev: 16.93221781102523
      sys
      Average: 520.4, Median: 520.0, Stddev: 1.140175425099138
      
      The difference is entirely within the noise of a typical run on zram. 
      This hardly justifies the complexity of maintaining both the pool lock and
      the class lock.  In fact, for writeback, we would need to introduce yet
      another lock to prevent data races on the pool's LRU, further complicating
      the lock handling logic.  IMHO, it is just better to collapse all of these
      into a single pool-level lock.
      
      Link: https://lkml.kernel.org/r/20221128191616.1261026-4-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c0547d0b
    • Johannes Weiner's avatar
      zpool: clean out dead code · 6a05aa30
      Johannes Weiner authored
      There is a lot of provision for flexibility that isn't actually needed or
      used.  Zswap (the only zpool user) always passes zpool_ops with an .evict
      method set.  The backends who reclaim only do so for zswap, so they can
      also directly call zpool_ops without indirection or checks.
      
      Finally, there is no need to check the retries parameters and bail with
      -EINVAL in the reclaim function, when that's called just a few lines below
      with a hard-coded 8.  There is no need to duplicate the evictable and
      sleep_mapped attrs from the driver in zpool_ops.
      
      Link: https://lkml.kernel.org/r/20221128191616.1261026-3-nphamcs@gmail.comReviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6a05aa30
    • Johannes Weiner's avatar
      zswap: fix writeback lock ordering for zsmalloc · 6b3379e8
      Johannes Weiner authored
      Patch series "Implement writeback for zsmalloc", v7.
      
      Unlike other zswap allocators such as zbud or z3fold, zsmalloc currently
      lacks the writeback mechanism.  This means that when the zswap pool is
      full, it will simply reject further allocations, and the pages will be
      written directly to swap.
      
      This series of patches implements writeback for zsmalloc. When the zswap
      pool becomes full, zsmalloc will attempt to evict all the compressed
      objects in the least-recently used zspages.
      
      
      This patch (of 6):
      
      zswap's customary lock order is tree->lock before pool->lock, because the
      tree->lock protects the entries' refcount, and the free callbacks in the
      backends acquire their respective pool locks to dispatch the backing
      object.  zsmalloc's map callback takes the pool lock, so zswap must not
      grab the tree->lock while a handle is mapped.  This currently only happens
      during writeback, which isn't implemented for zsmalloc.  In preparation
      for it, move the tree->lock section out of the mapped entry section
      
      Link: https://lkml.kernel.org/r/20221128191616.1261026-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20221128191616.1261026-2-nphamcs@gmail.comSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6b3379e8
    • Pavankumar Kondeti's avatar
      mm/madvise: fix madvise_pageout for private file mappings · fd3b1bc3
      Pavankumar Kondeti authored
      When MADV_PAGEOUT is called on a private file mapping VMA region, we bail
      out early if the process is neither owner nor write capable of the file. 
      However, this VMA may have both private/shared clean pages and private
      dirty pages.  The opportunity of paging out the private dirty pages (Anon
      pages) is missed.  Fix this behavior by allowing private file mappings
      pageout further and perform the file access check along with PageAnon()
      during page walk.
      
      We observe ~10% improvement in zram usage, thus leaving more available
      memory on a 4GB RAM system running Android.
      
      [quic_pkondeti@quicinc.com: v2]
        Link: https://lkml.kernel.org/r/1669962597-27724-1-git-send-email-quic_pkondeti@quicinc.com
      Link: https://lkml.kernel.org/r/1667971116-12900-1-git-send-email-quic_pkondeti@quicinc.comSigned-off-by: default avatarPavankumar Kondeti <quic_pkondeti@quicinc.com>
      Cc: Charan Teja Kalla <quic_charante@quicinc.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fd3b1bc3
    • Gautam Menghani's avatar
      mm/khugepaged: add tracepoint to collapse_file() · 4c9473e8
      Gautam Menghani authored
      "mm_khugepaged_collapse_file" for capturing is_shmem.
      Currently, is_shmem is not being captured. Capturing is_shmem is useful
      as it can indicate if tmpfs is being used as a backing store instead of
      persistent storage. Add the tracepoint in collapse_file() named
      "mm_khugepaged_collapse_file" for capturing is_shmem.
      
      [gautammenghani201@gmail.com: swap is_shmem and addr to save space, per Steven Rostedt]
        Link: https://lkml.kernel.org/r/20221202201807.182829-1-gautammenghani201@gmail.com
      Link: https://lkml.kernel.org/r/20221026052218.148234-1-gautammenghani201@gmail.comSigned-off-by: default avatarGautam Menghani <gautammenghani201@gmail.com>
      Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>	[tracing]
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4c9473e8
    • David Hildenbrand's avatar
      mm/gup: remove FOLL_MIGRATION · f7355e99
      David Hildenbrand authored
      Fortunately, the last user (KSM) is gone, so let's just remove this rather
      special code from generic GUP handling -- especially because KSM never
      required the PMD handling as KSM only deals with individual base pages.
      
      [akpm@linux-foundation.org: fix merge snafu]Link: https://lkml.kernel.org/r/20221021101141.84170-10-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f7355e99
    • David Hildenbrand's avatar
      mm/ksm: convert break_ksm() to use walk_page_range_vma() · d7c0e68d
      David Hildenbrand authored
      FOLL_MIGRATION exists only for the purpose of break_ksm(), and actually,
      there is not even the need to wait for the migration to finish, we only
      want to know if we're dealing with a KSM page.
      
      Using follow_page() just to identify a KSM page overcomplicates GUP code. 
      Let's use walk_page_range_vma() instead, because we don't actually care
      about the page itself, we only need to know a single property -- no need
      to even grab a reference.
      
      So, get rid of follow_page() usage such that we can get rid of
      FOLL_MIGRATION now and eventually be able to get rid of follow_page() in
      the future.
      
      In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge
      performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in
      a performance degradation of ~2% (old: ~5010 MiB/s, new: ~4900 MiB/s).  I
      don't think we particularly care for now.
      
      Interestingly, the benchmark reduction is due to the single callback. 
      Adding a second callback (e.g., pud_entry()) reduces the benchmark by
      another 100-200 MiB/s.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d7c0e68d
    • David Hildenbrand's avatar
      mm/pagewalk: add walk_page_range_vma() · e07cda5f
      David Hildenbrand authored
      Let's add walk_page_range_vma(), which is similar to walk_page_vma(),
      however, is only interested in a subset of the VMA range.
      
      To be used in KSM code to stop using follow_page() next.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e07cda5f
    • David Hildenbrand's avatar
      mm/ksm: fix KSM COW breaking with userfaultfd-wp via FAULT_FLAG_UNSHARE · 6cce3314
      David Hildenbrand authored
      Let's stop breaking COW via a fake write fault and let's use
      FAULT_FLAG_UNSHARE instead.  This avoids any wrong side effects of the
      fake write fault, such as mapping the PTE writable and marking the pte
      dirty/softdirty.
      
      Consequently, we will no longer trigger a fake write fault and break COW
      without any such side-effects.
      
      Also, this fixes KSM interaction with userfaultfd-wp: when we have a KSM
      page that's write-protected by userfaultfd, break_ksm()->handle_mm_fault()
      will fail with VM_FAULT_SIGBUS and will simply return in break_ksm() with
      0 instead of actually breaking COW.
      
      For now, the KSM unmerge tests can trigger that:
          $ sudo ./ksm_functional_tests
          TAP version 13
          1..3
          # [RUN] test_unmerge
          ok 1 Pages were unmerged
          # [RUN] test_unmerge_discarded
          ok 2 Pages were unmerged
          # [RUN] test_unmerge_uffd_wp
          not ok 3 Pages were unmerged
          Bail out! 1 out of 3 tests failed
          # Planned tests != run tests (2 != 3)
          # Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0
      
      The warning in dmesg also indicates this wrong handling:
          [  230.096368] FAULT_FLAG_ALLOW_RETRY missing 881
          [  230.100822] CPU: 1 PID: 1643 Comm: ksm-uffd-wp [...]
          [  230.110124] Hardware name: [...]
          [  230.117775] Call Trace:
          [  230.120227]  <TASK>
          [  230.122334]  dump_stack_lvl+0x44/0x5c
          [  230.126010]  handle_userfault.cold+0x14/0x19
          [  230.130281]  ? tlb_finish_mmu+0x65/0x170
          [  230.134207]  ? uffd_wp_range+0x65/0xa0
          [  230.137959]  ? _raw_spin_unlock+0x15/0x30
          [  230.141972]  ? do_wp_page+0x50/0x590
          [  230.145551]  __handle_mm_fault+0x9f5/0xf50
          [  230.149652]  ? mmput+0x1f/0x40
          [  230.152712]  handle_mm_fault+0xb9/0x2a0
          [  230.156550]  break_ksm+0x141/0x180
          [  230.159964]  unmerge_ksm_pages+0x60/0x90
          [  230.163890]  ksm_madvise+0x3c/0xb0
          [  230.167295]  do_madvise.part.0+0x10c/0xeb0
          [  230.171396]  ? do_syscall_64+0x67/0x80
          [  230.175157]  __x64_sys_madvise+0x5a/0x70
          [  230.179082]  do_syscall_64+0x58/0x80
          [  230.182661]  ? do_syscall_64+0x67/0x80
          [  230.186413]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      This is primarily a fix for KSM+userfaultfd-wp, however, the fake write
      fault was always questionable.  As this fix is not easy to backport and
      it's not very critical, let's not cc stable.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-6-david@redhat.com
      Fixes: 529b930b ("userfaultfd: wp: hook userfault handler to write protection fault")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6cce3314
    • David Hildenbrand's avatar
      mm: remove VM_FAULT_WRITE · cb8d8633
      David Hildenbrand authored
      All users -- GUP and KSM -- are gone, let's just remove it.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cb8d8633
    • David Hildenbrand's avatar
      mm/ksm: simplify break_ksm() to not rely on VM_FAULT_WRITE · 58f595c6
      David Hildenbrand authored
      Now that GUP no longer requires VM_FAULT_WRITE, break_ksm() is the sole
      remaining user of VM_FAULT_WRITE.  As we also want to stop triggering a
      fake write fault and instead use FAULT_FLAG_UNSHARE -- similar to
      GUP-triggered unsharing when taking a R/O pin on a shared anonymous page
      (including KSM pages), let's stop relying on VM_FAULT_WRITE.
      
      Let's rework break_ksm() to not rely on the return value of
      handle_mm_fault() anymore to figure out whether COW-breaking was
      successful.  Simply perform another follow_page() lookup to verify the
      result.
      
      While this makes break_ksm() slightly less efficient, we can simplify
      handle_mm_fault() a little and easily switch to FAULT_FLAG_UNSHARE without
      introducing similar KSM-specific behavior for FAULT_FLAG_UNSHARE.
      
      In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge
      performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in
      a performance degradation of ~4% -- 5% (old: ~5250 MiB/s, new: ~5010
      MiB/s).
      
      I don't think that we particularly care about that performance drop when
      unmerging.  If it ever turns out to be an actual performance issue, we can
      think about a better alternative for FAULT_FLAG_UNSHARE -- let's just keep
      it simple for now.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      58f595c6
    • David Hildenbrand's avatar
      selftests/vm: add test to measure MADV_UNMERGEABLE performance · 5036880e
      David Hildenbrand authored
      Let's add a test to measure performance of KSM breaking not triggered via
      COW, but triggered by disabling KSM on an area filled with KSM pages via
      MADV_UNMERGEABLE.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5036880e
    • David Hildenbrand's avatar
      mm/pagewalk: don't trigger test_walk() in walk_page_vma() · c31783ee
      David Hildenbrand authored
      As Peter points out, the caller passes a single VMA and can just do that
      check itself.
      
      And in fact, no existing users rely on test_walk() getting called.  So
      let's just remove it and make the implementation slightly more efficient.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c31783ee
    • David Hildenbrand's avatar
      selftests/vm: add KSM unmerge tests · 93fb70aa
      David Hildenbrand authored
      Patch series "mm/ksm: break_ksm() cleanups and fixes", v2.
      
      This series cleans up and fixes break_ksm().  In summary, we no longer use
      fake write faults to break COW but instead FAULT_FLAG_UNSHARE.  Further,
      we move away from using follow_page() --- that we can hopefully remove
      completely at one point --- and use new walk_page_range_vma() instead.
      
      Fortunately, we can get rid of VM_FAULT_WRITE and FOLL_MIGRATION in common
      code now.
      
      Extend the existing ksm tests by an unmerge benchmark, and a some new
      unmerge tests.
      
      Also, add a selftest to measure MADV_UNMERGEABLE performance.  In my setup
      (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge performance
      on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in a
      performance degradation of ~6% -- 7% (old: ~5250 MiB/s, new: ~4900 MiB/s).
      I don't think we particularly care for now, but it's good to be aware of
      the implication.
      
      
      This patch (of 9):
      
      Let's add three unmerge tests (MADV_UNMERGEABLE unmerging all pages in the
      range).
      
      test_unmerge(): basic unmerge tests
      test_unmerge_discarded(): have some pte_none() entries in the range
      test_unmerge_uffd_wp(): protect the merged pages using uffd-wp
      
      ksm_tests.c currently contains a mixture of benchmarks and tests, whereby
      each test is carried out by executing the ksm_tests binary with specific
      parameters.  Let's add new ksm_functional_tests.c that performs multiple,
      smaller functional tests all at once.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20221021101141.84170-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      93fb70aa
    • Joel Savitz's avatar
      selftests/vm: enable running select groups of tests · 85463321
      Joel Savitz authored
      Our memory management kernel CI testing at Red Hat uses the VM
      selftests and we have run into two problems:
      
      First, our LTP tests overlap with the VM selftests.
      
      We want to avoid unhelpful redundancy in our testing practices.
      
      Second, we have observed the current run_vmtests.sh to report overall
      failure/ambiguous results in the case that a machine lacks the necessary
      hardware to perform one or more of the tests. E.g. ksm tests that
      require more than one numa node.
      
      We want to be able to run the vm selftests suitable to particular hardware.
      
      Add the ability to run one or more groups of vm tests via run_vmtests.sh
      instead of simply all-or-none in order to solve these problems.
      
      Preserve existing default behavior of running all tests when the script
      is invoked with no arguments.
      
      Documentation of test groups is included in the patch as follows:
      
          # ./run_vmtests.sh [ -h || --help ]
      
          usage: ./tools/testing/selftests/vm/run_vmtests.sh [ -h | -t "<categories>"]
            -t: specify specific categories to tests to run
            -h: display this message
      
          The default behavior is to run all tests.
      
          Alternatively, specific groups tests can be run by passing a string
          to the -t argument containing one or more of the following categories
          separated by spaces:
          - mmap
      	    tests for mmap(2)
          - gup_test
      	    tests for gup using gup_test interface
          - userfaultfd
      	    tests for  userfaultfd(2)
          - compaction
      	    a test for the patch "Allow compaction of unevictable pages"
          - mlock
      	    tests for mlock(2)
          - mremap
      	    tests for mremap(2)
          - hugevm
      	    tests for very large virtual address space
          - vmalloc
      	    vmalloc smoke tests
          - hmm
      	    hmm smoke tests
          - madv_populate
      	    test memadvise(2) MADV_POPULATE_{READ,WRITE} options
          - memfd_secret
      	    test memfd_secret(2)
          - process_mrelease
      	    test process_mrelease(2)
          - ksm
      	    ksm tests that do not require >=2 NUMA nodes
          - ksm_numa
      	    ksm tests that require >=2 NUMA nodes
          - pkey
      	    memory protection key tests
          - soft_dirty
          	    test soft dirty page bit semantics
          - anon_cow
                  test anonymous copy-on-write semantics
          example: ./run_vmtests.sh -t "hmm mmap ksm"
      
      Link: https://lkml.kernel.org/r/20221018231222.1884715-1-jsavitz@redhat.comSigned-off-by: default avatarJoel Savitz <jsavitz@redhat.com>
      Cc: Joel Savitz <jsavitz@redhat.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85463321
  2. 10 Dec, 2022 10 commits
    • Andrew Morton's avatar
      3b910105
    • Tejun Heo's avatar
      memcg: fix possible use-after-free in memcg_write_event_control() · 4a7ba45b
      Tejun Heo authored
      memcg_write_event_control() accesses the dentry->d_name of the specified
      control fd to route the write call.  As a cgroup interface file can't be
      renamed, it's safe to access d_name as long as the specified file is a
      regular cgroup file.  Also, as these cgroup interface files can't be
      removed before the directory, it's safe to access the parent too.
      
      Prior to 347c4a87 ("memcg: remove cgroup_event->cft"), there was a
      call to __file_cft() which verified that the specified file is a regular
      cgroupfs file before further accesses.  The cftype pointer returned from
      __file_cft() was no longer necessary and the commit inadvertently dropped
      the file type check with it allowing any file to slip through.  With the
      invarients broken, the d_name and parent accesses can now race against
      renames and removals of arbitrary files and cause use-after-free's.
      
      Fix the bug by resurrecting the file type check in __file_cft().  Now that
      cgroupfs is implemented through kernfs, checking the file operations needs
      to go through a layer of indirection.  Instead, let's check the superblock
      and dentry type.
      
      Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org
      Fixes: 347c4a87 ("memcg: remove cgroup_event->cft")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>	[3.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a7ba45b
    • Muchun Song's avatar
      MAINTAINERS: update Muchun Song's email · a501788a
      Muchun Song authored
      I'm moving to the @linux.dev account.  Map my old addresses and update it
      to my new address.
      
      Link: https://lkml.kernel.org/r/20221208115548.85244-1-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a501788a
    • John Starks's avatar
      mm/gup: fix gup_pud_range() for dax · fcd0ccd8
      John Starks authored
      For dax pud, pud_huge() returns true on x86. So the function works as long
      as hugetlb is configured. However, dax doesn't depend on hugetlb.
      Commit 414fd080 ("mm/gup: fix gup_pmd_range() for dax") fixed
      devmap-backed huge PMDs, but missed devmap-backed huge PUDs. Fix this as
      well.
      
      This fixes the below kernel panic:
      
      general protection fault, probably for non-canonical address 0x69e7c000cc478: 0000 [#1] SMP
      	< snip >
      Call Trace:
      <TASK>
      get_user_pages_fast+0x1f/0x40
      iov_iter_get_pages+0xc6/0x3b0
      ? mempool_alloc+0x5d/0x170
      bio_iov_iter_get_pages+0x82/0x4e0
      ? bvec_alloc+0x91/0xc0
      ? bio_alloc_bioset+0x19a/0x2a0
      blkdev_direct_IO+0x282/0x480
      ? __io_complete_rw_common+0xc0/0xc0
      ? filemap_range_has_page+0x82/0xc0
      generic_file_direct_write+0x9d/0x1a0
      ? inode_update_time+0x24/0x30
      __generic_file_write_iter+0xbd/0x1e0
      blkdev_write_iter+0xb4/0x150
      ? io_import_iovec+0x8d/0x340
      io_write+0xf9/0x300
      io_issue_sqe+0x3c3/0x1d30
      ? sysvec_reschedule_ipi+0x6c/0x80
      __io_queue_sqe+0x33/0x240
      ? fget+0x76/0xa0
      io_submit_sqes+0xe6a/0x18d0
      ? __fget_light+0xd1/0x100
      __x64_sys_io_uring_enter+0x199/0x880
      ? __context_tracking_enter+0x1f/0x70
      ? irqentry_exit_to_user_mode+0x24/0x30
      ? irqentry_exit+0x1d/0x30
      ? __context_tracking_exit+0xe/0x70
      do_syscall_64+0x3b/0x90
      entry_SYSCALL_64_after_hwframe+0x61/0xcb
      RIP: 0033:0x7fc97c11a7be
      	< snip >
      </TASK>
      ---[ end trace 48b2e0e67debcaeb ]---
      RIP: 0010:internal_get_user_pages_fast+0x340/0x990
      	< snip >
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: disabled
      
      Link: https://lkml.kernel.org/r/1670392853-28252-1-git-send-email-ssengar@linux.microsoft.com
      Fixes: 414fd080 ("mm/gup: fix gup_pmd_range() for dax")
      Signed-off-by: default avatarJohn Starks <jostarks@microsoft.com>
      Signed-off-by: default avatarSaurabh Sengar <ssengar@linux.microsoft.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fcd0ccd8
    • Liam Howlett's avatar
      mmap: fix do_brk_flags() modifying obviously incorrect VMAs · 6c28ca64
      Liam Howlett authored
      Add more sanity checks to the VMA that do_brk_flags() will expand.  Ensure
      the VMA matches basic merge requirements within the function before
      calling can_vma_merge_after().
      
      Drop the duplicate checks from vm_brk_flags() since they will be enforced
      later.
      
      The old code would expand file VMAs on brk(), which is functionally
      wrong and also dangerous in terms of locking because the brk() path
      isn't designed for file VMAs and therefore doesn't lock the file
      mapping.  Checking can_vma_merge_after() ensures that new anonymous
      VMAs can't be merged into file VMAs.
      
      See https://lore.kernel.org/linux-mm/CAG48ez1tJZTOjS_FjRZhvtDA-STFmdw8PEizPDwMGFd_ui0Nrw@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20221205192304.1957418-1-Liam.Howlett@oracle.com
      Fixes: 2e7ce7d3 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Suggested-by: default avatarJann Horn <jannh@google.com>
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c28ca64
    • David Hildenbrand's avatar
      mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit · 630dc25e
      David Hildenbrand authored
      We use "unsigned long" to store a PFN in the kernel and phys_addr_t to
      store a physical address.
      
      On a 64bit system, both are 64bit wide.  However, on a 32bit system, the
      latter might be 64bit wide.  This is, for example, the case on x86 with
      PAE: phys_addr_t and PTEs are 64bit wide, while "unsigned long" only spans
      32bit.
      
      The current definition of SWP_PFN_BITS without MAX_PHYSMEM_BITS misses
      that case, and assumes that the maximum PFN is limited by an 32bit
      phys_addr_t.  This implies, that SWP_PFN_BITS will currently only be able
      to cover 4 GiB - 1 on any 32bit system with 4k page size, which is wrong.
      
      Let's rely on the number of bits in phys_addr_t instead, but make sure to
      not exceed the maximum swap offset, to not make the BUILD_BUG_ON() in
      is_pfn_swap_entry() unhappy.  Note that swp_entry_t is effectively an
      unsigned long and the maximum swap offset shares that value with the swap
      type.
      
      For example, on an 8 GiB x86 PAE system with a kernel config based on
      Debian 11.5 (-> CONFIG_FLATMEM=y, CONFIG_X86_PAE=y), we will currently
      fail removing migration entries (remove_migration_ptes()), because
      mm/page_vma_mapped.c:check_pte() will fail to identify a PFN match as
      swp_offset_pfn() wrongly masks off PFN bits.  For example,
      split_huge_page_to_list()->...->remap_page() will leave migration entries
      in place and continue to unlock the page.
      
      Later, when we stumble over these migration entries (e.g., via
      /proc/self/pagemap), pfn_swap_entry_to_page() will BUG_ON() because these
      migration entries shouldn't exist anymore and the page was unlocked.
      
      [   33.067591] kernel BUG at include/linux/swapops.h:497!
      [   33.067597] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
      [   33.067602] CPU: 3 PID: 742 Comm: cow Tainted: G            E      6.1.0-rc8+ #16
      [   33.067605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
      [   33.067606] EIP: pagemap_pmd_range+0x644/0x650
      [   33.067612] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 48 c6 52 00 e9 23 fb ff ff e8 61 83 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
      [   33.067615] EAX: ee394000 EBX: 00000002 ECX: ee394000 EDX: 00000000
      [   33.067617] ESI: c1b0ded4 EDI: 00024a00 EBP: c1b0ddb4 ESP: c1b0dd68
      [   33.067619] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
      [   33.067624] CR0: 80050033 CR2: b7a00000 CR3: 01bbbd20 CR4: 00350ef0
      [   33.067625] Call Trace:
      [   33.067628]  ? madvise_free_pte_range+0x720/0x720
      [   33.067632]  ? smaps_pte_range+0x4b0/0x4b0
      [   33.067634]  walk_pgd_range+0x325/0x720
      [   33.067637]  ? mt_find+0x1d6/0x3a0
      [   33.067641]  ? mt_find+0x1d6/0x3a0
      [   33.067643]  __walk_page_range+0x164/0x170
      [   33.067646]  walk_page_range+0xf9/0x170
      [   33.067648]  ? __kmem_cache_alloc_node+0x2a8/0x340
      [   33.067653]  pagemap_read+0x124/0x280
      [   33.067658]  ? default_llseek+0x101/0x160
      [   33.067662]  ? smaps_account+0x1d0/0x1d0
      [   33.067664]  vfs_read+0x90/0x290
      [   33.067667]  ? do_madvise.part.0+0x24b/0x390
      [   33.067669]  ? debug_smp_processor_id+0x12/0x20
      [   33.067673]  ksys_pread64+0x58/0x90
      [   33.067675]  __ia32_sys_ia32_pread64+0x1b/0x20
      [   33.067680]  __do_fast_syscall_32+0x4c/0xc0
      [   33.067683]  do_fast_syscall_32+0x29/0x60
      [   33.067686]  do_SYSENTER_32+0x15/0x20
      [   33.067689]  entry_SYSENTER_32+0x98/0xf1
      
      Decrease the indentation level of SWP_PFN_BITS and SWP_PFN_MASK to keep it
      readable and consistent.
      
      [david@redhat.com: rely on sizeof(phys_addr_t) and min_t() instead]
        Link: https://lkml.kernel.org/r/20221206105737.69478-1-david@redhat.com
      [david@redhat.com: use "int" for comparison, as we're only comparing numbers < 64]
        Link: https://lkml.kernel.org/r/1f157500-2676-7cef-a84e-9224ed64e540@redhat.com
      Link: https://lkml.kernel.org/r/20221205150857.167583-1-david@redhat.com
      Fixes: 0d206b5d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      630dc25e
    • Hugh Dickins's avatar
      tmpfs: fix data loss from failed fallocate · 44bcabd7
      Hugh Dickins authored
      Fix tmpfs data loss when the fallocate system call is interrupted by a
      signal, or fails for some other reason.  The partial folio handling in
      shmem_undo_range() forgot to consider this unfalloc case, and was liable
      to erase or truncate out data which had already been committed earlier.
      
      It turns out that none of the partial folio handling there is appropriate
      for the unfalloc case, which just wants to proceed to removal of whole
      folios: which find_get_entries() provides, even when partially covered.
      
      Original patch by Rui Wang.
      
      Link: https://lore.kernel.org/linux-mm/33b85d82.7764.1842e9ab207.Coremail.chenguoqic@163.com/
      Link: https://lkml.kernel.org/r/a5dac112-cf4b-7af-a33-f386e347fd38@google.com
      Fixes: b9a8a419 ("truncate,shmem: Handle truncates that split large folios")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarGuoqi Chen <chenguoqic@163.com>
        Link: https://lore.kernel.org/all/20221101032248.819360-1-kernel@hev.cc/
      Cc: Rui Wang <kernel@hev.cc>
      Cc: Huacai Chen <chenhuacai@loongson.cn>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.17+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      44bcabd7
    • Michal Hocko's avatar
      kselftests: cgroup: update kmem test precision tolerance · de16d6e4
      Michal Hocko authored
      1813e51e ("memcg: increase MEMCG_CHARGE_BATCH to 64") has changed
      the batch size while this test case has been left behind. This has led
      to a test failure reported by test bot:
      not ok 2 selftests: cgroup: test_kmem # exit=1
      
      Update the tolerance for the pcp charges to reflect the
      MEMCG_CHARGE_BATCH change to fix this.
      
      [akpm@linux-foundation.org: update comments, per Roman]
      Link: https://lkml.kernel.org/r/Y4m8Unt6FhWKC6IH@dhcp22.suse.cz
      Fixes: 1813e51e ("memcg: increase MEMCG_CHARGE_BATCH to 64")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarkernel test robot <yujie.liu@intel.com>
        Link: https://lore.kernel.org/oe-lkp/202212010958.c1053bd3-yujie.liu@intel.comAcked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Tested-by: default avatarYujie Liu <yujie.liu@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Michal Koutný" <mkoutny@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de16d6e4
    • Jason A. Donenfeld's avatar
      mm: do not BUG_ON missing brk mapping, because userspace can unmap it · f5ad5083
      Jason A. Donenfeld authored
      The following program will trigger the BUG_ON that this patch removes,
      because the user can munmap() mm->brk:
      
        #include <sys/syscall.h>
        #include <sys/mman.h>
        #include <assert.h>
        #include <unistd.h>
      
        static void *brk_now(void)
        {
          return (void *)syscall(SYS_brk, 0);
        }
      
        static void brk_set(void *b)
        {
          assert(syscall(SYS_brk, b) != -1);
        }
      
        int main(int argc, char *argv[])
        {
          void *b = brk_now();
          brk_set(b + 4096);
          assert(munmap(b - 4096, 4096 * 2) == 0);
          brk_set(b);
          return 0;
        }
      
      Compile that with musl, since glibc actually uses brk(), and then
      execute it, and it'll hit this splat:
      
        kernel BUG at mm/mmap.c:229!
        invalid opcode: 0000 [#1] PREEMPT SMP
        CPU: 12 PID: 1379 Comm: a.out Tainted: G S   U             6.1.0-rc7+ #419
        RIP: 0010:__do_sys_brk+0x2fc/0x340
        Code: 00 00 4c 89 ef e8 04 d3 fe ff eb 9a be 01 00 00 00 4c 89 ff e8 35 e0 fe ff e9 6e ff ff ff 4d 89 a7 20>
        RSP: 0018:ffff888140bc7eb0 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 00000000007e7000 RCX: ffff8881020fe000
        RDX: ffff8881020fe001 RSI: ffff8881955c9b00 RDI: ffff8881955c9b08
        RBP: 0000000000000000 R08: ffff8881955c9b00 R09: 00007ffc77844000
        R10: 0000000000000000 R11: 0000000000000001 R12: 00000000007e8000
        R13: 00000000007e8000 R14: 00000000007e7000 R15: ffff8881020fe000
        FS:  0000000000604298(0000) GS:ffff88901f700000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000603fe0 CR3: 000000015ba9a005 CR4: 0000000000770ee0
        PKRU: 55555554
        Call Trace:
         <TASK>
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
        RIP: 0033:0x400678
        Code: 10 4c 8d 41 08 4c 89 44 24 10 4c 8b 01 8b 4c 24 08 83 f9 2f 77 0a 4c 8d 4c 24 20 4c 01 c9 eb 05 48 8b>
        RSP: 002b:00007ffc77863890 EFLAGS: 00000212 ORIG_RAX: 000000000000000c
        RAX: ffffffffffffffda RBX: 000000000040031b RCX: 0000000000400678
        RDX: 00000000004006a1 RSI: 00000000007e6000 RDI: 00000000007e7000
        RBP: 00007ffc77863900 R08: 0000000000000000 R09: 00000000007e6000
        R10: 00007ffc77863930 R11: 0000000000000212 R12: 00007ffc77863978
        R13: 00007ffc77863988 R14: 0000000000000000 R15: 0000000000000000
         </TASK>
      
      Instead, just return the old brk value if the original mapping has been
      removed.
      
      [akpm@linux-foundation.org: fix changelog, per Liam]
      Link: https://lkml.kernel.org/r/20221202162724.2009-1-Jason@zx2c4.com
      Fixes: 2e7ce7d3 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f5ad5083
    • Matti Vaittinen's avatar
      mailmap: update Matti Vaittinen's email address · 38f1d4ae
      Matti Vaittinen authored
      The email backend used by ROHM keeps labeling patches as spam.  This can
      result in missing the patches.
      
      Switch my mail address from a company mail to a personal one.
      
      Link: https://lkml.kernel.org/r/8f4498b66fedcbded37b3b87e0c516e659f8f583.1669912977.git.mazziesaccount@gmail.comSigned-off-by: default avatarMatti Vaittinen <mazziesaccount@gmail.com>
      Suggested-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Anup Patel <anup@brainfault.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Atish Patra <atishp@atishpatra.org>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Ben Widawsky <bwidawsk@kernel.org>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Colin Ian King <colin.i.king@gmail.com>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Qais Yousef <qyousef@layalina.io>
      Cc: Vasily Averin <vasily.averin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      38f1d4ae
  3. 30 Nov, 2022 6 commits