An error occurred fetching the project authors.
  1. 08 Feb, 2024 1 commit
  2. 29 Dec, 2023 1 commit
  3. 07 Dec, 2023 1 commit
    • Jiexun Wang's avatar
      mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range() · b2f557a2
      Jiexun Wang authored
      I conducted real-time testing and observed that
      madvise_cold_or_pageout_pte_range() causes significant latency under
      memory pressure, which can be effectively reduced by adding cond_resched()
      within the loop.
      
      I tested on the LicheePi 4A board using Cylictest for latency testing and
      Ftrace for latency tracing.  The board uses TH1520 processor and has a
      memory size of 8GB.  The kernel version is 6.5.0 with the PREEMPT_RT patch
      applied.
      
      The script I tested is as follows:
      
      echo wakeup_rt > /sys/kernel/tracing/current_tracer
      echo 1 > /sys/kernel/tracing/tracing_on
      echo 0 > /sys/kernel/tracing/tracing_max_latency
      stress-ng --vm 8 --vm-bytes 2G &
      cyclictest --mlockall --smp --priority=99 --distance=0 --duration=30m
      echo 0 > /sys/kernel/tracing/tracing_on
      cat /sys/kernel/tracing/trace 
      
      The tracing results before modification are as follows:
      
      # tracer: wakeup_rt
      #
      # wakeup_rt latency trace v1.1.5 on 6.5.0-rt6-r1208-00003-g999d221864bf
      # --------------------------------------------------------------------
      # latency: 2552 us, #6/6, CPU#3 | (M:preempt_rt VP:0, KP:0, SP:0 HP:0 #P:4)
      #    -----------------
      #    | task: cyclictest-196 (uid:0 nice:0 policy:1 rt_prio:99)
      #    -----------------
      #
      #                    _--------=> CPU#
      #                   / _-------=> irqs-off/BH-disabled
      #                  | / _------=> need-resched
      #                  || / _-----=> need-resched-lazy
      #                  ||| / _----=> hardirq/softirq
      #                  |||| / _---=> preempt-depth
      #                  ||||| / _--=> preempt-lazy-depth
      #                  |||||| / _-=> migrate-disable
      #                  ||||||| /     delay
      #  cmd     pid     |||||||| time  |   caller
      #     \   /        ||||||||  \    |    /
      stress-n-206       3dn.h512    2us :      206:120:R   + [003]     196:  0:R cyclictest
      stress-n-206       3dn.h512    7us : <stack trace>
       => __ftrace_trace_stack
       => __trace_stack
       => probe_wakeup
       => ttwu_do_activate
       => try_to_wake_up
       => wake_up_process
       => hrtimer_wakeup
       => __hrtimer_run_queues
       => hrtimer_interrupt
       => riscv_timer_interrupt
       => handle_percpu_devid_irq
       => generic_handle_domain_irq
       => riscv_intc_irq
       => handle_riscv_irq
       => do_irq
      stress-n-206       3dn.h512    9us#: 0
      stress-n-206       3d...3.. 2544us : __schedule
      stress-n-206       3d...3.. 2545us :      206:120:R ==> [003]     196:  0:R cyclictest
      stress-n-206       3d...3.. 2551us : <stack trace>
       => __ftrace_trace_stack
       => __trace_stack
       => probe_wakeup_sched_switch
       => __schedule
       => preempt_schedule
       => migrate_enable
       => rt_spin_unlock
       => madvise_cold_or_pageout_pte_range
       => walk_pgd_range
       => __walk_page_range
       => walk_page_range
       => madvise_pageout
       => madvise_vma_behavior
       => do_madvise
       => sys_madvise
       => do_trap_ecall_u
       => ret_from_exception
      
      The tracing results after modification are as follows:
      
      # tracer: wakeup_rt
      #
      # wakeup_rt latency trace v1.1.5 on 6.5.0-rt6-r1208-00004-gca3876fc69a6-dirty
      # --------------------------------------------------------------------
      # latency: 1689 us, #6/6, CPU#0 | (M:preempt_rt VP:0, KP:0, SP:0 HP:0 #P:4)
      #    -----------------
      #    | task: cyclictest-217 (uid:0 nice:0 policy:1 rt_prio:99)
      #    -----------------
      #
      #                    _--------=> CPU#
      #                   / _-------=> irqs-off/BH-disabled
      #                  | / _------=> need-resched
      #                  || / _-----=> need-resched-lazy
      #                  ||| / _----=> hardirq/softirq
      #                  |||| / _---=> preempt-depth
      #                  ||||| / _--=> preempt-lazy-depth
      #                  |||||| / _-=> migrate-disable
      #                  ||||||| /     delay
      #  cmd     pid     |||||||| time  |   caller
      #     \   /        ||||||||  \    |    /
      stress-n-232       0dn.h413    1us+:      232:120:R   + [000]     217:  0:R cyclictest
      stress-n-232       0dn.h413   12us : <stack trace>
       => __ftrace_trace_stack
       => __trace_stack
       => probe_wakeup
       => ttwu_do_activate
       => try_to_wake_up
       => wake_up_process
       => hrtimer_wakeup
       => __hrtimer_run_queues
       => hrtimer_interrupt
       => riscv_timer_interrupt
       => handle_percpu_devid_irq
       => generic_handle_domain_irq
       => riscv_intc_irq
       => handle_riscv_irq
       => do_irq
      stress-n-232       0dn.h413   19us#: 0
      stress-n-232       0d...3.. 1671us : __schedule
      stress-n-232       0d...3.. 1676us+:      232:120:R ==> [000]     217:  0:R cyclictest
      stress-n-232       0d...3.. 1687us : <stack trace>
       => __ftrace_trace_stack
       => __trace_stack
       => probe_wakeup_sched_switch
       => __schedule
       => preempt_schedule
       => migrate_enable
       => free_unref_page_list
       => release_pages
       => free_pages_and_swap_cache
       => tlb_batch_pages_flush
       => tlb_flush_mmu
       => unmap_page_range
       => unmap_vmas
       => unmap_region
       => do_vmi_align_munmap.constprop.0
       => do_vmi_munmap
       => __vm_munmap
       => sys_munmap
       => do_trap_ecall_u
       => ret_from_exception
      
      After the modification, the cause of maximum latency is no longer
      madvise_cold_or_pageout_pte_range(), so this modification can reduce the
      latency caused by madvise_cold_or_pageout_pte_range().
      
      
      Currently the madvise_cold_or_pageout_pte_range() function exhibits
      significant latency under memory pressure, which can be effectively
      reduced by adding cond_resched() within the loop.
      
      When the batch_count reaches SWAP_CLUSTER_MAX, we reschedule
      the task to ensure fairness and avoid long lock holding times.
      
      Link: https://lkml.kernel.org/r/85363861af65fac66c7a98c251906afc0d9c8098.1695291046.git.wangjiexun@tinylab.orgSigned-off-by: default avatarJiexun Wang <wangjiexun@tinylab.org>
      Cc: Zhangjin Wu <falcon@tinylab.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b2f557a2
  4. 18 Oct, 2023 2 commits
    • Lorenzo Stoakes's avatar
      mm: drop the assumption that VM_SHARED always implies writable · e8e17ee9
      Lorenzo Stoakes authored
      Patch series "permit write-sealed memfd read-only shared mappings", v4.
      
      The man page for fcntl() describing memfd file seals states the following
      about F_SEAL_WRITE:-
      
          Furthermore, trying to create new shared, writable memory-mappings via
          mmap(2) will also fail with EPERM.
      
      With emphasis on 'writable'.  In turns out in fact that currently the
      kernel simply disallows all new shared memory mappings for a memfd with
      F_SEAL_WRITE applied, rendering this documentation inaccurate.
      
      This matters because users are therefore unable to obtain a shared mapping
      to a memfd after write sealing altogether, which limits their usefulness. 
      This was reported in the discussion thread [1] originating from a bug
      report [2].
      
      This is a product of both using the struct address_space->i_mmap_writable
      atomic counter to determine whether writing may be permitted, and the
      kernel adjusting this counter when any VM_SHARED mapping is performed and
      more generally implicitly assuming VM_SHARED implies writable.
      
      It seems sensible that we should only update this mapping if VM_MAYWRITE
      is specified, i.e.  whether it is possible that this mapping could at any
      point be written to.
      
      If we do so then all we need to do to permit write seals to function as
      documented is to clear VM_MAYWRITE when mapping read-only.  It turns out
      this functionality already exists for F_SEAL_FUTURE_WRITE - we can
      therefore simply adapt this logic to do the same for F_SEAL_WRITE.
      
      We then hit a chicken and egg situation in mmap_region() where the check
      for VM_MAYWRITE occurs before we are able to clear this flag.  To work
      around this, perform this check after we invoke call_mmap(), with careful
      consideration of error paths.
      
      Thanks to Andy Lutomirski for the suggestion!
      
      [1]:https://lore.kernel.org/all/20230324133646.16101dfa666f253c4715d965@linux-foundation.org/
      [2]:https://bugzilla.kernel.org/show_bug.cgi?id=217238
      
      
      This patch (of 3):
      
      There is a general assumption that VMAs with the VM_SHARED flag set are
      writable.  If the VM_MAYWRITE flag is not set, then this is simply not the
      case.
      
      Update those checks which affect the struct address_space->i_mmap_writable
      field to explicitly test for this by introducing
      [vma_]is_shared_maywrite() helper functions.
      
      This remains entirely conservative, as the lack of VM_MAYWRITE guarantees
      that the VMA cannot be written to.
      
      Link: https://lkml.kernel.org/r/cover.1697116581.git.lstoakes@gmail.com
      Link: https://lkml.kernel.org/r/d978aefefa83ec42d18dfa964ad180dbcde34795.1697116581.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e8e17ee9
    • Lorenzo Stoakes's avatar
      mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al. · 94d7d923
      Lorenzo Stoakes authored
      mprotect() and other functions which change VMA parameters over a range
      each employ a pattern of:-
      
      1. Attempt to merge the range with adjacent VMAs.
      2. If this fails, and the range spans a subset of the VMA, split it
         accordingly.
      
      This is open-coded and duplicated in each case. Also in each case most of
      the parameters passed to vma_merge() remain the same.
      
      Create a new function, vma_modify(), which abstracts this operation,
      accepting only those parameters which can be changed.
      
      To avoid the mess of invoking each function call with unnecessary
      parameters, create inline wrapper functions for each of the modify
      operations, parameterised only by what is required to perform the action.
      
      We can also significantly simplify the logic - by returning the VMA if we
      split (or merged VMA if we do not) we no longer need specific handling for
      merge/split cases in any of the call sites.
      
      Note that the userfaultfd_release() case works even though it does not
      split VMAs - since start is set to vma->vm_start and end is set to
      vma->vm_end, the split logic does not trigger.
      
      In addition, since we calculate pgoff to be equal to vma->vm_pgoff + (start
      - vma->vm_start) >> PAGE_SHIFT, and start - vma->vm_start will be 0 in this
      instance, this invocation will remain unchanged.
      
      We eliminate a VM_WARN_ON() in mprotect_fixup() as this simply asserts that
      vma_merge() correctly ensures that flags remain the same, something that is
      already checked in is_mergeable_vma() and elsewhere, and in any case is not
      specific to mprotect().
      
      Link: https://lkml.kernel.org/r/0dfa9368f37199a423674bf0ee312e8ea0619044.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      94d7d923
  5. 04 Oct, 2023 1 commit
  6. 24 Aug, 2023 3 commits
    • Suren Baghdasaryan's avatar
      swap: remove remnants of polling from read_swap_cache_async · b243dcbf
      Suren Baghdasaryan authored
      Patch series "Per-VMA lock support for swap and userfaults", v7.
      
      When per-VMA locks were introduced in [1] several types of page faults
      would still fall back to mmap_lock to keep the patchset simple.  Among
      them are swap and userfault pages.  The main reason for skipping those
      cases was the fact that mmap_lock could be dropped while handling these
      faults and that required additional logic to be implemented.  Implement
      the mechanism to allow per-VMA locks to be dropped for these cases.
      
      First, change handle_mm_fault to drop per-VMA locks when returning
      VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
      mmap_lock is handled.  Then change folio_lock_or_retry to accept vm_fault
      and return vm_fault_t which simplifies later patches.  Finally allow swap
      and uffd page faults to be handled under per-VMA locks by dropping per-VMA
      and retrying, the same way it's done under mmap_lock.  Naturally, once VMA
      lock is dropped that VMA should be assumed unstable and can't be used.
      
      
      This patch (of 6):
      
      Commit [1] introduced IO polling support duding swapin to reduce swap read
      latency for block devices that can be polled.  However later commit [2]
      removed polling support.  Therefore it seems safe to remove do_poll
      parameter in read_swap_cache_async and always call swap_readpage with
      synchronous=false waiting for IO completion in folio_lock_or_retry.
      
      [1] commit 23955622 ("swap: add block io poll in swapin path")
      [2] commit 9650b453 ("block: ignore RWF_HIPRI hint for sync dio")
      
      Link: https://lkml.kernel.org/r/20230630211957.1341547-1-surenb@google.com
      Link: https://lkml.kernel.org/r/20230630211957.1341547-2-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Suggested-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b243dcbf
    • Yin Fengwei's avatar
      madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check · 0e0e9bd5
      Yin Fengwei authored
      Commit 98b211d6 ("madvise: convert madvise_free_pte_range() to use a
      folio") replaced the page_mapcount() with folio_mapcount() to check
      whether the folio is shared by other mapping.
      
      It's not correct for large folios. folio_mapcount() returns the total
      mapcount of large folio which is not suitable to detect whether the folio
      is shared.
      
      Use folio_estimated_sharers() which returns a estimated number of shares.
      That means it's not 100% correct. It should be OK for madvise case here.
      
      User-visible effects is that the THP is skipped when user call madvise.
      But the correct behavior is THP should be split and processed then.
      
      NOTE: this change is a temporary fix to reduce the user-visible effects
      before the long term fix from David is ready.
      
      Link: https://lkml.kernel.org/r/20230808020917.2230692-4-fengwei.yin@intel.com
      Fixes: 98b211d6 ("madvise: convert madvise_free_pte_range() to use a folio")
      Signed-off-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Reviewed-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0e0e9bd5
    • Yin Fengwei's avatar
      madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against... · 2f406263
      Yin Fengwei authored
      madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check
      
      Patch series "don't use mapcount() to check large folio sharing", v2.
      
      In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(),
      folio_mapcount() is used to check whether the folio is shared.  But it's
      not correct as folio_mapcount() returns total mapcount of large folio.
      
      Use folio_estimated_sharers() here as the estimated number is enough.
      
      This patchset will fix the cases:
      User space application call madvise() with MADV_FREE, MADV_COLD and
      MADV_PAGEOUT for specific address range. There are THP mapped to the
      range. Without the patchset, the THP is skipped. With the patch, the
      THP will be split and handled accordingly.
      
      David reported the cow self test skip some cases because of MADV_PAGEOUT
      skip THP:
      https://lore.kernel.org/linux-mm/9e92e42d-488f-47db-ac9d-75b24cd0d037@intel.com/T/#mbf0f2ec7fbe45da47526de1d7036183981691e81
      and I confirmed this patchset make it work again.
      
      
      This patch (of 3):
      
      Commit 07e8c82b ("madvise: convert madvise_cold_or_pageout_pte_range()
      to use folios") replaced the page_mapcount() with folio_mapcount() to
      check whether the folio is shared by other mapping.
      
      It's not correct for large folio.  folio_mapcount() returns the total
      mapcount of large folio which is not suitable to detect whether the folio
      is shared.
      
      Use folio_estimated_sharers() which returns a estimated number of shares. 
      That means it's not 100% correct.  It should be OK for madvise case here.
      
      User-visible effects is that the THP is skipped when user call madvise. 
      But the correct behavior is THP should be split and processed then.
      
      NOTE: this change is a temporary fix to reduce the user-visible effects
      before the long term fix from David is ready.
      
      Link: https://lkml.kernel.org/r/20230808020917.2230692-1-fengwei.yin@intel.com
      Link: https://lkml.kernel.org/r/20230808020917.2230692-2-fengwei.yin@intel.com
      Fixes: 07e8c82b ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios")
      Signed-off-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Reviewed-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2f406263
  7. 21 Aug, 2023 2 commits
  8. 18 Aug, 2023 2 commits
    • Axel Rasmussen's avatar
      mm: make PTE_MARKER_SWAPIN_ERROR more general · af19487f
      Axel Rasmussen authored
      Patch series "add UFFDIO_POISON to simulate memory poisoning with UFFD",
      v4.
      
      This series adds a new userfaultfd feature, UFFDIO_POISON. See commit 4
      for a detailed description of the feature.
      
      
      This patch (of 8):
      
      Future patches will reuse PTE_MARKER_SWAPIN_ERROR to implement
      UFFDIO_POISON, so make some various preparations for that:
      
      First, rename it to just PTE_MARKER_POISONED.  The "SWAPIN" can be
      confusing since we're going to re-use it for something not really related
      to swap.  This can be particularly confusing for things like hugetlbfs,
      which doesn't support swap whatsoever.  Also rename some various helper
      functions.
      
      Next, fix pte marker copying for hugetlbfs.  Previously, it would WARN on
      seeing a PTE_MARKER_SWAPIN_ERROR, since hugetlbfs doesn't support swap. 
      But, since we're going to re-use it, we want it to go ahead and copy it
      just like non-hugetlbfs memory does today.  Since the code to do this is
      more complicated now, pull it out into a helper which can be re-used in
      both places.  While we're at it, also make it slightly more explicit in
      its handling of e.g.  uffd wp markers.
      
      For non-hugetlbfs page faults, instead of returning VM_FAULT_SIGBUS for an
      error entry, return VM_FAULT_HWPOISON.  For most cases this change doesn't
      matter, e.g.  a userspace program would receive a SIGBUS either way.  But
      for UFFDIO_POISON, this change will let KVM guests get an MCE out of the
      box, instead of giving a SIGBUS to the hypervisor and requiring it to
      somehow inject an MCE.
      
      Finally, for hugetlbfs faults, handle PTE_MARKER_POISONED, and return
      VM_FAULT_HWPOISON_LARGE in such cases.  Note that this can't happen today
      because the lack of swap support means we'll never end up with such a PTE
      anyway, but this behavior will be needed once such entries *can* show up
      via UFFDIO_POISON.
      
      Link: https://lkml.kernel.org/r/20230707215540.2324998-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20230707215540.2324998-2-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: T.J. Alumbaugh <talumbau@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      af19487f
    • Charan Teja Kalla's avatar
      mm: madvise: fix uneven accounting of psi · 20c897ea
      Charan Teja Kalla authored
      A folio turns into a Workingset during:
      1) shrink_active_list() placing the folio from active to inactive list.
      2) When a workingset transition is happening during the folio refault.
      
      And when Workingset is set on a folio, PSI for memory can be accounted
      during a) That folio is being reclaimed and b) Refault of that folio,
      for usual reclaims.
      
      This accounting of PSI for memory is not consistent for reclaim +
      refault operation between usual reclaim and madvise(COLD/PAGEOUT) which
      deactivate or proactively reclaim a folio:
      a) A folio started at inactive and moved to active as part of accesses.
      Workingset is absent on the folio thus refault of it when reclaimed
      through MADV_PAGEOUT operation doesn't account for PSI.
      
      b) When the same folio transition from inactive->active and then to
      inactive through shrink_active_list(). Workingset is set on the folio
      thus refault of it when reclaimed through MADV_PAGEOUT operation
      accounts for PSI.
      
      c) When the same folio is part of active list directly as a result of
      folio refault and this was a workingset folio prior to eviction.
      Workingset is set on the folio thus the refault of it when reclaimed
      through MADV_PAGEOUT/MADV_COLD operation accounts for PSI.
      
      d) MADV_COLD transfers the folio from active list to inactive
      list. Such folios may not have the Workingset thus refault operation on
      such folio doesn't account for PSI.
      
      As said above, refault operation caused because of MADV_PAGEOUT on a
      folio is accounts for memory PSI in b) and c) but not in a). Refault
      caused by the reclaim of a folio on which MADV_COLD is performed
      accounts memory PSI in c) but not in d). These behaviours are
      inconsistent w.r.t usual reclaim + refault operation. Make this PSI
      accounting always consistent by turning a folio into a workingset one
      whenever it is leaving the active list. Also, accounting of PSI on a
      folio whenever it leaves the active list as part of the
      MADV_COLD/PAGEOUT operation helps the users whether they are operating
      on proper folios[1].
      
      [1] https://lore.kernel.org/all/20230605180013.GD221380@cmpxchg.org/
      
      Link: https://lkml.kernel.org/r/1688393201-11135-1-git-send-email-quic_charante@quicinc.comSigned-off-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Suggested-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reported-by: default avatarSai Manobhiram Manapragada <quic_smanapra@quicinc.com>
      Reported-by: default avatarPavan Kondeti <quic_pkondeti@quicinc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      20c897ea
  9. 19 Jun, 2023 3 commits
    • Ryan Roberts's avatar
      mm: ptep_get() conversion · c33c7948
      Ryan Roberts authored
      Convert all instances of direct pte_t* dereferencing to instead use
      ptep_get() helper.  This means that by default, the accesses change from a
      C dereference to a READ_ONCE().  This is technically the correct thing to
      do since where pgtables are modified by HW (for access/dirty) they are
      volatile and therefore we should always ensure READ_ONCE() semantics.
      
      But more importantly, by always using the helper, it can be overridden by
      the architecture to fully encapsulate the contents of the pte.  Arch code
      is deliberately not converted, as the arch code knows best.  It is
      intended that arch code (arm64) will override the default with its own
      implementation that can (e.g.) hide certain bits from the core code, or
      determine young/dirty status by mixing in state from another source.
      
      Conversion was done using Coccinelle:
      
      ----
      
      // $ make coccicheck \
      //          COCCI=ptepget.cocci \
      //          SPFLAGS="--include-headers" \
      //          MODE=patch
      
      virtual patch
      
      @ depends on patch @
      pte_t *v;
      @@
      
      - *v
      + ptep_get(v)
      
      ----
      
      Then reviewed and hand-edited to avoid multiple unnecessary calls to
      ptep_get(), instead opting to store the result of a single call in a
      variable, where it is correct to do so.  This aims to negate any cost of
      READ_ONCE() and will benefit arch-overrides that may be more complex.
      
      Included is a fix for an issue in an earlier version of this patch that
      was pointed out by kernel test robot.  The issue arose because config
      MMU=n elides definition of the ptep helper functions, including
      ptep_get().  HUGETLB_PAGE=n configs still define a simple
      huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
      So when both configs are disabled, this caused a build error because
      ptep_get() is not defined.  Fix by continuing to do a direct dereference
      when MMU=n.  This is safe because for this config the arch code cannot be
      trying to virtualize the ptes because none of the ptep helpers are
      defined.
      
      Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.comReported-by: default avatarkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c33c7948
    • Hugh Dickins's avatar
      mm/madvise: clean up force_shm_swapin_readahead() · 179d3e4f
      Hugh Dickins authored
      Some nearby MADV_WILLNEED cleanup unrelated to pte_offset_map_lock(). 
      shmem_swapin_range() is a better name than force_shm_swapin_readahead(). 
      Fix unimportant off-by-one on end_index.  Call the swp_entry_t "entry"
      rather than "swap": either is okay, but entry is the name used elsewhere
      in mm/madvise.c.  Do not assume GFP_HIGHUSER_MOVABLE: that's right for
      anon swap, but shmem should take gfp from mapping.  Pass the actual vma
      and address to read_swap_cache_async(), in case a NUMA mempolicy applies. 
      lru_add_drain() at outer level, like madvise_willneed()'s other branch.
      
      Link: https://lkml.kernel.org/r/67e18875-ffb3-ec27-346-f350e07bed87@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      179d3e4f
    • Hugh Dickins's avatar
      mm/madvise: clean up pte_offset_map_lock() scans · f3cd4ab0
      Hugh Dickins authored
      Came here to make madvise's several pte_offset_map_lock() scans advance to
      next extent on failure, and remove superfluous pmd_trans_unstable() and
      pmd_none_or_trans_huge_or_clear_bad() calls.  But also did some nearby
      cleanup.
      
      swapin_walk_pmd_entry(): don't name an address "index"; don't drop the
      lock after every pte, only when calling out to read_swap_cache_async().
      
      madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(): prefer
      "start_pte" for pointer, orig_pte usually denotes a saved pte value; leave
      lazy MMU mode before unlocking; merge the success and failure paths after
      split_folio().
      
      Link: https://lkml.kernel.org/r/cc4d9a88-9da6-362-50d9-6735c2b125c6@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f3cd4ab0
  10. 18 Apr, 2023 1 commit
  11. 30 Mar, 2023 1 commit
  12. 16 Mar, 2023 1 commit
  13. 20 Feb, 2023 1 commit
  14. 10 Feb, 2023 4 commits
  15. 03 Feb, 2023 2 commits
  16. 19 Jan, 2023 4 commits
  17. 12 Jan, 2023 1 commit
  18. 12 Dec, 2022 2 commits
  19. 30 Nov, 2022 2 commits
    • Pasha Tatashin's avatar
      mm: anonymous shared memory naming · d09e8ca6
      Pasha Tatashin authored
      Since commit 9a10064f ("mm: add a field to store names for private
      anonymous memory"), name for private anonymous memory, but not shared
      anonymous, can be set.  However, naming shared anonymous memory just as
      useful for tracking purposes.
      
      Extend the functionality to be able to set names for shared anon.
      
      There are two ways to create anonymous shared memory, using memfd or
      directly via mmap():
      1. fd = memfd_create(...)
         mem = mmap(..., MAP_SHARED, fd, ...)
      2. mem = mmap(..., MAP_SHARED | MAP_ANONYMOUS, -1, ...)
      
      In both cases the anonymous shared memory is created the same way by
      mapping an unlinked file on tmpfs.
      
      The memfd way allows to give a name for anonymous shared memory, but
      not useful when parts of shared memory require to have distinct names.
      
      Example use case: The VMM maps VM memory as anonymous shared memory (not
      private because VMM is sandboxed and drivers are running in their own
      processes).  However, the VM tells back to the VMM how parts of the memory
      are actually used by the guest, how each of the segments should be backed
      (i.e.  4K pages, 2M pages), and some other information about the segments.
      The naming allows us to monitor the effective memory footprint for each
      of these segments from the host without looking inside the guest.
      
      Sample output:
        /* Create shared anonymous segmenet */
        anon_shmem = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);
        /* Name the segment: "MY-NAME" */
        rv = prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME,
                   anon_shmem, SIZE, "MY-NAME");
      
      cat /proc/<pid>/maps (and smaps):
      7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 [anon_shmem:MY-NAME]
      
      If the segment is not named, the output is:
      7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 /dev/zero (deleted)
      
      Link: https://lkml.kernel.org/r/20221115020602.804224-1-pasha.tatashin@soleen.comSigned-off-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: Colin Cross <ccross@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: xu xin <cgel.zte@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d09e8ca6
    • Mike Kravetz's avatar
      madvise: use zap_page_range_single for madvise dontneed · 21b85b09
      Mike Kravetz authored
      This series addresses the issue first reported in [1], and fully described
      in patch 2.  Patches 1 and 2 address the user visible issue and are tagged
      for stable backports.
      
      While exploring solutions to this issue, related problems with mmu
      notification calls were discovered.  This is addressed in the patch
      "hugetlb: remove duplicate mmu notifications:".  Since there are no user
      visible effects, this third is not tagged for stable backports.
      
      Previous discussions suggested further cleanup by removing the
      routine zap_page_range.  This is possible because zap_page_range_single
      is now exported, and all callers of zap_page_range pass ranges entirely
      within a single vma.  This work will be done in a later patch so as not
      to distract from this bug fix.
      
      [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
      
      
      This patch (of 2):
      
      Expose the routine zap_page_range_single to zap a range within a single
      vma.  The madvise routine madvise_dontneed_single_vma can use this routine
      as it explicitly operates on a single vma.  Also, update the mmu
      notification range in zap_page_range_single to take hugetlb pmd sharing
      into account.  This is required as MADV_DONTNEED supports hugetlb vmas.
      
      Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com
      Fixes: 90e7e7f5 ("mm: enable MADV_DONTNEED for hugetlb mappings")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarWei Chen <harperchen1110@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      21b85b09
  20. 25 Nov, 2022 1 commit
    • Al Viro's avatar
      use less confusing names for iov_iter direction initializers · de4eda9d
      Al Viro authored
      READ/WRITE proved to be actively confusing - the meanings are
      "data destination, as used with read(2)" and "data source, as
      used with write(2)", but people keep interpreting those as
      "we read data from it" and "we write data to it", i.e. exactly
      the wrong way.
      
      Call them ITER_DEST and ITER_SOURCE - at least that is harder
      to misinterpret...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      de4eda9d
  21. 28 Oct, 2022 1 commit
    • Rik van Riel's avatar
      mm,madvise,hugetlb: fix unexpected data loss with MADV_DONTNEED on hugetlbfs · 8ebe0a5e
      Rik van Riel authored
      A common use case for hugetlbfs is for the application to create
      memory pools backed by huge pages, which then get handed over to
      some malloc library (eg. jemalloc) for further management.
      
      That malloc library may be doing MADV_DONTNEED calls on memory
      that is no longer needed, expecting those calls to happen on
      PAGE_SIZE boundaries.
      
      However, currently the MADV_DONTNEED code rounds up any such
      requests to HPAGE_PMD_SIZE boundaries. This leads to undesired
      outcomes when jemalloc expects a 4kB MADV_DONTNEED, but 2MB of
      memory get zeroed out, instead.
      
      Use of pre-built shared libraries means that user code does not
      always know the page size of every memory arena in use.
      
      Avoid unexpected data loss with MADV_DONTNEED by rounding up
      only to PAGE_SIZE (in do_madvise), and rounding down to huge
      page granularity.
      
      That way programs will only get as much memory zeroed out as
      they requested.
      
      Link: https://lkml.kernel.org/r/20221021192805.366ad573@imladris.surriel.com
      Fixes: 90e7e7f5 ("mm: enable MADV_DONTNEED for hugetlb mappings")
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8ebe0a5e
  22. 03 Oct, 2022 1 commit
  23. 27 Sep, 2022 1 commit
  24. 26 Sep, 2022 1 commit