1. 13 May, 2022 40 commits
    • Peter Xu's avatar
      mm/hugetlb: handle uffd-wp during fork() · bc70fbf2
      Peter Xu authored
      Firstly, we'll need to pass in dst_vma into copy_hugetlb_page_range()
      because for uffd-wp it's the dst vma that matters on deciding how we
      should treat uffd-wp protected ptes.
      
      We should recognize pte markers during fork and do the pte copy if needed.
      
      [lkp@intel.com: vma_needs_copy can be static]
        Link: https://lkml.kernel.org/r/Ylb0CGeFJlc4EzLk@7ec4ff11d4ae
      Link: https://lkml.kernel.org/r/20220405014918.14932-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bc70fbf2
    • Peter Xu's avatar
      mm/hugetlb: only drop uffd-wp special pte if required · 05e90bd0
      Peter Xu authored
      As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte
      if unmapping an entire vma or synchronized such that faults can not race
      with the unmap operation.  This requires passing zap_flags all the way to
      the lowest level hugetlb unmap routine: __unmap_hugepage_range.
      
      In general, unmap calls originated in hugetlbfs code will pass the
      ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent
      faults.  The exception is hole punch which will first unmap without any
      synchronization.  Later when hole punch actually removes the page from the
      file, it will check to see if there was a subsequent fault and if so take
      the hugetlb fault mutex while unmapping again.  This second unmap will
      pass in ZAP_FLAG_DROP_MARKER.
      
      The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when
      unmap a hugetlb range" is (IMHO): we should never reach a state when a
      page fault could errornously fault in a page-cache page that was
      wr-protected to be writable, even in an extremely short period.  That
      could happen if e.g.  we pass ZAP_FLAG_DROP_MARKER when
      hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(), because if a page
      faults after that call and before remove_inode_hugepages() is executed,
      the page cache can be mapped writable again in the small racy window, that
      can cause unexpected data overwritten.
      
      [peterx@redhat.com: fix sparse warning]
        Link: https://lkml.kernel.org/r/Ylcdw8I1L5iAoWhb@xz-m1.local
      [akpm@linux-foundation.org: move zap_flags_t from mm.h to mm_types.h to fix build issues]
      Link: https://lkml.kernel.org/r/20220405014915.14873-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      05e90bd0
    • Peter Xu's avatar
      mm/hugetlb: allow uffd wr-protect none ptes · 60dfaad6
      Peter Xu authored
      Teach hugetlbfs code to wr-protect none ptes just in case the page cache
      existed for that pte.  Meanwhile we also need to be able to recognize a
      uffd-wp marker pte and remove it for uffd_wp_resolve.
      
      Since at it, introduce a variable "psize" to replace all references to the
      huge page size fetcher.
      
      Link: https://lkml.kernel.org/r/20220405014912.14815-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      60dfaad6
    • Peter Xu's avatar
      mm/hugetlb: handle pte markers in page faults · c64e912c
      Peter Xu authored
      Allow hugetlb code to handle pte markers just like none ptes.  It's mostly
      there, we just need to make sure we don't assume hugetlb_no_page() only
      handles none pte, so when detecting pte change we should use pte_same()
      rather than pte_none().  We need to pass in the old_pte to do the
      comparison.
      
      Check the original pte to see whether it's a pte marker, if it is, we
      should recover uffd-wp bit on the new pte to be installed, so that the
      next write will be trapped by uffd.
      
      Link: https://lkml.kernel.org/r/20220405014909.14761-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c64e912c
    • Peter Xu's avatar
      mm/hugetlb: handle UFFDIO_WRITEPROTECT · 5a90d5a1
      Peter Xu authored
      This starts from passing cp_flags into hugetlb_change_protection() so
      hugetlb will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests.
      
      huge_pte_clear_uffd_wp() is introduced to handle the case where the
      UFFDIO_WRITEPROTECT is requested upon migrating huge page entries.
      
      Link: https://lkml.kernel.org/r/20220405014906.14708-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a90d5a1
    • Peter Xu's avatar
      mm/hugetlb: take care of UFFDIO_COPY_MODE_WP · 6041c691
      Peter Xu authored
      Pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout the
      stack.  Apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with UFFDIO_COPY.
      
      Hugetlb pages are only managed by hugetlbfs, so we're safe even without
      setting dirty bit in the huge pte if the page is installed as read-only. 
      However we'd better still keep the dirty bit set for a read-only
      UFFDIO_COPY pte (when UFFDIO_COPY_MODE_WP bit is set), not only to match
      what we do with shmem, but also because the page does contain dirty data
      that the kernel just copied from the userspace.
      
      Link: https://lkml.kernel.org/r/20220405014904.14643-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6041c691
    • Peter Xu's avatar
      mm/hugetlb: hook page faults for uffd write protection · 166f3ecc
      Peter Xu authored
      Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp
      faults.
      
      We do this slightly earlier than hugetlb_cow() so that we can avoid taking
      some extra locks that we definitely don't need.
      
      Link: https://lkml.kernel.org/r/20220405014901.14590-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      166f3ecc
    • Peter Xu's avatar
      mm/hugetlb: introduce huge pte version of uffd-wp helpers · 229f3fa7
      Peter Xu authored
      They will be used in the follow up patches to either check/set/clear
      uffd-wp bit of a huge pte.
      
      So far it reuses all the small pte helpers.  Archs can overwrite these
      versions when necessary (with __HAVE_ARCH_HUGE_PTE_UFFD_WP* macros) in the
      future.
      
      Link: https://lkml.kernel.org/r/20220405014858.14531-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      229f3fa7
    • Peter Xu's avatar
      mm/shmem: handle uffd-wp during fork() · c56d1b62
      Peter Xu authored
      Normally we skip copy page when fork() for VM_SHARED shmem, but we can't
      skip it anymore if uffd-wp is enabled on dst vma.  This should only happen
      when the src uffd has UFFD_FEATURE_EVENT_FORK enabled on uffd-wp shmem
      vma, so that VM_UFFD_WP will be propagated onto dst vma too, then we
      should copy the pgtables with uffd-wp bit and pte markers, because these
      information will be lost otherwise.
      
      Since the condition checks will become even more complicated for deciding
      "whether a vma needs to copy the pgtable during fork()", introduce a
      helper vma_needs_copy() for it, so everything will be clearer.
      
      Link: https://lkml.kernel.org/r/20220405014855.14468-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c56d1b62
    • Peter Xu's avatar
      mm/shmem: allows file-back mem to be uffd wr-protected on thps · 019c2d8b
      Peter Xu authored
      We don't have "huge" version of pte markers, instead when necessary we
      split the thp.
      
      However split the thp is not enough, because file-backed thp is handled
      totally differently comparing to anonymous thps: rather than doing a real
      split, the thp pmd will simply got cleared in __split_huge_pmd_locked().
      
      That is not enough if e.g.  when there is a thp covers range [0, 2M) but
      we want to wr-protect small page resides in [4K, 8K) range, because after
      __split_huge_pmd() returns, there will be a none pmd, and
      change_pmd_range() will just skip it right after the split.
      
      Here we leverage the previously introduced change_pmd_prepare() macro so
      that we'll populate the pmd with a pgtable page after the pmd split (in
      which process the pmd will be cleared for cases like shmem).  Then
      change_pte_range() will do all the rest for us by installing the uffd-wp
      pte marker at any none pte that we'd like to wr-protect.
      
      Link: https://lkml.kernel.org/r/20220405014852.14413-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      019c2d8b
    • Peter Xu's avatar
      mm/shmem: allow uffd wr-protect none pte for file-backed mem · fe2567eb
      Peter Xu authored
      File-backed memory differs from anonymous memory in that even if the pte
      is missing, the data could still resides either in the file or in
      page/swap cache.  So when wr-protect a pte, we need to consider none ptes
      too.
      
      We do that by installing the uffd-wp pte markers when necessary.  So when
      there's a future write to the pte, the fault handler will go the special
      path to first fault-in the page as read-only, then report to userfaultfd
      server with the wr-protect message.
      
      On the other hand, when unprotecting a page, it's also possible that the
      pte got unmapped but replaced by the special uffd-wp marker.  Then we'll
      need to be able to recover from a uffd-wp pte marker into a none pte, so
      that the next access to the page will fault in correctly as usual when
      accessed the next time.
      
      Special care needs to be taken throughout the change_protection_range()
      process.  Since now we allow user to wr-protect a none pte, we need to be
      able to pre-populate the page table entries if we see (!anonymous &&
      MM_CP_UFFD_WP) requests, otherwise change_protection_range() will always
      skip when the pgtable entry does not exist.
      
      For example, the pgtable can be missing for a whole chunk of 2M pmd, but
      the page cache can exist for the 2M range.  When we want to wr-protect one
      4K page within the 2M pmd range, we need to pre-populate the pgtable and
      install the pte marker showing that we want to get a message and block the
      thread when the page cache of that 4K page is written.  Without
      pre-populating the pmd, change_protection() will simply skip that whole
      pmd.
      
      Note that this patch only covers the small pages (pte level) but not
      covering any of the transparent huge pages yet.  That will be done later,
      and this patch will be a preparation for it too.
      
      Link: https://lkml.kernel.org/r/20220405014850.14352-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fe2567eb
    • Peter Xu's avatar
      mm/shmem: persist uffd-wp bit across zapping for file-backed · 999dad82
      Peter Xu authored
      File-backed memory is prone to being unmapped at any time.  It means all
      information in the pte will be dropped, including the uffd-wp flag.
      
      To persist the uffd-wp flag, we'll use the pte markers.  This patch
      teaches the zap code to understand uffd-wp and know when to keep or drop
      the uffd-wp bit.
      
      Add a new flag ZAP_FLAG_DROP_MARKER and set it in zap_details when we
      don't want to persist such an information, for example, when destroying
      the whole vma, or punching a hole in a shmem file.  For the rest cases we
      should never drop the uffd-wp bit, or the wr-protect information will get
      lost.
      
      The new ZAP_FLAG_DROP_MARKER needs to be put into mm.h rather than
      memory.c because it'll be further referenced in hugetlb files later.
      
      Link: https://lkml.kernel.org/r/20220405014847.14295-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      999dad82
    • Peter Xu's avatar
      mm/shmem: handle uffd-wp special pte in page fault handler · 9c28a205
      Peter Xu authored
      File-backed memories are prone to unmap/swap so the ptes are always
      unstable, because they can be easily faulted back later using the page
      cache.  This could lead to uffd-wp getting lost when unmapping or swapping
      out such memory.  One example is shmem.  PTE markers are needed to store
      those information.
      
      This patch prepares it by handling uffd-wp pte markers first it is applied
      elsewhere, so that the page fault handler can recognize uffd-wp pte
      markers.
      
      The handling of uffd-wp pte markers is similar to missing fault, it's just
      that we'll handle this "missing fault" when we see the pte markers,
      meanwhile we need to make sure the marker information is kept during
      processing the fault.
      
      This is a slow path of uffd-wp handling, because zapping of wr-protected
      shmem ptes should be rare.  So far it should only trigger in two
      conditions:
      
        (1) When trying to punch holes in shmem_fallocate(), there is an
            optimization to zap the pgtables before evicting the page.
      
        (2) When swapping out shmem pages.
      
      Because of this, the page fault handling is simplifed too by not sending
      the wr-protect message in the 1st page fault, instead the page will be
      installed read-only, so the uffd-wp message will be generated in the next
      fault, which will trigger the do_wp_page() path of general uffd-wp
      handling.
      
      Disable fault-around for all uffd-wp registered ranges for extra safety
      just like uffd-minor fault, and clean the code up.
      
      Link: https://lkml.kernel.org/r/20220405014844.14239-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c28a205
    • Peter Xu's avatar
      mm/shmem: take care of UFFDIO_COPY_MODE_WP · 8ee79edf
      Peter Xu authored
      Pass wp_copy into shmem_mfill_atomic_pte() through the stack, then apply
      the UFFD_WP bit properly when the UFFDIO_COPY on shmem is with
      UFFDIO_COPY_MODE_WP.  wp_copy lands mfill_atomic_install_pte() finally.
      
      Note: we must do pte_wrprotect() if !writable in
      mfill_atomic_install_pte(), as mk_pte() could return a writable pte (e.g.,
      when VM_SHARED on a shmem file).
      
      Link: https://lkml.kernel.org/r/20220405014841.14185-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8ee79edf
    • Peter Xu's avatar
      mm/uffd: PTE_MARKER_UFFD_WP · 1db9dbc2
      Peter Xu authored
      This patch introduces the 1st user of pte marker: the uffd-wp marker.
      
      When the pte marker is installed with the uffd-wp bit set, it means this
      pte was wr-protected by uffd.
      
      We will use this special pte to arm the ptes that got either unmapped or
      swapped out for a file-backed region that was previously wr-protected. 
      This special pte could trigger a page fault just like swap entries.
      
      This idea is greatly inspired by Hugh and Andrea in the discussion, which
      is referenced in the links below.
      
      Some helpers are introduced to detect whether a swap pte is uffd
      wr-protected.  After the pte marker introduced, one swap pte can be
      wr-protected in two forms: either it is a normal swap pte and it has
      _PAGE_SWP_UFFD_WP set, or it's a pte marker that has PTE_MARKER_UFFD_WP
      set.
      
      [peterx@redhat.com: fixup]
        Link: https://lkml.kernel.org/r/YkzKiM8tI4+qOfXF@xz-m1.local
      Link: https://lore.kernel.org/lkml/20201126222359.8120-1-peterx@redhat.com/
      Link: https://lore.kernel.org/lkml/20201130230603.46187-1-peterx@redhat.com/
      Link: https://lkml.kernel.org/r/20220405014838.14131-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1db9dbc2
    • Peter Xu's avatar
      mm: check against orig_pte for finish_fault() · f46f2ade
      Peter Xu authored
      This patch allows do_fault() to trigger on !pte_none() cases too.  This
      prepares for the pte markers to be handled by do_fault() just like none
      pte.
      
      To achieve this, instead of unconditionally check against pte_none() in
      finish_fault(), we may hit the case that the orig_pte was some pte marker
      so what we want to do is to replace the pte marker with some valid pte
      entry.  Then if orig_pte was set we'd want to check the current *pte
      (under pgtable lock) against orig_pte rather than none pte.
      
      Right now there's no solid way to safely reference orig_pte because when
      pmd is not allocated handle_pte_fault() will not initialize orig_pte, so
      it's not safe to reference it.
      
      There's another solution proposed before this patch to do pte_clear() for
      vmf->orig_pte for pmd==NULL case, however it turns out it'll break arm32
      because arm32 could have assumption that pte_t* pointer will always reside
      on a real ram32 pgtable, not any kernel stack variable.
      
      To solve this, we add a new flag FAULT_FLAG_ORIG_PTE_VALID, and it'll be
      set along with orig_pte when there is valid orig_pte, or it'll be cleared
      when orig_pte was not initialized.
      
      It'll be updated every time we call handle_pte_fault(), so e.g.  if a page
      fault retry happened it'll be properly updated along with orig_pte.
      
      [1] https://lore.kernel.org/lkml/710c48c9-406d-e4c5-a394-10501b951316@samsung.com/
      
      [akpm@linux-foundation.org: coding-style cleanups]
      [peterx@redhat.com: fix crash reported by Marek]
        Link: https://lkml.kernel.org/r/Ylb9rXJyPm8/ao8f@xz-m1.local
      Link: https://lkml.kernel.org/r/20220405014836.14077-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Tested-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f46f2ade
    • Peter Xu's avatar
      mm: teach core mm about pte markers · 5c041f5d
      Peter Xu authored
      This patch still does not use pte marker in any way, however it teaches
      the core mm about the pte marker idea.
      
      For example, handle_pte_marker() is introduced that will parse and handle
      all the pte marker faults.
      
      Many of the places are more about commenting it up - so that we know
      there's the possibility of pte marker showing up, and why we don't need
      special code for the cases.
      
      [peterx@redhat.com: userfaultfd.c needs swapops.h]
        Link: https://lkml.kernel.org/r/YmRlVj3cdizYJsr0@xz-m1.local
      Link: https://lkml.kernel.org/r/20220405014833.14015-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5c041f5d
    • Peter Xu's avatar
      mm: introduce PTE_MARKER swap entry · 679d1033
      Peter Xu authored
      Patch series "userfaultfd-wp: Support shmem and hugetlbfs", v8.
      
      
      Overview
      ========
      
      Userfaultfd-wp anonymous support was merged two years ago.  There're quite
      a few applications that started to leverage this capability either to take
      snapshots for user-app memory, or use it for full user controled swapping.
      
      This series tries to complete the feature for uffd-wp so as to cover all
      the RAM-based memory types.  So far uffd-wp is the only missing piece of
      the rest features (uffd-missing & uffd-minor mode).
      
      One major reason to do so is that anonymous pages are sometimes not
      satisfying the need of applications, and there're growing users of either
      shmem and hugetlbfs for either sharing purpose (e.g., sharing guest mem
      between hypervisor process and device emulation process, shmem local live
      migration for upgrades), or for performance on tlb hits.
      
      All these mean that if a uffd-wp app wants to switch to any of the memory
      types, it'll stop working.  I think it's worthwhile to have the kernel to
      cover all these aspects.
      
      This series chose to protect pages in pte level not page level.
      
      One major reason is safety.  I have no idea how we could make it safe if
      any of the uffd-privileged app can wr-protect a page that any other
      application can use.  It means this app can block any process potentially
      for any time it wants.
      
      The other reason is that it aligns very well with not only the anonymous
      uffd-wp solution, but also uffd as a whole.  For example, userfaultfd is
      implemented fundamentally based on VMAs.  We set flags to VMAs showing the
      status of uffd tracking.  For another per-page based protection solution,
      it'll be crossing the fundation line on VMA-based, and it could simply be
      too far away already from what's called userfaultfd.
      
      PTE markers
      ===========
      
      The patchset is based on the idea called PTE markers.  It was discussed in
      one of the mm alignment sessions, proposed starting from v6, and this is
      the 2nd version of it using PTE marker idea.
      
      PTE marker is a new type of swap entry that is ony applicable to file
      backed memories like shmem and hugetlbfs.  It's used to persist some
      pte-level information even if the original present ptes in pgtable are
      zapped.
      
      Logically pte markers can store more than uffd-wp information, but so far
      only one bit is used for uffd-wp purpose.  When the pte marker is
      installed with uffd-wp bit set, it means this pte is wr-protected by uffd.
      
      It solves the problem on e.g.  file-backed memory mapped ptes got zapped
      due to any reason (e.g.  thp split, or swapped out), we can still keep the
      wr-protect information in the ptes.  Then when the page fault triggers
      again, we'll know this pte is wr-protected so we can treat the pte the
      same as a normal uffd wr-protected pte.
      
      The extra information is encoded into the swap entry, or swp_offset to be
      explicit, with the swp_type being PTE_MARKER.  So far uffd-wp only uses
      one bit out of the swap entry, the rest bits of swp_offset are still
      reserved for other purposes.
      
      There're two configs to enable/disable PTE markers:
      
        CONFIG_PTE_MARKER
        CONFIG_PTE_MARKER_UFFD_WP
      
      We can set !PTE_MARKER to completely disable all the PTE markers, along
      with uffd-wp support.  I made two config so we can also enable PTE marker
      but disable uffd-wp file-backed for other purposes.  At the end of current
      series, I'll enable CONFIG_PTE_MARKER by default, but that patch is
      standalone and if anyone worries about having it by default, we can also
      consider turn it off by dropping that oneliner patch.  So far I don't see
      a huge risk of doing so, so I kept that patch.
      
      In most cases, PTE markers should be treated as none ptes.  It is because
      that unlike most of the other swap entry types, there's no PFN or block
      offset information encoded into PTE markers but some extra well-defined
      bits showing the status of the pte.  These bits should only be used as
      extra data when servicing an upcoming page fault, and then we behave as if
      it's a none pte.
      
      I did spend a lot of time observing all the pte_none() users this time. 
      It is indeed a challenge because there're a lot, and I hope I didn't miss
      a single of them when we should take care of pte markers.  Luckily, I
      don't think it'll need to be considered in many cases, for example: boot
      code, arch code (especially non-x86), kernel-only page handlings (e.g. 
      CPA), or device driver codes when we're tackling with pure PFN mappings.
      
      I introduced pte_none_mostly() in this series when we need to handle pte
      markers the same as none pte, the "mostly" is the other way to write
      "either none pte or a pte marker".
      
      I didn't replace pte_none() to cover pte markers for below reasons:
      
        - Very rare case of pte_none() callers will handle pte markers.  E.g., all
          the kernel pages do not require knowledge of pte markers.  So we don't
          pollute the major use cases.
      
        - Unconditionally change pte_none() semantics could confuse people, because
          pte_none() existed for so long a time.
      
        - Unconditionally change pte_none() semantics could make pte_none() slower
          even if in many cases pte markers do not exist.
      
        - There're cases where we'd like to handle pte markers differntly from
          pte_none(), so a full replace is also impossible.  E.g. khugepaged should
          still treat pte markers as normal swap ptes rather than none ptes, because
          pte markers will always need a fault-in to merge the marker with a valid
          pte.  Or the smap code will need to parse PTE markers not none ptes.
      
      Patch Layout
      ============
      
      Introducing PTE marker and uffd-wp bit in PTE marker:
      
        mm: Introduce PTE_MARKER swap entry
        mm: Teach core mm about pte markers
        mm: Check against orig_pte for finish_fault()
        mm/uffd: PTE_MARKER_UFFD_WP
      
      Adding support for shmem uffd-wp:
      
        mm/shmem: Take care of UFFDIO_COPY_MODE_WP
        mm/shmem: Handle uffd-wp special pte in page fault handler
        mm/shmem: Persist uffd-wp bit across zapping for file-backed
        mm/shmem: Allow uffd wr-protect none pte for file-backed mem
        mm/shmem: Allows file-back mem to be uffd wr-protected on thps
        mm/shmem: Handle uffd-wp during fork()
      
      Adding support for hugetlbfs uffd-wp:
      
        mm/hugetlb: Introduce huge pte version of uffd-wp helpers
        mm/hugetlb: Hook page faults for uffd write protection
        mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP
        mm/hugetlb: Handle UFFDIO_WRITEPROTECT
        mm/hugetlb: Handle pte markers in page faults
        mm/hugetlb: Allow uffd wr-protect none ptes
        mm/hugetlb: Only drop uffd-wp special pte if required
        mm/hugetlb: Handle uffd-wp during fork()
      
      Misc handling on the rest mm for uffd-wp file-backed:
      
        mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered
        mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs
      
      Enabling of uffd-wp on file-backed memory:
      
        mm/uffd: Enable write protection for shmem & hugetlbfs
        mm: Enable PTE markers by default
        selftests/uffd: Enable uffd-wp for shmem/hugetlbfs
      
      Tests
      =====
      
      - Compile test on x86_64 and aarch64 on different configs
      - Kernel selftests
      - uffd-test [0]
      - Umapsort [1,2] test for shmem/hugetlb, with swap on/off
      
      [0] https://github.com/xzpeter/clibs/tree/master/uffd-test
      [1] https://github.com/xzpeter/umap-apps/tree/peter
      [2] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs
      
      
      This patch (of 23):
      
      Introduces a new swap entry type called PTE_MARKER.  It can be installed
      for any pte that maps a file-backed memory when the pte is temporarily
      zapped, so as to maintain per-pte information.
      
      The information that kept in the pte is called a "marker".  Here we define
      the marker as "unsigned long" just to match pgoff_t, however it will only
      work if it still fits in swp_offset(), which is e.g.  currently 58 bits on
      x86_64.
      
      A new config CONFIG_PTE_MARKER is introduced too; it's by default off.  A
      bunch of helpers are defined altogether to service the rest of the pte
      marker code.
      
      [peterx@redhat.com: fixup]
        Link: https://lkml.kernel.org/r/Yk2rdB7SXZf+2BDF@xz-m1.local
      Link: https://lkml.kernel.org/r/20220405014646.13522-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220405014646.13522-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      679d1033
    • Wonhyuk Yang's avatar
      mm/page_alloc: cache the result of node_dirty_ok() · 8a87d695
      Wonhyuk Yang authored
      To spread dirty pages, nodes are checked whether they have reached the
      dirty limit using the expensive node_dirty_ok().  To reduce the frequency
      of calling node_dirty_ok(), the last node that hit the dirty limit can be
      cached.
      
      Instead of caching the node, caching both the node and its node_dirty_ok()
      status can reduce the number of calle to node_dirty_ok().
      
      [akpm@linux-foundation.org: rename last_pgdat_dirty_limit to last_pgdat_dirty_ok]
      Link: https://lkml.kernel.org/r/20220430011032.64071-1-vvghjk1234@gmail.comSigned-off-by: default avatarWonhyuk Yang <vvghjk1234@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Donghyeok Kim <dthex5d@gmail.com>
      Cc: JaeSang Yoo <jsyoo5b@gmail.com>
      Cc: Jiyoup Kim <lakroforce@gmail.com>
      Cc: Ohhoon Kwon <ohkwon1043@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8a87d695
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/reclaim: document 'commit_inputs' parameter · 81a84182
      SeongJae Park authored
      This commit documents the new DAMON_RECLAIM parameter, 'commit_inputs' in
      its usage document.
      
      Link: https://lkml.kernel.org/r/20220429160606.127307-15-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      81a84182
    • SeongJae Park's avatar
      mm/damon/reclaim: support online inputs update · e035c280
      SeongJae Park authored
      DAMON_RECLAIM reads the user input parameters only when it starts.  To
      allow more efficient online tuning, this commit implements a new input
      parameter called 'commit_inputs'.  Writing true to the parameter makes
      DAMON_RECLAIM reads the input parameters again.
      
      Link: https://lkml.kernel.org/r/20220429160606.127307-14-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e035c280
    • SeongJae Park's avatar
      Docs/{ABI,admin-guide}/damon: Update for 'state' sysfs file input keyword, 'commit' · adc286e6
      SeongJae Park authored
      This commit documents the newly added 'state' sysfs file input keyword,
      'commit', which allows online tuning of DAMON contexts.
      
      Link: https://lkml.kernel.org/r/20220429160606.127307-13-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      adc286e6
    • SeongJae Park's avatar
      mm/damon/sysfs: support online inputs update · da878780
      SeongJae Park authored
      Currently, DAMON sysfs interface doesn't provide a way for adjusting DAMON
      input parameters while it is turned on.  Therefore, users who want to
      reconfigure DAMON need to stop DAMON and restart.  This means all the
      monitoring results that accumulated so far, which could be useful, should
      be flushed.  This would be inefficient for many cases.
      
      For an example, let's suppose a sysadmin was running a DAMON-based
      Operation Scheme to find memory regions not accessed for more than 5 mins
      and page out the regions.  If it turns out the 5 mins threshold was too
      long and therefore the sysadmin wants to reduce it to 4 mins, the sysadmin
      should turn off DAMON, restart it, and wait for at least 4 more minutes so
      that DAMON can find the cold memory regions, even though DAMON was knowing
      there are regions that not accessed for 4 mins at the time of shutdown.
      
      This commit makes DAMON sysfs interface to support online DAMON input
      parameters updates by adding a new input keyword for the 'state' DAMON
      sysfs file, 'commit'.  Writing the keyword to the 'state' file while the
      corresponding kdamond is running makes the kdamond to read the sysfs file
      values again and update the DAMON context.
      
      Link: https://lkml.kernel.org/r/20220429160606.127307-12-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      da878780
    • SeongJae Park's avatar
      mm/damon/sysfs: update schemes stat in the kdamond context · 01538719
      SeongJae Park authored
      Only '->kdamond' and '->kdamond_stop' are protected by 'kdamond_lock' of
      'struct damon_ctx'.  All other DAMON context internal data items are
      recommended to be accessed in DAMON callbacks, or under some additional
      synchronizations.  But, DAMON sysfs is accessing the schemes stat under
      'kdamond_lock'.
      
      It makes no big issue as the read values are not used anywhere inside
      kernel, but would better to be fixed.  This commit moves the reads to
      DAMON callback context, as supposed to be used for the purpose.
      
      Link: https://lkml.kernel.org/r/20220429160606.127307-11-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      01538719
    • SeongJae Park's avatar
      mm/damon/sysfs: use enum for 'state' input handling · 3cbab4ca
      SeongJae Park authored
      DAMON sysfs 'state' file handling code is using string literals in both
      'state_show()' and 'state_store()'.  This makes the code error prone and
      inflexible for future extensions.
      
      To improve the situation, this commit defines possible input strings and
      'enum' for identifying each input keyword only once, and refactors the
      code to reuse those.
      
      Link: https://lkml.kernel.org/r/20220429160606.127307-10-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3cbab4ca
    • SeongJae Park's avatar
      mm/damon/sysfs: reuse damon_set_regions() for regions setting · 97d482f4
      SeongJae Park authored
      'damon_set_regions()' is general enough so that it can also be used for
      only creating regions.  This commit makes DAMON sysfs interface to reuse
      the function rather keeping two implementations for a same purpose.
      
      Link: https://lkml.kernel.org/r/20220429160606.127307-9-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      97d482f4
    • SeongJae Park's avatar
      mm/damon/sysfs: move targets setup code to a separated function · 74bd8b7d
      SeongJae Park authored
      This commit separates DAMON sysfs interface's monitoring context targets
      setup code to a new function for better readability.
      
      Link: https://lkml.kernel.org/r/20220429160606.127307-8-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      74bd8b7d
    • SeongJae Park's avatar
      mm/damon/sysfs: prohibit multiple physical address space monitoring targets · 0a890a9f
      SeongJae Park authored
      Having multiple targets for physical address space monitoring makes no
      sense.  This commit prohibits such a ridiculous DAMON context setup my
      making the DAMON context build function to check and return an error for
      the case.
      
      Link: https://lkml.kernel.org/r/20220429160606.127307-7-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0a890a9f
    • SeongJae Park's avatar
      mm/damon/vaddr: remove damon_va_apply_three_regions() · dae0087a
      SeongJae Park authored
      'damon_va_apply_three_regions()' is just a wrapper of its general version,
      'damon_set_regions()'.  This commit replaces the wrapper calls to directly
      call the general version.
      
      Link: https://lkml.kernel.org/r/20220429160606.127307-6-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dae0087a
    • SeongJae Park's avatar
      mm/damon/vaddr: move 'damon_set_regions()' to core · d0723bc0
      SeongJae Park authored
      This commit moves 'damon_set_regions()' from vaddr to core, as it is aimed
      to be used by not only 'vaddr' but also other parts of DAMON.
      
      Link: https://lkml.kernel.org/r/20220429160606.127307-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0723bc0
    • SeongJae Park's avatar
      mm/damon/vaddr: generalize damon_va_apply_three_regions() · af3f18f6
      SeongJae Park authored
      'damon_va_apply_three_regions()' is for adjusting address ranges to fit in
      three discontiguous ranges.  The function can be generalized for arbitrary
      number of discontiguous ranges and reused for future usage, such as
      arbitrary online regions update.  For such future usage, this commit
      introduces a generalized version of the function called
      'damon_set_regions()'.
      
      Link: https://lkml.kernel.org/r/20220429160606.127307-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      af3f18f6
    • SeongJae Park's avatar
      mm/damon/core: finish kdamond as soon as any callback returns an error · abacd635
      SeongJae Park authored
      When 'after_sampling()' or 'after_aggregation()' DAMON callbacks return an
      error, kdamond continues the remaining loop once.  It makes no much sense
      to run the remaining part while something wrong already happened.  The
      context might be corrupted or having invalid data.  This commit therefore
      makes kdamond skips the remaining works and immediately finish in the
      cases.
      
      Link: https://lkml.kernel.org/r/20220429160606.127307-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      abacd635
    • SeongJae Park's avatar
      mm/damon/core: add a new callback for watermarks checks · 6e74d2bf
      SeongJae Park authored
      Patch series "mm/damon: Support online tuning".
      
      Effects of DAMON and DAMON-based Operation Schemes highly depends on the
      configurations.  Wrong configurations could even result in unexpected
      efficiency degradations.  For finding a best configuration, repeating
      incremental configuration changes and results measurements, in other
      words, online tuning, could be helpful.
      
      Nevertheless, DAMON kernel API supports only restrictive online tuning. 
      Worse yet, the sysfs-based DAMON user interface doesn't support online
      tuning at all.  DAMON_RECLAIM also doesn't support online tuning.
      
      This patchset makes the DAMON kernel API, DAMON sysfs interface, and
      DAMON_RECLAIM supports online tuning.
      
      Sequence of patches
      -------------------
      
      First two patches enhance DAMON online tuning for kernel API users. 
      Specifically, patch 1 let kernel API users to be able to do DAMON online
      tuning without a restriction, and patch 2 makes error handling easier.
      
      Following seven patches (patches 3-9) refactor code for better readability
      and easier reuse of code fragments that will be useful for online tuning
      support.
      
      Patch 10 introduces DAMON callback based user request handling structure
      for DAMON sysfs interface, and patch 11 enables DAMON online tuning via
      DAMON sysfs interface.  Documentation patch (patch 12) for usage of it
      follows.
      
      Patch 13 enables online tuning of DAMON_RECLAIM and finally patch 14
      documents the DAMON_RECLAIM online tuning usage.
      
      
      This patch (of 14):
      
      For updating input parameters for running DAMON contexts, DAMON kernel API
      users can use the contexts' callbacks, as it is the safe place for context
      internal data accesses.  When the context has DAMON-based operation
      schemes and all schemes are deactivated due to their watermarks, however,
      DAMON does nothing but only watermarks checks.  As a result, no callbacks
      will be called back, and therefore the kernel API users cannot update the
      input parameters including monitoring attributes, DAMON-based operation
      schemes, and watermarks.
      
      To let users easily update such DAMON input parameters in such a case,
      this commit adds a new callback, 'after_wmarks_check()'.  It will be
      called after each watermarks check.  Users can do the online input
      parameters update in the callback even under the schemes deactivated case.
      
      Link: https://lkml.kernel.org/r/20220429160606.127307-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6e74d2bf
    • Niels Dossche's avatar
      selftest/vm: test that mremap fails on non-existent vma · 99947153
      Niels Dossche authored
      Add a regression test that validates that mremap fails for vma's that
      don't exist.
      
      Link: https://lkml.kernel.org/r/20220427224439.23828-3-dossche.niels@gmail.comSigned-off-by: default avatarNiels Dossche <dossche.niels@gmail.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      99947153
    • Adrian Huang's avatar
      mm/rmap: Fix typos in comments · dd062302
      Adrian Huang authored
      Fix spelling/grammar mistakes in comments.
      
      Link: https://lkml.kernel.org/r/20220428061522.666-1-adrianhuang0701@gmail.comSigned-off-by: default avatarAdrian Huang <ahuang12@lenovo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dd062302
    • Hongchen Zhang's avatar
      mm/swapops: make is_pmd_migration_entry more strict · b304c6f0
      Hongchen Zhang authored
      A pmd migration entry should first be a swap pmd,so use is_swap_pmd(pmd)
      instead of !pmd_present(pmd).
      
      On the other hand, some architecture (MIPS for example) may misjudge a
      pmd_none entry as a pmd migration entry.
      
      Link: https://lkml.kernel.org/r/1651131333-6386-1-git-send-email-zhanghongchen@loongson.cnSigned-off-by: default avatarHongchen Zhang <zhanghongchen@loongson.cn>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b304c6f0
    • Florian Rommel's avatar
      mmap locking API: fix missed mmap_sem references in comments · 5b449489
      Florian Rommel authored
      Commit c1e8d7c6 ("mmap locking API: convert mmap_sem comments") missed
      replacing some references of mmap_sem by mmap_lock due to misspelling
      (mm_sem instead of mmap_sem).
      
      Link: https://lkml.kernel.org/r/20220503113333.214124-1-mail@florommel.deSigned-off-by: default avatarFlorian Rommel <mail@florommel.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5b449489
    • Peter Collingbourne's avatar
      mm: make minimum slab alignment a runtime property · d949a815
      Peter Collingbourne authored
      When CONFIG_KASAN_HW_TAGS is enabled we currently increase the minimum
      slab alignment to 16.  This happens even if MTE is not supported in
      hardware or disabled via kasan=off, which creates an unnecessary memory
      overhead in those cases.  Eliminate this overhead by making the minimum
      slab alignment a runtime property and only aligning to 16 if KASAN is
      enabled at runtime.
      
      On a DragonBoard 845c (non-MTE hardware) with a kernel built with
      CONFIG_KASAN_HW_TAGS, waiting for quiescence after a full Android boot I
      see the following Slab measurements in /proc/meminfo (median of 3
      reboots):
      
      Before: 169020 kB
      After:  167304 kB
      
      [akpm@linux-foundation.org: make slab alignment type `unsigned int' to avoid casting]
      Link: https://linux-review.googlesource.com/id/I752e725179b43b144153f4b6f584ceb646473ead
      Link: https://lkml.kernel.org/r/20220427195820.1716975-2-pcc@google.comSigned-off-by: default avatarPeter Collingbourne <pcc@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Tested-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d949a815
    • Peter Collingbourne's avatar
      printk: stop including cache.h from printk.h · 534aa1dc
      Peter Collingbourne authored
      An inclusion of cache.h in printk.h was added in 2014 in commit
      c28aa1f0 ("printk/cache: mark printk_once test variable
      __read_mostly") in order to bring in the definition of __read_mostly.  The
      usage of __read_mostly was later removed in commit 3ec25826 ("printk:
      Tie printk_once / printk_deferred_once into .data.once for reset") which
      made the inclusion of cache.h unnecessary, so remove it.
      
      We have a small amount of code that depended on the inclusion of cache.h
      from printk.h; fix that code to include the appropriate header.
      
      This fixes a circular inclusion on arm64 (linux/printk.h -> linux/cache.h
      -> asm/cache.h -> linux/kasan-enabled.h -> linux/static_key.h ->
      linux/jump_label.h -> linux/bug.h -> asm/bug.h -> linux/printk.h) that
      would otherwise be introduced by the next patch.
      
      Build tested using {allyesconfig,defconfig} x {arm64,x86_64}.
      
      Link: https://linux-review.googlesource.com/id/I8fd51f72c9ef1f2d6afd3b2cbc875aa4792c1fba
      Link: https://lkml.kernel.org/r/20220427195820.1716975-1-pcc@google.comSigned-off-by: default avatarPeter Collingbourne <pcc@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      534aa1dc
    • Baolin Wang's avatar
      mm: rmap: use flush_cache_range() to flush cache for hugetlb pages · dfc7ab57
      Baolin Wang authored
      Now we will use flush_cache_page() to flush cache for anonymous hugetlb
      pages when unmapping or migrating a hugetlb page mapping, but the
      flush_cache_page() only handles a PAGE_SIZE range on some architectures
      (like arm32, arc and so on), which will cause potential cache issues. 
      Thus change to use flush_cache_range() to cover the whole size of a
      hugetlb page.
      
      Link: https://lkml.kernel.org/r/dc903b378d1e2d26bbbe85409ab9d009631f175c.1651056365.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dfc7ab57