1. 10 May, 2022 11 commits
    • David Hildenbrand's avatar
      mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages · 78fbe906
      David Hildenbrand authored
      The basic question we would like to have a reliable and efficient answer
      to is: is this anonymous page exclusive to a single process or might it be
      shared?  We need that information for ordinary/single pages, hugetlb
      pages, and possibly each subpage of a THP.
      
      Introduce a way to mark an anonymous page as exclusive, with the ultimate
      goal of teaching our COW logic to not do "wrong COWs", whereby GUP pins
      lose consistency with the pages mapped into the page table, resulting in
      reported memory corruptions.
      
      Most pageflags already have semantics for anonymous pages, however,
      PG_mappedtodisk should never apply to pages in the swapcache, so let's
      reuse that flag.
      
      As PG_has_hwpoisoned also uses that flag on the second tail page of a
      compound page, convert it to PG_error instead, which is marked as
      PF_NO_TAIL, so never used for tail pages.
      
      Use custom page flag modification functions such that we can do additional
      sanity checks.  The semantics we'll put into some kernel doc in the future
      are:
      
      "
        PG_anon_exclusive is *usually* only expressive in combination with a
        page table entry. Depending on the page table entry type it might
        store the following information:
      
             Is what's mapped via this page table entry exclusive to the
             single process and can be mapped writable without further
             checks? If not, it might be shared and we might have to COW.
      
        For now, we only expect PTE-mapped THPs to make use of
        PG_anon_exclusive in subpages. For other anonymous compound
        folios (i.e., hugetlb), only the head page is logically mapped and
        holds this information.
      
        For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
        set on the head page. When replacing the PMD by a page table full
        of PTEs, PG_anon_exclusive, if set on the head page, will be set on
        all tail pages accordingly. Note that converting from a PTE-mapping
        to a PMD mapping using the same compound page is currently not
        possible and consequently doesn't require care.
      
        If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
        it should only pin if the relevant PG_anon_exclusive is set. In that
        case, the pin will be fully reliable and stay consistent with the pages
        mapped into the page table, as the bit cannot get cleared (e.g., by
        fork(), KSM) while the page is pinned. For anonymous pages that
        are mapped R/W, PG_anon_exclusive can be assumed to always be set
        because such pages cannot possibly be shared.
      
        The page table lock protecting the page table entry is the primary
        synchronization mechanism for PG_anon_exclusive; GUP-fast that does
        not take the PT lock needs special care when trying to clear the
        flag.
      
        Page table entry types and PG_anon_exclusive:
        * Present: PG_anon_exclusive applies.
        * Swap: the information is lost. PG_anon_exclusive was cleared.
        * Migration: the entry holds this information instead.
                     PG_anon_exclusive was cleared.
        * Device private: PG_anon_exclusive applies.
        * Device exclusive: PG_anon_exclusive applies.
        * HW Poison: PG_anon_exclusive is stale and not changed.
      
        If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
        not allowed and the flag will stick around until the page is freed
        and folio->mapping is cleared.
      "
      
      We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
      zapping) of page table entries, page freeing code will handle that when
      also invalidate page->mapping to not indicate PageAnon() anymore.  Letting
      information about exclusivity stick around will be an important property
      when adding sanity checks to unpinning code.
      
      Note that we properly clear the flag in free_pages_prepare() via
      PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page,
      so there is no need to manually clear the flag.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-12-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      78fbe906
    • David Hildenbrand's avatar
      mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page() · 50053941
      David Hildenbrand authored
      We can already theoretically fail to unmap (still having page_mapped()) in
      case arch_unmap_one() fails, which can happen on sparc.  Failures to unmap
      are handled gracefully, just as if there are other references on the
      target page: freezing the refcount in split_huge_page_to_list() will fail
      if still mapped and we'll simply remap.
      
      In commit 504e070d ("mm: thp: replace DEBUG_VM BUG with VM_WARN when
      unmap fails for split") we already converted to VM_WARN_ON_ONCE_PAGE,
      let's get rid of it completely now.
      
      This is a preparation for making try_to_migrate() fail on anonymous pages
      with GUP pins, which will make this VM_WARN_ON_ONCE_PAGE trigger more
      frequently.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-11-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50053941
    • David Hildenbrand's avatar
      mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively · 6c54dc6c
      David Hildenbrand authored
      We want to mark anonymous pages exclusive, and when using
      page_move_anon_rmap() we know that we are the exclusive user, as properly
      documented.  This is a preparation for marking anonymous pages exclusive
      in page_move_anon_rmap().
      
      In both instances, we're holding page lock and are sure that we're the
      exclusive owner (page_count() == 1).  hugetlb already properly uses
      page_move_anon_rmap() in the write fault handler.
      
      Note that in case of a PTE-mapped THP, we'll only end up calling this
      function if the whole THP is only referenced by the single PTE mapping a
      single subpage (page_count() == 1); consequently, it's fine to modify the
      compound page mapping inside page_move_anon_rmap().
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-10-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c54dc6c
    • David Hildenbrand's avatar
      mm/rmap: drop "compound" parameter from page_add_new_anon_rmap() · 40f2bbf7
      David Hildenbrand authored
      New anonymous pages are always mapped natively: only THP/khugepaged code
      maps a new compound anonymous page and passes "true".  Otherwise, we're
      just dealing with simple, non-compound pages.
      
      Let's give the interface clearer semantics and document these.  Remove the
      PageTransCompound() sanity check from page_add_new_anon_rmap().
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      40f2bbf7
    • David Hildenbrand's avatar
      mm/rmap: pass rmap flags to hugepage_add_anon_rmap() · 28c5209d
      David Hildenbrand authored
      Let's prepare for passing RMAP_EXCLUSIVE, similarly as we do for
      page_add_anon_rmap() now.  RMAP_COMPOUND is implicit for hugetlb pages and
      ignored.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      28c5209d
    • David Hildenbrand's avatar
      mm/rmap: remove do_page_add_anon_rmap() · f1e2db12
      David Hildenbrand authored
      ... and instead convert page_add_anon_rmap() to accept flags.
      
      Passing flags instead of bools is usually nicer either way, and we want to
      more often also pass RMAP_EXCLUSIVE in follow up patches when detecting
      that an anonymous page is exclusive: for example, when restoring an
      anonymous page from a writable migration entry.
      
      This is a preparation for marking an anonymous page inside
      page_add_anon_rmap() as exclusive when RMAP_EXCLUSIVE is passed.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f1e2db12
    • David Hildenbrand's avatar
      mm/rmap: convert RMAP flags to a proper distinct rmap_t type · 14f9135d
      David Hildenbrand authored
      We want to pass the flags to more than one anon rmap function, getting rid
      of special "do_page_add_anon_rmap()".  So let's pass around a distinct
      __bitwise type and refine documentation.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      14f9135d
    • David Hildenbrand's avatar
      mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap() · fb3d824d
      David Hildenbrand authored
      ...  and move the special check for pinned pages into
      page_try_dup_anon_rmap() to prepare for tracking exclusive anonymous pages
      via a new pageflag, clearing it only after making sure that there are no
      GUP pins on the anonymous page.
      
      We really only care about pins on anonymous pages, because they are prone
      to getting replaced in the COW handler once mapped R/O.  For !anon pages
      in cow-mappings (!VM_SHARED && VM_MAYWRITE) we shouldn't really care about
      that, at least not that I could come up with an example.
      
      Let's drop the is_cow_mapping() check from page_needs_cow_for_dma(), as we
      know we're dealing with anonymous pages.  Also, drop the handling of
      pinned pages from copy_huge_pud() and add a comment if ever supporting
      anonymous pages on the PUD level.
      
      This is a preparation for tracking exclusivity of anonymous pages in the
      rmap code, and disallowing marking a page shared (-> failing to duplicate)
      if there are GUP pins on a page.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fb3d824d
    • David Hildenbrand's avatar
      mm/memory: slightly simplify copy_present_pte() · b51ad4f8
      David Hildenbrand authored
      Let's move the pinning check into the caller, to simplify return code
      logic and prepare for further changes: relocating the
      page_needs_cow_for_dma() into rmap handling code.
      
      While at it, remove the unused pte parameter and simplify the comments a
      bit.
      
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b51ad4f8
    • David Hildenbrand's avatar
      mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range() · 623a1ddf
      David Hildenbrand authored
      Let's do it just like copy_page_range(), taking the seqlock and making
      sure the mmap_lock is held in write mode.
      
      This allows for add a VM_BUG_ON to page_needs_cow_for_dma() and properly
      synchronizes concurrent fork() with GUP-fast of hugetlb pages, which will
      be relevant for further changes.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      623a1ddf
    • David Hildenbrand's avatar
      mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed · 322842ea
      David Hildenbrand authored
      Patch series "mm: COW fixes part 2: reliable GUP pins of anonymous pages", v4.
      
      This series is the result of the discussion on the previous approach [2]. 
      More information on the general COW issues can be found there.  It is
      based on latest linus/master (post v5.17, with relevant core-MM changes
      for v5.18-rc1).
      
      This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
      on an anonymous page and COW logic fails to detect exclusivity of the page
      to then replacing the anonymous page by a copy in the page table: The GUP
      pin lost synchronicity with the pages mapped into the page tables.
      
      This issue, including other related COW issues, has been summarized in [3]
      under 3):
      "
        3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)
      
        page_maybe_dma_pinned() is used to check if a page may be pinned for
        DMA (using FOLL_PIN instead of FOLL_GET).  While false positives are
        tolerable, false negatives are problematic: pages that are pinned for
        DMA must not be added to the swapcache.  If it happens, the (now pinned)
        page could be faulted back from the swapcache into page tables
        read-only.  Future write-access would detect the pinning and COW the
        page, losing synchronicity.  For the interested reader, this is nicely
        documented in feb889fb ("mm: don't put pinned pages into the swap
        cache").
      
        Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
        cases and can result in a violation of the documented semantics: giving
        false negatives because of the race.
      
        There are cases where we call it without properly taking a per-process
        sequence lock, turning the usage of page_maybe_dma_pinned() racy.  While
        one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
        handle, there is especially one rmap case (shrink_page_list) that's hard
        to fix: in the rmap world, we're not limited to a single process.
      
        The shrink_page_list() issue is really subtle.  If we race with
        someone pinning a page, we can trigger the same issue as in the FOLL_GET
        case.  See the detail section at the end of this mail on a discussion
        how bad this can bite us with VFIO or other FOLL_PIN user.
      
        It's harder to reproduce, but I managed to modify the O_DIRECT
        reproducer to use io_uring fixed buffers [15] instead, which ends up
        using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
        similarly trigger a loss of synchronicity and consequently a memory
        corruption.
      
        Again, the root issue is that a write-fault on a page that has
        additional references results in a COW and thereby a loss of
        synchronicity and consequently a memory corruption if two parties
        believe they are referencing the same page.
      "
      
      This series makes GUP pins (R/O and R/W) on anonymous pages fully
      reliable, especially also taking care of concurrent pinning via GUP-fast,
      for example, also fully fixing an issue reported regarding NUMA balancing
      [4] recently.  While doing that, it further reduces "unnecessary COWs",
      especially when we don't fork()/KSM and don't swapout, and fixes the COW
      security for hugetlb for FOLL_PIN.
      
      In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
      anonymous page is exclusive.  Exclusive anonymous pages that are mapped
      R/O can directly be mapped R/W by the COW logic in the write fault
      handler.  Exclusive anonymous pages that want to be shared (fork(), KSM)
      first have to be marked shared -- which will fail if there are GUP pins on
      the page.  GUP is only allowed to take a pin on anonymous pages that are
      exclusive.  The PT lock is the primary mechanism to synchronize
      modifications of PG_anon_exclusive.  We synchronize against GUP-fast
      either via the src_mm->write_protect_seq (during fork()) or via
      clear/invalidate+flush of the relevant page table entry.
      
      Special care has to be taken about swap, migration, and THPs (whereby a
      PMD-mapping can be converted to a PTE mapping and we have to track
      information for subpages).  Besides these, we let the rmap code handle
      most magic.  For reliable R/O pins of anonymous pages, we need
      FAULT_FLAG_UNSHARE logic as part of our previous approach [2], however,
      it's now 100% mapcount free and I further simplified it a bit.
      
        #1 is a fix
        #3-#10 are mostly rmap preparations for PG_anon_exclusive handling
        #11 introduces PG_anon_exclusive
        #12 uses PG_anon_exclusive and make R/W pins of anonymous pages
         reliable
        #13 is a preparation for reliable R/O pins
        #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins
         make R/O pins of anonymous pages reliable
        #16 adds sanity check when (un)pinning anonymous pages
      
      [1] https://lkml.kernel.org/r/20220131162940.210846-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616
      
      
      This patch (of 17):
      
      In case arch_unmap_one() fails, we already did a swap_duplicate().  let's
      undo that properly via swap_free().
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220428083441.37290-2-david@redhat.com
      Fixes: ca827d55 ("mm, swap: Add infrastructure for saving page metadata on swap")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      322842ea
  2. 29 Apr, 2022 29 commits