1. 25 Mar, 2022 40 commits
    • Kees Cook's avatar
      selftests: kselftest framework: provide "finished" helper · 25fd2d41
      Kees Cook authored
      Instead of having each time that wants to use ksft_exit() have to figure
      out the internals of kselftest.h, add the helper ksft_finished() that
      makes sure the passes, xfails, and skips are equal to the test plan count.
      
      Link: https://lkml.kernel.org/r/20220201013717.2464392-1-keescook@chromium.orgSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25fd2d41
    • Johannes Weiner's avatar
      mm: madvise: MADV_DONTNEED_LOCKED · 9457056a
      Johannes Weiner authored
      MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
      and MCL_ONFAULT allowing to mlock without populating, there are valid use
      cases for depopulating locked ranges as well.
      
      Users mlock memory to protect secrets.  There are allocators for secure
      buffers that want in-use memory generally mlocked, but cleared and
      invalidated memory to give up the physical pages.  This could be done with
      explicit munlock -> mlock calls on free -> alloc of course, but that adds
      two unnecessary syscalls, heavy mmap_sem write locks, vma splits and
      re-merges - only to get rid of the backing pages.
      
      Users also mlockall(MCL_ONFAULT) to suppress sustained paging, but are
      okay with on-demand initial population.  It seems valid to selectively
      free some memory during the lifetime of such a process, without having to
      mess with its overall policy.
      
      Why add a separate flag? Isn't this a pretty niche usecase?
      
      - MADV_DONTNEED has been bailing on locked vmas forever. It's at least
        conceivable that someone, somewhere is relying on mlock to protect
        data from perhaps broader invalidation calls. Changing this behavior
        now could lead to quiet data corruption.
      
      - It also clarifies expectations around MADV_FREE and maybe
        MADV_REMOVE. It avoids the situation where one quietly behaves
        different than the others. MADV_FREE_LOCKED can be added later.
      
      - The combination of mlock() and madvise() in the first place is
        probably niche. But where it happens, I'd say that dropping pages
        from a locked region once they don't contain secrets or won't page
        anymore is much saner than relying on mlock to protect memory from
        speculative or errant invalidation calls. It's just that we can't
        change the default behavior because of the two previous points.
      
      Given that, an explicit new flag seems to make the most sense.
      
      [hannes@cmpxchg.org: fix mips build]
      
      Link: https://lkml.kernel.org/r/20220304171912.305060-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9457056a
    • Mauricio Faria de Oliveira's avatar
      mm: fix race between MADV_FREE reclaim and blkdev direct IO read · 6c8e2a25
      Mauricio Faria de Oliveira authored
      Problem:
      =======
      
      Userspace might read the zero-page instead of actual data from a direct IO
      read on a block device if the buffers have been called madvise(MADV_FREE)
      on earlier (this is discussed below) due to a race between page reclaim on
      MADV_FREE and blkdev direct IO read.
      
      - Race condition:
        ==============
      
      During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks
      if the page is not dirty, then discards its rmap PTE(s) (vs.  remap back
      if the page is dirty).
      
      However, after try_to_unmap_one() returns to shrink_page_list(), it might
      keep the page _anyway_ if page_ref_freeze() fails (it expects exactly
      _one_ page reference, from the isolation for page reclaim).
      
      Well, blkdev_direct_IO() gets references for all pages, and on READ
      operations it only sets them dirty _later_.
      
      So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct
      IO read from block devices, and page reclaim happens during
      __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
      returns, but BEFORE the pages are set dirty, the situation happens.
      
      The direct IO read eventually completes.  Now, when userspace reads the
      buffers, the PTE is no longer there and the page fault handler
      do_anonymous_page() services that with the zero-page, NOT the data!
      
      A synthetic reproducer is provided.
      
      - Page faults:
        ===========
      
      If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't
      happen, because that faults-in all pages as writeable, so
      do_anonymous_page() sets up a new page/rmap/PTE, and that is used by
      direct IO.  The userspace reads don't fault as the PTE is there (thus
      zero-page is not used/setup).
      
      But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE
      is no longer there; the subsequent page faults can't help:
      
      The data-read from the block device probably won't generate faults due to
      DMA (no MMU) but even in the case it wouldn't use DMA, that happens on
      different virtual addresses (not user-mapped addresses) because `struct
      bio_vec` stores `struct page` to figure addresses out (which are different
      from user-mapped addresses) for the read.
      
      Thus userspace reads (to user-mapped addresses) still fault, then
      do_anonymous_page() gets another `struct page` that would address/ map to
      other memory than the `struct page` used by `struct bio_vec` for the read.
      (The original `struct page` is not available, since it wasn't freed, as
      page_ref_freeze() failed due to more page refs.  And even if it were
      available, its data cannot be trusted anymore.)
      
      Solution:
      ========
      
      One solution is to check for the expected page reference count in
      try_to_unmap_one().
      
      There should be one reference from the isolation (that is also checked in
      shrink_page_list() with page_ref_freeze()) plus one or more references
      from page mapping(s) (put in discard: label).  Further references mean
      that rmap/PTE cannot be unmapped/nuked.
      
      (Note: there might be more than one reference from mapping due to
      fork()/clone() without CLONE_VM, which use the same `struct page` for
      references, until the copy-on-write page gets copied.)
      
      So, additional page references (e.g., from direct IO read) now prevent the
      rmap/PTE from being unmapped/dropped; similarly to the page is not freed
      per shrink_page_list()/page_ref_freeze()).
      
      - Races and Barriers:
        ==================
      
      The new check in try_to_unmap_one() should be safe in races with
      bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's
      done under the PTE lock.
      
      The fast path doesn't take the lock, but it checks if the PTE has changed
      and if so, it drops the reference and leaves the page for the slow path
      (which does take that lock).
      
      The fast path requires synchronization w/ full memory barrier: it writes
      the page reference count first then it reads the PTE later, while
      try_to_unmap() writes PTE first then it reads page refcount.
      
      And a second barrier is needed, as the page dirty flag should not be read
      before the page reference count (as in __remove_mapping()).  (This can be
      a load memory barrier only; no writes are involved.)
      
      Call stack/comments:
      
      - try_to_unmap_one()
        - page_vma_mapped_walk()
          - map_pte()			# see pte_offset_map_lock():
              pte_offset_map()
              spin_lock()
      
        - ptep_get_and_clear()	# write PTE
        - smp_mb()			# (new barrier) GUP fast path
        - page_ref_count()		# (new check) read refcount
      
        - page_vma_mapped_walk_done()	# see pte_unmap_unlock():
            pte_unmap()
            spin_unlock()
      
      - bio_iov_iter_get_pages()
        - __bio_iov_iter_get_pages()
          - iov_iter_get_pages()
            - get_user_pages_fast()
              - internal_get_user_pages_fast()
      
                # fast path
                - lockless_pages_from_mm()
                  - gup_{pgd,p4d,pud,pmd,pte}_range()
                      ptep = pte_offset_map()		# not _lock()
                      pte = ptep_get_lockless(ptep)
      
                      page = pte_page(pte)
                      try_grab_compound_head(page)	# inc refcount
                                                  	# (RMW/barrier
                                                   	#  on success)
      
                      if (pte_val(pte) != pte_val(*ptep)) # read PTE
                              put_compound_head(page) # dec refcount
                              			# go slow path
      
                # slow path
                - __gup_longterm_unlocked()
                  - get_user_pages_unlocked()
                    - __get_user_pages_locked()
                      - __get_user_pages()
                        - follow_{page,p4d,pud,pmd}_mask()
                          - follow_page_pte()
                              ptep = pte_offset_map_lock()
                              pte = *ptep
                              page = vm_normal_page(pte)
                              try_grab_page(page)	# inc refcount
                              pte_unmap_unlock()
      
      - Huge Pages:
        ==========
      
      Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE
      (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
      (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
      thus should reach shrink_page_list() -> split_huge_page_to_list() before
      try_to_unmap[_one](), so it deals with normal pages only.
      
      (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens,
      which should not or be rare, the page refcount should be greater than
      mapcount: the head page is referenced by tail pages.  That also prevents
      checking the head `page` then incorrectly call page_remove_rmap(subpage)
      for a tail page, that isn't even in the shrink_page_list()'s page_list (an
      effect of split huge pmd/pmvw), as it might happen today in this unlikely
      scenario.)
      
      MADV_FREE'd buffers:
      ===================
      
      So, back to the "if MADV_FREE pages are used as buffers" note.  The case
      is arguable, and subject to multiple interpretations.
      
      The madvise(2) manual page on the MADV_FREE advice value says:
      
      1) 'After a successful MADV_FREE ... data will be lost when
         the kernel frees the pages.'
      2) 'the free operation will be canceled if the caller writes
         into the page' / 'subsequent writes ... will succeed and
         then [the] kernel cannot free those dirtied pages'
      3) 'If there is no subsequent write, the kernel can free the
         pages at any time.'
      
      Thoughts, questions, considerations... respectively:
      
      1) Since the kernel didn't actually free the page (page_ref_freeze()
         failed), should the data not have been lost? (on userspace read.)
      2) Should writes performed by the direct IO read be able to cancel
         the free operation?
         - Should the direct IO read be considered as 'the caller' too,
           as it's been requested by 'the caller'?
         - Should the bio technique to dirty pages on return to userspace
           (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
           be considered in another/special way here?
      3) Should an upcoming write from a previously requested direct IO
         read be considered as a subsequent write, so the kernel should
         not free the pages? (as it's known at the time of page reclaim.)
      
      And lastly:
      
      Technically, the last point would seem a reasonable consideration and
      balance, as the madvise(2) manual page apparently (and fairly) seem to
      assume that 'writes' are memory access from the userspace process (not
      explicitly considering writes from the kernel or its corner cases; again,
      fairly)..  plus the kernel fix implementation for the corner case of the
      largely 'non-atomic write' encompassed by a direct IO read operation, is
      relatively simple; and it helps.
      
      Reproducer:
      ==========
      
      @ test.c (simplified, but works)
      
      	#define _GNU_SOURCE
      	#include <fcntl.h>
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      
      	int main() {
      		int fd, i;
      		char *buf;
      
      		fd = open(DEV, O_RDONLY | O_DIRECT);
      
      		buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
                      	   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			buf[i] = 1; // init to non-zero
      
      		madvise(buf, BUF_SIZE, MADV_FREE);
      
      		read(fd, buf, BUF_SIZE);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			printf("%p: 0x%x\n", &buf[i], buf[i]);
      
      		return 0;
      	}
      
      @ block/fops.c (formerly fs/block_dev.c)
      
      	+#include <linux/swap.h>
      	...
      	... __blkdev_direct_IO[_simple](...)
      	{
      	...
      	+	if (!strcmp(current->comm, "good"))
      	+		shrink_all_memory(ULONG_MAX);
      	+
               	ret = bio_iov_iter_get_pages(...);
      	+
      	+	if (!strcmp(current->comm, "bad"))
      	+		shrink_all_memory(ULONG_MAX);
      	...
      	}
      
      @ shell
      
              # NUM_PAGES=4
              # PAGE_SIZE=$(getconf PAGE_SIZE)
      
              # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
              # DEV=$(losetup -f --show test.img)
      
              # gcc -DDEV=\"$DEV\" \
                    -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
                    -DPAGE_SIZE=${PAGE_SIZE} \
                     test.c -o test
      
              # od -tx1 $DEV
              0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
              *
              0040000
      
              # mv test good
              # ./good
              0x7f7c10418000: 0x79
              0x7f7c10419000: 0x79
              0x7f7c1041a000: 0x79
              0x7f7c1041b000: 0x79
      
              # mv good bad
              # ./bad
              0x7fa1b8050000: 0x0
              0x7fa1b8051000: 0x0
              0x7fa1b8052000: 0x0
              0x7fa1b8053000: 0x0
      
      Note: the issue is consistent on v5.17-rc3, but it's intermittent with the
      support of MADV_FREE on v4.5 (60%-70% error; needs swap).  [wrap
      do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c].
      
      - v5.17-rc3:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x0
      
              # free | grep Swap
              Swap:             0           0           0
      
      - v4.5:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 2702  0x0
                 1298  0x79
      
              # swapoff -av
              swapoff /swap
      
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
      Ceph/TCMalloc:
      =============
      
      For documentation purposes, the use case driving the analysis/fix is Ceph
      on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to
      release unused memory to the system from the mmap'ed page heap (might be
      committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan()
      -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() ->
      TCMalloc_SystemCommit() -> do nothing.
      
      Note: TCMalloc switched back to MADV_DONTNEED a few commits after the
      release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue
      just 'disappeared' on Ceph on later Ubuntu releases but is still present
      in the kernel, and can be hit by other use cases.
      
      The observed issue seems to be the old Ceph bug #22464 [1], where checksum
      mismatches are observed (and instrumentation with buffer dumps shows
      zero-pages read from mmap'ed/MADV_FREE'd page ranges).
      
      The issue in Ceph was reasonably deemed a kernel bug (comment #50) and
      mostly worked around with a retry mechanism, but other parts of Ceph could
      still hit that (rocksdb).  Anyway, it's less likely to be hit again as
      TCMalloc switched out of MADV_FREE by default.
      
      (Some kernel versions/reports from the Ceph bug, and relation with
      the MADV_FREE introduction/changes; TCMalloc versions not checked.)
      - 4.4 good
      - 4.5 (madv_free: introduction)
      - 4.9 bad
      - 4.10 good? maybe a swapless system
      - 4.12 (madv_free: no longer free instantly on swapless systems)
      - 4.13 bad
      
      [1] https://tracker.ceph.com/issues/22464
      
      Thanks:
      ======
      
      Several people contributed to analysis/discussions/tests/reproducers in
      the first stages when drilling down on ceph/tcmalloc/linux kernel:
      
      - Dan Hill
      - Dan Streetman
      - Dongdong Tao
      - Gavin Guo
      - Gerald Yang
      - Heitor Alves de Siqueira
      - Ioanna Alifieraki
      - Jay Vosburgh
      - Matthew Ruffell
      - Ponnuvel Palaniyappan
      
      Reviews, suggestions, corrections, comments:
      
      - Minchan Kim
      - Yu Zhao
      - Huang, Ying
      - John Hubbard
      - Christoph Hellwig
      
      [mfo@canonical.com: v4]
        Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com
      
      Fixes: 802a3a92 ("mm: reclaim MADV_FREE pages")
      Signed-off-by: default avatarMauricio Faria de Oliveira <mfo@canonical.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Hill <daniel.hill@canonical.com>
      Cc: Dan Streetman <dan.streetman@canonical.com>
      Cc: Dongdong Tao <dongdong.tao@canonical.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Gerald Yang <gerald.yang@canonical.com>
      Cc: Heitor Alves de Siqueira <halves@canonical.com>
      Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
      Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
      Cc: <stable@vger.kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c8e2a25
    • Anshuman Khandual's avatar
      mm: generalize ARCH_HAS_FILTER_PGPROT · 24e988c7
      Anshuman Khandual authored
      ARCH_HAS_FILTER_PGPROT config has duplicate definitions on platforms that
      subscribe it.  Instead make it a generic config option which can be
      selected on applicable platforms when required.
      
      Link: https://lkml.kernel.org/r/1643004823-16441-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24e988c7
    • Hugh Dickins's avatar
      mm: unmap_mapping_range_tree() with i_mmap_rwsem shared · 2c865995
      Hugh Dickins authored
      Revert 48ec833b ("Revert "mm/memory.c: share the i_mmap_rwsem"") to
      reinstate c8475d14 ("mm/memory.c: share the i_mmap_rwsem"): the
      unmap_mapping_range family of functions do the unmapping of user pages
      (ultimately via zap_page_range_single) without modifying the interval tree
      itself, and unmapping races are necessarily guarded by page table lock,
      thus the i_mmap_rwsem should be shared in unmap_mapping_pages() and
      unmap_mapping_folio().
      
      Commit 48ec833b was intended as a short-term measure, allowing the
      other shared lock changes into 3.19 final, before investigating three
      trinity crashes, one of which had been bisected to commit c8475d14:
      
      [1] https://lkml.org/lkml/2014/11/14/342
      https://lore.kernel.org/lkml/5466142C.60100@oracle.com/
      [2] https://lkml.org/lkml/2014/12/22/213
      https://lore.kernel.org/lkml/549832E2.8060609@oracle.com/
      [3] https://lkml.org/lkml/2014/12/9/741
      https://lore.kernel.org/lkml/5487ACC5.1010002@oracle.com/
      
      Two of those were Bad page states: free_pages_prepare() found PG_mlocked
      still set - almost certain to have been fixed by 4.4 commit b87537d9
      ("mm: rmap use pte lock not mmap_sem to set PageMlocked").  The NULL deref
      on rwsem in [2]: unclear, only happened once, not bisected to c8475d14.
      
      No change to the i_mmap_lock_write() around __unmap_hugepage_range_final()
      in unmap_single_vma(): IIRC that's a special usage, helping to serialize
      hugetlbfs page table sharing, not to be dabbled with lightly.  No change
      to other uses of i_mmap_lock_write() by hugetlbfs.
      
      I am not aware of any significant gains from the concurrency allowed by
      this commit: it is submitted more to resolve an ancient misunderstanding.
      
      Link: https://lkml.kernel.org/r/e4a5e356-6c87-47b2-3ce8-c2a95ae84e20@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c865995
    • Hugh Dickins's avatar
      mm: warn on deleting redirtied only if accounted · 566d3362
      Hugh Dickins authored
      filemap_unaccount_folio() has a WARN_ON_ONCE(folio_test_dirty(folio)).  It
      is good to warn of late dirtying on a persistent filesystem, but late
      dirtying on tmpfs can only lose data which is expected to be thrown away;
      and it's a pity if that warning comes ONCE on tmpfs, then hides others
      which really matter.  Make it conditional on mapping_cap_writeback().
      
      Cleanup: then folio_account_cleaned() no longer needs to check that for
      itself, and so no longer needs to know the mapping.
      
      Link: https://lkml.kernel.org/r/b5a1106c-7226-a5c6-ad41-ad4832cae1f@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jan Kara <jack@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      566d3362
    • David Hildenbrand's avatar
      mm/huge_memory: remove stale locking logic from __split_huge_pmd() · 7f760917
      David Hildenbrand authored
      Let's remove the stale logic that was required for reuse_swap_page().
      
      [akpm@linux-foundation.org: simplification, per Yang Shi]
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-10-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f760917
    • David Hildenbrand's avatar
      mm/huge_memory: remove stale page_trans_huge_mapcount() · 55c62fa7
      David Hildenbrand authored
      All users are gone, let's remove it.
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55c62fa7
    • David Hildenbrand's avatar
      mm/swapfile: remove stale reuse_swap_page() · 03104c2c
      David Hildenbrand authored
      All users are gone, let's remove it.  We'll let SWP_STABLE_WRITES stick
      around for now, as it might come in handy in the near future.
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03104c2c
    • David Hildenbrand's avatar
      mm/khugepaged: remove reuse_swap_page() usage · 363106c4
      David Hildenbrand authored
      reuse_swap_page() currently indicates if we can write to an anon page
      without COW.  A COW is required if the page is shared by multiple
      processes (either already mapped or via swap entries) or if there is
      concurrent writeback that cannot tolerate concurrent page modifications.
      
      However, in the context of khugepaged we're not actually going to write to
      a read-only mapped page, we'll copy the page content to our newly
      allocated THP and map that THP writable.  All we have to make sure is that
      the read-only mapped page we're about to copy won't get reused by another
      process sharing the page, otherwise, page content would get modified.  But
      that is already guaranteed via multiple mechanisms (e.g., holding a
      reference, holding the page lock, removing the rmap after copying the
      page).
      
      The swapcache handling was introduced in commit 10359213 ("mm:
      incorporate read-only pages into transparent huge pages") and it sounds
      like it merely wanted to mimic what do_swap_page() would do when trying to
      map a page obtained via the swapcache writable.
      
      As that logic is unnecessary, let's just remove it, removing the last user
      of reuse_swap_page().
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      363106c4
    • David Hildenbrand's avatar
      mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page() · 3bff7e3f
      David Hildenbrand authored
      We currently have a different COW logic for anon THP than we have for
      ordinary anon pages in do_wp_page(): the effect is that the issue reported
      in CVE-2020-29374 is currently still possible for anon THP: an unintended
      information leak from the parent to the child.
      
      Let's apply the same logic (page_count() == 1), with similar optimizations
      to remove additional references first as we really want to avoid
      PTE-mapping the THP and copying individual pages best we can.
      
      If we end up with a page that has page_count() != 1, we'll have to PTE-map
      the THP and fallback to do_wp_page(), which will always copy the page.
      
      Note that KSM does not apply to THP.
      
      I. Interaction with the swapcache and writeback
      
      While a THP is in the swapcache, the swapcache holds one reference on each
      subpage of the THP.  So with PageSwapCache() set, we expect as many
      additional references as we have subpages.  If we manage to remove the THP
      from the swapcache, all these references will be gone.
      
      Usually, a THP is not split when entered into the swapcache and stays a
      compound page.  However, try_to_unmap() will PTE-map the THP and use PTE
      swap entries.  There are no PMD swap entries for that purpose,
      consequently, we always only swapin subpages into PTEs.
      
      Removing a page from the swapcache can fail either when there are
      remaining swap entries (in which case COW is the right thing to do) or if
      the page is currently under writeback.
      
      Having a locked, R/O PMD-mapped THP that is in the swapcache seems to be
      possible only in corner cases, for example, if try_to_unmap() failed after
      adding the page to the swapcache.  However, it's comparatively easy to
      handle.
      
      As we have to fully unmap a THP before starting writeback, and swapin is
      always done on the PTE level, we shouldn't find a R/O PMD-mapped THP in
      the swapcache that is under writeback.  This should at least leave
      writeback out of the picture.
      
      II. Interaction with GUP references
      
      Having a R/O PMD-mapped THP with GUP references (i.e., R/O references)
      will result in PTE-mapping the THP on a write fault.  Similar to ordinary
      anon pages, do_wp_page() will have to copy sub-pages and result in a
      disconnect between the GUP references and the pages actually mapped into
      the page tables.  To improve the situation in the future, we'll need
      additional handling to mark anonymous pages as definitely exclusive to a
      single process, only allow GUP pins on exclusive anon pages, and disallow
      sharing of exclusive anon pages with GUP pins e.g., during fork().
      
      III. Interaction with references from LRU pagevecs
      
      There is no need to try draining the (local) LRU pagevecs in case we would
      stumble over a !PageLRU() page: folio_add_lru() and friends will always
      flush the affected pagevec after adding a compound page to it immediately
      -- pagevec_add_and_need_flush() always returns "true" for them.  Note that
      the LRU pagevecs will hold a reference on the compound page for a very
      short time, between adding the page to the pagevec and draining it
      immediately afterwards.
      
      IV. Interaction with speculative/temporary references
      
      Similar to ordinary anon pages, other speculative/temporary references on
      the THP, for example, from the pagecache or page migration code, will
      disallow exclusive reuse of the page.  We'll have to PTE-map the THP.
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3bff7e3f
    • David Hildenbrand's avatar
      mm: streamline COW logic in do_swap_page() · c145e0b4
      David Hildenbrand authored
      Currently we have a different COW logic when:
      * triggering a read-fault to swapin first and then trigger a write-fault
        -> do_swap_page() + do_wp_page()
      * triggering a write-fault to swapin
        -> do_swap_page() + do_wp_page() only if we fail reuse in do_swap_page()
      
      The COW logic in do_swap_page() is different than our reuse logic in
      do_wp_page().  The COW logic in do_wp_page() -- page_count() == 1 -- makes
      currently sure that we certainly don't have a remaining reference, e.g.,
      via GUP, on the target page we want to reuse: if there is any unexpected
      reference, we have to copy to avoid information leaks.
      
      As do_swap_page() behaves differently, in environments with swap enabled
      we can currently have an unintended information leak from the parent to
      the child, similar as known from CVE-2020-29374:
      
      	1. Parent writes to anonymous page
      	-> Page is mapped writable and modified
      	2. Page is swapped out
      	-> Page is unmapped and replaced by swap entry
      	3. fork()
      	-> Swap entries are copied to child
      	4. Child pins page R/O
      	-> Page is mapped R/O into child
      	5. Child unmaps page
      	-> Child still holds GUP reference
      	6. Parent writes to page
      	-> Page is reused in do_swap_page()
      	-> Child can observe changes
      
      Exchanging 2. and 3. should have the same effect.
      
      Let's apply the same COW logic as in do_wp_page(), conditionally trying to
      remove the page from the swapcache after freeing the swap entry, however,
      before actually mapping our page.  We can change the order now that we use
      try_to_free_swap(), which doesn't care about the mapcount, instead of
      reuse_swap_page().
      
      To handle references from the LRU pagevecs, conditionally drain the local
      LRU pagevecs when required, however, don't consider the page_count() when
      deciding whether to drain to keep it simple for now.
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c145e0b4
    • David Hildenbrand's avatar
      mm: slightly clarify KSM logic in do_swap_page() · 84d60fdd
      David Hildenbrand authored
      Let's make it clearer that KSM might only have to copy a page in case we
      have a page in the swapcache, not if we allocated a fresh page and
      bypassed the swapcache.  While at it, add a comment why this is usually
      necessary and merge the two swapcache conditions.
      
      [akpm@linux-foundation.org: fix comment, per David]
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84d60fdd
    • David Hildenbrand's avatar
      mm: optimize do_wp_page() for fresh pages in local LRU pagevecs · d4c47097
      David Hildenbrand authored
      For example, if a page just got swapped in via a read fault, the LRU
      pagevecs might still hold a reference to the page.  If we trigger a write
      fault on such a page, the additional reference from the LRU pagevecs will
      prohibit reusing the page.
      
      Let's conditionally drain the local LRU pagevecs when we stumble over a
      !PageLRU() page.  We cannot easily drain remote LRU pagevecs and it might
      not be desirable performance-wise.  Consequently, this will only avoid
      copying in some cases.
      
      Add a simple "page_count(page) > 3" check first but keep the
      "page_count(page) > 1 + PageSwapCache(page)" check in place, as we want to
      minimize cases where we remove a page from the swapcache but won't be able
      to reuse it, for example, because another process has it mapped R/O, to
      not affect reclaim.
      
      We cannot easily handle the following cases and we will always have to
      copy:
      
      (1) The page is referenced in the LRU pagevecs of other CPUs. We really
          would have to drain the LRU pagevecs of all CPUs -- most probably
          copying is much cheaper.
      
      (2) The page is already PageLRU() but is getting moved between LRU
          lists, for example, for activation (e.g., mark_page_accessed()),
          deactivation (MADV_COLD), or lazyfree (MADV_FREE). We'd have to
          drain mostly unconditionally, which might be bad performance-wise.
          Most probably this won't happen too often in practice.
      
      Note that there are other reasons why an anon page might temporarily not
      be PageLRU(): for example, compaction and migration have to isolate LRU
      pages from the LRU lists first (isolate_lru_page()), moving them to
      temporary local lists and clearing PageLRU() and holding an additional
      reference on the page.  In that case, we'll always copy.
      
      This change seems to be fairly effective with the reproducer [1] shared by
      Nadav, as long as writeback is done synchronously, for example, using
      zram.  However, with asynchronous writeback, we'll usually fail to free
      the swapcache because the page is still under writeback: something we
      cannot easily optimize for, and maybe it's not really relevant in
      practice.
      
      [1] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d4c47097
    • David Hildenbrand's avatar
      mm: optimize do_wp_page() for exclusive pages in the swapcache · 53a05ad9
      David Hildenbrand authored
      Patch series "mm: COW fixes part 1: fix the COW security issue for THP and swap", v3.
      
      This series attempts to optimize and streamline the COW logic for ordinary
      anon pages and THP anon pages, fixing two remaining instances of
      CVE-2020-29374 in do_swap_page() and do_huge_pmd_wp_page(): information
      can leak from a parent process to a child process via anonymous pages
      shared during fork().
      
      This issue, including other related COW issues, has been summarized in [2]:
      
       "1. Observing Memory Modifications of Private Pages From A Child Process
      
        Long story short: process-private memory might not be as private as you
        think once you fork(): successive modifications of private memory
        regions in the parent process can still be observed by the child
        process, for example, by smart use of vmsplice()+munmap().
      
        The core problem is that pinning pages readable in a child process, such
        as done via the vmsplice system call, can result in a child process
        observing memory modifications done in the parent process the child is
        not supposed to observe. [1] contains an excellent summary and [2]
        contains further details. This issue was assigned CVE-2020-29374 [9].
      
        For this to trigger, it's required to use a fork() without subsequent
        exec(), for example, as used under Android zygote. Without further
        details about an application that forks less-privileged child processes,
        one cannot really say what's actually affected and what's not -- see the
        details section the end of this mail for a short sshd/openssh analysis.
      
        While commit 17839856 ("gup: document and work around "COW can break
        either way" issue") fixed this issue and resulted in other problems
        (e.g., ptrace on pmem), commit 09854ba9 ("mm: do_wp_page()
        simplification") re-introduced part of the problem unfortunately.
      
        The original reproducer can be modified quite easily to use THP [3] and
        make the issue appear again on upstream kernels. I modified it to use
        hugetlb [4] and it triggers as well. The problem is certainly less
        severe with hugetlb than with THP; it merely highlights that we still
        have plenty of open holes we should be closing/fixing.
      
        Regarding vmsplice(), the only known workaround is to disallow the
        vmsplice() system call ... or disable THP and hugetlb. But who knows
        what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in
        the end, it's a more generic issue"
      
      This security issue was first reported by Jann Horn on 27 May 2020 and it
      currently affects anonymous pages during swapin, anonymous THP and hugetlb.
      This series tackles anonymous pages during swapin and anonymous THP:
      
       - do_swap_page() for handling COW on PTEs during swapin directly
      
       - do_huge_pmd_wp_page() for handling COW on PMD-mapped THP during write
         faults
      
      With this series, we'll apply the same COW logic we have in do_wp_page()
      to all swappable anon pages: don't reuse (map writable) the page in
      case there are additional references (page_count() != 1). All users of
      reuse_swap_page() are remove, and consequently reuse_swap_page() is
      removed.
      
      In general, we're struggling with the following COW-related issues:
      
      (1) "missed COW": we miss to copy on write and reuse the page (map it
          writable) although we must copy because there are pending references
          from another process to this page. The result is a security issue.
      
      (2) "wrong COW": we copy on write although we wouldn't have to and
          shouldn't: if there are valid GUP references, they will become out
          of sync with the pages mapped into the page table. We fail to detect
          that such a page can be reused safely, especially if never more than
          a single process mapped the page. The result is an intra process
          memory corruption.
      
      (3) "unnecessary COW": we copy on write although we wouldn't have to:
          performance degradation and temporary increases swap+memory
          consumption can be the result.
      
      While this series fixes (1) for swappable anon pages, it tries to reduce
      reported cases of (3) first as good and easy as possible to limit the
      impact when streamlining.  The individual patches try to describe in
      which cases we will run into (3).
      
      This series certainly makes (2) worse for THP, because a THP will now
      get PTE-mapped on write faults if there are additional references, even
      if there was only ever a single process involved: once PTE-mapped, we'll
      copy each and every subpage and won't reuse any subpage as long as the
      underlying compound page wasn't split.
      
      I'm working on an approach to fix (2) and improve (3): PageAnonExclusive
      to mark anon pages that are exclusive to a single process, allow GUP
      pins only on such exclusive pages, and allow turning exclusive pages
      shared (clearing PageAnonExclusive) only if there are no GUP pins.  Anon
      pages with PageAnonExclusive set never have to be copied during write
      faults, but eventually during fork() if they cannot be turned shared.
      The improved reuse logic in this series will essentially also be the
      logic to reset PageAnonExclusive.  This work will certainly take a
      while, but I'm planning on sharing details before having code fully
      ready.
      
      #1-#5 can be applied independently of the rest. #6-#9 are mostly only
      cleanups related to reuse_swap_page().
      
      Notes:
      * For now, I'll leave hugetlb code untouched: "unnecessary COW" might
        easily break existing setups because hugetlb pages are a scarce resource
        and we could just end up having to crash the application when we run out
        of hugetlb pages. We have to be very careful and the security aspect with
        hugetlb is most certainly less relevant than for unprivileged anon pages.
      * Instead of lru_add_drain() we might actually just drain the lru_add list
        or even just remove the single page of interest from the lru_add list.
        This would require a new helper function, and could be added if the
        conditional lru_add_drain() turn out to be a problem.
      * I extended the test case already included in [1] to also test for the
        newly found do_swap_page() case. I'll send that out separately once/if
        this part was merged.
      
      [1] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [2] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      
      This patch (of 9):
      
      Liang Zhang reported [1] that the current COW logic in do_wp_page() is
      sub-optimal when it comes to swap+read fault+write fault of anonymous
      pages that have a single user, visible via a performance degradation in
      the redis benchmark.  Something similar was previously reported [2] by
      Nadav with a simple reproducer.
      
      After we put an anon page into the swapcache and unmapped it from a single
      process, that process might read that page again and refault it read-only.
      If that process then writes to that page, the process is actually the
      exclusive user of the page, however, the COW logic in do_co_page() won't
      be able to reuse it due to the additional reference from the swapcache.
      
      Let's optimize for pages that have been added to the swapcache but only
      have an exclusive user.  Try removing the swapcache reference if there is
      hope that we're the exclusive user.
      
      We will fail removing the swapcache reference in two scenarios:
      (1) There are additional swap entries referencing the page: copying
          instead of reusing is the right thing to do.
      (2) The page is under writeback: theoretically we might be able to reuse
          in some cases, however, we cannot remove the additional reference
          and will have to copy.
      
      Note that we'll only try removing the page from the swapcache when it's
      highly likely that we'll be the exclusive owner after removing the page
      from the swapache.  As we're about to map that page writable and redirty
      it, that should not affect reclaim but is rather the right thing to do.
      
      Further, we might have additional references from the LRU pagevecs, which
      will force us to copy instead of being able to reuse.  We'll try handling
      such references for some scenarios next.  Concurrent writeback cannot be
      handled easily and we'll always have to copy.
      
      While at it, remove the superfluous page_mapcount() check: it's
      implicitly covered by the page_count() for ordinary anon pages.
      
      [1] https://lkml.kernel.org/r/20220113140318.11117-1-zhangliang5@huawei.com
      [2] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarLiang Zhang <zhangliang5@huawei.com>
      Reported-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53a05ad9
    • Miaohe Lin's avatar
      mm/huge_memory: make is_transparent_hugepage() static · 562beb72
      Miaohe Lin authored
      It's only used inside the huge_memory.c now. Don't export it and make
      it static. We can thus reduce the size of huge_memory.o a bit.
      
      Without this patch:
         text	   data	    bss	    dec	    hex	filename
        32319	   2965	      4	  35288	   89d8	mm/huge_memory.o
      
      With this patch:
         text	   data	    bss	    dec	    hex	filename
        32042	   2957	      4	  35003	   88bb	mm/huge_memory.o
      
      Link: https://lkml.kernel.org/r/20220302082145.12028-1-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      562beb72
    • Mike Kravetz's avatar
      userfaultfd/selftests: enable hugetlb remap and remove event testing · 9ae8f2b8
      Mike Kravetz authored
      With MADV_DONTNEED support added to hugetlb mappings, mremap testing can
      also be enabled for hugetlb.
      
      Modify the tests to use madvise MADV_DONTNEED and MADV_REMOVE instead of
      fallocate hole puch for releasing hugetlb pages.
      
      Link: https://lkml.kernel.org/r/20220215002348.128823-4-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ae8f2b8
    • Mike Kravetz's avatar
      selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test · c4b6cb88
      Mike Kravetz authored
      Now that MADV_DONTNEED support for hugetlb is enabled, add corresponding
      tests.  MADV_REMOVE has been enabled for some time, but no tests exist so
      add them as well.
      
      Link: https://lkml.kernel.org/r/20220215002348.128823-3-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4b6cb88
    • Mike Kravetz's avatar
      mm: enable MADV_DONTNEED for hugetlb mappings · 90e7e7f5
      Mike Kravetz authored
      Patch series "Add hugetlb MADV_DONTNEED support", v3.
      
      Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP
      testing.  However, mremap support was recently added in commit
      550a7d60 ("mm, hugepages: add mremap() support for hugepage backed
      vma").  While attempting to enable mremap support in the test, it was
      discovered that the mremap test indirectly depends on MADV_DONTNEED.
      
      madvise does not allow MADV_DONTNEED for hugetlb mappings.  However, that
      is primarily due to the check in can_madv_lru_vma().  By simply removing
      the check and adding huge page alignment, MADV_DONTNEED can be made to
      work for hugetlb mappings.
      
      Do note that there is no compelling use case for adding this support.
      This was discussed in the RFC [1].  However, adding support makes sense as
      it is fairly trivial and brings hugetlb functionality more in line with
      'normal' memory.
      
      After enabling support, add selftest for MADV_DONTNEED as well as
      MADV_REMOVE.  Then update userfaultfd selftest.
      
      If new functionality is accepted, then madvise man page will be updated to
      indicate hugetlb is supported.  It will also be updated to clarify what
      happens to the passed length argument.
      
      This patch (of 3):
      
      MADV_DONTNEED is currently disabled for hugetlb mappings.  This certainly
      makes sense in shared file mappings as the pagecache maintains a reference
      to the page and it will never be freed.  However, it could be useful to
      unmap and free pages in private mappings.  In addition, userfaultfd minor
      fault users may be able to simplify code by using MADV_DONTNEED.
      
      The primary thing preventing MADV_DONTNEED from working on hugetlb
      mappings is a check in can_madv_lru_vma().  To allow support for hugetlb
      mappings create and use a new routine madvise_dontneed_free_valid_vma()
      that allows hugetlb mappings in this specific case.
      
      For normal mappings, madvise requires the start address be PAGE aligned
      and rounds up length to the next multiple of PAGE_SIZE.  Do similarly for
      hugetlb mappings: require start address be huge page size aligned and
      round up length to the next multiple of huge page size.  Use the new
      madvise_dontneed_free_valid_vma routine to check alignment and round up
      length/end.  zap_page_range requires this alignment for hugetlb vmas
      otherwise we will hit BUGs.
      
      Link: https://lkml.kernel.org/r/20220215002348.128823-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20220215002348.128823-2-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90e7e7f5
    • Andrey Konovalov's avatar
      kasan: disable LOCKDEP when printing reports · c32caa26
      Andrey Konovalov authored
      If LOCKDEP detects a bug while KASAN is printing a report and if
      panic_on_warn is set, KASAN will not be able to finish.  Disable LOCKDEP
      while KASAN is printing a report.
      
      See https://bugzilla.kernel.org/show_bug.cgi?id=202115 for an example
      of the issue.
      
      Link: https://lkml.kernel.org/r/c48a2a3288200b07e1788b77365c2f02784cfeb4.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c32caa26
    • Andrey Konovalov's avatar
      kasan: move and hide kasan_save_enable/restore_multi_shot · 80207910
      Andrey Konovalov authored
       - Move kasan_save_enable/restore_multi_shot() declarations to
         mm/kasan/kasan.h, as there is no need for them to be visible outside
         of KASAN implementation.
      
       - Only define and export these functions when KASAN tests are enabled.
      
       - Move their definitions closer to other test-related code in report.c.
      
      Link: https://lkml.kernel.org/r/6ba637333b78447f027d775f2d55ab1a40f63c99.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80207910
    • Andrey Konovalov's avatar
      kasan: reorder reporting functions · 865bfa28
      Andrey Konovalov authored
      Move print_error_description()'s, report_suppressed()'s, and
      report_enabled()'s definitions to improve the logical order of function
      definitions in report.c.
      
      No functional changes.
      
      Link: https://lkml.kernel.org/r/82aa926c411e00e76e97e645a551ede9ed0c5e79.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      865bfa28
    • Andrey Konovalov's avatar
      kasan: respect KASAN_BIT_REPORTED in all reporting routines · c068664c
      Andrey Konovalov authored
      Currently, only kasan_report() checks the KASAN_BIT_REPORTED and
      KASAN_BIT_MULTI_SHOT flags.
      
      Make other reporting routines check these flags as well.
      
      Also add explanatory comments.
      
      Note that the current->kasan_depth check is split out into
      report_suppressed() and only called for kasan_report().
      
      Link: https://lkml.kernel.org/r/715e346b10b398e29ba1b425299dcd79e29d58ce.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c068664c
    • Andrey Konovalov's avatar
      kasan: add comment about UACCESS regions to kasan_report · 795b760f
      Andrey Konovalov authored
      Add a comment explaining why kasan_report() is the only reporting function
      that uses user_access_save/restore().
      
      Link: https://lkml.kernel.org/r/1201ca3c2be42c7bd077c53d2e46f4a51dd1476a.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      795b760f
    • Andrey Konovalov's avatar
      kasan: rename kasan_access_info to kasan_report_info · c965cdd6
      Andrey Konovalov authored
      Rename kasan_access_info to kasan_report_info, as the latter name better
      reflects the struct's purpose.
      
      Link: https://lkml.kernel.org/r/158a4219a5d356901d017352558c989533a0782c.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c965cdd6
    • Andrey Konovalov's avatar
      kasan: move and simplify kasan_report_async · bb2f967c
      Andrey Konovalov authored
      Place kasan_report_async() next to the other main reporting routines.
      Also simplify printed information.
      
      Link: https://lkml.kernel.org/r/52d942ef3ffd29bdfa225bbe8e327bc5bda7ab09.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb2f967c
    • Andrey Konovalov's avatar
      kasan: call print_report from kasan_report_invalid_free · 31c65110
      Andrey Konovalov authored
      Call print_report() in kasan_report_invalid_free() instead of calling
      printing functions directly.  Compared to the existing implementation of
      kasan_report_invalid_free(), print_report() makes sure that the buggy
      address has metadata before printing it.
      
      The change requires adding a report type field into kasan_access_info and
      using it accordingly.
      
      kasan_report_async() is left as is, as using print_report() will only
      complicate the code.
      
      Link: https://lkml.kernel.org/r/9ea6f0604c5d2e1fb28d93dc6c44232c1f8017fe.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31c65110
    • Andrey Konovalov's avatar
      kasan: merge __kasan_report into kasan_report · be8631a1
      Andrey Konovalov authored
      Merge __kasan_report() into kasan_report().  The code is simple enough to
      be readable without the __kasan_report() helper.
      
      Link: https://lkml.kernel.org/r/c8a125497ef82f7042b3795918dffb81a85a878e.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be8631a1
    • Andrey Konovalov's avatar
      kasan: restructure kasan_report · b3bb1d70
      Andrey Konovalov authored
      Restructure kasan_report() to make reviewing the subsequent patches
      easier.
      
      Link: https://lkml.kernel.org/r/ca28042889858b8cc4724d3d4378387f90d7a59d.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3bb1d70
    • Andrey Konovalov's avatar
      kasan: simplify kasan_find_first_bad_addr call sites · b9132800
      Andrey Konovalov authored
      Move the addr_has_metadata() check into kasan_find_first_bad_addr().
      
      Link: https://lkml.kernel.org/r/a49576f7a23283d786ba61579cb0c5057e8f0b9b.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9132800
    • Andrey Konovalov's avatar
      kasan: split out print_report from __kasan_report · 9d7b7dd9
      Andrey Konovalov authored
      Split out the part of __kasan_report() that prints things into
      print_report().  One of the subsequent patches makes another error handler
      use print_report() as well.
      
      Includes lower-level changes:
      
       - Allow addr_has_metadata() accepting a tagged address.
      
       - Drop the const qualifier from the fields of kasan_access_info to
         avoid excessive type casts.
      
       - Change the type of the address argument of __kasan_report() and
         end_report() to void * to reduce the number of type casts.
      
      Link: https://lkml.kernel.org/r/9be3ed99dd24b9c4e1c4a848b69a0c6ecefd845e.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d7b7dd9
    • Andrey Konovalov's avatar
      kasan: move disable_trace_on_warning to start_report · 0a6e8a07
      Andrey Konovalov authored
      Move the disable_trace_on_warning() call, which enables the
      /proc/sys/kernel/traceoff_on_warning interface for KASAN bugs, to
      start_report(), so that it functions for all types of KASAN reports.
      
      Link: https://lkml.kernel.org/r/7c066c5de26234ad2cebdd931adfe437f8a95d58.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a6e8a07
    • Andrey Konovalov's avatar
      kasan: move update_kunit_status to start_report · a260d281
      Andrey Konovalov authored
      Instead of duplicating calls to update_kunit_status() in every error
      report routine, call it once in start_report().  Pass the sync flag as an
      additional argument to start_report().
      
      Link: https://lkml.kernel.org/r/cae5c845a0b6f3c867014e53737cdac56b11edc7.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a260d281
    • Andrey Konovalov's avatar
      kasan: check CONFIG_KASAN_KUNIT_TEST instead of CONFIG_KUNIT · 49d9977a
      Andrey Konovalov authored
      Check the more specific CONFIG_KASAN_KUNIT_TEST config option when
      defining things related to KUnit-compatible KASAN tests instead of
      CONFIG_KUNIT.
      
      Also put the kunit_kasan_status definition next to the definitons of other
      KASAN-related structs.
      
      Link: https://lkml.kernel.org/r/223592d38d2a601a160a3b2b3d5a9f9090350e62.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49d9977a
    • Andrey Konovalov's avatar
      kasan: simplify kasan_update_kunit_status() and call sites · 3784c299
      Andrey Konovalov authored
       - Rename kasan_update_kunit_status() to update_kunit_status() (the
         function is static).
      
       - Move the IS_ENABLED(CONFIG_KUNIT) to the function's definition
         instead of duplicating it at call sites.
      
       - Obtain and check current->kunit_test within the function.
      
      Link: https://lkml.kernel.org/r/dac26d811ae31856c3d7666de0b108a3735d962d.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3784c299
    • Andrey Konovalov's avatar
      kasan: simplify async check in end_report() · 476b1dc2
      Andrey Konovalov authored
      Currently, end_report() does not call trace_error_report_end() for bugs
      detected in either async or asymm mode (when kasan_async_fault_possible()
      returns true), as the address of the bad access might be unknown.
      
      However, for asymm mode, the address is known for faults triggered by read
      operations.
      
      Instead of using kasan_async_fault_possible(), simply check that the addr
      is not NULL when calling trace_error_report_end().
      
      Link: https://lkml.kernel.org/r/1c8ce43f97300300e62c941181afa2eb738965c5.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      476b1dc2
    • Andrey Konovalov's avatar
      kasan: print basic stack frame info for SW_TAGS · 1e0f611f
      Andrey Konovalov authored
      Software Tag-Based mode tags stack allocations when CONFIG_KASAN_STACK
      is enabled. Print task name and id in reports for stack-related bugs.
      
      [andreyknvl@google.com: include linux/sched/task_stack.h]
        Link: https://lkml.kernel.org/r/d7598f11a34ed96e508f7640fa038662ed2305ec.1647099922.git.andreyknvl@google.com
      
      Link: https://lkml.kernel.org/r/029aaa87ceadde0702f3312a34697c9139c9fb53.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e0f611f
    • Andrey Konovalov's avatar
      kasan: improve stack frame info in reports · 16347c31
      Andrey Konovalov authored
       - Print at least task name and id for reports affecting allocas
         (get_address_stack_frame_info() does not support them).
      
       - Capitalize first letter of each sentence.
      
      Link: https://lkml.kernel.org/r/aa613f097c12f7b75efb17f2618ae00480fb4bc3.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16347c31
    • Andrey Konovalov's avatar
      kasan: rearrange stack frame info in reports · 0f9b35f3
      Andrey Konovalov authored
       - Move printing stack frame info before printing page info.
      
       - Add object_is_on_stack() check to print_address_description() and add
         a corresponding WARNING to kasan_print_address_stack_frame(). This
         looks more in line with the rest of the checks in this function and
         also allows to avoid complicating code logic wrt line breaks.
      
       - Clean up comments related to get_address_stack_frame_info().
      
      Link: https://lkml.kernel.org/r/1ee113a4c111df97d168c820b527cda77a3cac40.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f9b35f3
    • Andrey Konovalov's avatar
      kasan: more line breaks in reports · 038fd2b4
      Andrey Konovalov authored
      Add a line break after each part that describes the buggy address.
      Improves readability of reports.
      
      Link: https://lkml.kernel.org/r/8682c4558e533cd0f99bdb964ce2fe741f2a9212.1646237226.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      038fd2b4