1. 30 Nov, 2022 40 commits
    • Stefan Roesch's avatar
      mm: add bdi_set_strict_limit() function · 8e9d5ead
      Stefan Roesch authored
      Patch series "mm/block: add bdi sysfs knobs", v4.
      
      At meta network block devices (nbd) are used to implement remote block
      storage.  In testing and during production it has been observed that these
      network block devices can consume a huge portion of the dirty writeback
      cache and writeback can take a considerable time.
      
      To be able to give stricter limits, I'm proposing the following changes:
      
      1) introduce strictlimit knob
      
        Currently the max_ratio knob exists to limit the dirty_memory. However
        this knob only applies once (dirty_ratio + dirty_background_ratio) / 2
        has been reached.
        With the BDI_CAP_STRICTLIMIT flag, the max_ratio can be applied without
        reaching that limit. This change exposes that knob.
      
        This knob can also be useful for NFS, fuse filesystems and USB devices.
      
      2) Use part of 1000000 internal calculation
      
        The max_ratio is based on percentage. With the current machine sizes
        percentage values can be very high (1% of a 256GB main memory is already
        2.5GB). This change uses part of 1000000 instead of percentages for the
        internal calculations.
      
      3) Introduce two new sysfs knobs: min_bytes and max_bytes.
      
        Currently all calculations are based on ratio, but for a user it often
        more convenient to specify a limit in bytes. The new knobs will not
        store bytes values, instead they will translate the byte value to a
        corresponding ratio. As the internal values are now part of 1000, the
        ratio is closer to the specified value. However the value should be more
        seen as an approximation as it can fluctuate over time.
      
      
      3) Introduce two new sysfs knobs: min_ratio_fine and max_ratio_fine.
      
        The granularity for the existing sysfs bdi knobs min_ratio and max_ratio
        is based on percentage values. The new sysfs bdi knobs min_ratio_fine
        and max_ratio_fine allow to specify the ratio as part of 1 million.
      
      
      This patch (of 20):
      
      This adds the bdi_set_strict_limit function to be able to set/unset the
      BDI_CAP_STRICTLIMIT flag.
      
      Link: https://lkml.kernel.org/r/20221119005215.3052436-1-shr@devkernel.io
      Link: https://lkml.kernel.org/r/20221119005215.3052436-2-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Chris Mason <clm@meta.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e9d5ead
    • Randy Dunlap's avatar
      maple_tree: allow TEST_MAPLE_TREE only when DEBUG_KERNEL is set · 845aad0a
      Randy Dunlap authored
      Prevent a kconfig warning that is caused by TEST_MAPLE_TREE by adding a
      "depends on" clause for TEST_MAPLE_TREE since 'select' does not follow any
      kconfig dependencies.
      
      WARNING: unmet direct dependencies detected for DEBUG_MAPLE_TREE
        Depends on [n]: DEBUG_KERNEL [=n]
        Selected by [y]:
        - TEST_MAPLE_TREE [=y] && RUNTIME_TESTING_MENU [=y]
      
      Link: https://lkml.kernel.org/r/20221119055117.14094-1-rdunlap@infradead.org
      Fixes: 120b1162 ("maple_tree: reorganize testing to restore module testing")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      845aad0a
    • Alexander Potapenko's avatar
      Revert "kmsan: unpoison @tlb in arch_tlb_gather_mmu()" · f6fbb8b2
      Alexander Potapenko authored
      This reverts commit ac801e7e.
      
      The patch in question was picked to -mm from the KMSAN v6 patch series
      (https://lore.kernel.org/linux-mm/20220905122452.2258262-1-glider@google.com/)
      and sneaked into mainline despite its removal from the v7 series
      (https://lore.kernel.org/linux-mm/20220915150417.722975-1-glider@google.com/)
      
      Currently KMSAN does not warn about origin chains hitting the maximum
      depth, so keeping @tlb poisoned won't result in any inconveniences.
      
      Link: https://lkml.kernel.org/r/20221110113541.1844156-1-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f6fbb8b2
    • Vishal Moola (Oracle)'s avatar
      folio-compat: remove try_to_release_page() · 7438899b
      Vishal Moola (Oracle) authored
      There are no more callers of try_to_release_page(), so remove it.  This
      saves 85 bytes of kernel text.
      
      Link: https://lkml.kernel.org/r/20221118073055.55694-5-vishal.moola@gmail.comSigned-off-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7438899b
    • Vishal Moola (Oracle)'s avatar
      memory-failure: convert truncate_error_page() to use folio · ac5efa78
      Vishal Moola (Oracle) authored
      Replace try_to_release_page() with filemap_release_folio().  This change
      is in preparation for the removal of the try_to_release_page() wrapper.
      
      Link: https://lkml.kernel.org/r/20221118073055.55694-4-vishal.moola@gmail.comSigned-off-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ac5efa78
    • Vishal Moola (Oracle)'s avatar
      khugepage: replace try_to_release_page() with filemap_release_folio() · 64ab3195
      Vishal Moola (Oracle) authored
      Replace some calls with their folio equivalents.  This change removes 4
      calls to compound_head() and is in preparation for the removal of the
      try_to_release_page() wrapper.
      
      Link: https://lkml.kernel.org/r/20221118073055.55694-3-vishal.moola@gmail.comSigned-off-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      64ab3195
    • Vishal Moola (Oracle)'s avatar
      ext4: convert move_extent_per_page() to use folios · 6dd8fe86
      Vishal Moola (Oracle) authored
      Patch series "Removing the try_to_release_page() wrapper", v3.
      
      This patchset replaces the remaining calls of try_to_release_page() with
      the folio equivalent: filemap_release_folio().  This allows us to remove
      the wrapper.
      
      
      This patch (of 4):
      
      Convert move_extent_per_page() to use folios.  This change removes 5 calls
      to compound_head() and is in preparation for the removal of the
      try_to_release_page() wrapper.
      
      Link: https://lkml.kernel.org/r/20221118073055.55694-1-vishal.moola@gmail.com
      Link: https://lkml.kernel.org/r/20221118073055.55694-2-vishal.moola@gmail.comSigned-off-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6dd8fe86
    • Mel Gorman's avatar
      mm/page_alloc: simplify locking during free_unref_page_list · a4bafffb
      Mel Gorman authored
      While freeing a large list, the zone lock will be released and reacquired
      to avoid long hold times since commit c24ad77d ("mm/page_alloc.c:
      avoid excessive IRQ disabled times in free_unref_page_list()").  As
      suggested by Vlastimil Babka, the lockrelease/reacquire logic can be
      simplified by reusing the logic that acquires a different lock when
      changing zones.
      
      Link: https://lkml.kernel.org/r/20221122131229.5263-3-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a4bafffb
    • Mel Gorman's avatar
      mm/page_alloc: leave IRQs enabled for per-cpu page allocations · 57490774
      Mel Gorman authored
      The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
      allocating from the PCP must not re-enter the allocator from IRQ context. 
      In each instance where IRQ-reentrancy is possible, the lock is acquired
      using pcp_spin_trylock_irqsave() even though IRQs are disabled and
      re-entrancy is impossible.
      
      Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the
      common case at the cost of some IRQ allocations taking a slower path.  If
      the PCP lists need to be refilled, the zone lock still needs to disable
      IRQs but that will only happen on PCP refill and drain.  If an IRQ is
      raised when a PCP allocation is in progress, the trylock will fail and
      fallback to using the buddy lists directly.  Note that this may not be a
      universal win if an interrupt-intensive workload also allocates heavily
      from interrupt context and contends heavily on the zone->lock as a result.
      
      [mgorman@techsingularity.net: migratetype might be wrong if a PCP was locked]
        Link: https://lkml.kernel.org/r/20221122131229.5263-2-mgorman@techsingularity.net
      [yuzhao@google.com: reported lockdep issue on IO completion from softirq]
      [hughd@google.com: fix list corruption, lock improvements, micro-optimsations]
      Link: https://lkml.kernel.org/r/20221118101714.19590-3-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      57490774
    • Mel Gorman's avatar
      mm/page_alloc: always remove pages from temporary list · c3e58a70
      Mel Gorman authored
      Patch series "Leave IRQs enabled for per-cpu page allocations", v3.
      
      
      This patch (of 2):
      
      free_unref_page_list() has neglected to remove pages properly from the
      list of pages to free since forever.  It works by coincidence because
      list_add happened to do the right thing adding the pages to just the PCP
      lists.  However, a later patch added pages to either the PCP list or the
      zone list but only properly deleted the page from the list in one path
      leading to list corruption and a subsequent failure.  As a preparation
      patch, always delete the pages from one list properly before adding to
      another.  On its own, this fixes nothing although it adds a fractional
      amount of overhead but is critical to the next patch.
      
      Link: https://lkml.kernel.org/r/20221118101714.19590-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20221118101714.19590-2-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c3e58a70
    • Peter Xu's avatar
      selftests/vm: use memfd for hugepage-mmap test · 91a99f1d
      Peter Xu authored
      This test was overlooked with a hard-coded mntpoint path in test when
      we're removing the hugetlb mntpoint in commit 0796c7b8.  Fix it up so
      the test can keep running.
      
      Link: https://lkml.kernel.org/r/Y3aojfUC2nSwbCzB@x1n
      Fixes: 0796c7b8 ("selftests/vm: drop mnt point for hugetlb in run_vmtests.sh")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reported-by: default avatarJoel Savitz <jsavitz@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91a99f1d
    • Sergey Senozhatsky's avatar
      zram: remove unused stats fields · 47939359
      Sergey Senozhatsky authored
      We don't show num_reads and num_writes since we removed corresponding
      sysfs nodes in 2017.  Block layer stats are exposed via
      /sys/block/zramX/stat file.
      
      However, we still increment those atomic vars and store them in zram
      stats.  Remove leftovers.
      
      Link: https://lkml.kernel.org/r/20221117141326.1105181-1-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      47939359
    • Yang Li's avatar
    • Yu Zhao's avatar
      mm: multi-gen LRU: remove NULL checks on NODE_DATA() · 931b6a8b
      Yu Zhao authored
      NODE_DATA() is preallocated for all possible nodes after commit
      09f49dca ("mm: handle uninitialized numa nodes gracefully").  Checking
      its return value against NULL is now unnecessary.
      
      Link: https://lkml.kernel.org/r/20221116013808.3995280-2-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      931b6a8b
    • David Hildenbrand's avatar
      mm/gup: disallow FOLL_FORCE|FOLL_WRITE on hugetlb mappings · f347454d
      David Hildenbrand authored
      hugetlb does not support fake write-faults (write faults without write
      permissions).  However, we are currently able to trigger a
      FAULT_FLAG_WRITE fault on a VMA without VM_WRITE.
      
      If we'd ever want to support FOLL_FORCE|FOLL_WRITE, we'd have to teach
      hugetlb to:
      
      (1) Leave the page mapped R/O after the fake write-fault, like
          maybe_mkwrite() does.
      (2) Allow writing to an exclusive anon page that's mapped R/O when
          FOLL_FORCE is set, like can_follow_write_pte(). E.g.,
          __follow_hugetlb_must_fault() needs adjustment.
      
      For now, it's not clear if that added complexity is really required. 
      History tolds us that FOLL_FORCE is dangerous and that we better limit its
      use to a bare minimum.
      
      --------------------------------------------------------------------------
        #include <stdio.h>
        #include <stdlib.h>
        #include <fcntl.h>
        #include <unistd.h>
        #include <errno.h>
        #include <stdint.h>
        #include <sys/mman.h>
        #include <linux/mman.h>
      
        int main(int argc, char **argv)
        {
                char *map;
                int mem_fd;
      
                map = mmap(NULL, 2 * 1024 * 1024u, PROT_READ,
                           MAP_PRIVATE|MAP_ANON|MAP_HUGETLB|MAP_HUGE_2MB, -1, 0);
                if (map == MAP_FAILED) {
                        fprintf(stderr, "mmap() failed: %d\n", errno);
                        return 1;
                }
      
                mem_fd = open("/proc/self/mem", O_RDWR);
                if (mem_fd < 0) {
                        fprintf(stderr, "open(/proc/self/mem) failed: %d\n", errno);
                        return 1;
                }
      
                if (pwrite(mem_fd, "0", 1, (uintptr_t) map) == 1) {
                        fprintf(stderr, "write() succeeded, which is unexpected\n");
                        return 1;
                }
      
                printf("write() failed as expected: %d\n", errno);
                return 0;
        }
      --------------------------------------------------------------------------
      
      Fortunately, we have a sanity check in hugetlb_wp() in place ever since
      commit 1d8d1464 ("mm/hugetlb: support write-faults in shared
      mappings"), that bails out instead of silently mapping a page writable in
      a !PROT_WRITE VMA.
      
      Consequently, above reproducer triggers a warning, similar to the one
      reported by szsbot:
      
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 3612 at mm/hugetlb.c:5313 hugetlb_wp+0x20a/0x1af0 mm/hugetlb.c:5313
      Modules linked in:
      CPU: 1 PID: 3612 Comm: syz-executor250 Not tainted 6.1.0-rc2-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
      RIP: 0010:hugetlb_wp+0x20a/0x1af0 mm/hugetlb.c:5313
      Code: ea 03 80 3c 02 00 0f 85 31 14 00 00 49 8b 5f 20 31 ff 48 89 dd 83 e5 02 48 89 ee e8 70 ab b7 ff 48 85 ed 75 5b e8 76 ae b7 ff <0f> 0b 41 bd 40 00 00 00 e8 69 ae b7 ff 48 b8 00 00 00 00 00 fc ff
      RSP: 0018:ffffc90003caf620 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000008640070 RCX: 0000000000000000
      RDX: ffff88807b963a80 RSI: ffffffff81c4ed2a RDI: 0000000000000007
      RBP: 0000000000000000 R08: 0000000000000007 R09: 0000000000000000
      R10: 0000000000000000 R11: 000000000008c07e R12: ffff888023805800
      R13: 0000000000000000 R14: ffffffff91217f38 R15: ffff88801d4b0360
      FS:  0000555555bba300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fff7a47a1b8 CR3: 000000002378d000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       hugetlb_no_page mm/hugetlb.c:5755 [inline]
       hugetlb_fault+0x19cc/0x2060 mm/hugetlb.c:5874
       follow_hugetlb_page+0x3f3/0x1850 mm/hugetlb.c:6301
       __get_user_pages+0x2cb/0xf10 mm/gup.c:1202
       __get_user_pages_locked mm/gup.c:1434 [inline]
       __get_user_pages_remote+0x18f/0x830 mm/gup.c:2187
       get_user_pages_remote+0x84/0xc0 mm/gup.c:2260
       __access_remote_vm+0x287/0x6b0 mm/memory.c:5517
       ptrace_access_vm+0x181/0x1d0 kernel/ptrace.c:61
       generic_ptrace_pokedata kernel/ptrace.c:1323 [inline]
       ptrace_request+0xb46/0x10c0 kernel/ptrace.c:1046
       arch_ptrace+0x36/0x510 arch/x86/kernel/ptrace.c:828
       __do_sys_ptrace kernel/ptrace.c:1296 [inline]
       __se_sys_ptrace kernel/ptrace.c:1269 [inline]
       __x64_sys_ptrace+0x178/0x2a0 kernel/ptrace.c:1269
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      [...]
      
      So let's silence that warning by teaching GUP code that FOLL_FORCE -- so
      far -- does not apply to hugetlb.
      
      Note that FOLL_FORCE for read-access seems to be working as expected.  The
      assumption is that this has been broken forever, only ever since above
      commit, we actually detect the wrong handling and WARN_ON_ONCE().
      
      I assume this has been broken at least since 2014, when mm/gup.c came to
      life.  I failed to come up with a suitable Fixes tag quickly.
      
      Link: https://lkml.kernel.org/r/20221031152524.173644-1-david@redhat.com
      Fixes: 1d8d1464 ("mm/hugetlb: support write-faults in shared mappings")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: <syzbot+f0b97304ef90f0d0b1dc@syzkaller.appspotmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f347454d
    • David Hildenbrand's avatar
      habanalabs: remove FOLL_FORCE usage · 052d9b0f
      David Hildenbrand authored
      FOLL_FORCE is really only for ptrace access. As we unpin the pinned pages
      using unpin_user_pages_dirty_lock(true), the assumption is that all these
      pages are writable.
      
      FOLL_FORCE in this case seems to be due to copy-and-past from other
      drivers. Let's just remove it.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-20-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarOded Gabbay <ogabbay@kernel.org>
      Cc: Oded Gabbay <ogabbay@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      052d9b0f
    • David Hildenbrand's avatar
      RDMA/hw/qib/qib_user_pages: remove FOLL_FORCE usage · 20ea7783
      David Hildenbrand authored
      FOLL_FORCE is really only for ptrace access. As we unpin the pinned pages
      using unpin_user_pages_dirty_lock(true), the assumption is that all these
      pages are writable.
      
      FOLL_FORCE in this case seems to be a legacy leftover. Let's just remove
      it.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-19-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Leon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      20ea7783
    • David Hildenbrand's avatar
      drm/exynos: remove FOLL_FORCE usage · c098ce73
      David Hildenbrand authored
      FOLL_FORCE is really only for ptrace access. As we unpin the pinned pages
      using unpin_user_pages_dirty_lock(true), the assumption is that all these
      pages are writable.
      
      FOLL_FORCE in this case seems to be a legacy leftover. Let's just remove
      it.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-18-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Inki Dae <inki.dae@samsung.com>
      Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: David Airlie <airlied@gmail.com>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c098ce73
    • David Hildenbrand's avatar
      mm/frame-vector: remove FOLL_FORCE usage · cb78a634
      David Hildenbrand authored
      FOLL_FORCE is really only for ptrace access. According to commit
      70794724 ("media: videobuf2-vmalloc: get_userptr: buffers are always
      writable"), get_vaddr_frames() currently pins all pages writable as a
      workaround for issues with read-only buffers.
      
      FOLL_FORCE, however, seems to be a legacy leftover as it predates
      commit 70794724 ("media: videobuf2-vmalloc: get_userptr: buffers are
      always writable"). Let's just remove it.
      
      Once the read-only buffer issue has been resolved, FOLL_WRITE could
      again be set depending on the DMA direction.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-17-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Acked-by: default avatarHans Verkuil <hverkuil-cisco@xs4all.nl>
      Acked-by: default avatarTomasz Figa <tfiga@chromium.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cb78a634
    • David Hildenbrand's avatar
      media: pci/ivtv: remove FOLL_FORCE usage · 70b96f24
      David Hildenbrand authored
      FOLL_FORCE is really only for ptrace access. R/O pinning a page is
      supposed to fail if the VMA misses proper access permissions (no VM_READ).
      
      Let's just remove FOLL_FORCE usage here; there would have to be a pretty
      good reason to allow arbitrary drivers to R/O pin pages in a PROT_NONE
      VMA. Most probably, FOLL_FORCE usage is just some legacy leftover.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-16-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarHans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Andy Walls <awalls@md.metrocast.net>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      70b96f24
    • David Hildenbrand's avatar
      drm/etnaviv: remove FOLL_FORCE usage · 7d96eb6a
      David Hildenbrand authored
      GUP now supports reliable R/O long-term pinning in COW mappings, such
      that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
      far in one corner case (DAXFS file with holes), which can be ignored
      because GUP does not support long-term pinning in fsdax (see
      check_vma_flags()).
      
      commit cd5297b0 ("drm/etnaviv: Use FOLL_FORCE for userptr")
      documents that FOLL_FORCE | FOLL_WRITE was really only used for reliable
      R/O pinning.
      
      Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
      for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
      using FOLL_FORCE, which is really only for ptrace access.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-15-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Russell King <linux+etnaviv@armlinux.org.uk>
      Cc: Christian Gmeiner <christian.gmeiner@gmail.com>
      Cc: David Airlie <airlied@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d96eb6a
    • David Hildenbrand's avatar
      media: videobuf-dma-sg: remove FOLL_FORCE usage · 3298de2c
      David Hildenbrand authored
      GUP now supports reliable R/O long-term pinning in COW mappings, such
      that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
      far in one corner case (DAXFS file with holes), which can be ignored
      because GUP does not support long-term pinning in fsdax (see
      check_vma_flags()).
      
      Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
      for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
      using FOLL_FORCE, which is really only for ptrace access.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-14-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Acked-by: default avatarHans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3298de2c
    • David Hildenbrand's avatar
      RDMA/siw: remove FOLL_FORCE usage · 129e636f
      David Hildenbrand authored
      GUP now supports reliable R/O long-term pinning in COW mappings, such
      that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
      far in one corner case (DAXFS file with holes), which can be ignored
      because GUP does not support long-term pinning in fsdax (see
      check_vma_flags()).
      
      Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
      for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
      using FOLL_FORCE, which is really only for ptrace access.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-13-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Bernard Metzler <bmt@zurich.ibm.com>
      Cc: Leon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      129e636f
    • David Hildenbrand's avatar
      RDMA/usnic: remove FOLL_FORCE usage · a9d02840
      David Hildenbrand authored
      GUP now supports reliable R/O long-term pinning in COW mappings, such
      that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
      far in one corner case (DAXFS file with holes), which can be ignored
      because GUP does not support long-term pinning in fsdax (see
      check_vma_flags()).
      
      Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
      for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
      using FOLL_FORCE, which is really only for ptrace access.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-12-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Christian Benvenuti <benve@cisco.com>
      Cc: Nelson Escobar <neescoba@cisco.com>
      Cc: Leon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a9d02840
    • David Hildenbrand's avatar
      RDMA/umem: remove FOLL_FORCE usage · b40656aa
      David Hildenbrand authored
      GUP now supports reliable R/O long-term pinning in COW mappings, such
      that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
      far in one corner case (DAXFS file with holes), which can be ignored
      because GUP does not support long-term pinning in fsdax (see
      check_vma_flags()).
      
      Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
      for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
      using FOLL_FORCE, which is really only for ptrace access.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-11-david@redhat.com
      Tested-by: Leon Romanovsky <leonro@nvidia.com>	[over mlx4 and mlx5]
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Leon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b40656aa
    • David Hildenbrand's avatar
      mm/gup: reliable R/O long-term pinning in COW mappings · 84209e87
      David Hildenbrand authored
      We already support reliable R/O pinning of anonymous memory. However,
      assume we end up pinning (R/O long-term) a pagecache page or the shared
      zeropage inside a writable private ("COW") mapping. The next write access
      will trigger a write-fault and replace the pinned page by an exclusive
      anonymous page in the process page tables to break COW: the pinned page no
      longer corresponds to the page mapped into the process' page table.
      
      Now that FAULT_FLAG_UNSHARE can break COW on anything mapped into a
      COW mapping, let's properly break COW first before R/O long-term
      pinning something that's not an exclusive anon page inside a COW
      mapping. FAULT_FLAG_UNSHARE will break COW and map an exclusive anon page
      instead that can get pinned safely.
      
      With this change, we can stop using FOLL_FORCE|FOLL_WRITE for reliable
      R/O long-term pinning in COW mappings.
      
      With this change, the new R/O long-term pinning tests for non-anonymous
      memory succeed:
        # [RUN] R/O longterm GUP pin ... with shared zeropage
        ok 151 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP pin ... with memfd
        ok 152 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP pin ... with tmpfile
        ok 153 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP pin ... with huge zeropage
        ok 154 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB)
        ok 155 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB)
        ok 156 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP-fast pin ... with shared zeropage
        ok 157 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP-fast pin ... with memfd
        ok 158 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP-fast pin ... with tmpfile
        ok 159 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP-fast pin ... with huge zeropage
        ok 160 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB)
        ok 161 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB)
        ok 162 Longterm R/O pin is reliable
      
      Note 1: We don't care about short-term R/O-pinning, because they have
      snapshot semantics: they are not supposed to observe modifications that
      happen after pinning.
      
      As one example, assume we start direct I/O to read from a page and store
      page content into a file: modifications to page content after starting
      direct I/O are not guaranteed to end up in the file. So even if we'd pin
      the shared zeropage, the end result would be as expected -- getting zeroes
      stored to the file.
      
      Note 2: For shared mappings we'll now always fallback to the slow path to
      lookup the VMA when R/O long-term pining. While that's the necessary price
      we have to pay right now, it's actually not that bad in practice: most
      FOLL_LONGTERM users already specify FOLL_WRITE, for example, along with
      FOLL_FORCE because they tried dealing with COW mappings correctly ...
      
      Note 3: For users that use FOLL_LONGTERM right now without FOLL_WRITE,
      such as VFIO, we'd now no longer pin the shared zeropage. Instead, we'd
      populate exclusive anon pages that we can pin. There was a concern that
      this could affect the memlock limit of existing setups.
      
      For example, a VM running with VFIO could run into the memlock limit and
      fail to run. However, we essentially had the same behavior already in
      commit 17839856 ("gup: document and work around "COW can break either
      way" issue") which got merged into some enterprise distros, and there were
      not any such complaints. So most probably, we're fine.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-10-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      84209e87
    • David Hildenbrand's avatar
      mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping · 8d6a0ac0
      David Hildenbrand authored
      Extend FAULT_FLAG_UNSHARE to break COW on anything mapped into a
      COW (i.e., private writable) mapping and adjust the documentation
      accordingly.
      
      FAULT_FLAG_UNSHARE will now also break COW when encountering the shared
      zeropage, a pagecache page, a PFNMAP, ... inside a COW mapping, by
      properly replacing the mapped page/pfn by a private copy (an exclusive
      anonymous page).
      
      Note that only do_wp_page() needs care: hugetlb_wp() already handles
      FAULT_FLAG_UNSHARE correctly. wp_huge_pmd()/wp_huge_pud() also handles it
      correctly, for example, splitting the huge zeropage on FAULT_FLAG_UNSHARE
      such that we can handle FAULT_FLAG_UNSHARE on the PTE level.
      
      This change is a requirement for reliable long-term R/O pinning in
      COW mappings.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8d6a0ac0
    • David Hildenbrand's avatar
      mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for private mappings · aea06577
      David Hildenbrand authored
      If we already have a PMD/PUD mapped write-protected in a private mapping
      and we want to break COW either due to FAULT_FLAG_WRITE or
      FAULT_FLAG_UNSHARE, there is no need to inform the file system just like on
      the PTE path.
      
      Let's just split (->zap) + fallback in that case.
      
      This is a preparation for more generic FAULT_FLAG_UNSHARE support in
      COW mappings.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aea06577
    • David Hildenbrand's avatar
      mm: rework handling in do_wp_page() based on private vs. shared mappings · b9086fde
      David Hildenbrand authored
      We want to extent FAULT_FLAG_UNSHARE support to anything mapped into a
      COW mapping (pagecache page, zeropage, PFN, ...), not just anonymous pages.
      Let's prepare for that by handling shared mappings first such that we can
      handle private mappings last.
      
      While at it, use folio-based functions instead of page-based functions
      where we touch the code either way.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b9086fde
    • David Hildenbrand's avatar
      mm: add early FAULT_FLAG_WRITE consistency checks · 79881fed
      David Hildenbrand authored
      Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to
      care in all other handlers and might get "surprises" if we forget to do
      so.
      
      Write faults without VM_MAYWRITE don't make any sense, and our
      maybe_mkwrite() logic could have hidden such abuse for now.
      
      Write faults without VM_WRITE on something that is not a COW mapping is
      similarly broken, and e.g., do_wp_page() could end up placing an
      anonymous page into a shared mapping, which would be bad.
      
      This is a preparation for reliable R/O long-term pinning of pages in
      private mappings, whereby we want to make sure that we will never break
      COW in a read-only private mapping.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      79881fed
    • David Hildenbrand's avatar
      mm: add early FAULT_FLAG_UNSHARE consistency checks · cdc5021c
      David Hildenbrand authored
      For now, FAULT_FLAG_UNSHARE only applies to anonymous pages, which
      implies a COW mapping. Let's hide FAULT_FLAG_UNSHARE early if we're not
      dealing with a COW mapping, such that we treat it like a read fault as
      documented and don't have to worry about the flag throughout all fault
      handlers.
      
      While at it, centralize the check for mutual exclusion of
      FAULT_FLAG_UNSHARE and FAULT_FLAG_WRITE and just drop the check that
      either flag is set in the WP handler.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cdc5021c
    • David Hildenbrand's avatar
      selftests/vm: cow: R/O long-term pinning reliability tests for non-anon pages · 97713a3a
      David Hildenbrand authored
      Let's test whether R/O long-term pinning is reliable for non-anonymous
      memory: when R/O long-term pinning a page, the expectation is that we
      break COW early before pinning, such that actual write access via the
      page tables won't break COW later and end up replacing the R/O-pinned
      page in the page table.
      
      Consequently, R/O long-term pinning in private mappings would only target
      exclusive anonymous pages.
      
      For now, all tests fail:
      	# [RUN] R/O longterm GUP pin ... with shared zeropage
      	not ok 151 Longterm R/O pin is reliable
      	# [RUN] R/O longterm GUP pin ... with memfd
      	not ok 152 Longterm R/O pin is reliable
      	# [RUN] R/O longterm GUP pin ... with tmpfile
      	not ok 153 Longterm R/O pin is reliable
      	# [RUN] R/O longterm GUP pin ... with huge zeropage
      	not ok 154 Longterm R/O pin is reliable
      	# [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB)
      	not ok 155 Longterm R/O pin is reliable
      	# [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB)
      	not ok 156 Longterm R/O pin is reliable
      	# [RUN] R/O longterm GUP-fast pin ... with shared zeropage
      	not ok 157 Longterm R/O pin is reliable
      	# [RUN] R/O longterm GUP-fast pin ... with memfd
      	not ok 158 Longterm R/O pin is reliable
      	# [RUN] R/O longterm GUP-fast pin ... with tmpfile
      	not ok 159 Longterm R/O pin is reliable
      	# [RUN] R/O longterm GUP-fast pin ... with huge zeropage
      	not ok 160 Longterm R/O pin is reliable
      	# [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB)
      	not ok 161 Longterm R/O pin is reliable
      	# [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB)
      	not ok 162 Longterm R/O pin is reliable
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      97713a3a
    • David Hildenbrand's avatar
      selftests/vm: cow: basic COW tests for non-anonymous pages · f8664f3c
      David Hildenbrand authored
      Let's add basic tests for COW with non-anonymous pages in private
      mappings: write access should properly trigger COW and result in the
      private changes not being visible through other page mappings.
      
      Especially, add tests for:
      * Zeropage
      * Huge zeropage
      * Ordinary pagecache pages via memfd and tmpfile()
      * Hugetlb pages via memfd
      
      Fortunately, all tests pass.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f8664f3c
    • David Hildenbrand's avatar
      selftests/vm: anon_cow: prepare for non-anonymous COW tests · 7aca5ca1
      David Hildenbrand authored
      Patch series "mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O
      long-term pinning)".
      
      For now, we did not support reliable R/O long-term pinning in COW
      mappings.  That means, if we would trigger R/O long-term pinning in
      MAP_PRIVATE mapping, we could end up pinning the (R/O-mapped) shared
      zeropage or a pagecache page.
      
      The next write access would trigger a write fault and replace the pinned
      page by an exclusive anonymous page in the process page table; whatever
      the process would write to that private page copy would not be visible by
      the owner of the previous page pin: for example, RDMA could read stale
      data.  The end result is essentially an unexpected and hard-to-debug
      memory corruption.
      
      Some drivers tried working around that limitation by using
      "FOLL_FORCE|FOLL_WRITE|FOLL_LONGTERM" for R/O long-term pinning for now. 
      FOLL_WRITE would trigger a write fault, if required, and break COW before
      pinning the page.  FOLL_FORCE is required because the VMA might lack write
      permissions, and drivers wanted to make that working as well, just like
      one would expect (no write access, but still triggering a write access to
      break COW).
      
      However, that is not a practical solution, because
      (1) Drivers that don't stick to that undocumented and debatable pattern
          would still run into that issue. For example, VFIO only uses
          FOLL_LONGTERM for R/O long-term pinning.
      (2) Using FOLL_WRITE just to work around a COW mapping + page pinning
          limitation is unintuitive. FOLL_WRITE would, for example, mark the
          page softdirty or trigger uffd-wp, even though, there actually isn't
          going to be any write access.
      (3) The purpose of FOLL_FORCE is debug access, not access without lack of
          VMA permissions by arbitrarty drivers.
      
      So instead, make R/O long-term pinning work as expected, by breaking COW
      in a COW mapping early, such that we can remove any FOLL_FORCE usage from
      drivers and make FOLL_FORCE ptrace-specific (renaming it to FOLL_PTRACE).
      More details in patch #8.
      
      
      This patch (of 19):
      
      Originally, the plan was to have a separate tests for testing COW of
      non-anonymous (e.g., shared zeropage) pages.
      
      Turns out, that we'd need a lot of similar functionality and that there
      isn't a really good reason to separate it. So let's prepare for non-anon
      tests by renaming to "cow".
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20221116102659.70287-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Walls <awalls@md.metrocast.net>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bernard Metzler <bmt@zurich.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Benvenuti <benve@cisco.com>
      Cc: Christian Gmeiner <christian.gmeiner@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Airlie <airlied@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Inki Dae <inki.dae@samsung.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Leon Romanovsky <leon@kernel.org>
      Cc: Leon Romanovsky <leonro@nvidia.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Nelson Escobar <neescoba@cisco.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oded Gabbay <ogabbay@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux+etnaviv@armlinux.org.uk>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomasz Figa <tfiga@chromium.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7aca5ca1
    • Lukas Bulwahn's avatar
      mm: Kconfig: make config SECRETMEM visible with EXPERT · 74947724
      Lukas Bulwahn authored
      Commit 6a108a14 ("kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT")
      introduces CONFIG_EXPERT to carry the previous intent of CONFIG_EMBEDDED
      and just gives that intent a much better name.  That has been clearly a
      good and long overdue renaming, and it is clearly an improvement to the
      kernel build configuration that has shown to help managing the kernel
      build configuration in the last decade.
      
      However, rather than bravely and radically just deleting CONFIG_EMBEDDED,
      this commit gives CONFIG_EMBEDDED a new intended semantics, but keeps it
      open for future contributors to implement that intended semantics:
      
          A new CONFIG_EMBEDDED option is added that automatically selects
          CONFIG_EXPERT when enabled and can be used in the future to isolate
          options that should only be considered for embedded systems (RISC
          architectures, SLOB, etc).
      
      Since then, this CONFIG_EMBEDDED implicitly had two purposes:
      
        - It can make even more options visible beyond what CONFIG_EXPERT makes
          visible. In other words, it may introduce another level of enabling the
          visibility of configuration options: always visible, visible with
          CONFIG_EXPERT and visible with CONFIG_EMBEDDED.
      
        - Set certain default values of some configurations differently,
          following the assumption that configuring a kernel build for an
          embedded system generally starts with a different set of default values
          compared to kernel builds for all other kind of systems.
      
      Considering the second purpose, note that already probably arguing that a
      kernel build for an embedded system would choose some values differently
      is already tricky: the set of embedded systems with Linux kernels is
      already quite diverse.  Many embedded system have powerful CPUs and it
      would not be clear that all embedded systems just optimize towards one
      specific aspect, e.g., a smaller kernel image size.  So, it is unclear if
      starting with "one set of default configuration" that is induced by
      CONFIG_EMBEDDED is a good offer for developers configuring their kernels.
      
      Also, the differences of needed user-space features in an embedded system
      compared to a non-embedded system are probably difficult or even
      impossible to name in some generic way.
      
      So it is not surprising that in the last decade hardly anyone has
      contributed changes to make something default differently in case of
      CONFIG_EMBEDDED=y.
      
      Currently, in v6.0-rc4, SECRETMEM is the only config switched off if
      CONFIG_EMBEDDED=y.
      
      As long as that is actually the only option that currently is selected or
      deselected, it is better to just make SECRETMEM configurable at build time
      by experts using menuconfig instead.
      
      Make SECRETMEM configurable when EXPERT is set and otherwise default to
      yes.  Further, SECRETMEM needs ARCH_HAS_SET_DIRECT_MAP.
      
      This allows us to remove CONFIG_EMBEDDED in the close future.
      
      Link: https://lkml.kernel.org/r/20221116131922.25533-1-lukas.bulwahn@gmail.comSigned-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      74947724
    • Jason Gunthorpe's avatar
      mm/gup: remove the restriction on locked with FOLL_LONGTERM · 53b2d09b
      Jason Gunthorpe authored
      This restriction was created because FOLL_LONGTERM used to scan the vma
      list, so it could not tolerate becoming unlocked.  That was fixed in
      commit 52650c8b ("mm/gup: remove the vma allocation from
      gup_longterm_locked()") and the restriction on !vma was removed.
      
      However, the locked restriction remained, even though it isn't necessary
      anymore.
      
      Adjust __gup_longterm_locked() so it can handle the mmap_read_lock()
      becoming unlocked while it is looping for migration.  Migration does not
      require the mmap_read_sem because it is only handling struct pages.  If we
      had to unlock then ensure the whole thing returns unlocked.
      
      Remove __get_user_pages_remote() and __gup_longterm_unlocked().  These
      cases can now just directly call other functions.
      
      Link: https://lkml.kernel.org/r/0-v1-b9ae39aa8884+14dbb-gup_longterm_locked_jgg@nvidia.comSigned-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53b2d09b
    • Rong Tao's avatar
      selftests/damon: fix unnecessary compilation warnings · eff6aa17
      Rong Tao authored
      When testing overflow and overread, there is no need to keep unnecessary
      compilation warnings, we should simply ignore them.
      
      The motivation for this patch is to eliminate the compilation warning,
      maybe one day we will compile the kernel with "-Werror -Wall", at which
      point this compilation warning will turn into a compilation error, we
      should fix this error in advance.
      
      How to reproduce the problem (with gcc-11.3.1):
      
          $ make -C tools/testing/selftests/
          ...
          warning: `write' reading 4294967295 bytes from a region of size 1
          [-Wstringop-overread]
          warning: `read' writing 4294967295 bytes into a region of size 25
          overflows the destination [-Wstringop-overflow=]
      
      "-Wno-stringop-overread" is supported at least in gcc-11.1.0.
      
      Link: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d14c547abd484d3540b692bb8048c4a6efe92c8b
      Link: https://lkml.kernel.org/r/tencent_51C4ACA8CB3895C2D7F35178440283602107@qq.comSigned-off-by: default avatarRong Tao <rongtao@cestc.cn>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eff6aa17
    • Li zeming's avatar
      hugetlbfs: inode: remove unnecessary (void*) conversions · dbaf7dc9
      Li zeming authored
      The ei pointer does not need to cast the type.
      
      Link: https://lkml.kernel.org/r/20221107015659.3221-1-zeming@nfschina.comSigned-off-by: default avatarLi zeming <zeming@nfschina.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dbaf7dc9
    • Jan Kara's avatar
      mm: make drop_caches keep reclaiming on all nodes · e83b39d6
      Jan Kara authored
      Currently, drop_caches are reclaiming node-by-node, looping on each node
      until reclaim could not make progress.  This can however leave quite some
      slab entries (such as filesystem inodes) unreclaimed if objects say on
      node 1 keep objects on node 0 pinned.  So move the "loop until no
      progress" loop to the node-by-node iteration to retry reclaim also on
      other nodes if reclaim on some nodes made progress.  This fixes problem
      when drop_caches was not reclaiming lots of otherwise perfectly fine to
      reclaim inodes.
      
      Link: https://lkml.kernel.org/r/20221115123255.12559-1-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Reported-by: default avatarYou Zhou <you.zhou@intel.com>
      Reported-by: default avatarPengfei Xu <pengfei.xu@intel.com>
      Tested-by: default avatarPengfei Xu <pengfei.xu@intel.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e83b39d6
    • Pasha Tatashin's avatar
      mm: anonymous shared memory naming · d09e8ca6
      Pasha Tatashin authored
      Since commit 9a10064f ("mm: add a field to store names for private
      anonymous memory"), name for private anonymous memory, but not shared
      anonymous, can be set.  However, naming shared anonymous memory just as
      useful for tracking purposes.
      
      Extend the functionality to be able to set names for shared anon.
      
      There are two ways to create anonymous shared memory, using memfd or
      directly via mmap():
      1. fd = memfd_create(...)
         mem = mmap(..., MAP_SHARED, fd, ...)
      2. mem = mmap(..., MAP_SHARED | MAP_ANONYMOUS, -1, ...)
      
      In both cases the anonymous shared memory is created the same way by
      mapping an unlinked file on tmpfs.
      
      The memfd way allows to give a name for anonymous shared memory, but
      not useful when parts of shared memory require to have distinct names.
      
      Example use case: The VMM maps VM memory as anonymous shared memory (not
      private because VMM is sandboxed and drivers are running in their own
      processes).  However, the VM tells back to the VMM how parts of the memory
      are actually used by the guest, how each of the segments should be backed
      (i.e.  4K pages, 2M pages), and some other information about the segments.
      The naming allows us to monitor the effective memory footprint for each
      of these segments from the host without looking inside the guest.
      
      Sample output:
        /* Create shared anonymous segmenet */
        anon_shmem = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);
        /* Name the segment: "MY-NAME" */
        rv = prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME,
                   anon_shmem, SIZE, "MY-NAME");
      
      cat /proc/<pid>/maps (and smaps):
      7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 [anon_shmem:MY-NAME]
      
      If the segment is not named, the output is:
      7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 /dev/zero (deleted)
      
      Link: https://lkml.kernel.org/r/20221115020602.804224-1-pasha.tatashin@soleen.comSigned-off-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: Colin Cross <ccross@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: xu xin <cgel.zte@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d09e8ca6