1. 23 Oct, 2020 1 commit
  2. 20 Oct, 2020 4 commits
  3. 19 Oct, 2020 3 commits
    • Linus Torvalds's avatar
      Merge tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · bbe85027
      Linus Torvalds authored
      Pull more xfs updates from Darrick Wong:
       "The second large pile of new stuff for 5.10, with changes even more
        monumental than last week!
      
        We are formally announcing the deprecation of the V4 filesystem format
        in 2030. All users must upgrade to the V5 format, which contains
        design improvements that greatly strengthen metadata validation,
        supports reflink and online fsck, and is the intended vehicle for
        handling timestamps past 2038. We're also deprecating the old Irix
        behavioral tweaks in September 2025.
      
        Coming along for the ride are two design changes to the deferred
        metadata ops subsystem. One of the improvements is to retain correct
        logical ordering of tasks and subtasks, which is a more logical design
        for upper layers of XFS and will become necessary when we add atomic
        file range swaps and commits. The second improvement to deferred ops
        improves the scalability of the log by helping the log tail to move
        forward during long-running operations. This reduces log contention
        when there are a large number of threads trying to run transactions.
      
        In addition to that, this fixes numerous small bugs in log recovery;
        refactors logical intent log item recovery to remove the last
        remaining place in XFS where we could have nested transactions; fixes
        a couple of ways that intent log item recovery could fail in ways that
        wouldn't have happened in the regular commit paths; fixes a deadlock
        vector in the GETFSMAP implementation (which improves its performance
        by 20%); and fixes serious bugs in the realtime growfs, fallocate, and
        bitmap handling code.
      
        Summary:
      
         - Deprecate the V4 filesystem format, some disused mount options, and
           some legacy sysctl knobs now that we can support dates into the
           25th century. Note that removal of V4 support will not happen until
           the early 2030s.
      
         - Fix some probles with inode realtime flag propagation.
      
         - Fix some buffer handling issues when growing a rt filesystem.
      
         - Fix a problem where a BMAP_REMAP unmap call would free rt extents
           even though the purpose of BMAP_REMAP is to avoid freeing the
           blocks.
      
         - Strengthen the dabtree online scrubber to check hash values on
           child dabtree blocks.
      
         - Actually log new intent items created as part of recovering log
           intent items.
      
         - Fix a bug where quotas weren't attached to an inode undergoing bmap
           intent item recovery.
      
         - Fix a buffer overrun problem with specially crafted log buffer
           headers.
      
         - Various cleanups to type usage and slightly inaccurate comments.
      
         - More cleanups to the xattr, log, and quota code.
      
         - Don't run the (slower) shared-rmap operations on attr fork
           mappings.
      
         - Fix a bug where we failed to check the LSN of finobt blocks during
           replay and could therefore overwrite newer data with older data.
      
         - Clean up the ugly nested transaction mess that log recovery uses to
           stage intent item recovery in the correct order by creating a
           proper data structure to capture recovered chains.
      
         - Use the capture structure to resume intent item chains with the
           same log space and block reservations as when they were captured.
      
         - Fix a UAF bug in bmap intent item recovery where we failed to
           maintain our reference to the incore inode if the bmap operation
           needed to relog itself to continue.
      
         - Rearrange the defer ops mechanism to finish newly created subtasks
           of a parent task before moving on to the next parent task.
      
         - Automatically relog intent items in deferred ops chains if doing so
           would help us avoid pinning the log tail. This will help fix some
           log scaling problems now and will facilitate atomic file updates
           later.
      
         - Fix a deadlock in the GETFSMAP implementation by using an internal
           memory buffer to reduce indirect calls and copies to userspace,
           thereby improving its performance by ~20%.
      
         - Fix various problems when calling growfs on a realtime volume would
           not fully update the filesystem metadata.
      
         - Fix broken Kconfig asking about deprecated XFS when XFS is
           disabled"
      
      * tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
        xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n
        xfs: fix high key handling in the rt allocator's query_range function
        xfs: annotate grabbing the realtime bitmap/summary locks in growfs
        xfs: make xfs_growfs_rt update secondary superblocks
        xfs: fix realtime bitmap/summary file truncation when growing rt volume
        xfs: fix the indent in xfs_trans_mod_dquot
        xfs: do the ASSERT for the arguments O_{u,g,p}dqpp
        xfs: fix deadlock and streamline xfs_getfsmap performance
        xfs: limit entries returned when counting fsmap records
        xfs: only relog deferred intent items if free space in the log gets low
        xfs: expose the log push threshold
        xfs: periodically relog deferred intent items
        xfs: change the order in which child and parent defer ops are finished
        xfs: fix an incore inode UAF in xfs_bui_recover
        xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering
        xfs: clean up bmap intent item recovery checking
        xfs: xfs_defer_capture should absorb remaining transaction reservation
        xfs: xfs_defer_capture should absorb remaining block reservations
        xfs: proper replay of deferred ops queued during log recovery
        xfs: remove XFS_LI_RECOVERED
        ...
      bbe85027
    • Linus Torvalds's avatar
      Merge tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse · 69456535
      Linus Torvalds authored
      Pull fuse updates from Miklos Szeredi:
      
       - Support directly accessing host page cache from virtiofs. This can
         improve I/O performance for various workloads, as well as reducing
         the memory requirement by eliminating double caching. Thanks to Vivek
         Goyal for doing most of the work on this.
      
       - Allow automatic submounting inside virtiofs. This allows unique
         st_dev/ st_ino values to be assigned inside the guest to files
         residing on different filesystems on the host. Thanks to Max Reitz
         for the patches.
      
       - Fix an old use after free bug found by Pradeep P V K.
      
      * tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
        virtiofs: calculate number of scatter-gather elements accurately
        fuse: connection remove fix
        fuse: implement crossmounts
        fuse: Allow fuse_fill_super_common() for submounts
        fuse: split fuse_mount off of fuse_conn
        fuse: drop fuse_conn parameter where possible
        fuse: store fuse_conn in fuse_req
        fuse: add submount support to <uapi/linux/fuse.h>
        fuse: fix page dereference after free
        virtiofs: add logic to free up a memory range
        virtiofs: maintain a list of busy elements
        virtiofs: serialize truncate/punch_hole and dax fault path
        virtiofs: define dax address space operations
        virtiofs: add DAX mmap support
        virtiofs: implement dax read/write operations
        virtiofs: introduce setupmapping/removemapping commands
        virtiofs: implement FUSE_INIT map_alignment field
        virtiofs: keep a list of free dax memory ranges
        virtiofs: add a mount option to enable dax
        virtiofs: set up virtio_fs dax_device
        ...
      69456535
    • Linus Torvalds's avatar
      Merge tag 'zonefs-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs · 922a763a
      Linus Torvalds authored
      Pull zonefs updates from Damien Le Moal:
       "Add an 'explicit-open' mount option to automatically issue a
        REQ_OP_ZONE_OPEN command to the device whenever a sequential zone file
        is open for writing for the first time.
      
        This avoids 'insufficient zone resources' errors for write operations
        on some drives with limited zone resources or on ZNS drives with a
        limited number of active zones. From Johannes"
      
      * tag 'zonefs-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
        zonefs: document the explicit-open mount option
        zonefs: open/close zone on file open/close
        zonefs: provide no-lock zonefs_io_error variant
        zonefs: introduce helper for zone management
      922a763a
  4. 18 Oct, 2020 32 commits
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-kunit-5.10-rc1' of... · 7cf726a5
      Linus Torvalds authored
      Merge tag 'linux-kselftest-kunit-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull more Kunit updates from Shuah Khan:
      
       - add Kunit to kernel_init() and remove KUnit from init calls entirely.
      
         This addresses the concern that Kunit would not work correctly during
         late init phase.
      
       - add a linker section where KUnit can put references to its test
         suites.
      
         This is the first step in transitioning to dispatching all KUnit
         tests from a centralized executor rather than having each as its own
         separate late_initcall.
      
       - add a centralized executor to dispatch tests rather than relying on
         late_initcall to schedule each test suite separately. Centralized
         execution is for built-in tests only; modules will execute tests when
         loaded.
      
       - convert bitfield test to use KUnit framework
      
       - Documentation updates for naming guidelines and how
         kunit_test_suite() works.
      
       - add test plan to KUnit TAP format
      
      * tag 'linux-kselftest-kunit-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        lib: kunit: Fix compilation test when using TEST_BIT_FIELD_COMPILE
        lib: kunit: add bitfield test conversion to KUnit
        Documentation: kunit: add a brief blurb about kunit_test_suite
        kunit: test: add test plan to KUnit TAP format
        init: main: add KUnit to kernel init
        kunit: test: create a single centralized executor for all tests
        vmlinux.lds.h: add linker section for KUnit test suites
        Documentation: kunit: Add naming guidelines
      7cf726a5
    • Linus Torvalds's avatar
      Merge tag 'core-rcu-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 41eea65e
      Linus Torvalds authored
      Pull RCU changes from Ingo Molnar:
      
       - Debugging for smp_call_function()
      
       - RT raw/non-raw lock ordering fixes
      
       - Strict grace periods for KASAN
      
       - New smp_call_function() torture test
      
       - Torture-test updates
      
       - Documentation updates
      
       - Miscellaneous fixes
      
      [ This doesn't actually pull the tag - I've dropped the last merge from
        the RCU branch due to questions about the series.   - Linus ]
      
      * tag 'core-rcu-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
        smp: Make symbol 'csd_bug_count' static
        kernel/smp: Provide CSD lock timeout diagnostics
        smp: Add source and destination CPUs to __call_single_data
        rcu: Shrink each possible cpu krcp
        rcu/segcblist: Prevent useless GP start if no CBs to accelerate
        torture: Add gdb support
        rcutorture: Allow pointer leaks to test diagnostic code
        rcutorture: Hoist OOM registry up one level
        refperf: Avoid null pointer dereference when buf fails to allocate
        rcutorture: Properly synchronize with OOM notifier
        rcutorture: Properly set rcu_fwds for OOM handling
        torture: Add kvm.sh --help and update help message
        rcutorture: Add CONFIG_PROVE_RCU_LIST to TREE05
        torture: Update initrd documentation
        rcutorture: Replace HTTP links with HTTPS ones
        locktorture: Make function torture_percpu_rwsem_init() static
        torture: document --allcpus argument added to the kvm.sh script
        rcutorture: Output number of elapsed grace periods
        rcutorture: Remove KCSAN stubs
        rcu: Remove unused "cpu" parameter from rcu_report_qs_rdp()
        ...
      41eea65e
    • Linus Torvalds's avatar
      Merge tag 'mailbox-v5.10' of git://git.linaro.org/landing-teams/working/fujitsu/integration · 373014bb
      Linus Torvalds authored
      Pull mailbox updates from Jassi Brar:
      
       - arm: implementation of mhu as a doorbell driver and conversion of
         dt-bindings to json-schema
      
       - mediatek: fix platform_get_irq error handling
      
       - bcm: convert tasklets to use new tasklet_setup api
      
       - core: fix race cause by hrtimer starting inappropriately
      
      * tag 'mailbox-v5.10' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
        mailbox: avoid timer start from callback
        maiblox: mediatek: Fix handling of platform_get_irq() error
        mailbox: arm_mhu: Add ARM MHU doorbell driver
        mailbox: arm_mhu: Match only if compatible is "arm,mhu"
        dt-bindings: mailbox: add doorbell support to ARM MHU
        dt-bindings: mailbox : arm,mhu: Convert to Json-schema
        mailbox: bcm: convert tasklets to use new tasklet_setup() API
      373014bb
    • Linus Torvalds's avatar
      Merge branch 'for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawall/linux · f66179ca
      Linus Torvalds authored
      Pull coccinelle updates from Julia Lawall.
      
      * 'for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawall/linux:
        coccinelle: api: add kfree_mismatch script
        coccinelle: iterators: Add for_each_child.cocci script
        scripts: coccicheck: Change default condition for parallelism
        scripts: coccicheck: Add quotes to improve portability
        coccinelle: api: kfree_sensitive: print memset position
        coccinelle: misc: add flexible_array.cocci script
        coccinelle: api: add kvmalloc script
        scripts: coccicheck: Change default value for parallelism
        coccinelle: misc: add excluded_middle.cocci script
        scripts: coccicheck: Improve error feedback when coccicheck fails
        coccinelle: api: update kzfree script to kfree_sensitive
        coccinelle: misc: add uninitialized_var.cocci script
        coccinelle: ifnullfree: add vfree(), kvfree*() functions
        coccinelle: api: add kobj_to_dev.cocci script
        coccinelle: add patch rule for dma_alloc_coherent
        scripts: coccicheck: Add chain mode to list of modes
      f66179ca
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 1912b04e
      Linus Torvalds authored
      Merge yet more updates from Andrew Morton:
       "Subsystems affected by this patch series: mm (memcg, migration,
        pagemap, gup, madvise, vmalloc), ia64, and misc"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits)
        mm: remove duplicate include statement in mmu.c
        mm: remove the filename in the top of file comment in vmalloc.c
        mm: cleanup the gfp_mask handling in __vmalloc_area_node
        mm: remove alloc_vm_area
        x86/xen: open code alloc_vm_area in arch_gnttab_valloc
        xen/xenbus: use apply_to_page_range directly in xenbus_map_ring_pv
        drm/i915: use vmap in i915_gem_object_map
        drm/i915: stop using kmap in i915_gem_object_map
        drm/i915: use vmap in shmem_pin_map
        zsmalloc: switch from alloc_vm_area to get_vm_area
        mm: allow a NULL fn callback in apply_to_page_range
        mm: add a vmap_pfn function
        mm: add a VM_MAP_PUT_PAGES flag for vmap
        mm: update the documentation for vfree
        mm/madvise: introduce process_madvise() syscall: an external memory hinting API
        pid: move pidfd_get_pid() to pid.c
        mm/madvise: pass mm to do_madvise
        selftests/vm: 10x speedup for hmm-tests
        binfmt_elf: take the mmap lock around find_extend_vma()
        mm/gup_benchmark: take the mmap lock around GUP
        ...
      1912b04e
    • Linus Torvalds's avatar
      Merge tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml · 9453b2d4
      Linus Torvalds authored
      Pull UML updates from Richard Weinberger:
      
       - Improve support for non-glibc systems
      
       - Vector: Add support for scripting and dynamic tap devices
      
       - Various fixes for the vector networking driver
      
       - Various fixes for time travel mode
      
      * tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
        um: vector: Add dynamic tap interfaces and scripting
        um: Clean up stacktrace dump
        um: Fix incorrect assumptions about max pid length
        um: Remove dead usage of TIF_IA32
        um: Remove redundant NULL check
        um: change sigio_spinlock to a mutex
        um: time-travel: Return the sequence number in ACK messages
        um: time-travel: Fix IRQ handling in time_travel_handle_message()
        um: Allow static linking for non-glibc implementations
        um: Some fixes to build UML with musl
        um: vector: Use GFP_ATOMIC under spin lock
        um: Fix null pointer dereference in vector_user_bpf
      9453b2d4
    • Linus Torvalds's avatar
      Merge tag 'for-linus-5.10-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs · 42973127
      Linus Torvalds authored
      Pull more ubi and ubifs updates from Richard Weinberger:
       "UBI:
         - Correctly use kthread_should_stop in ubi worker
      
        UBIFS:
         - Fixes for memory leaks while iterating directory entries
         - Fix for a user triggerable error message
         - Fix for a space accounting bug in authenticated mode"
      
      * tag 'for-linus-5.10-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
        ubifs: journal: Make sure to not dirty twice for auth nodes
        ubifs: setflags: Don't show error message when vfs_ioc_setflags_prepare() fails
        ubifs: ubifs_jnl_change_xattr: Remove assertion 'nlink > 0' for host inode
        ubi: check kthread_should_stop() after the setting of task state
        ubifs: dent: Fix some potential memory leaks while iterating entries
        ubifs: xattr: Fix some potential memory leaks while iterating entries
      42973127
    • Linus Torvalds's avatar
      Merge tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs · a96fd1cc
      Linus Torvalds authored
      Pull ubifs updates from Richard Weinberger:
      
       - Kernel-doc fixes
      
       - Fixes for memory leaks in authentication option parsing
      
      * tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
        ubifs: mount_ubifs: Release authentication resource in error handling path
        ubifs: Don't parse authentication mount options in remount process
        ubifs: Fix a memleak after dumping authentication mount options
        ubifs: Fix some kernel-doc warnings in tnc.c
        ubifs: Fix some kernel-doc warnings in replay.c
        ubifs: Fix some kernel-doc warnings in gc.c
        ubifs: Fix 'hash' kernel-doc warning in auth.c
      a96fd1cc
    • Tian Tao's avatar
      mm: remove duplicate include statement in mmu.c · c922781f
      Tian Tao authored
      asm/sections.h is included more than once, Remove the one that isn't
      necessary.
      Signed-off-by: default avatarTian Tao <tiantao6@hisilicon.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Link: https://lkml.kernel.org/r/1600088607-17327-1-git-send-email-tiantao6@hisilicon.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c922781f
    • Christoph Hellwig's avatar
      mm: remove the filename in the top of file comment in vmalloc.c · b71df8de
      Christoph Hellwig authored
      No point in having the filename inside the file.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Link: https://lkml.kernel.org/r/20201002124035.1539300-3-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b71df8de
    • Christoph Hellwig's avatar
      mm: cleanup the gfp_mask handling in __vmalloc_area_node · f255935b
      Christoph Hellwig authored
      Patch series "two small vmalloc cleanups".
      
      This patch (of 2):
      
      __vmalloc_area_node currently has four different gfp_t variables to
      just express this simple logic:
      
       - use the passed in mask, plus __GFP_NOWARN and __GFP_HIGHMEM (if
         suitable) for the underlying page allocation
       - use just the reclaim flags from the passed in mask plus __GFP_ZERO
         for allocating the page array
      
      Simplify this down to just use the pre-existing nested_gfp as-is for
      the page array allocation, and just the passed in gfp_mask for the
      page allocation, after conditionally ORing __GFP_HIGHMEM into it.  This
      also makes the allocation warning a little more correct.
      
      Also initialize two variables at the time of declaration while touching
      this area.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Link: https://lkml.kernel.org/r/20201002124035.1539300-1-hch@lst.de
      Link: https://lkml.kernel.org/r/20201002124035.1539300-2-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f255935b
    • Christoph Hellwig's avatar
      mm: remove alloc_vm_area · 301fa9f2
      Christoph Hellwig authored
      All users are gone now.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Link: https://lkml.kernel.org/r/20201002122204.1534411-12-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      301fa9f2
    • Christoph Hellwig's avatar
      x86/xen: open code alloc_vm_area in arch_gnttab_valloc · 5dd63bf1
      Christoph Hellwig authored
      Replace the last call to alloc_vm_area with an open coded version using an
      iterator in struct gnttab_vm_area instead of the triple indirection magic
      in alloc_vm_area.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Link: https://lkml.kernel.org/r/20201002122204.1534411-11-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5dd63bf1
    • Christoph Hellwig's avatar
      xen/xenbus: use apply_to_page_range directly in xenbus_map_ring_pv · b723caec
      Christoph Hellwig authored
      Replacing alloc_vm_area with get_vm_area_caller + apply_page_range allows
      to fill put the phys_addr values directly instead of doing another loop
      over all addresses.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Link: https://lkml.kernel.org/r/20201002122204.1534411-10-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b723caec
    • Christoph Hellwig's avatar
      drm/i915: use vmap in i915_gem_object_map · 534a6687
      Christoph Hellwig authored
      i915_gem_object_map implements fairly low-level vmap functionality in a
      driver.  Split it into two helpers, one for remapping kernel memory which
      can use vmap, and one for I/O memory that uses vmap_pfn.
      
      The only practical difference is that alloc_vm_area prefeaults the vmalloc
      area PTEs, which doesn't seem to be required here for the kernel memory
      case (and could be added to vmap using a flag if actually required).
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Link: https://lkml.kernel.org/r/20201002122204.1534411-9-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      534a6687
    • Christoph Hellwig's avatar
      drm/i915: stop using kmap in i915_gem_object_map · 46ce3a62
      Christoph Hellwig authored
      kmap for !PageHighmem is just a convoluted way to say page_address, and
      kunmap is a no-op in that case.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Link: https://lkml.kernel.org/r/20201002122204.1534411-8-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46ce3a62
    • Christoph Hellwig's avatar
      drm/i915: use vmap in shmem_pin_map · bfed6708
      Christoph Hellwig authored
      shmem_pin_map somewhat awkwardly reimplements vmap using alloc_vm_area and
      manual pte setup.  The only practical difference is that alloc_vm_area
      prefeaults the vmalloc area PTEs, which doesn't seem to be required here
      (and could be added to vmap using a flag if actually required).  Switch to
      use vmap, and use vfree to free both the vmalloc mapping and the page
      array, as well as dropping the references to each page.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Link: https://lkml.kernel.org/r/20201002122204.1534411-7-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bfed6708
    • Christoph Hellwig's avatar
      zsmalloc: switch from alloc_vm_area to get_vm_area · d1b6d2e1
      Christoph Hellwig authored
      Just manually pre-fault the PTEs using apply_to_page_range.
      Co-developed-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Link: https://lkml.kernel.org/r/20201002122204.1534411-6-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1b6d2e1
    • Christoph Hellwig's avatar
      mm: allow a NULL fn callback in apply_to_page_range · eeb4a05f
      Christoph Hellwig authored
      Besides calling the callback on each page, apply_to_page_range also has
      the effect of pre-faulting all PTEs for the range.  To support callers
      that only need the pre-faulting, make the callback optional.
      
      Based on a patch from Minchan Kim <minchan@kernel.org>.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Link: https://lkml.kernel.org/r/20201002122204.1534411-5-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eeb4a05f
    • Christoph Hellwig's avatar
      mm: add a vmap_pfn function · 3e9a9e25
      Christoph Hellwig authored
      Add a proper helper to remap PFNs into kernel virtual space so that
      drivers don't have to abuse alloc_vm_area and open coded PTE manipulation
      for it.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Link: https://lkml.kernel.org/r/20201002122204.1534411-4-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e9a9e25
    • Christoph Hellwig's avatar
      mm: add a VM_MAP_PUT_PAGES flag for vmap · b944afc9
      Christoph Hellwig authored
      Add a flag so that vmap takes ownership of the passed in page array.  When
      vfree is called on such an allocation it will put one reference on each
      page, and free the page array itself.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Link: https://lkml.kernel.org/r/20201002122204.1534411-3-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b944afc9
    • Matthew Wilcox (Oracle)'s avatar
      mm: update the documentation for vfree · fa307474
      Matthew Wilcox (Oracle) authored
      Patch series "remove alloc_vm_area", v4.
      
      This series removes alloc_vm_area, which was left over from the big
      vmalloc interface rework.  It is a rather arkane interface, basicaly the
      equivalent of get_vm_area + actually faulting in all PTEs in the allocated
      area.  It was originally addeds for Xen (which isn't modular to start
      with), and then grew users in zsmalloc and i915 which seems to mostly
      qualify as abuses of the interface, especially for i915 as a random driver
      should not set up PTE bits directly.
      
      This patch (of 11):
      
       * Document that you can call vfree() on an address returned from vmap()
       * Remove the note about the minimum size -- the minimum size of a vmalloc
         allocation is one page
       * Add a Context: section
       * Fix capitalisation
       * Reword the prohibition on calling from NMI context to avoid a double
         negative
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Link: https://lkml.kernel.org/r/20201002122204.1534411-1-hch@lst.de
      Link: https://lkml.kernel.org/r/20201002122204.1534411-2-hch@lst.deSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa307474
    • Minchan Kim's avatar
      mm/madvise: introduce process_madvise() syscall: an external memory hinting API · ecb8ac8b
      Minchan Kim authored
      There is usecase that System Management Software(SMS) want to give a
      memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
      case of Android, it is the ActivityManagerService.
      
      The information required to make the reclaim decision is not known to the
      app.  Instead, it is known to the centralized userspace
      daemon(ActivityManagerService), and that daemon must be able to initiate
      reclaim on its own without any app involvement.
      
      To solve the issue, this patch introduces a new syscall
      process_madvise(2).  It uses pidfd of an external process to give the
      hint.  It also supports vector address range because Android app has
      thousands of vmas due to zygote so it's totally waste of CPU and power if
      we should call the syscall one by one for each vma.(With testing 2000-vma
      syscall vs 1-vector syscall, it showed 15% performance improvement.  I
      think it would be bigger in real practice because the testing ran very
      cache friendly environment).
      
      Another potential use case for the vector range is to amortize the cost
      ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
      benefit users like TCP receive zerocopy and malloc implementations.  In
      future, we could find more usecases for other advises so let's make it
      happens as API since we introduce a new syscall at this moment.  With
      that, existing madvise(2) user could replace it with process_madvise(2)
      with their own pid if they want to have batch address ranges support
      feature.
      
      ince it could affect other process's address range, only privileged
      process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
      UID) gives it the right to ptrace the process could use it successfully.
      The flag argument is reserved for future use if we need to extend the API.
      
      I think supporting all hints madvise has/will supported/support to
      process_madvise is rather risky.  Because we are not sure all hints make
      sense from external process and implementation for the hint may rely on
      the caller being in the current context so it could be error-prone.  Thus,
      I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
      
      If someone want to add other hints, we could hear the usecase and review
      it for each hint.  It's safer for maintenance rather than introducing a
      buggy syscall but hard to fix it later.
      
      So finally, the API is as follows,
      
            ssize_t process_madvise(int pidfd, const struct iovec *iovec,
                      unsigned long vlen, int advice, unsigned int flags);
      
          DESCRIPTION
            The process_madvise() system call is used to give advice or directions
            to the kernel about the address ranges from external process as well as
            local process. It provides the advice to address ranges of process
            described by iovec and vlen. The goal of such advice is to improve
            system or application performance.
      
            The pidfd selects the process referred to by the PID file descriptor
            specified in pidfd. (See pidofd_open(2) for further information)
      
            The pointer iovec points to an array of iovec structures, defined in
            <sys/uio.h> as:
      
              struct iovec {
                  void *iov_base;         /* starting address */
                  size_t iov_len;         /* number of bytes to be advised */
              };
      
            The iovec describes address ranges beginning at address(iov_base)
            and with size length of bytes(iov_len).
      
            The vlen represents the number of elements in iovec.
      
            The advice is indicated in the advice argument, which is one of the
            following at this moment if the target process specified by pidfd is
            external.
      
              MADV_COLD
              MADV_PAGEOUT
      
            Permission to provide a hint to external process is governed by a
            ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
      
            The process_madvise supports every advice madvise(2) has if target
            process is in same thread group with calling process so user could
            use process_madvise(2) to extend existing madvise(2) to support
            vector address ranges.
      
          RETURN VALUE
            On success, process_madvise() returns the number of bytes advised.
            This return value may be less than the total number of requested
            bytes, if an error occurred. The caller should check return value
            to determine whether a partial advice occurred.
      
      FAQ:
      
      Q.1 - Why does any external entity have better knowledge?
      
      Quote from Sandeep
      
      "For Android, every application (including the special SystemServer)
      are forked from Zygote.  The reason of course is to share as many
      libraries and classes between the two as possible to benefit from the
      preloading during boot.
      
      After applications start, (almost) all of the APIs end up calling into
      this SystemServer process over IPC (binder) and back to the
      application.
      
      In a fully running system, the SystemServer monitors every single
      process periodically to calculate their PSS / RSS and also decides
      which process is "important" to the user for interactivity.
      
      So, because of how these processes start _and_ the fact that the
      SystemServer is looping to monitor each process, it does tend to *know*
      which address range of the application is not used / useful.
      
      Besides, we can never rely on applications to clean things up
      themselves.  We've had the "hey app1, the system is low on memory,
      please trim your memory usage down" notifications for a long time[1].
      They rely on applications honoring the broadcasts and very few do.
      
      So, if we want to avoid the inevitable killing of the application and
      restarting it, some way to be able to tell the OS about unimportant
      memory in these applications will be useful.
      
      - ssp
      
      Q.2 - How to guarantee the race(i.e., object validation) between when
      giving a hint from an external process and get the hint from the target
      process?
      
      process_madvise operates on the target process's address space as it
      exists at the instant that process_madvise is called.  If the space
      target process can run between the time the process_madvise process
      inspects the target process address space and the time that
      process_madvise is actually called, process_madvise may operate on
      memory regions that the calling process does not expect.  It's the
      responsibility of the process calling process_madvise to close this
      race condition.  For example, the calling process can suspend the
      target process with ptrace, SIGSTOP, or the freezer cgroup so that it
      doesn't have an opportunity to change its own address space before
      process_madvise is called.  Another option is to operate on memory
      regions that the caller knows a priori will be unchanged in the target
      process.  Yet another option is to accept the race for certain
      process_madvise calls after reasoning that mistargeting will do no
      harm.  The suggested API itself does not provide synchronization.  It
      also apply other APIs like move_pages, process_vm_write.
      
      The race isn't really a problem though.  Why is it so wrong to require
      that callers do their own synchronization in some manner?  Nobody
      objects to write(2) merely because it's possible for two processes to
      open the same file and clobber each other's writes --- instead, we tell
      people to use flock or something.  Think about mmap.  It never
      guarantees newly allocated address space is still valid when the user
      tries to access it because other threads could unmap the memory right
      before.  That's where we need synchronization by using other API or
      design from userside.  It shouldn't be part of API itself.  If someone
      needs more fine-grained synchronization rather than process level,
      there were two ideas suggested - cookie[2] and anon-fd[3].  Both are
      applicable via using last reserved argument of the API but I don't
      think it's necessary right now since we have already ways to prevent
      the race so don't want to add additional complexity with more
      fine-grained optimization model.
      
      To make the API extend, it reserved an unsigned long as last argument
      so we could support it in future if someone really needs it.
      
      Q.3 - Why doesn't ptrace work?
      
      Injecting an madvise in the target process using ptrace would not work
      for us because such injected madvise would have to be executed by the
      target process, which means that process would have to be runnable and
      that creates the risk of the abovementioned race and hinting a wrong
      VMA.  Furthermore, we want to act the hint in caller's context, not the
      callee's, because the callee is usually limited in cpuset/cgroups or
      even freezed state so they can't act by themselves quick enough, which
      causes more thrashing/kill.  It doesn't work if the target process are
      ptraced(e.g., strace, debugger, minidump) because a process can have at
      most one ptracer.
      
      [1] https://developer.android.com/topic/performance/memory"
      
      [2] process_getinfo for getting the cookie which is updated whenever
          vma of process address layout are changed - Daniel Colascione -
          https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
      
      [3] anonymous fd which is used for the object(i.e., address range)
          validation - Michal Hocko -
          https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
      
      [minchan@kernel.org: fix process_madvise build break for arm64]
        Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
      [minchan@kernel.org: fix build error for mips of process_madvise]
        Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
      [akpm@linux-foundation.org: fix patch ordering issue]
      [akpm@linux-foundation.org: fix arm64 whoops]
      [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
      [akpm@linux-foundation.org: fix i386 build]
      [sfr@canb.auug.org.au: fix syscall numbering]
        Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
      [sfr@canb.auug.org.au: madvise.c needs compat.h]
        Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
      [minchan@kernel.org: fix mips build]
        Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
      [yuehaibing@huawei.com: remove duplicate header which is included twice]
        Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
      [minchan@kernel.org: do not use helper functions for process_madvise]
        Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
      [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
      [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
        Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.auSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
      Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ecb8ac8b
    • Minchan Kim's avatar
      pid: move pidfd_get_pid() to pid.c · 1aa92cd3
      Minchan Kim authored
      process_madvise syscall needs pidfd_get_pid function to translate pidfd to
      pid so this patch move the function to kernel/pid.c.
      Suggested-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jann Horn <jannh@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-3-minchan@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1aa92cd3
    • Minchan Kim's avatar
      mm/madvise: pass mm to do_madvise · 0726b01e
      Minchan Kim authored
      Patch series "introduce memory hinting API for external process", v9.
      
      Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API.  With
      that, application could give hints to kernel what memory range are
      preferred to be reclaimed.  However, in some platform(e.g., Android), the
      information required to make the hinting decision is not known to the app.
      Instead, it is known to a centralized userspace daemon(e.g.,
      ActivityManagerService), and that daemon must be able to initiate reclaim
      on its own without any app involvement.
      
      To solve the concern, this patch introduces new syscall -
      process_madvise(2).  Bascially, it's same with madvise(2) syscall but it
      has some differences.
      
      1. It needs pidfd of target process to provide the hint
      
      2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
         moment.  Other hints in madvise will be opened when there are explicit
         requests from community to prevent unexpected bugs we couldn't support.
      
      3. Only privileged processes can do something for other process's
         address space.
      
      For more detail of the new API, please see "mm: introduce external memory
      hinting API" description in this patchset.
      
      This patch (of 3):
      
      In upcoming patches, do_madvise will be called from external process
      context so we shouldn't asssume "current" is always hinted process's
      task_struct.
      
      Furthermore, we must not access mm_struct via task->mm, but obtain it via
      access_mm() once (in the following patch) and only use that pointer [1],
      so pass it to do_madvise() as well.  Note the vma->vm_mm pointers are
      safe, so we can use them further down the call stack.
      
      And let's pass current->mm as arguments of do_madvise so it shouldn't
      change existing behavior but prepare next patch to make review easy.
      
      [vbabka@suse.cz: changelog tweak]
      [minchan@kernel.org: use current->mm for io_uring]
        Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
      [akpm@linux-foundation.org: fix it for upstream changes]
      [akpm@linux-foundation.org: whoops]
      [rdunlap@infradead.org: add missing includes]
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jann Horn <jannh@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0726b01e
    • John Hubbard's avatar
      selftests/vm: 10x speedup for hmm-tests · 25596530
      John Hubbard authored
      This patch reduces the running time for hmm-tests from about 10+ seconds,
      to just under 1.0 second, for an approximately 10x speedup.  That brings
      it in line with most of the other tests in selftests/vm, which mostly run
      in < 1 sec.
      
      This is done with a one-line change that simply reduces the number of
      iterations of several tests, from 256, to 10.  Thanks to Ralph Campbell
      for suggesting changing NTIMES as a way to get the speedup.
      Suggested-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: https://lkml.kernel.org/r/20201003011721.44238-1-jhubbard@nvidia.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25596530
    • Jann Horn's avatar
      binfmt_elf: take the mmap lock around find_extend_vma() · b2767d97
      Jann Horn authored
      create_elf_tables() runs after setup_new_exec(), so other tasks can
      already access our new mm and do things like process_madvise() on it.  (At
      the time I'm writing this commit, process_madvise() is not in mainline
      yet, but has been in akpm's tree for some time.)
      
      While I believe that there are currently no APIs that would actually allow
      another process to mess up our VMA tree (process_madvise() is limited to
      MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm
      under which no syscalls have been executed yet), this seems like an
      accident waiting to happen.
      
      Let's make sure that we always take the mmap lock around GUP paths as long
      as another process might be able to see the mm.
      
      (Yes, this diff looks suspicious because we drop the lock before doing
      anything with `vma`, but that's because we actually don't do anything with
      it apart from the NULL check.)
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Link: https://lkml.kernel.org/r/CAG48ez1-PBCdv3y8pn-Ty-b+FmBSLwDuVKFSt8h7wARLy0dF-Q@mail.gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2767d97
    • Jann Horn's avatar
      mm/gup_benchmark: take the mmap lock around GUP · f3964599
      Jann Horn authored
      To be safe against concurrent changes to the VMA tree, we must take the
      mmap lock around GUP operations (excluding the GUP-fast family of
      operations, which will take the mmap lock by themselves if necessary).
      
      This code is only for testing, and it's only reachable by root through
      debugfs, so this doesn't really have any impact; however, if we want to
      add lockdep asserts into the GUP path, we need to have clean locking here.
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Link: https://lkml.kernel.org/r/CAG48ez3SG6ngZLtasxJ6LABpOnqCz5-QHqb0B4k44TQ8F9n6+w@mail.gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3964599
    • Liam R. Howlett's avatar
      mm/mmap: add inline munmap_vma_range() for code readability · fb8090b6
      Liam R. Howlett authored
      There are two locations that have a block of code for munmapping a vma
      range.  Change those two locations to use a function and add meaningful
      comments about what happens to the arguments, which was unclear in the
      previous code.
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200818154707.2515169-2-Liam.Howlett@Oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb8090b6
    • Liam R. Howlett's avatar
      mm/mmap: add inline vma_next() for readability of mmap code · 3903b55a
      Liam R. Howlett authored
      There are three places that the next vma is required which uses the same
      block of code.  Replace the block with a function and add comments on what
      happens in the case where NULL is encountered.
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200818154707.2515169-1-Liam.Howlett@Oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3903b55a
    • Miaohe Lin's avatar
      mm/migrate: avoid possible unnecessary process right check in kernel_move_pages() · 4dc200ce
      Miaohe Lin authored
      There is no need to check if this process has the right to modify the
      specified process when they are same.  And we could also skip the security
      hook call if a process is modifying its own pages.  Add helper function to
      handle these.
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarHongxiang Lou <louhongxiang@huawei.com>
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christopher Lameter <cl@linux.com>
      Link: https://lkml.kernel.org/r/20200819083331.19012-1-linmiaohe@huawei.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4dc200ce
    • Joonsoo Kim's avatar
      mm/memory_hotplug: remove a wrapper for alloc_migration_target() · 203e6e5c
      Joonsoo Kim authored
      To calculate the correct node to migrate the page for hotplug, we need to
      check node id of the page.  Wrapper for alloc_migration_target() exists
      for this purpose.
      
      However, Vlastimil informs that all migration source pages come from a
      single node.  In this case, we don't need to check the node id for each
      page and we don't need to re-set the target nodemask for each page by
      using the wrapper.  Set up the migration_target_control once and use it
      for all pages.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Link: http://lkml.kernel.org/r/1594622517-20681-10-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      203e6e5c