1. 23 Jul, 2019 2 commits
  2. 17 Jul, 2019 1 commit
    • Thomas Gleixner's avatar
      Merge tag 'perf-core-for-mingo-5.3-20190715' of... · e0c5c5e3
      Thomas Gleixner authored
      Merge tag 'perf-core-for-mingo-5.3-20190715' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
      
      Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
      
       perf db-export:
        Adrian Hunter:
        - Improvements in how COMM details are exported to databases for
          post processing and use in the sql-viewer.py UI.
      
        - Export switch events to the database.
      
       BPF:
        Arnaldo Carvalho de Melo:
        - Bump rlimit(MEMLOCK) for 'perf test bpf' and 'perf trace', just like
          selftests/bpf/bpf_rlimit.h do, which makes errors due to exhaustion of
          this limit, which are kinda cryptic (EPERM sometimes) less frequent.
      
       perf version:
        Ravi Bangoria:
        - Fix segfault due to missing OPT_END(), noticed on PowerPC.
      
       perf vendor events:
        Thomas Richter:
        - Add JSON files for IBM s/390 machine type 8561.
      
       perf cs-etm (ARM):
        YueHaibing:
        - Fix two cases of error returns not bing done properly: Invalid ERR_PTR() use
          and loss of propagation error codes.
      e0c5c5e3
  3. 15 Jul, 2019 1 commit
  4. 13 Jul, 2019 6 commits
    • Kan Liang's avatar
      perf/x86/intel: Fix spurious NMI on fixed counter · e4557c1a
      Kan Liang authored
      If a user first sample a PEBS event on a fixed counter, then sample a
      non-PEBS event on the same fixed counter on Icelake, it will trigger
      spurious NMI. For example:
      
        perf record -e 'cycles:p' -a
        perf record -e 'cycles' -a
      
      The error message for spurious NMI:
      
        [June 21 15:38] Uhhuh. NMI received for unknown reason 30 on CPU 2.
        [    +0.000000] Do you have a strange power saving mode enabled?
        [    +0.000000] Dazed and confused, but trying to continue
      
      The bug was introduced by the following commit:
      
        commit 6f55967a ("perf/x86/intel: Fix race in intel_pmu_disable_event()")
      
      The commit moves the intel_pmu_pebs_disable() after intel_pmu_disable_fixed(),
      which returns immediately.  The related bit of PEBS_ENABLE MSR will never be
      cleared for the fixed counter. Then a non-PEBS event runs on the fixed counter,
      but the bit on PEBS_ENABLE is still set, which triggers spurious NMIs.
      
      Check and disable PEBS for fixed counters after intel_pmu_disable_fixed().
      Reported-by: default avatarYi, Ammy <ammy.yi@intel.com>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 6f55967a ("perf/x86/intel: Fix race in intel_pmu_disable_event()")
      Link: https://lkml.kernel.org/r/20190625142135.22112-1-kan.liang@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e4557c1a
    • Alexander Shishkin's avatar
      perf/core: Fix exclusive events' grouping · 8a58ddae
      Alexander Shishkin authored
      So far, we tried to disallow grouping exclusive events for the fear of
      complications they would cause with moving between contexts. Specifically,
      moving a software group to a hardware context would violate the exclusivity
      rules if both groups contain matching exclusive events.
      
      This attempt was, however, unsuccessful: the check that we have in the
      perf_event_open() syscall is both wrong (looks at wrong PMU) and
      insufficient (group leader may still be exclusive), as can be illustrated
      by running:
      
        $ perf record -e '{intel_pt//,cycles}' uname
        $ perf record -e '{cycles,intel_pt//}' uname
      
      ultimately successfully.
      
      Furthermore, we are completely free to trigger the exclusivity violation
      by:
      
         perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'
      
      even though the helpful perf record will not allow that, the ABI will.
      
      The warning later in the perf_event_open() path will also not trigger, because
      it's also wrong.
      
      Fix all this by validating the original group before moving, getting rid
      of broken safeguards and placing a useful one to perf_install_in_context().
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: mathieu.poirier@linaro.org
      Cc: will.deacon@arm.com
      Fixes: bed5b25a ("perf: Add a pmu capability for "exclusive" events")
      Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8a58ddae
    • Kim Phillips's avatar
      perf/x86/amd/uncore: Set the thread mask for F17h L3 PMCs · 2f217d58
      Kim Phillips authored
      Fill in the L3 performance event select register ThreadMask
      bitfield, to enable per hardware thread accounting.
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Gary Hook <Gary.Hook@amd.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Liska <mliska@suse.cz>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/20190628215906.4276-2-kim.phillips@amd.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2f217d58
    • Kim Phillips's avatar
      perf/x86/amd/uncore: Do not set 'ThreadMask' and 'SliceMask' for non-L3 PMCs · 16f46411
      Kim Phillips authored
      The following commit:
      
        d7cbbe49 ("perf/x86/amd/uncore: Set ThreadMask and SliceMask for L3 Cache perf events")
      
      enables L3 PMC events for all threads and slices by writing 1's in
      'ChL3PmcCfg' (L3 PMC PERF_CTL) register fields.
      
      Those bitfields overlap with high order event select bits in the Data
      Fabric PMC control register, however.
      
      So when a user requests raw Data Fabric events (-e amd_df/event=0xYYY/),
      the two highest order bits get inadvertently set, changing the counter
      select to events that don't exist, and for which no counts are read.
      
      This patch changes the logic to write the L3 masks only when dealing
      with L3 PMC counters.
      
      AMD Family 16h and below Northbridge (NB) counters were not affected.
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Gary Hook <Gary.Hook@amd.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Liska <mliska@suse.cz>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: d7cbbe49 ("perf/x86/amd/uncore: Set ThreadMask and SliceMask for L3 Cache perf events")
      Link: https://lkml.kernel.org/r/20190628215906.4276-1-kim.phillips@amd.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      16f46411
    • Peter Zijlstra's avatar
      perf/core: Fix race between close() and fork() · 1cf8dfe8
      Peter Zijlstra authored
      Syzcaller reported the following Use-after-Free bug:
      
      	close()						clone()
      
      							  copy_process()
      							    perf_event_init_task()
      							      perf_event_init_context()
      							        mutex_lock(parent_ctx->mutex)
      								inherit_task_group()
      								  inherit_group()
      								    inherit_event()
      								      mutex_lock(event->child_mutex)
      								      // expose event on child list
      								      list_add_tail()
      								      mutex_unlock(event->child_mutex)
      							        mutex_unlock(parent_ctx->mutex)
      
      							    ...
      							    goto bad_fork_*
      
      							  bad_fork_cleanup_perf:
      							    perf_event_free_task()
      
      	  perf_release()
      	    perf_event_release_kernel()
      	      list_for_each_entry()
      		mutex_lock(ctx->mutex)
      		mutex_lock(event->child_mutex)
      		// event is from the failing inherit
      		// on the other CPU
      		perf_remove_from_context()
      		list_move()
      		mutex_unlock(event->child_mutex)
      		mutex_unlock(ctx->mutex)
      
      							      mutex_lock(ctx->mutex)
      							      list_for_each_entry_safe()
      							        // event already stolen
      							      mutex_unlock(ctx->mutex)
      
      							    delayed_free_task()
      							      free_task()
      
      	     list_for_each_entry_safe()
      	       list_del()
      	       free_event()
      	         _free_event()
      		   // and so event->hw.target
      		   // is the already freed failed clone()
      		   if (event->hw.target)
      		     put_task_struct(event->hw.target)
      		       // WHOOPSIE, already quite dead
      
      Which puts the lie to the the comment on perf_event_free_task():
      'unexposed, unused context' not so much.
      
      Which is a 'fun' confluence of fail; copy_process() doing an
      unconditional free_task() and not respecting refcounts, and perf having
      creative locking. In particular:
      
        82d94856 ("perf/core: Fix lock inversion between perf,trace,cpuhp")
      
      seems to have overlooked this 'fun' parade.
      
      Solve it by using the fact that detached events still have a reference
      count on their (previous) context. With this perf_event_free_task()
      can detect when events have escaped and wait for their destruction.
      Debugged-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Reported-by: syzbot+a24c397a29ad22d86c98@syzkaller.appspotmail.com
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 82d94856 ("perf/core: Fix lock inversion between perf,trace,cpuhp")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1cf8dfe8
    • Ingo Molnar's avatar
      Merge tag 'perf-core-for-mingo-5.3-20190709' of... · e5eb08ac
      Ingo Molnar authored
      Merge tag 'perf-core-for-mingo-5.3-20190709' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
      
      Pull perf/core improvements and fixes:
      
      Intel PT:
      
        Adrian Hunter:
      
        - Fix DROP VIEW power_events_view in the postgresql and sqlite export-db
          python scripts.
      
      perf script:
      
        Song Liu:
      
        - Assume native_arch for pipe mode, fixing a segfault.
      
      perf inject:
      
        Arnaldo Carvalho de Melo:
      
        - The tool->read() call may pass a NULL evsel, handle it.
      
      core:
      
        Arnaldo Carvalho de Melo:
      
        - Move zalloc/zfree.c to tools/lib, further eroding tools/perf/util.[ch]
      
        - Use zfree() where applicable instead of open coded equivalent.
      
        - Add stdlib.h and some other headers to places where its needed and were
          getting via util.h, that doesn't need that anymore.
      
        - Use list_del_init() more thoroughly.
      
      Miscellaneous:
      
        Leo Yan:
      
        - Fix use after free and potential NULL pointer derefs detected by the
          smatch tool in various places.
      
        Luke Mujica:
      
        - Remove a couple unused variables in the parse-events code.
      
        Numfor Mbiziwo-Tiapo:
      
        - Initialize variable to suppress memory sanitizer warning in the
          mmap-thread-lookup 'perf test' entry.
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e5eb08ac
  5. 12 Jul, 2019 30 commits
    • Linus Torvalds's avatar
      Merge tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core · f632a817
      Linus Torvalds authored
      Pull driver core and debugfs updates from Greg KH:
       "Here is the "big" driver core and debugfs changes for 5.3-rc1
      
        It's a lot of different patches, all across the tree due to some api
        changes and lots of debugfs cleanups.
      
        Other than the debugfs cleanups, in this set of changes we have:
      
         - bus iteration function cleanups
      
         - scripts/get_abi.pl tool to display and parse Documentation/ABI
           entries in a simple way
      
         - cleanups to Documenatation/ABI/ entries to make them parse easier
           due to typos and other minor things
      
         - default_attrs use for some ktype users
      
         - driver model documentation file conversions to .rst
      
         - compressed firmware file loading
      
         - deferred probe fixes
      
        All of these have been in linux-next for a while, with a bunch of
        merge issues that Stephen has been patient with me for"
      
      * tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (102 commits)
        debugfs: make error message a bit more verbose
        orangefs: fix build warning from debugfs cleanup patch
        ubifs: fix build warning after debugfs cleanup patch
        driver: core: Allow subsystems to continue deferring probe
        drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT
        arch_topology: Remove error messages on out-of-memory conditions
        lib: notifier-error-inject: no need to check return value of debugfs_create functions
        swiotlb: no need to check return value of debugfs_create functions
        ceph: no need to check return value of debugfs_create functions
        sunrpc: no need to check return value of debugfs_create functions
        ubifs: no need to check return value of debugfs_create functions
        orangefs: no need to check return value of debugfs_create functions
        nfsd: no need to check return value of debugfs_create functions
        lib: 842: no need to check return value of debugfs_create functions
        debugfs: provide pr_fmt() macro
        debugfs: log errors when something goes wrong
        drivers: s390/cio: Fix compilation warning about const qualifiers
        drivers: Add generic helper to match by of_node
        driver_find_device: Unify the match function with class_find_device()
        bus_find_device: Unify the match callback with class_find_device
        ...
      f632a817
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · ef8f3d48
      Linus Torvalds authored
      Merge updates from Andrew Morton:
       "Am experimenting with splitting MM up into identifiable subsystems
        perhaps with a view to gitifying it in complex ways. Also with more
        verbose "incoming" emails.
      
        Most of MM is here and a few other trees.
      
        Subsystems affected by this patch series:
         - hotfixes
         - iommu
         - scripts
         - arch/sh
         - ocfs2
         - mm:slab-generic
         - mm:slub
         - mm:kmemleak
         - mm:kasan
         - mm:cleanups
         - mm:debug
         - mm:pagecache
         - mm:swap
         - mm:memcg
         - mm:gup
         - mm:pagemap
         - mm:infrastructure
         - mm:vmalloc
         - mm:initialization
         - mm:pagealloc
         - mm:vmscan
         - mm:tools
         - mm:proc
         - mm:ras
         - mm:oom-kill
      
        hotfixes:
            mm: vmscan: scan anonymous pages on file refaults
            mm/nvdimm: add is_ioremap_addr and use that to check ioremap address
            mm/memcontrol: fix wrong statistics in memory.stat
            mm/z3fold.c: lock z3fold page before  __SetPageMovable()
            nilfs2: do not use unexported cpu_to_le32()/le32_to_cpu() in uapi header
            MAINTAINERS: nilfs2: update email address
      
        iommu:
            include/linux/dmar.h: replace single-char identifiers in macros
      
        scripts:
            scripts/decode_stacktrace: match basepath using shell prefix operator, not regex
            scripts/decode_stacktrace: look for modules with .ko.debug extension
            scripts/spelling.txt: drop "sepc" from the misspelling list
            scripts/spelling.txt: add spelling fix for prohibited
            scripts/decode_stacktrace: Accept dash/underscore in modules
            scripts/spelling.txt: add more spellings to spelling.txt
      
        arch/sh:
            arch/sh/configs/sdk7786_defconfig: remove CONFIG_LOGFS
            sh: config: remove left-over BACKLIGHT_LCD_SUPPORT
            sh: prevent warnings when using iounmap
      
        ocfs2:
            fs: ocfs: fix spelling mistake "hearbeating" -> "heartbeat"
            ocfs2/dlm: use struct_size() helper
            ocfs2: add last unlock times in locking_state
            ocfs2: add locking filter debugfs file
            ocfs2: add first lock wait time in locking_state
            ocfs: no need to check return value of debugfs_create functions
            fs/ocfs2/dlmglue.c: unneeded variable: "status"
            ocfs2: use kmemdup rather than duplicating its implementation
      
        mm:slab-generic:
          Patch series "mm/slab: Improved sanity checking":
            mm/slab: validate cache membership under freelist hardening
            mm/slab: sanity-check page type when looking up cache
            lkdtm/heap: add tests for freelist hardening
      
        mm:slub:
            mm/slub.c: avoid double string traverse in kmem_cache_flags()
            slub: don't panic for memcg kmem cache creation failure
      
        mm:kmemleak:
            mm/kmemleak.c: fix check for softirq context
            mm/kmemleak.c: change error at _write when kmemleak is disabled
            docs: kmemleak: add more documentation details
      
        mm:kasan:
            mm/kasan: print frame description for stack bugs
            Patch series "Bitops instrumentation for KASAN", v5:
              lib/test_kasan: add bitops tests
              x86: use static_cpu_has in uaccess region to avoid instrumentation
              asm-generic, x86: add bitops instrumentation for KASAN
            Patch series "mm/kasan: Add object validation in ksize()", v3:
              mm/kasan: introduce __kasan_check_{read,write}
              mm/kasan: change kasan_check_{read,write} to return boolean
              lib/test_kasan: Add test for double-kzfree detection
              mm/slab: refactor common ksize KASAN logic into slab_common.c
              mm/kasan: add object validation in ksize()
      
        mm:cleanups:
            include/linux/pfn_t.h: remove pfn_t_to_virt()
            Patch series "remove ARCH_SELECT_MEMORY_MODEL where it has no effect":
              arm: remove ARCH_SELECT_MEMORY_MODEL
              s390: remove ARCH_SELECT_MEMORY_MODEL
              sparc: remove ARCH_SELECT_MEMORY_MODEL
            mm/gup.c: make follow_page_mask() static
            mm/memory.c: trivial clean up in insert_page()
            mm: make !CONFIG_HUGE_PAGE wrappers into static inlines
            include/linux/mm_types.h: ifdef struct vm_area_struct::swap_readahead_info
            mm: remove the account_page_dirtied export
            mm/page_isolation.c: change the prototype of undo_isolate_page_range()
            include/linux/vmpressure.h: use spinlock_t instead of struct spinlock
            mm: remove the exporting of totalram_pages
            include/linux/pagemap.h: document trylock_page() return value
      
        mm:debug:
            mm/failslab.c: by default, do not fail allocations with direct reclaim only
            Patch series "debug_pagealloc improvements":
              mm, debug_pagelloc: use static keys to enable debugging
              mm, page_alloc: more extensive free page checking with debug_pagealloc
              mm, debug_pagealloc: use a page type instead of page_ext flag
      
        mm:pagecache:
            Patch series "fix filler_t callback type mismatches", v2:
              mm/filemap.c: fix an overly long line in read_cache_page
              mm/filemap: don't cast ->readpage to filler_t for do_read_cache_page
              jffs2: pass the correct prototype to read_cache_page
              9p: pass the correct prototype to read_cache_page
            mm/filemap.c: correct the comment about VM_FAULT_RETRY
      
        mm:swap:
            mm, swap: fix race between swapoff and some swap operations
            mm/swap_state.c: simplify total_swapcache_pages() with get_swap_device()
            mm, swap: use rbtree for swap_extent
            mm/mincore.c: fix race between swapoff and mincore
      
        mm:memcg:
            memcg, oom: no oom-kill for __GFP_RETRY_MAYFAIL
            memcg, fsnotify: no oom-kill for remote memcg charging
            mm, memcg: introduce memory.events.local
            mm: memcontrol: dump memory.stat during cgroup OOM
            Patch series "mm: reparent slab memory on cgroup removal", v7:
              mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache()
              mm: memcg/slab: rename slab delayed deactivation functions and fields
              mm: memcg/slab: generalize postponed non-root kmem_cache deactivation
              mm: memcg/slab: introduce __memcg_kmem_uncharge_memcg()
              mm: memcg/slab: unify SLAB and SLUB page accounting
              mm: memcg/slab: don't check the dying flag on kmem_cache creation
              mm: memcg/slab: synchronize access to kmem_cache dying flag using a spinlock
              mm: memcg/slab: rework non-root kmem_cache lifecycle management
              mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages
              mm: memcg/slab: reparent memcg kmem_caches on cgroup removal
            mm, memcg: add a memcg_slabinfo debugfs file
      
        mm:gup:
            Patch series "switch the remaining architectures to use generic GUP", v4:
              mm: use untagged_addr() for get_user_pages_fast addresses
              mm: simplify gup_fast_permitted
              mm: lift the x86_32 PAE version of gup_get_pte to common code
              MIPS: use the generic get_user_pages_fast code
              sh: add the missing pud_page definition
              sh: use the generic get_user_pages_fast code
              sparc64: add the missing pgd_page definition
              sparc64: define untagged_addr()
              sparc64: use the generic get_user_pages_fast code
              mm: rename CONFIG_HAVE_GENERIC_GUP to CONFIG_HAVE_FAST_GUP
              mm: reorder code blocks in gup.c
              mm: consolidate the get_user_pages* implementations
              mm: validate get_user_pages_fast flags
              mm: move the powerpc hugepd code to mm/gup.c
              mm: switch gup_hugepte to use try_get_compound_head
              mm: mark the page referenced in gup_hugepte
            mm/gup: speed up check_and_migrate_cma_pages() on huge page
            mm/gup.c: remove some BUG_ONs from get_gate_page()
            mm/gup.c: mark undo_dev_pagemap as __maybe_unused
      
        mm:pagemap:
            asm-generic, x86: introduce generic pte_{alloc,free}_one[_kernel]
            alpha: switch to generic version of pte allocation
            arm: switch to generic version of pte allocation
            arm64: switch to generic version of pte allocation
            csky: switch to generic version of pte allocation
            m68k: sun3: switch to generic version of pte allocation
            mips: switch to generic version of pte allocation
            nds32: switch to generic version of pte allocation
            nios2: switch to generic version of pte allocation
            parisc: switch to generic version of pte allocation
            riscv: switch to generic version of pte allocation
            um: switch to generic version of pte allocation
            unicore32: switch to generic version of pte allocation
            mm/pgtable: drop pgtable_t variable from pte_fn_t functions
            mm/memory.c: fail when offset == num in first check of __vm_map_pages()
      
        mm:infrastructure:
            mm/mmu_notifier: use hlist_add_head_rcu()
      
        mm:vmalloc:
            Patch series "Some cleanups for the KVA/vmalloc", v5:
              mm/vmalloc.c: remove "node" argument
              mm/vmalloc.c: preload a CPU with one object for split purpose
              mm/vmalloc.c: get rid of one single unlink_va() when merge
              mm/vmalloc.c: switch to WARN_ON() and move it under unlink_va()
            mm/vmalloc.c: spelling> s/informaion/information/
      
        mm:initialization:
            mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist
            mm/large system hash: clear hashdist when only one node with memory is booted
      
        mm:pagealloc:
            arm64: move jump_label_init() before parse_early_param()
            Patch series "add init_on_alloc/init_on_free boot options", v10:
              mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
              mm: init: report memory auto-initialization features at boot time
      
        mm:vmscan:
            mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned
            mm: vmscan: correct some vmscan counters for THP swapout
      
        mm:tools:
            tools/vm/slabinfo: order command line options
            tools/vm/slabinfo: add partial slab listing to -X
            tools/vm/slabinfo: add option to sort by partial slabs
            tools/vm/slabinfo: add sorting info to help menu
      
        mm:proc:
            proc: use down_read_killable mmap_sem for /proc/pid/maps
            proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
            proc: use down_read_killable mmap_sem for /proc/pid/pagemap
            proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
            proc: use down_read_killable mmap_sem for /proc/pid/map_files
            mm: use down_read_killable for locking mmap_sem in access_remote_vm
            mm: smaps: split PSS into components
            mm: vmalloc: show number of vmalloc pages in /proc/meminfo
      
        mm:ras:
            mm/memory-failure.c: clarify error message
      
        mm:oom-kill:
            mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()
            mm, oom: refactor dump_tasks for memcg OOMs
            mm, oom: remove redundant task_in_mem_cgroup() check
            oom: decouple mems_allowed from oom_unkillable_task
            mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()"
      
      * akpm: (147 commits)
        mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()
        oom: decouple mems_allowed from oom_unkillable_task
        mm, oom: remove redundant task_in_mem_cgroup() check
        mm, oom: refactor dump_tasks for memcg OOMs
        mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()
        mm/memory-failure.c: clarify error message
        mm: vmalloc: show number of vmalloc pages in /proc/meminfo
        mm: smaps: split PSS into components
        mm: use down_read_killable for locking mmap_sem in access_remote_vm
        proc: use down_read_killable mmap_sem for /proc/pid/map_files
        proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
        proc: use down_read_killable mmap_sem for /proc/pid/pagemap
        proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
        proc: use down_read_killable mmap_sem for /proc/pid/maps
        tools/vm/slabinfo: add sorting info to help menu
        tools/vm/slabinfo: add option to sort by partial slabs
        tools/vm/slabinfo: add partial slab listing to -X
        tools/vm/slabinfo: order command line options
        mm: vmscan: correct some vmscan counters for THP swapout
        mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned
        ...
      ef8f3d48
    • Tetsuo Handa's avatar
      mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process() · 2c207985
      Tetsuo Handa authored
      Since commit bbbe4802 ("mm, oom: remove 'prefer children over
      parent' heuristic") removed the
      
        "%s: Kill process %d (%s) score %u or sacrifice child\n"
      
      line, oc->chosen_points is no longer used after select_bad_process().
      
      Link: http://lkml.kernel.org/r/1560853435-15575-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c207985
    • Shakeel Butt's avatar
      oom: decouple mems_allowed from oom_unkillable_task · ac311a14
      Shakeel Butt authored
      Commit ef08e3b4 ("[PATCH] cpusets: confine oom_killer to
      mem_exclusive cpuset") introduces a heuristic where a potential
      oom-killer victim is skipped if the intersection of the potential victim
      and the current (the process triggered the oom) is empty based on the
      reason that killing such victim most probably will not help the current
      allocating process.
      
      However the commit 7887a3da ("[PATCH] oom: cpuset hint") changed the
      heuristic to just decrease the oom_badness scores of such potential
      victim based on the reason that the cpuset of such processes might have
      changed and previously they may have allocated memory on mems where the
      current allocating process can allocate from.
      
      Unintentionally 7887a3da ("[PATCH] oom: cpuset hint") introduced a
      side effect as the oom_badness is also exposed to the user space through
      /proc/[pid]/oom_score, so, readers with different cpusets can read
      different oom_score of the same process.
      
      Later, commit 6cf86ac6 ("oom: filter tasks not sharing the same
      cpuset") fixed the side effect introduced by 7887a3da by moving the
      cpuset intersection back to only oom-killer context and out of
      oom_badness.  However the combination of ab290adb ("oom: make
      oom_unkillable_task() helper function") and 26ebc984 ("oom:
      /proc/<pid>/oom_score treat kernel thread honestly") unintentionally
      brought back the cpuset intersection check into the oom_badness
      calculation function.
      
      Other than doing cpuset/mempolicy intersection from oom_badness, the memcg
      oom context is also doing cpuset/mempolicy intersection which is quite
      wrong and is caught by syzcaller with the following report:
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
      RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
      RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
      RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
      Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
      00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f
      85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
      RSP: 0018:ffff888000127490 EFLAGS: 00010a03
      RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
      RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
      RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
      R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
      R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
      FS:  00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      Call Trace:
        oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321
        mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169
        select_bad_process mm/oom_kill.c:374 [inline]
        out_of_memory mm/oom_kill.c:1088 [inline]
        out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035
        mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573
        mem_cgroup_oom mm/memcontrol.c:1905 [inline]
        try_charge+0xfbe/0x1480 mm/memcontrol.c:2468
        mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073
        mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088
        do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201
        do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359
        wp_huge_pmd mm/memory.c:3793 [inline]
        __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006
        handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053
        do_user_addr_fault arch/x86/mm/fault.c:1455 [inline]
        __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521
        do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552
        page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156
      RIP: 0033:0x400590
      Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48
      8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01
      00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b
      RSP: 002b:00007fff7bc49780 EFLAGS: 00010206
      RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001
      RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008
      R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0
      Modules linked in:
      ---[ end trace a65689219582ffff ]---
      RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
      RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
      RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
      RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
      Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
      00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f
      85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
      RSP: 0018:ffff888000127490 EFLAGS: 00010a03
      RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
      RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
      RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
      R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
      R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
      FS:  00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      
      The fix is to decouple the cpuset/mempolicy intersection check from
      oom_unkillable_task() and make sure cpuset/mempolicy intersection check is
      only done in the global oom context.
      
      [shakeelb@google.com: change function name and update comment]
        Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac311a14
    • Shakeel Butt's avatar
      mm, oom: remove redundant task_in_mem_cgroup() check · 6ba749ee
      Shakeel Butt authored
      oom_unkillable_task() can be called from three different contexts i.e.
      global OOM, memcg OOM and oom_score procfs interface.  At the moment
      oom_unkillable_task() does a task_in_mem_cgroup() check on the given
      process.  Since there is no reason to perform task_in_mem_cgroup()
      check for global OOM and oom_score procfs interface, those contexts
      provide NULL memcg and skips the task_in_mem_cgroup() check.  However
      for memcg OOM context, the oom_unkillable_task() is always called from
      mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
      redundant and effectively dead code.  So, just remove the
      task_in_mem_cgroup() check altogether.
      
      Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ba749ee
    • Shakeel Butt's avatar
      mm, oom: refactor dump_tasks for memcg OOMs · 5eee7e1c
      Shakeel Butt authored
      dump_tasks() traverses all the existing processes even for the memcg OOM
      context which is not only unnecessary but also wasteful.  This imposes a
      long RCU critical section even from a contained context which can be quite
      disruptive.
      
      Change dump_tasks() to be aligned with select_bad_process and use
      mem_cgroup_scan_tasks to selectively traverse only processes of the target
      memcg hierarchy during memcg OOM.
      
      Link: http://lkml.kernel.org/r/20190617231207.160865-1-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5eee7e1c
    • Tetsuo Handa's avatar
      mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks() · f168a9a5
      Tetsuo Handa authored
      Since commit c03cd773 ("cgroup: Include dying leaders with live
      threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
      mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
      only one thread from each thread group.
      
      [penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()]
        Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jpSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f168a9a5
    • Jane Chu's avatar
      mm/memory-failure.c: clarify error message · 135e5351
      Jane Chu authored
      Some user who install SIGBUS handler that does longjmp out therefore
      keeping the process alive is confused by the error message
      
        "[188988.765862] Memory failure: 0x1840200: Killing cellsrv:33395 due to hardware memory corruption"
      
      Slightly modify the error message to improve clarity.
      
      Link: http://lkml.kernel.org/r/1558403523-22079-1-git-send-email-jane.chu@oracle.comSigned-off-by: default avatarJane Chu <jane.chu@oracle.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarPankaj Gupta <pagupta@redhat.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      135e5351
    • Roman Gushchin's avatar
      mm: vmalloc: show number of vmalloc pages in /proc/meminfo · 97105f0a
      Roman Gushchin authored
      Vmalloc() is getting more and more used these days (kernel stacks, bpf and
      percpu allocator are new top users), and the total % of memory consumed by
      vmalloc() can be pretty significant and changes dynamically.
      
      /proc/meminfo is the best place to display this information: its top goal
      is to show top consumers of the memory.
      
      Since the VmallocUsed field in /proc/meminfo is not in use for quite a
      long time (it has been defined to 0 by a5ad88ce ("mm: get rid of
      'vmalloc_info' from /proc/meminfo")), let's reuse it for showing the
      actual physical memory consumption of vmalloc().
      
      Link: http://lkml.kernel.org/r/20190417194002.12369-3-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97105f0a
    • Luigi Semenzato's avatar
      mm: smaps: split PSS into components · ee2ad71b
      Luigi Semenzato authored
      Report separate components (anon, file, and shmem) for PSS in
      smaps_rollup.
      
      This helps understand and tune the memory manager behavior in consumer
      devices, particularly mobile devices.  Many of them (e.g.  chromebooks and
      Android-based devices) use zram for anon memory, and perform disk reads
      for discarded file pages.  The difference in latency is large (e.g.
      reading a single page from SSD is 30 times slower than decompressing a
      zram page on one popular device), thus it is useful to know how much of
      the PSS is anon vs.  file.
      
      All the information is already present in /proc/pid/smaps, but much more
      expensive to obtain because of the large size of that procfs entry.
      
      This patch also removes a small code duplication in smaps_account, which
      would have gotten worse otherwise.
      
      Also updated Documentation/filesystems/proc.txt (the smaps section was a
      bit stale, and I added a smaps_rollup section) and
      Documentation/ABI/testing/procfs-smaps_rollup.
      
      [semenzato@chromium.org: v5]
        Link: http://lkml.kernel.org/r/20190626234333.44608-1-semenzato@chromium.org
      Link: http://lkml.kernel.org/r/20190626180429.174569-1-semenzato@chromium.orgSigned-off-by: default avatarLuigi Semenzato <semenzato@chromium.org>
      Acked-by: default avatarYu Zhao <yuzhao@chromium.org>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Cc: Yu Zhao <yuzhao@chromium.org>
      Cc: Brian Geffon <bgeffon@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee2ad71b
    • Konstantin Khlebnikov's avatar
      mm: use down_read_killable for locking mmap_sem in access_remote_vm · 1e426fe2
      Konstantin Khlebnikov authored
      This function is used by ptrace and proc files like /proc/pid/cmdline and
      /proc/pid/environ.
      
      Access_remote_vm never returns error codes, all errors are ignored and
      only size of successfully read data is returned.  So, if current task was
      killed we'll simply return 0 (bytes read).
      
      Mmap_sem could be locked for a long time or forever if something goes
      wrong.  Using a killable lock permits cleanup of stuck tasks and
      simplifies investigation.
      
      Link: http://lkml.kernel.org/r/156007494202.3335.16782303099589302087.stgit@buzzSigned-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e426fe2
    • Konstantin Khlebnikov's avatar
      proc: use down_read_killable mmap_sem for /proc/pid/map_files · cd9e2bb8
      Konstantin Khlebnikov authored
      Do not remain stuck forever if something goes wrong.  Using a killable
      lock permits cleanup of stuck tasks and simplifies investigation.
      
      It seems ->d_revalidate() could return any error (except ECHILD) to abort
      validation and pass error as result of lookup sequence.
      
      [akpm@linux-foundation.org: fix proc_map_files_lookup() return value, per Andrei]
      Link: http://lkml.kernel.org/r/156007493995.3335.9595044802115356911.stgit@buzzSigned-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd9e2bb8
    • Konstantin Khlebnikov's avatar
      proc: use down_read_killable mmap_sem for /proc/pid/clear_refs · c4603801
      Konstantin Khlebnikov authored
      Do not remain stuck forever if something goes wrong.  Using a killable
      lock permits cleanup of stuck tasks and simplifies investigation.
      
      Replace the only unkillable mmap_sem lock in clear_refs_write().
      
      Link: http://lkml.kernel.org/r/156007493826.3335.5424884725467456239.stgit@buzzSigned-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4603801
    • Konstantin Khlebnikov's avatar
      proc: use down_read_killable mmap_sem for /proc/pid/pagemap · ad80b932
      Konstantin Khlebnikov authored
      Do not remain stuck forever if something goes wrong.  Using a killable
      lock permits cleanup of stuck tasks and simplifies investigation.
      
      Link: http://lkml.kernel.org/r/156007493638.3335.4872164955523928492.stgit@buzzSigned-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad80b932
    • Konstantin Khlebnikov's avatar
      proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup · a26a9781
      Konstantin Khlebnikov authored
      Do not remain stuck forever if something goes wrong.  Using a killable
      lock permits cleanup of stuck tasks and simplifies investigation.
      
      Link: http://lkml.kernel.org/r/156007493429.3335.14666825072272692455.stgit@buzzSigned-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a26a9781
    • Konstantin Khlebnikov's avatar
      proc: use down_read_killable mmap_sem for /proc/pid/maps · 8a713e7d
      Konstantin Khlebnikov authored
      Do not remain stuck forever if something goes wrong.  Using a killable
      lock permits cleanup of stuck tasks and simplifies investigation.
      
      This function is also used for /proc/pid/smaps.
      
      Link: http://lkml.kernel.org/r/156007493160.3335.14447544314127417266.stgit@buzzSigned-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a713e7d
    • Tobin C. Harding's avatar
      tools/vm/slabinfo: add sorting info to help menu · cbf800d9
      Tobin C. Harding authored
      Passing more than one sorting option has undefined behaviour.
      
      Add an explicit statement as such to the help menu, this also has the
      advantage of highlighting all the sorting options.
      
      Link: http://lkml.kernel.org/r/20190426022622.4089-5-tobin@kernel.orgSigned-off-by: default avatarTobin C. Harding <tobin@kernel.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>,
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@iki.fi>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbf800d9
    • Tobin C. Harding's avatar
      tools/vm/slabinfo: add option to sort by partial slabs · 53a83f97
      Tobin C. Harding authored
      We would like to get a better view of the level of fragmentation within
      the SLUB allocator.  Total number of partial slabs is an indicator of
      fragmentation.
      
      Add a command line option (-P | --partial) to sort the slab list by total
      number of partial slabs.
      
      Link: http://lkml.kernel.org/r/20190426022622.4089-4-tobin@kernel.orgSigned-off-by: default avatarTobin C. Harding <tobin@kernel.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>,
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@iki.fi>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53a83f97
    • Tobin C. Harding's avatar
      tools/vm/slabinfo: add partial slab listing to -X · 1106b205
      Tobin C. Harding authored
      We would like to see how fragmented the SLUB allocator is, one window into
      fragmentation is the total number of partial slabs.
      
      Currently `slabinfo -X` shows slabs sorted by loss and by size.  We can
      use this option to also show slabs sorted by number of partial slabs.
      
      Option '-X' can be used in conjunction with '-N' to control the number of
      slabs shown e.g.  list of top 5 slabs:
      
      	slabinfo -X -N5
      
      Add list of slabs ordered by number of partial slabs to output of
      `slabinfo -X`.
      
      Link: http://lkml.kernel.org/r/20190426022622.4089-3-tobin@kernel.orgSigned-off-by: default avatarTobin C. Harding <tobin@kernel.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>,
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@iki.fi>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1106b205
    • Tobin C. Harding's avatar
      tools/vm/slabinfo: order command line options · d9149996
      Tobin C. Harding authored
      During recent discussion on LKML over SLAB vs SLUB it was suggested by
      Jesper that it would be nice to have a tool to view the current
      fragmentation of the slab allocators.  CC list for this set is taken
      from that thread.
      
      For SLUB we have all the information for this already exposed by the
      kernel and also we have a userspace tool for displaying this info:
      
      	tools/vm/slabinfo.c
      
      Extend slabinfo to improve the fragmentation information by enabling
      sorting of caches by number of partial slabs.
      
      Also add cache list sorted in this manner to the output of `slabinfo -X`.
      
      This patch (of 4):
      
      get_opt() has a spurious character within the option string.  Remove it
      and reorder the options in alphabetic order so that it is easier to keep
      the options correct.  Use the same ordering for command help output and
      long option handling code.
      
      Link: http://lkml.kernel.org/r/20190426022622.4089-2-tobin@kernel.orgSigned-off-by: default avatarTobin C. Harding <tobin@kernel.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@iki.fi>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>,
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9149996
    • Yang Shi's avatar
      mm: vmscan: correct some vmscan counters for THP swapout · 98879b3b
      Yang Shi authored
      Commit bd4c82c2 ("mm, THP, swap: delay splitting THP after swapped
      out"), THP can be swapped out in a whole.  But, nr_reclaimed and some
      other vm counters still get inc'ed by one even though a whole THP (512
      pages) gets swapped out.
      
      This doesn't make too much sense to memory reclaim.
      
      For example, direct reclaim may just need reclaim SWAP_CLUSTER_MAX
      pages, reclaiming one THP could fulfill it.  But, if nr_reclaimed is not
      increased correctly, direct reclaim may just waste time to reclaim more
      pages, SWAP_CLUSTER_MAX * 512 pages in worst case.
      
      And, it may cause pgsteal_{kswapd|direct} is greater than
      pgscan_{kswapd|direct}, like the below:
      
      pgsteal_kswapd 122933
      pgsteal_direct 26600225
      pgscan_kswapd 174153
      pgscan_direct 14678312
      
      nr_reclaimed and nr_scanned must be fixed in parallel otherwise it would
      break some page reclaim logic, e.g.
      
      vmpressure: this looks at the scanned/reclaimed ratio so it won't change
      semantics as long as scanned & reclaimed are fixed in parallel.
      
      compaction/reclaim: compaction wants a certain number of physical pages
      freed up before going back to compacting.
      
      kswapd priority raising: kswapd raises priority if we scan fewer pages
      than the reclaim target (which itself is obviously expressed in order-0
      pages).  As a result, kswapd can falsely raise its aggressiveness even
      when it's making great progress.
      
      Other than nr_scanned and nr_reclaimed, some other counters, e.g.
      pgactivate, nr_skipped, nr_ref_keep and nr_unmap_fail need to be fixed too
      since they are user visible via cgroup, /proc/vmstat or trace points,
      otherwise they would be underreported.
      
      When isolating pages from LRUs, nr_taken has been accounted in base page,
      but nr_scanned and nr_skipped are still accounted in THP.  It doesn't make
      too much sense too since this may cause trace point underreport the
      numbers as well.
      
      So accounting those counters in base page instead of accounting THP as one
      page.
      
      nr_dirty, nr_unqueued_dirty, nr_congested and nr_writeback are used by
      file cache, so they are not impacted by THP swap.
      
      This change may result in lower steal/scan ratio in some cases since THP
      may get split during page reclaim, then a part of tail pages get reclaimed
      instead of the whole 512 pages, but nr_scanned is accounted by 512,
      particularly for direct reclaim.  But, this should be not a significant
      issue.
      
      Link: http://lkml.kernel.org/r/1559025859-72759-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98879b3b
    • Yang Shi's avatar
      mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned · af5d4403
      Yang Shi authored
      Commit 9092c71b ("mm: use sc->priority for slab shrink targets") has
      broken up the relationship between sc->nr_scanned and slab pressure.
      The sc->nr_scanned can't double slab pressure anymore.  So, it sounds no
      sense to still keep sc->nr_scanned inc'ed.  Actually, it would prevent
      from adding pressure on slab shrink since excessive sc->nr_scanned would
      prevent from scan->priority raise.
      
      The bonnie test doesn't show this would change the behavior of slab
      shrinkers.
      
      				w/		w/o
      			  /sec    %CP      /sec      %CP
      Sequential delete: 	3960.6    94.6    3997.6     96.2
      Random delete: 		2518      63.8    2561.6     64.6
      
      The slight increase of "/sec" without the patch would be caused by the
      slight increase of CPU usage.
      
      Link: http://lkml.kernel.org/r/1559025859-72759-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af5d4403
    • Alexander Potapenko's avatar
      mm: init: report memory auto-initialization features at boot time · 23a5c8cb
      Alexander Potapenko authored
      Print the currently enabled stack and heap initialization modes.
      
      Stack initialization is enabled by a config flag, while heap
      initialization is configured at boot time with defaults being set in the
      config.  It's more convenient for the user to have all information about
      these hardening measures in one place at boot, so the user can reason
      about the expected behavior of the running system.
      
      The possible options for stack are:
       - "all" for CONFIG_INIT_STACK_ALL;
       - "byref_all" for CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL;
       - "byref" for CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF;
       - "__user" for CONFIG_GCC_PLUGIN_STRUCTLEAK_USER;
       - "off" otherwise.
      
      Depending on the values of init_on_alloc and init_on_free boottime options
      we also report "heap alloc" and "heap free" as "on"/"off".
      
      In the init_on_free mode initializing pages at boot time may take a while,
      so print a notice about that as well.  This depends on how much memory is
      installed, the memory bandwidth, etc.  On a relatively modern x86 system,
      it takes about 0.75s/GB to wipe all memory:
      
        [    0.418722] mem auto-init: stack:byref_all, heap alloc:off, heap free:on
        [    0.419765] mem auto-init: clearing system memory may take some time...
        [   12.376605] Memory: 16408564K/16776672K available (14339K kernel code, 1397K rwdata, 3756K rodata, 1636K init, 11460K bss, 368108K reserved, 0K cma-reserved)
      
      Link: http://lkml.kernel.org/r/20190617151050.92663-3-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Sandeep Patil <sspatil@android.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Kaiwan N Billimoria <kaiwan@kaiwantech.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23a5c8cb
    • Alexander Potapenko's avatar
      mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options · 6471384a
      Alexander Potapenko authored
      Patch series "add init_on_alloc/init_on_free boot options", v10.
      
      Provide init_on_alloc and init_on_free boot options.
      
      These are aimed at preventing possible information leaks and making the
      control-flow bugs that depend on uninitialized values more deterministic.
      
      Enabling either of the options guarantees that the memory returned by the
      page allocator and SL[AU]B is initialized with zeroes.  SLOB allocator
      isn't supported at the moment, as its emulation of kmem caches complicates
      handling of SLAB_TYPESAFE_BY_RCU caches correctly.
      
      Enabling init_on_free also guarantees that pages and heap objects are
      initialized right after they're freed, so it won't be possible to access
      stale data by using a dangling pointer.
      
      As suggested by Michal Hocko, right now we don't let the heap users to
      disable initialization for certain allocations.  There's not enough
      evidence that doing so can speed up real-life cases, and introducing ways
      to opt-out may result in things going out of control.
      
      This patch (of 2):
      
      The new options are needed to prevent possible information leaks and make
      control-flow bugs that depend on uninitialized values more deterministic.
      
      This is expected to be on-by-default on Android and Chrome OS.  And it
      gives the opportunity for anyone else to use it under distros too via the
      boot args.  (The init_on_free feature is regularly requested by folks
      where memory forensics is included in their threat models.)
      
      init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
      objects with zeroes.  Initialization is done at allocation time at the
      places where checks for __GFP_ZERO are performed.
      
      init_on_free=1 makes the kernel initialize freed pages and heap objects
      with zeroes upon their deletion.  This helps to ensure sensitive data
      doesn't leak via use-after-free accesses.
      
      Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
      returns zeroed memory.  The two exceptions are slab caches with
      constructors and SLAB_TYPESAFE_BY_RCU flag.  Those are never
      zero-initialized to preserve their semantics.
      
      Both init_on_alloc and init_on_free default to zero, but those defaults
      can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
      CONFIG_INIT_ON_FREE_DEFAULT_ON.
      
      If either SLUB poisoning or page poisoning is enabled, those options take
      precedence over init_on_alloc and init_on_free: initialization is only
      applied to unpoisoned allocations.
      
      Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
      
      hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
      hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
      
      Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
      Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
      Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
      Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
      
      The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
      is within the standard error.
      
      The new features are also going to pave the way for hardware memory
      tagging (e.g.  arm64's MTE), which will require both on_alloc and on_free
      hooks to set the tags for heap objects.  With MTE, tagging will have the
      same cost as memory initialization.
      
      Although init_on_free is rather costly, there are paranoid use-cases where
      in-memory data lifetime is desired to be minimized.  There are various
      arguments for/against the realism of the associated threat models, but
      given that we'll need the infrastructure for MTE anyway, and there are
      people who want wipe-on-free behavior no matter what the performance cost,
      it seems reasonable to include it in this series.
      
      [glider@google.com: v8]
        Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
      [glider@google.com: v10]
        Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
      Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: Michal Hocko <mhocko@suse.cz>		[page and dmapool parts
      Acked-by: James Morris <jamorris@linux.microsoft.com>]
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Sandeep Patil <sspatil@android.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6471384a
    • Kees Cook's avatar
      arm64: move jump_label_init() before parse_early_param() · ba5c5e4a
      Kees Cook authored
      While jump_label_init() was moved earlier in the boot process in
      efd9e03f ("arm64: Use static keys for CPU features"), it wasn't early
      enough for early params to use it.  The old state of things was as
      described here...
      
      init/main.c calls out to arch-specific things before general jump label
      and early param handling:
      
        asmlinkage __visible void __init start_kernel(void)
        {
              ...
              setup_arch(&command_line);
              ...
              smp_prepare_boot_cpu();
              ...
              /* parameters may set static keys */
              jump_label_init();
              parse_early_param();
              ...
        }
      
      x86 setup_arch() wants those earlier, so it handles jump label and
      early param:
      
        void __init setup_arch(char **cmdline_p)
        {
              ...
              jump_label_init();
              ...
              parse_early_param();
              ...
        }
      
      arm64 setup_arch() only had early param:
      
        void __init setup_arch(char **cmdline_p)
        {
              ...
              parse_early_param();
              ...
      }
      
      with jump label later in smp_prepare_boot_cpu():
      
        void __init smp_prepare_boot_cpu(void)
        {
              ...
              jump_label_init();
              ...
        }
      
      This moves arm64 jump_label_init() from smp_prepare_boot_cpu() to
      setup_arch(), as done already on x86, in preparation from early param
      usage in the init_on_alloc/free() series:
      https://lkml.kernel.org/r/1561572949.5154.81.camel@lca.pw
      
      Link: http://lkml.kernel.org/r/201906271003.005303B52@keescookSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba5c5e4a
    • Nicholas Piggin's avatar
      mm/large system hash: clear hashdist when only one node with memory is booted · e03a5125
      Nicholas Piggin authored
      CONFIG_NUMA on 64-bit CPUs currently enables hashdist unconditionally even
      when booting on single node machines.  This causes the large system hashes
      to be allocated with vmalloc, and mapped with small pages.
      
      This change clears hashdist if only one node has come up with memory.
      
      This results in the important large inode and dentry hashes using memblock
      allocations.  All others are within 4MB size up to about 128GB of RAM,
      which allows them to be allocated from the linear map on most non-NUMA
      images.
      
      Other big hashes like futex and TCP should eventually be moved over to the
      same style of allocation as those vfs caches that use HASH_EARLY if
      !hashdist, so they don't exceed MAX_ORDER on very large non-NUMA images.
      
      This brings dTLB misses for linux kernel tree `git diff` from ~45,000 to
      ~8,000 on a Kaby Lake KVM guest with 8MB dentry hash and mitigations=off
      (performance is in the noise, under 1% difference, page tables are likely
      to be well cached for this workload).
      
      Link: http://lkml.kernel.org/r/20190605144814.29319-2-npiggin@gmail.comSigned-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e03a5125
    • Nicholas Piggin's avatar
      mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist · ec11408a
      Nicholas Piggin authored
      The kernel currently clamps large system hashes to MAX_ORDER when hashdist
      is not set, which is rather arbitrary.
      
      vmalloc space is limited on 32-bit machines, but this shouldn't result in
      much more used because of small physical memory limiting system hash
      sizes.
      
      Include "vmalloc" or "linear" in the kernel log message.
      
      Link: http://lkml.kernel.org/r/20190605144814.29319-1-npiggin@gmail.comSigned-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec11408a
    • Geert Uytterhoeven's avatar
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc.c: switch to WARN_ON() and move it under unlink_va() · 460e42d1
      Uladzislau Rezki (Sony) authored
      Trigger a warning if an object that is about to be freed is detached.  We
      used to have a BUG_ON(), but even though it is considered as faulty
      behaviour that is not a good reason to break a system.
      
      Link: http://lkml.kernel.org/r/20190606120411.8298-5-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      460e42d1
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc.c: get rid of one single unlink_va() when merge · 54f63d9d
      Uladzislau Rezki (Sony) authored
      It does not make sense to try to "unlink" the node that is definitely not
      linked with a list nor tree.  On the first merge step VA just points to
      the previously disconnected busy area.
      
      On the second step, check if the node has been merged and do "unlink" if
      so, because now it points to an object that must be linked.
      
      Link: http://lkml.kernel.org/r/20190606120411.8298-4-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: default avatarHillf Danton <hdanton@sina.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54f63d9d