1. 13 Aug, 2019 16 commits
    • Andrea Arcangeli's avatar
      Revert "mm, thp: restore node-local hugepage allocations" · a8282608
      Andrea Arcangeli authored
      This reverts commit 2f0799a0 ("mm, thp: restore node-local
      hugepage allocations").
      
      commit 2f0799a0 was rightfully applied to avoid the risk of a
      severe regression that was reported by the kernel test robot at the end
      of the merge window.  Now we understood the regression was a false
      positive and was caused by a significant increase in fairness during a
      swap trashing benchmark.  So it's safe to re-apply the fix and continue
      improving the code from there.  The benchmark that reported the
      regression is very useful, but it provides a meaningful result only when
      there is no significant alteration in fairness during the workload.  The
      removal of __GFP_THISNODE increased fairness.
      
      __GFP_THISNODE cannot be used in the generic page faults path for new
      memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
      behavior significantly deviates from what the MPOL_DEFAULT semantics are
      supposed to be for THP and 4k allocations alike.
      
      Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
      set to "madvise") has never meant to provide an implicit MPOL_BIND on
      the "current" node the task is running on, causing swap storms and
      providing a much more aggressive behavior than even zone_reclaim_node =
      3.
      
      Any workload who could have benefited from __GFP_THISNODE has now to
      enable zone_reclaim_mode=1||2||3.  __GFP_THISNODE implicitly provided
      the zone_reclaim_mode behavior, but it only did so if THP was enabled:
      if THP was disabled, there would have been no chance to get any 4k page
      from the current node if the current node was full of pagecache, which
      further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
      MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
      semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
      must work exactly the same with MADV_HUGEPAGE set or not.
      
      The performance characteristic of memory depends on the hardware
      details.  The numbers below are obtained on Naples/EPYC architecture and
      the N/A projection extends them to show what we should aim for in the
      future as a good THP NUMA locality default.  The benchmark used
      exercises random memory seeks (note: the cost of the page faults is not
      part of the measurement).
      
        D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
        0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A
      
      D0 means distance zero (i.e.  local memory), D1 means distance one (i.e.
      intra socket memory), D2 means distance two (i.e.  inter socket memory),
      etc...
      
      For the guest physical memory allocated by qemu and for guest mode
      kernel the performance characteristic of RAM is more complex and an
      ideal default could be:
      
        D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
        0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A
      
      NOTE: the N/A are projections and haven't been measured yet, the
      measurement in this case is done on a 1950x with only two NUMA nodes.
      The THP case here means THP was used both in the host and in the guest.
      
      After applying this commit the THP NUMA locality order that we'll get
      out of MADV_HUGEPAGE is this:
      
        D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...
      
      Before this commit it was:
      
        D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...
      
      Even if we ignore the breakage of large workloads that can't fit in a
      single node that the __GFP_THISNODE implicit "current node" mbind
      caused, the THP NUMA locality order provided by __GFP_THISNODE was still
      not the one we shall aim for in the long term (i.e.  the first one at
      the top).
      
      After this commit is applied, we can introduce a new allocator multi
      order API and to replace those two alloc_pages_vmas calls in the page
      fault path, with a single multi order call:
      
              unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
              page = alloc_pages_multi_order(..., &order);
              if (!page)
              	goto out;
              if (!(order & (1 << 0))) {
              	VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
              	/* THP fault */
              } else {
              	VM_WARN_ON(order != 1 << 0);
              	/* 4k fallback */
              }
      
      The page allocator logic has to be altered so that when it fails on any
      zone with order 9, it has to try again with a order 0 before falling
      back to the next zone in the zonelist.
      
      After that we need to do more measurements and evaluate if adding an
      opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
      with "DN+1 THP | DN 4k" at every NUMA distance crossing.
      
      Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8282608
    • Andrea Arcangeli's avatar
      Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" · 92717d42
      Andrea Arcangeli authored
      Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".
      
      The fixes for what was originally reported as "pathological THP
      behavior" we rightfully reverted to be sure not to introduced
      regressions at end of a merge window after a severe regression report
      from the kernel bot.  We can safely re-apply them now that we had time
      to analyze the problem.
      
      The mm process worked fine, because the good fixes were eventually
      committed upstream without excessive delay.
      
      The regression reported by the kernel bot however forced us to revert
      the good fixes to be sure not to introduce regressions and to give us
      the time to analyze the issue further.  The silver lining is that this
      extra time allowed to think more at this issue and also plan for a
      future direction to improve things further in terms of THP NUMA
      locality.
      
      This patch (of 2):
      
      This reverts commit 356ff8a9 ("Revert "mm, thp: consolidate THP
      gfp handling into alloc_hugepage_direct_gfpmask").  So it reapplies
      89c83fb5 ("mm, thp: consolidate THP gfp handling into
      alloc_hugepage_direct_gfpmask").
      
      Consolidation of the THP allocation flags at the same place was meant to
      be a clean up to easier handle otherwise scattered code which is
      imposing a maintenance burden.  There were no real problems observed
      with the gfp mask consolidation but the reversion was rushed through
      without a larger consensus regardless.
      
      This patch brings the consolidation back because this should make the
      long term maintainability easier as well as it should allow future
      changes to be less error prone.
      
      [mhocko@kernel.org: changelog additions]
      Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      92717d42
    • Qian Cai's avatar
      include/asm-generic/5level-fixup.h: fix variable 'p4d' set but not used · 0cfaee2a
      Qian Cai authored
      A compiler throws a warning on an arm64 system since commit 9849a569
      ("arch, mm: convert all architectures to use 5level-fixup.h"),
      
        mm/kasan/init.c: In function 'kasan_free_p4d':
        mm/kasan/init.c:344:9: warning: variable 'p4d' set but not used [-Wunused-but-set-variable]
         p4d_t *p4d;
                ^~~
      
      because p4d_none() in "5level-fixup.h" is compiled away while it is a
      static inline function in "pgtable-nopud.h".
      
      However, if converted p4d_none() to a static inline there, powerpc would
      be unhappy as it reads those in assembler language in
      "arch/powerpc/include/asm/book3s/64/pgtable.h", so it needs to skip
      assembly include for the static inline C function.
      
      While at it, converted a few similar functions to be consistent with the
      ones in "pgtable-nopud.h".
      
      Link: http://lkml.kernel.org/r/20190806232917.881-1-cai@lca.pwSigned-off-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cfaee2a
    • NeilBrown's avatar
      seq_file: fix problem when seeking mid-record · 6a2aeab5
      NeilBrown authored
      If you use lseek or similar (e.g.  pread) to access a location in a
      seq_file file that is within a record, rather than at a record boundary,
      then the first read will return the remainder of the record, and the
      second read will return the whole of that same record (instead of the
      next record).  When seeking to a record boundary, the next record is
      correctly returned.
      
      This bug was introduced by a recent patch (identified below).  Before
      that patch, seq_read() would increment m->index when the last of the
      buffer was returned (m->count == 0).  After that patch, we rely on
      ->next to increment m->index after filling the buffer - but there was
      one place where that didn't happen.
      
      Link: https://lkml.kernel.org/lkml/877e7xl029.fsf@notabene.neil.brown.name/
      Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code and interface")
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Reported-by: default avatarSergei Turchanov <turchanov@farpost.com>
      Tested-by: default avatarSergei Turchanov <turchanov@farpost.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Markus Elfring <Markus.Elfring@web.de>
      Cc: <stable@vger.kernel.org>	[4.19+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a2aeab5
    • Roman Gushchin's avatar
      mm: workingset: fix vmstat counters for shadow nodes · ec9f0238
      Roman Gushchin authored
      Memcg counters for shadow nodes are broken because the memcg pointer is
      obtained in a wrong way. The following approach is used:
              virt_to_page(xa_node)->mem_cgroup
      
      Since commit 4d96ba35 ("mm: memcg/slab: stop setting
      page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
      set for slab pages, so memcg_from_slab_page() should be used instead.
      
      Also I doubt that it ever worked correctly: virt_to_head_page() should
      be used instead of virt_to_page().  Otherwise objects residing on tail
      pages are not accounted, because only the head page contains a valid
      mem_cgroup pointer.  That was a case since the introduction of these
      counters by the commit 68d48e6a ("mm: workingset: add vmstat counter
      for shadow nodes").
      
      Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
      Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec9f0238
    • Isaac J. Manjarres's avatar
      mm/usercopy: use memory range to be accessed for wraparound check · 95153169
      Isaac J. Manjarres authored
      Currently, when checking to see if accessing n bytes starting at address
      "ptr" will cause a wraparound in the memory addresses, the check in
      check_bogus_address() adds an extra byte, which is incorrect, as the
      range of addresses that will be accessed is [ptr, ptr + (n - 1)].
      
      This can lead to incorrectly detecting a wraparound in the memory
      address, when trying to read 4 KB from memory that is mapped to the the
      last possible page in the virtual address space, when in fact, accessing
      that range of memory would not cause a wraparound to occur.
      
      Use the memory range that will actually be accessed when considering if
      accessing a certain amount of bytes will cause the memory address to
      wrap around.
      
      Link: http://lkml.kernel.org/r/1564509253-23287-1-git-send-email-isaacm@codeaurora.org
      Fixes: f5509cc1 ("mm: Hardened usercopy")
      Signed-off-by: default avatarPrasad Sodagudi <psodagud@codeaurora.org>
      Signed-off-by: default avatarIsaac J. Manjarres <isaacm@codeaurora.org>
      Co-developed-by: default avatarPrasad Sodagudi <psodagud@codeaurora.org>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Trilok Soni <tsoni@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95153169
    • Catalin Marinas's avatar
      mm: kmemleak: disable early logging in case of error · fcf3a5b6
      Catalin Marinas authored
      If an error occurs during kmemleak_init() (e.g.  kmem cache cannot be
      created), kmemleak is disabled but kmemleak_early_log remains enabled.
      Subsequently, when the .init.text section is freed, the log_early()
      function no longer exists.  To avoid a page fault in such scenario,
      ensure that kmemleak_disable() also disables early logging.
      
      Link: http://lkml.kernel.org/r/20190731152302.42073-1-catalin.marinas@arm.comSigned-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcf3a5b6
    • Kuppuswamy Sathyanarayanan's avatar
      mm/vmalloc.c: fix percpu free VM area search criteria · 5336e52c
      Kuppuswamy Sathyanarayanan authored
      Recent changes to the vmalloc code by commit 68ad4a33
      ("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
      cause spurious percpu allocation failures.  These, in turn, can result
      in panic()s in the slub code.  One such possible panic was reported by
      Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
      Another related panic observed is,
      
       RIP: 0033:0x7f46f7441b9b
       Call Trace:
        dump_stack+0x61/0x80
        pcpu_alloc.cold.30+0x22/0x4f
        mem_cgroup_css_alloc+0x110/0x650
        cgroup_apply_control_enable+0x133/0x330
        cgroup_mkdir+0x41b/0x500
        kernfs_iop_mkdir+0x5a/0x90
        vfs_mkdir+0x102/0x1b0
        do_mkdirat+0x7d/0xf0
        do_syscall_64+0x5b/0x180
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
      to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
      uses two lists (vmap_area_list & free_vmap_area_list) to track the used
      and free VM areas in VMALLOC space.  And pcpu_get_vm_areas(offsets[],
      sizes[], nr_vms, align) function is used for allocating congruent VM
      areas for percpu memory allocator.  In order to not conflict with
      VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
      VMALLOC space.  So the search for free vm_area for the given requirement
      starts near VMALLOC_END and moves upwards towards VMALLOC_START.
      
      Prior to commit 68ad4a33, the search for free vm_area in
      pcpu_get_vm_areas() involves following two main steps.
      
      Step 1:
          Find a aligned "base" adress near VMALLOC_END.
          va = free vm area near VMALLOC_END
      Step 2:
          Loop through number of requested vm_areas and check,
              Step 2.1:
                 if (base < VMALLOC_START)
                    1. fail with error
              Step 2.2:
                 // end is offsets[area] + sizes[area]
                 if (base + end > va->vm_end)
                     1. Move the base downwards and repeat Step 2
              Step 2.3:
                 if (base + start < va->vm_start)
                    1. Move to previous free vm_area node, find aligned
                       base address and repeat Step 2
      
      But Commit 68ad4a33 removed Step 2.2 and modified Step 2.3 as below:
      
              Step 2.3:
                 if (base + start < va->vm_start || base + end > va->vm_end)
                    1. Move to previous free vm_area node, find aligned
                       base address and repeat Step 2
      
      Above change is the root cause of spurious percpu memory allocation
      failures.  For example, consider a case where a relatively large vm_area
      (~ 30 TB) was ignored in free vm_area search because it did not pass the
      base + end < vm->vm_end boundary check.  Ignoring such large free
      vm_area's would lead to not finding free vm_area within boundary of
      VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.
      
      So modify the search algorithm to include Step 2.2.
      
      Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com
      Fixes: 68ad4a33 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
      Signed-off-by: default avatarKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
      Reported-by: default avatarDave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: sathyanarayanan kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5336e52c
    • Miles Chen's avatar
      mm/memcontrol.c: fix use after free in mem_cgroup_iter() · 54a83d6b
      Miles Chen authored
      This patch is sent to report an use after free in mem_cgroup_iter()
      after merging commit be2657752e9e ("mm: memcg: fix use after free in
      mem_cgroup_iter()").
      
      I work with android kernel tree (4.9 & 4.14), and commit be2657752e9e
      ("mm: memcg: fix use after free in mem_cgroup_iter()") has been merged
      to the trees.  However, I can still observe use after free issues
      addressed in the commit be2657752e9e.  (on low-end devices, a few times
      this month)
      
      backtrace:
              css_tryget <- crash here
              mem_cgroup_iter
              shrink_node
              shrink_zones
              do_try_to_free_pages
              try_to_free_pages
              __perform_reclaim
              __alloc_pages_direct_reclaim
              __alloc_pages_slowpath
              __alloc_pages_nodemask
      
      To debug, I poisoned mem_cgroup before freeing it:
      
        static void __mem_cgroup_free(struct mem_cgroup *memcg)
              for_each_node(node)
              free_mem_cgroup_per_node_info(memcg, node);
              free_percpu(memcg->stat);
        +     /* poison memcg before freeing it */
        +     memset(memcg, 0x78, sizeof(struct mem_cgroup));
              kfree(memcg);
        }
      
      The coredump shows the position=0xdbbc2a00 is freed.
      
        (gdb) p/x ((struct mem_cgroup_per_node *)0xe5009e00)->iter[8]
        $13 = {position = 0xdbbc2a00, generation = 0x2efd}
      
        0xdbbc2a00:     0xdbbc2e00      0x00000000      0xdbbc2800      0x00000100
        0xdbbc2a10:     0x00000200      0x78787878      0x00026218      0x00000000
        0xdbbc2a20:     0xdcad6000      0x00000001      0x78787800      0x00000000
        0xdbbc2a30:     0x78780000      0x00000000      0x0068fb84      0x78787878
        0xdbbc2a40:     0x78787878      0x78787878      0x78787878      0xe3fa5cc0
        0xdbbc2a50:     0x78787878      0x78787878      0x00000000      0x00000000
        0xdbbc2a60:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a70:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a80:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a90:     0x00000001      0x00000000      0x00000000      0x00100000
        0xdbbc2aa0:     0x00000001      0xdbbc2ac8      0x00000000      0x00000000
        0xdbbc2ab0:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2ac0:     0x00000000      0x00000000      0xe5b02618      0x00001000
        0xdbbc2ad0:     0x00000000      0x78787878      0x78787878      0x78787878
        0xdbbc2ae0:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2af0:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b00:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b10:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b20:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b30:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b40:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b50:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b60:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b70:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b80:     0x78787878      0x78787878      0x00000000      0x78787878
        0xdbbc2b90:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2ba0:     0x78787878      0x78787878      0x78787878      0x78787878
      
      In the reclaim path, try_to_free_pages() does not setup
      sc.target_mem_cgroup and sc is passed to do_try_to_free_pages(), ...,
      shrink_node().
      
      In mem_cgroup_iter(), root is set to root_mem_cgroup because
      sc->target_mem_cgroup is NULL.  It is possible to assign a memcg to
      root_mem_cgroup.nodeinfo.iter in mem_cgroup_iter().
      
              try_to_free_pages
              	struct scan_control sc = {...}, target_mem_cgroup is 0x0;
              do_try_to_free_pages
              shrink_zones
              shrink_node
              	 mem_cgroup *root = sc->target_mem_cgroup;
              	 memcg = mem_cgroup_iter(root, NULL, &reclaim);
              mem_cgroup_iter()
              	if (!root)
              		root = root_mem_cgroup;
              	...
      
              	css = css_next_descendant_pre(css, &root->css);
              	memcg = mem_cgroup_from_css(css);
              	cmpxchg(&iter->position, pos, memcg);
      
      My device uses memcg non-hierarchical mode.  When we release a memcg:
      invalidate_reclaim_iterators() reaches only dead_memcg and its parents.
      If non-hierarchical mode is used, invalidate_reclaim_iterators() never
      reaches root_mem_cgroup.
      
        static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
        {
              struct mem_cgroup *memcg = dead_memcg;
      
              for (; memcg; memcg = parent_mem_cgroup(memcg)
              ...
        }
      
      So the use after free scenario looks like:
      
        CPU1						CPU2
      
        try_to_free_pages
        do_try_to_free_pages
        shrink_zones
        shrink_node
        mem_cgroup_iter()
            if (!root)
            	root = root_mem_cgroup;
            ...
            css = css_next_descendant_pre(css, &root->css);
            memcg = mem_cgroup_from_css(css);
            cmpxchg(&iter->position, pos, memcg);
      
              				invalidate_reclaim_iterators(memcg);
              				...
              				__mem_cgroup_free()
              					kfree(memcg);
      
        try_to_free_pages
        do_try_to_free_pages
        shrink_zones
        shrink_node
        mem_cgroup_iter()
            if (!root)
            	root = root_mem_cgroup;
            ...
            mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id);
            iter = &mz->iter[reclaim->priority];
            pos = READ_ONCE(iter->position);
            css_tryget(&pos->css) <- use after free
      
      To avoid this, we should also invalidate root_mem_cgroup.nodeinfo.iter
      in invalidate_reclaim_iterators().
      
      [cai@lca.pw: fix -Wparentheses compilation warning]
        Link: http://lkml.kernel.org/r/1564580753-17531-1-git-send-email-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190730015729.4406-1-miles.chen@mediatek.com
      Fixes: 5ac8fb31 ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
      Signed-off-by: default avatarMiles Chen <miles.chen@mediatek.com>
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54a83d6b
    • Henry Burns's avatar
      mm/z3fold.c: fix z3fold_destroy_pool() race condition · b997052b
      Henry Burns authored
      The constraint from the zpool use of z3fold_destroy_pool() is there are
      no outstanding handles to memory (so no active allocations), but it is
      possible for there to be outstanding work on either of the two wqs in
      the pool.
      
      Calling z3fold_deregister_migration() before the workqueues are drained
      means that there can be allocated pages referencing a freed inode,
      causing any thread in compaction to be able to trip over the bad pointer
      in PageMovable().
      
      Link: http://lkml.kernel.org/r/20190726224810.79660-2-henryburns@google.com
      Fixes: 1f862989 ("mm/z3fold.c: support page migration")
      Signed-off-by: default avatarHenry Burns <henryburns@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarJonathan Adams <jwadams@google.com>
      Cc: Vitaly Vul <vitaly.vul@sony.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Henry Burns <henrywolfeburns@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b997052b
    • Henry Burns's avatar
      mm/z3fold.c: fix z3fold_destroy_pool() ordering · 6051d3bd
      Henry Burns authored
      The constraint from the zpool use of z3fold_destroy_pool() is there are
      no outstanding handles to memory (so no active allocations), but it is
      possible for there to be outstanding work on either of the two wqs in
      the pool.
      
      If there is work queued on pool->compact_workqueue when it is called,
      z3fold_destroy_pool() will do:
      
         z3fold_destroy_pool()
           destroy_workqueue(pool->release_wq)
           destroy_workqueue(pool->compact_wq)
             drain_workqueue(pool->compact_wq)
               do_compact_page(zhdr)
                 kref_put(&zhdr->refcount)
                   __release_z3fold_page(zhdr, ...)
                     queue_work_on(pool->release_wq, &pool->work) *BOOM*
      
      So compact_wq needs to be destroyed before release_wq.
      
      Link: http://lkml.kernel.org/r/20190726224810.79660-1-henryburns@google.com
      Fixes: 5d03a661 ("mm/z3fold.c: use kref to prevent page free/compact race")
      Signed-off-by: default avatarHenry Burns <henryburns@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarJonathan Adams <jwadams@google.com>
      Cc: Vitaly Vul <vitaly.vul@sony.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk
      Cc: Henry Burns <henrywolfeburns@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6051d3bd
    • Yang Shi's avatar
      mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind · a53190a4
      Yang Shi authored
      When running syzkaller internally, we ran into the below bug on 4.9.x
      kernel:
      
        kernel BUG at mm/huge_memory.c:2124!
        invalid opcode: 0000 [#1] SMP KASAN
        CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
        task: ffff880067b34900 task.stack: ffff880068998000
        RIP: split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
        Call Trace:
          split_huge_page include/linux/huge_mm.h:100 [inline]
          queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538
          walk_pmd_range mm/pagewalk.c:50 [inline]
          walk_pud_range mm/pagewalk.c:90 [inline]
          walk_pgd_range mm/pagewalk.c:116 [inline]
          __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208
          walk_page_range+0x154/0x370 mm/pagewalk.c:285
          queue_pages_range+0x115/0x150 mm/mempolicy.c:694
          do_mbind mm/mempolicy.c:1241 [inline]
          SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370
          SyS_mbind+0x46/0x60 mm/mempolicy.c:1352
          do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282
          entry_SYSCALL_64_after_swapgs+0x5d/0xdb
        Code: c7 80 1c 02 00 e8 26 0a 76 01 <0f> 0b 48 c7 c7 40 46 45 84 e8 4c
        RIP  [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
         RSP <ffff88006899f980>
      
      with the below test:
      
        uint64_t r[1] = {0xffffffffffffffff};
      
        int main(void)
        {
              syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
                                      intptr_t res = 0;
              res = syscall(__NR_socket, 0x11, 3, 0x300);
              if (res != -1)
                      r[0] = res;
              *(uint32_t*)0x20000040 = 0x10000;
              *(uint32_t*)0x20000044 = 1;
              *(uint32_t*)0x20000048 = 0xc520;
              *(uint32_t*)0x2000004c = 1;
              syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10);
              syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0);
              *(uint64_t*)0x20000340 = 2;
              syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340, 0x45d4, 3);
              return 0;
        }
      
      Actually the test does:
      
        mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
        socket(AF_PACKET, SOCK_RAW, 768)        = 3
        setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0
        mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_DENYWRITE, 3, 0) = 0x20fed000
        mbind(..., MPOL_MF_STRICT|MPOL_MF_MOVE) = 0
      
      The setsockopt() would allocate compound pages (16 pages in this test)
      for packet tx ring, then the mmap() would call packet_mmap() to map the
      pages into the user address space specified by the mmap() call.
      
      When calling mbind(), it would scan the vma to queue the pages for
      migration to the new node.  It would split any huge page since 4.9
      doesn't support THP migration, however, the packet tx ring compound
      pages are not THP and even not movable.  So, the above bug is triggered.
      
      However, the later kernel is not hit by this issue due to commit
      d44d363f ("mm: don't assume anonymous pages have SwapBacked flag"),
      which just removes the PageSwapBacked check for a different reason.
      
      But, there is a deeper issue.  According to the semantic of mbind(), it
      should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and
      MPOL_MF_STRICT was also specified, but the kernel was unable to move all
      existing pages in the range.  The tx ring of the packet socket is
      definitely not movable, however, mbind() returns success for this case.
      
      Although the most socket file associates with non-movable pages, but XDP
      may have movable pages from gup.  So, it sounds not fine to just check
      the underlying file type of vma in vma_migratable().
      
      Change migrate_page_add() to check if the page is movable or not, if it
      is unmovable, just return -EIO.  But do not abort pte walk immediately,
      since there may be pages off LRU temporarily.  We should migrate other
      pages if MPOL_MF_MOVE* is specified.  Set has_unmovable flag if some
      paged could not be not moved, then return -EIO for mbind() eventually.
      
      With this change the above test would return -EIO as expected.
      
      [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
        Link: http://lkml.kernel.org/r/1563556862-54056-3-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1561162809-59140-3-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a53190a4
    • Yang Shi's avatar
      mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified · d8835445
      Yang Shi authored
      When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
      try best to migrate misplaced pages, if some of the pages could not be
      migrated, then return -EIO.
      
      There are three different sub-cases:
       1. vma is not migratable
       2. vma is migratable, but there are unmovable pages
       3. vma is migratable, pages are movable, but migrate_pages() fails
      
      If #1 happens, kernel would just abort immediately, then return -EIO,
      after a7f40cfe ("mm: mempolicy: make mbind() return -EIO when
      MPOL_MF_STRICT is specified").
      
      If #3 happens, kernel would set policy and migrate pages with
      best-effort, but won't rollback the migrated pages and reset the policy
      back.
      
      Before that commit, they behaves in the same way.  It'd better to keep
      their behavior consistent.  But, rolling back the migrated pages and
      resetting the policy back sounds not feasible, so just make #1 behave as
      same as #3.
      
      Userspace will know that not everything was successfully migrated (via
      -EIO), and can take whatever steps it deems necessary - attempt
      rollback, determine which exact page(s) are violating the policy, etc.
      
      Make queue_pages_range() return 1 to indicate there are unmovable pages
      or vma is not migratable.
      
      The #2 is not handled correctly in the current kernel, the following
      patch will fix it.
      
      [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
        Link: http://lkml.kernel.org/r/1563556862-54056-2-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1561162809-59140-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8835445
    • Ralph Campbell's avatar
      mm/hmm: fix bad subpage pointer in try_to_unmap_one · 1de13ee5
      Ralph Campbell authored
      When migrating an anonymous private page to a ZONE_DEVICE private page,
      the source page->mapping and page->index fields are copied to the
      destination ZONE_DEVICE struct page and the page_mapcount() is
      increased.  This is so rmap_walk() can be used to unmap and migrate the
      page back to system memory.
      
      However, try_to_unmap_one() computes the subpage pointer from a swap pte
      which computes an invalid page pointer and a kernel panic results such
      as:
      
        BUG: unable to handle page fault for address: ffffea1fffffffc8
      
      Currently, only single pages can be migrated to device private memory so
      no subpage computation is needed and it can be set to "page".
      
      [rcampbell@nvidia.com: add comment]
        Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com
      Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com
      Fixes: a5430dda ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1de13ee5
    • Ralph Campbell's avatar
      mm/hmm: fix ZONE_DEVICE anon page mapping reuse · 7ab0ad0e
      Ralph Campbell authored
      When a ZONE_DEVICE private page is freed, the page->mapping field can be
      set.  If this page is reused as an anonymous page, the previous value
      can prevent the page from being inserted into the CPU's anon rmap table.
      For example, when migrating a pte_none() page to device memory:
      
        migrate_vma(ops, vma, start, end, src, dst, private)
          migrate_vma_collect()
            src[] = MIGRATE_PFN_MIGRATE
          migrate_vma_prepare()
            /* no page to lock or isolate so OK */
          migrate_vma_unmap()
            /* no page to unmap so OK */
          ops->alloc_and_copy()
            /* driver allocates ZONE_DEVICE page for dst[] */
          migrate_vma_pages()
            migrate_vma_insert_page()
              page_add_new_anon_rmap()
                __page_set_anon_rmap()
                  /* This check sees the page's stale mapping field */
                  if (PageAnon(page))
                    return
                  /* page->mapping is not updated */
      
      The result is that the migration appears to succeed but a subsequent CPU
      fault will be unable to migrate the page back to system memory or worse.
      
      Clear the page->mapping field when freeing the ZONE_DEVICE page so stale
      pointer data doesn't affect future page use.
      
      Link: http://lkml.kernel.org/r/20190719192955.30462-3-rcampbell@nvidia.com
      Fixes: b7a52310 ("mm: don't clear ->mapping in hmm_devmem_free")
      Signed-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ab0ad0e
    • Ralph Campbell's avatar
      mm: document zone device struct page field usage · 76470ccd
      Ralph Campbell authored
      Patch series "mm/hmm: fixes for device private page migration", v3.
      
      Testing the latest linux git tree turned up a few bugs with page
      migration to and from ZONE_DEVICE private and anonymous pages.
      Hopefully it clarifies how ZONE_DEVICE private struct page uses the same
      mapping and index fields from the source anonymous page mapping.
      
      This patch (of 3):
      
      Struct page for ZONE_DEVICE private pages uses the page->mapping and and
      page->index fields while the source anonymous pages are migrated to
      device private memory.  This is so rmap_walk() can find the page when
      migrating the ZONE_DEVICE private page back to system memory.
      ZONE_DEVICE pmem backed fsdax pages also use the page->mapping and
      page->index fields when files are mapped into a process address space.
      
      Add comments to struct page and remove the unused "_zd_pad_1" field to
      make this more clear.
      
      Link: http://lkml.kernel.org/r/20190724232700.23327-2-rcampbell@nvidia.comSigned-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76470ccd
  2. 11 Aug, 2019 3 commits
  3. 10 Aug, 2019 21 commits
    • Linus Torvalds's avatar
      Merge tag 'riscv/for-v5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 296d05cb
      Linus Torvalds authored
      Pull RISC-V updates from Paul Walmsley:
       "A few minor RISC-V updates for v5.3-rc4:
      
         - Remove __udivdi3() from the 32-bit Linux port, converting the only
           upstream user to use do_div(), per Linux policy
      
         - Convert the RISC-V standard clocksource away from per-cpu data
           structures, since only one is used by Linux, even on a multi-CPU
           system
      
         - A set of DT binding updates that remove an obsolete text binding in
           favor of a YAML binding, fix a bogus compatible string in the
           schema (thus fixing a "make dtbs_check" warning), and clarifies the
           future values expected in one of the RISC-V CPU properties"
      
      * tag 'riscv/for-v5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        dt-bindings: riscv: fix the schema compatible string for the HiFive Unleashed board
        dt-bindings: riscv: remove obsolete cpus.txt
        RISC-V: Remove udivdi3
        riscv: delay: use do_div() instead of __udivdi3()
        dt-bindings: Update the riscv,isa string description
        RISC-V: Remove per cpu clocksource
      296d05cb
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 6d8f809c
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
       "A few fixes for x86:
      
         - Don't reset the carefully adjusted build flags for the purgatory
           and remove the unwanted flags instead. The 'reset all' approach led
           to build fails under certain circumstances.
      
         - Unbreak CLANG build of the purgatory by avoiding the builtin
           memcpy/memset implementations.
      
         - Address missing prototype warnings by including the proper header
      
         - Fix yet more fall-through issues"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/lib/cpu: Address missing prototypes warning
        x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGS
        x86/purgatory: Do not use __builtin_memcpy and __builtin_memset
        x86: mtrr: cyrix: Mark expected switch fall-through
        x86/ptrace: Mark expected switch fall-through
      6d8f809c
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d2359a51
      Linus Torvalds authored
      Pull perf tooling fixes from Thomas Gleixner:
       "Perf tooling fixes all over the place:
      
         - Fix the selection of the main thread COMM in db-export
      
         - Fix the disassemmbly display for BPF in annotate
      
         - Fix cpumap mask setup in perf ftrace when only one CPU is present
      
         - Add the missing 'cpu_clk_unhalted.core' event
      
         - Fix CPU 0 bindings in NUMA benchmarks
      
         - Fix the module size calculations for s390
      
         - Handle the gap between kernel end and module start on s390
           correctly
      
         - Build and typo fixes"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf pmu-events: Fix missing "cpu_clk_unhalted.core" event
        perf annotate: Fix s390 gap between kernel end and module start
        perf record: Fix module size on s390
        perf tools: Fix include paths in ui directory
        perf tools: Fix a typo in a variable name in the Documentation Makefile
        perf cpumap: Fix writing to illegal memory in handling cpumap mask
        perf ftrace: Fix failure to set cpumask when only one cpu is present
        perf db-export: Fix thread__exec_comm()
        perf annotate: Fix printing of unaugmented disassembled instructions from BPF
        perf bench numa: Fix cpu0 binding
      d2359a51
    • Linus Torvalds's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · dcbb4a15
      Linus Torvalds authored
      Pull scheduler fixes from Thomas Gleixner:
       "Three fixlets for the scheduler:
      
         - Avoid double bandwidth accounting in the push & pull code
      
         - Use a sane FIFO priority for the Pressure Stall Information (PSI)
           thread.
      
         - Avoid permission checks when setting the scheduler params for the
           PSI thread"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/psi: Do not require setsched permission from the trigger creator
        sched/psi: Reduce psimon FIFO priority
        sched/deadline: Fix double accounting of rq/running bw in push & pull
      dcbb4a15
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ed254bb5
      Linus Torvalds authored
      Pull irq fix from Thomas Gleixner:
       "A small fix for the affinity spreading code.
      
        It failed to handle situations where a single vector was requested
        either due to only one CPU being available or vector exhaustion
        causing only a single interrupt to be granted.
      
        The fix is to simply remove the requirement in the affinity spreading
        code for more than one interrupt being available"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        genirq/affinity: Create affinity mask for single vector
      ed254bb5
    • Linus Torvalds's avatar
      Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 6054f4ec
      Linus Torvalds authored
      Pull objtool warning fix from Thomas Gleixner:
       "The recent objtool fixes/enhancements unearthed a unbalanced CLAC in
        the i915 driver.
      
        Chris asked me to pick the fix up and route it through"
      
      * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        drm/i915: Remove redundant user_access_end() from __copy_from_user() error path
      6054f4ec
    • Linus Torvalds's avatar
      Merge tag 'gfs2-v5.3-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 · 829890d2
      Linus Torvalds authored
      Pull gfs2 fix from Andreas Gruenbacher:
       "Fix incorrect lseek / fiemap results"
      
      * tag 'gfs2-v5.3-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
        gfs2: gfs2_walk_metadata fix
      829890d2
    • Joe Perches's avatar
      Makefile: Convert -Wimplicit-fallthrough=3 to just -Wimplicit-fallthrough for clang · bfd77145
      Joe Perches authored
      A compilation -Wimplicit-fallthrough warning was enabled by commit
      a035d552 ("Makefile: Globally enable fall-through warning")
      
      Even though clang 10.0.0 does not currently support this warning without
      a patch, clang currently does not support a value for this option.
      
        Link: https://bugs.llvm.org/show_bug.cgi?id=39382
      
      The gcc default for this warning is 3 so removing the =3 has no effect
      for gcc and enables the warning for patched versions of clang.
      
      Also remove the =3 from an existing use in a parisc Makefile:
      arch/parisc/math-emu/Makefile
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Reviewed-and-tested-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Cc: Gustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bfd77145
    • Linus Torvalds's avatar
      Merge tag 'char-misc-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 5aa91007
      Linus Torvalds authored
      Pull char/misc driver fixes Greg KH:
       "Here are some small char/misc driver fixes for 5.3-rc4.
      
        Two of these are for the habanalabs driver for issues found when
        running on a big-endian system (are they still alive?) The others are
        tiny fixes reported by people, and a MAINTAINERS update about the
        location of the fpga development tree.
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'char-misc-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        coresight: Fix DEBUG_LOCKS_WARN_ON for uninitialized attribute
        MAINTAINERS: Move linux-fpga tree to new location
        nvmem: Use the same permissions for eeprom as for nvmem
        habanalabs: fix host memory polling in BE architecture
        habanalabs: fix F/W download in BE architecture
      5aa91007
    • Linus Torvalds's avatar
      Merge tag 'driver-core-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core · 36e630ed
      Linus Torvalds authored
      Pull driver core fixes from Greg KH:
       "Here are two small fixes for some driver core issues that have been
        reported. There is also a kernfs "fix" here, which was then reverted
        because it was found to cause problems in linux-next.
      
        The driver core fixes both resolve reported issues, one with gpioint
        stuff that showed up in 5.3-rc1, and the other finally (and hopefully)
        resolves a very long standing race when removing glue directories.
        It's nice to get that issue finally resolved and the developers
        involved should be applauded for the persistence it took to get this
        patch finally accepted.
      
        All of these have been in linux-next for a while with no reported
        issues. Well, the one reported issue, hence the revert :)"
      
      * tag 'driver-core-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
        Revert "kernfs: fix memleak in kernel_ops_readdir()"
        kernfs: fix memleak in kernel_ops_readdir()
        driver core: Fix use-after-free and double free on glue directory
        driver core: platform: return -ENXIO for missing GpioInt
      36e630ed
    • Linus Torvalds's avatar
      Merge tag 'tty-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · c13f8670
      Linus Torvalds authored
      Pull tty fix from Greg KH:
       "Here is a single tty kgdb fix for 5.3-rc4.
      
        It fixes an annoying log message that has caused kdb to become
        useless. It's another fallout from commit ddde3c18 ("vt: More
        locking checks") which tries to enforce locking checks more strictly
        in the tty layer, unfortunatly when kdb is stopped, there's no need
        for locks :)
      
        This patch has been linux-next for a while with no reported issues"
      
      * tag 'tty-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        kgdboc: disable the console lock when in kgdb
      c13f8670
    • Linus Torvalds's avatar
      Merge tag 'staging-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 15fa98e4
      Linus Torvalds authored
      Pull staging / IIO driver fixes from Greg KH:
       "Here are some small staging and IIO driver fixes for 5.3-rc4.
      
        Nothing major, just resolutions for a number of small reported issues,
        full details in the shortlog.
      
        All have been in linux-next for a while with no reported issues"
      
      * tag 'staging-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        iio: adc: gyroadc: fix uninitialized return code
        docs: generic-counter.rst: fix broken references for ABI file
        staging: android: ion: Bail out upon SIGKILL when allocating memory.
        Staging: fbtft: Fix GPIO handling
        staging: unisys: visornic: Update the description of 'poll_for_irq()'
        staging: wilc1000: flush the workqueue before deinit the host
        staging: gasket: apex: fix copy-paste typo
        Staging: fbtft: Fix reset assertion when using gpio descriptor
        Staging: fbtft: Fix probing of gpio descriptor
        iio: imu: mpu6050: add missing available scan masks
        iio: cros_ec_accel_legacy: Fix incorrect channel setting
        IIO: Ingenic JZ47xx: Set clock divider on probe
        iio: adc: max9611: Fix misuse of GENMASK macro
      15fa98e4
    • Linus Torvalds's avatar
      Merge tag 'usb-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 1041f509
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are some small USB fixes for 5.3-rc4.
      
        The "biggest" one here is moving code from one file to another in
        order to fix a long-standing race condition with the creation of sysfs
        files for USB devices. Turns out that there are now userspace tools
        out there that are hitting this long-known bug, so it's time to fix
        them. Thankfully the tool-maker in this case fixed the issue :)
      
        The other patches in here are all fixes for reported issues. Now that
        syzbot knows how to fuzz USB drivers better, and is starting to now
        fuzz the userspace facing side of them at the same time, there will be
        more and more small fixes like these coming, which is a good thing.
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'usb-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usb: setup authorized_default attributes using usb_bus_notify
        usb: iowarrior: fix deadlock on disconnect
        Revert "USB: rio500: simplify locking"
        usb: usbfs: fix double-free of usb memory upon submiturb error
        usb: yurex: Fix use-after-free in yurex_delete
        usb: typec: tcpm: Ignore unsupported/unknown alternate mode requests
        xhci: Fix NULL pointer dereference at endpoint zero reset.
        usb: host: xhci-rcar: Fix timeout in xhci_suspend()
        usb: typec: ucsi: ccg: Fix uninitilized symbol error
        usb: typec: tcpm: remove tcpm dir if no children
        usb: typec: tcpm: free log buf memory when remove debug file
        usb: typec: tcpm: Add NULL check before dereferencing config
      1041f509
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 97946f59
      Linus Torvalds authored
      Pull pin control fixes from Linus Walleij:
      
       - Delay acquisition of regmaps in the Aspeed G5 driver.
      
       - Make a symbol static to reduce compiler noise.
      
      * tag 'pinctrl-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: aspeed: Make aspeed_pinmux_ips static
        pinctrl: aspeed-g5: Delay acquisition of regmaps
      97946f59
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 23df57af
      Linus Torvalds authored
      Pull powerpc fix from Michael Ellerman:
       "Just one fix, a revert of a commit that was meant to be a minor
        improvement to some inline asm, but ended up having no real benefit
        with GCC and broke booting 32-bit machines when using Clang.
      
        Thanks to: Arnd Bergmann, Christophe Leroy, Nathan Chancellor, Nick
        Desaulniers, Segher Boessenkool"
      
      * tag 'powerpc-5.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        Revert "powerpc: slightly improve cache helpers"
      23df57af
    • Linus Torvalds's avatar
      Merge tag 'Wimplicit-fallthrough-5.3-rc4' of... · bf1881cf
      Linus Torvalds authored
      Merge tag 'Wimplicit-fallthrough-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
      
      Pull fall-through fixes from Gustavo A. R. Silva:
       "Mark more switch cases where we are expecting to fall through, fixing
        fall-through warnings in arm, sparc64, mips, i386 and s390"
      
      * tag 'Wimplicit-fallthrough-5.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
        ARM: ep93xx: Mark expected switch fall-through
        scsi: fas216: Mark expected switch fall-throughs
        pcmcia: db1xxx_ss: Mark expected switch fall-throughs
        video: fbdev: omapfb_main: Mark expected switch fall-throughs
        watchdog: riowd: Mark expected switch fall-through
        s390/net: Mark expected switch fall-throughs
        crypto: ux500/crypt: Mark expected switch fall-throughs
        watchdog: wdt977: Mark expected switch fall-through
        watchdog: scx200_wdt: Mark expected switch fall-through
        watchdog: Mark expected switch fall-throughs
        ARM: signal: Mark expected switch fall-through
        mfd: omap-usb-host: Mark expected switch fall-throughs
        mfd: db8500-prcmu: Mark expected switch fall-throughs
        ARM: OMAP: dma: Mark expected switch fall-throughs
        ARM: alignment: Mark expected switch fall-throughs
        ARM: tegra: Mark expected switch fall-through
        ARM/hw_breakpoint: Mark expected switch fall-throughs
      bf1881cf
    • Linus Torvalds's avatar
      Merge tag 'kbuild-fixes-v5.3-3' of... · 451577f3
      Linus Torvalds authored
      Merge tag 'kbuild-fixes-v5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
      
      Pull Kbuild fixes from Masahiro Yamada:
      
       - revive single target %.ko
      
       - do not create built-in.a where it is unneeded
      
       - do not create modules.order where it is unneeded
      
       - show a warning if subdir-y/m is used to visit a module Makefile
      
      * tag 'kbuild-fixes-v5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        kbuild: show hint if subdir-y/m is used to visit module Makefile
        kbuild: generate modules.order only in directories visited by obj-y/m
        kbuild: fix false-positive need-builtin calculation
        kbuild: revive single target %.ko
      451577f3
    • Gustavo A. R. Silva's avatar
      ARM: ep93xx: Mark expected switch fall-through · 1f7585f3
      Gustavo A. R. Silva authored
      Mark switch cases where we are expecting to fall through.
      
      Fix the following warnings (Building: arm-ep93xx_defconfig arm):
      
      arch/arm/mach-ep93xx/crunch.c: In function 'crunch_do':
      arch/arm/mach-ep93xx/crunch.c:46:3: warning: this statement may
      fall through [-Wimplicit-fallthrough=]
            memset(crunch_state, 0, sizeof(*crunch_state));
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         arch/arm/mach-ep93xx/crunch.c:53:2: note: here
           case THREAD_NOTIFY_EXIT:
           ^~~~
      
      Notice that, in this particular case, the code comment is
      modified in accordance with what GCC is expecting to find.
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      1f7585f3
    • Gustavo A. R. Silva's avatar
      scsi: fas216: Mark expected switch fall-throughs · fccf01b6
      Gustavo A. R. Silva authored
      Mark switch cases where we are expecting to fall through.
      
      Fix the following warnings (Building: rpc_defconfig arm):
      
      drivers/scsi/arm/fas216.c: In function ‘fas216_disconnect_intr’:
      drivers/scsi/arm/fas216.c:913:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
         if (fas216_get_last_msg(info, info->scsi.msgin_fifo) == ABORT) {
            ^
      drivers/scsi/arm/fas216.c:919:2: note: here
        default:    /* huh?     */
        ^~~~~~~
      drivers/scsi/arm/fas216.c: In function ‘fas216_kick’:
      drivers/scsi/arm/fas216.c:1959:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
         fas216_allocate_tag(info, SCpnt);
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/scsi/arm/fas216.c:1960:2: note: here
        case TYPE_OTHER:
        ^~~~
      drivers/scsi/arm/fas216.c: In function ‘fas216_busservice_intr’:
      drivers/scsi/arm/fas216.c:1413:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
         fas216_stoptransfer(info);
         ^~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/scsi/arm/fas216.c:1414:2: note: here
        case STATE(STAT_STATUS, PHASE_SELSTEPS):/* Sel w/ steps -> Status       */
        ^~~~
      drivers/scsi/arm/fas216.c:1424:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
         fas216_stoptransfer(info);
         ^~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/scsi/arm/fas216.c:1425:2: note: here
        case STATE(STAT_MESGIN, PHASE_COMMAND): /* Command -> Message In */
        ^~~~
      drivers/scsi/arm/fas216.c: In function ‘fas216_funcdone_intr’:
      drivers/scsi/arm/fas216.c:1573:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
         if ((stat & STAT_BUSMASK) == STAT_MESGIN) {
            ^
      drivers/scsi/arm/fas216.c:1579:2: note: here
        default:
        ^~~~~~~
      drivers/scsi/arm/fas216.c: In function ‘fas216_handlesync’:
      drivers/scsi/arm/fas216.c:605:20: warning: this statement may fall through [-Wimplicit-fallthrough=]
         info->scsi.phase = PHASE_MSGOUT_EXPECT;
         ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
      drivers/scsi/arm/fas216.c:607:2: note: here
        case async:
        ^~~~
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      fccf01b6
    • Gustavo A. R. Silva's avatar
      pcmcia: db1xxx_ss: Mark expected switch fall-throughs · 5f163f33
      Gustavo A. R. Silva authored
      Mark switch cases where we are expecting to fall through.
      
      This patch fixes the following warnings (Building: db1xxx_defconfig mips):
      
      drivers/pcmcia/db1xxx_ss.c:257:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
      drivers/pcmcia/db1xxx_ss.c:269:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      5f163f33
    • Gustavo A. R. Silva's avatar
      video: fbdev: omapfb_main: Mark expected switch fall-throughs · 70a2783c
      Gustavo A. R. Silva authored
      Mark switch cases where we are expecting to fall through.
      
      This patch fixes the following warning (Building: omap1_defconfig arm):
      
      drivers/watchdog/wdt285.c:170:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
      drivers/watchdog/ar7_wdt.c:237:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
      drivers/video/fbdev/omap/omapfb_main.c:449:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
      drivers/video/fbdev/omap/omapfb_main.c:1549:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      drivers/video/fbdev/omap/omapfb_main.c:1547:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
      drivers/video/fbdev/omap/omapfb_main.c:1545:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
      drivers/video/fbdev/omap/omapfb_main.c:1543:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
      drivers/video/fbdev/omap/omapfb_main.c:1540:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      drivers/video/fbdev/omap/omapfb_main.c:1538:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
      drivers/video/fbdev/omap/omapfb_main.c:1535:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      70a2783c