1. 26 Aug, 2020 37 commits
  2. 21 Aug, 2020 3 commits
    • Greg Kroah-Hartman's avatar
      8d71b611
    • Oscar Salvador's avatar
      mm: Avoid calling build_all_zonelists_init under hotplug context · 23feab18
      Oscar Salvador authored
      Recently a customer of ours experienced a crash when booting the
      system while enabling memory-hotplug.
      
      The problem is that Normal zones on different nodes don't get their private
      zone->pageset allocated, and keep sharing the initial boot_pageset.
      The sharing between zones is normally safe as explained by the comment for
      boot_pageset - it's a percpu structure, and manipulations are done with
      disabled interrupts, and boot_pageset is set up in a way that any page placed
      on its pcplist is immediately flushed to shared zone's freelist, because
      pcp->high == 1.
      However, the hotplug operation updates pcp->high to a higher value as it
      expects to be operating on a private pageset.
      
      The problem is in build_all_zonelists(), which is called when the first range
      of pages is onlined for the Normal zone of node X or Y:
      
      	if (system_state == SYSTEM_BOOTING) {
      		build_all_zonelists_init();
      	} else {
      	#ifdef CONFIG_MEMORY_HOTPLUG
      		if (zone)
      			setup_zone_pageset(zone);
      	#endif
      		/* we have to stop all cpus to guarantee there is no user
      		of zonelist */
      		stop_machine(__build_all_zonelists, pgdat, NULL);
      		/* cpuset refresh routine should be here */
      	}
      
      When called during hotplug, it should execute the setup_zone_pageset(zone)
      which allocates the private pageset.
      However, with memhp_default_state=online, this happens early while
      system_state == SYSTEM_BOOTING is still true, hence this step is skipped.
      (and build_all_zonelists_init() is probably unsafe anyway at this point).
      
      Another hotplug operation on the same zone then leads to zone_pcp_update(zone)
      called from online_pages(), which updates the pcp->high for the shared
      boot_pageset to a value higher than 1.
      At that point, pages freed from Node X and Y Normal zones can end up on the same
      pcplist and from there they can be freed to the wrong zone's freelist,
      leading to the corruption and crashes.
      
      Please, note that upstream has fixed that differently (and unintentionally) by
      adding another boot state (SYSTEM_SCHEDULING), which is set before smp_init().
      That should happen before memory hotplug events even with memhp_default_state=online.
      Backporting that would be too intrusive.
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Debugged-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: Michal Hocko <mhocko@suse.com> # for stable trees
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23feab18
    • Hugh Dickins's avatar
      khugepaged: retract_page_tables() remember to test exit · dc3ff4f6
      Hugh Dickins authored
      commit 18e77600 upstream.
      
      Only once have I seen this scenario (and forgot even to notice what forced
      the eventual crash): a sequence of "BUG: Bad page map" alerts from
      vm_normal_page(), from zap_pte_range() servicing exit_mmap();
      pmd:00000000, pte values corresponding to data in physical page 0.
      
      The pte mappings being zapped in this case were supposed to be from a huge
      page of ext4 text (but could as well have been shmem): my belief is that
      it was racing with collapse_file()'s retract_page_tables(), found *pmd
      pointing to a page table, locked it, but *pmd had become 0 by the time
      start_pte was decided.
      
      In most cases, that possibility is excluded by holding mmap lock; but
      exit_mmap() proceeds without mmap lock.  Most of what's run by khugepaged
      checks khugepaged_test_exit() after acquiring mmap lock:
      khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
      for example.  But retract_page_tables() did not: fix that.
      
      The fix is for retract_page_tables() to check khugepaged_test_exit(),
      after acquiring mmap lock, before doing anything to the page table.
      Getting the mmap lock serializes with __mmput(), which briefly takes and
      drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
      mm_users makes sure we don't touch the page table once exit_mmap() might
      reach it, since exit_mmap() will be proceeding without mmap lock, not
      expecting anyone to be racing with it.
      
      Fixes: f3f0e1d2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvilsSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      
      dc3ff4f6