• David Hildenbrand's avatar
    drivers/base/memory: determine and store zone for single-zone memory blocks · 395f6081
    David Hildenbrand authored
    test_pages_in_a_zone() is just another nasty PFN walker that can easily
    stumble over ZONE_DEVICE memory ranges falling into the same memory block
    as ordinary system RAM: the memmap of parts of these ranges might possibly
    be uninitialized.  In fact, we observed (on an older kernel) with UBSAN:
    
      UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50
      index 7 is out of range for type 'zone [5]'
      CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...]
      Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019
      Call Trace:
       dump_stack+0x9a/0xf0
       ubsan_epilogue+0x9/0x7a
       __ubsan_handle_out_of_bounds+0x13a/0x181
       test_pages_in_a_zone+0x3c4/0x500
       show_valid_zones+0x1fa/0x380
       dev_attr_show+0x43/0xb0
       sysfs_kf_seq_show+0x1c5/0x440
       seq_read+0x49d/0x1190
       vfs_read+0xff/0x300
       ksys_read+0xb8/0x170
       do_syscall_64+0xa5/0x4b0
       entry_SYSCALL_64_after_hwframe+0x6a/0xdf
      RIP: 0033:0x7f01f4439b52
    
    We seem to stumble over a memmap that contains a garbage zone id.  While
    we could try inserting pfn_to_online_page() calls, it will just make
    memory offlining slower, because we use test_pages_in_a_zone() to make
    sure we're offlining pages that all belong to the same zone.
    
    Let's just get rid of this PFN walker and determine the single zone of a
    memory block -- if any -- for early memory blocks during boot.  For memory
    onlining, we know the single zone already.  Let's avoid any additional
    memmap scanning and just rely on the zone information available during
    boot.
    
    For memory hot(un)plug, we only really care about memory blocks that:
    * span a single zone (and, thereby, a single node)
    * are completely System RAM (IOW, no holes, no ZONE_DEVICE)
    If one of these conditions is not met, we reject memory offlining.
    Hotplugged memory blocks (starting out offline), always meet both
    conditions.
    
    There are three scenarios to handle:
    
    (1) Memory hot(un)plug
    
    A memory block with zone == NULL cannot be offlined, corresponding to
    our previous test_pages_in_a_zone() check.
    
    After successful memory onlining/offlining, we simply set the zone
    accordingly.
    * Memory onlining: set the zone we just used for onlining
    * Memory offlining: set zone = NULL
    
    So a hotplugged memory block starts with zone = NULL. Once memory
    onlining is done, we set the proper zone.
    
    (2) Boot memory with !CONFIG_NUMA
    
    We know that there is just a single pgdat, so we simply scan all zones
    of that pgdat for an intersection with our memory block PFN range when
    adding the memory block. If more than one zone intersects (e.g., DMA and
    DMA32 on x86 for the first memory block) we set zone = NULL and
    consequently mimic what test_pages_in_a_zone() used to do.
    
    (3) Boot memory with CONFIG_NUMA
    
    At the point in time we create the memory block devices during boot, we
    don't know yet which nodes *actually* span a memory block. While we could
    scan all zones of all nodes for intersections, overlapping nodes complicate
    the situation and scanning all nodes is possibly expensive. But that
    problem has already been solved by the code that sets the node of a memory
    block and creates the link in the sysfs --
    do_register_memory_block_under_node().
    
    So, we hook into the code that sets the node id for a memory block. If
    we already have a different node id set for the memory block, we know
    that multiple nodes *actually* have PFNs falling into our memory block:
    we set zone = NULL and consequently mimic what test_pages_in_a_zone() used
    to do. If there is no node id set, we do the same as (2) for the given
    node.
    
    Note that the call order in driver_init() is:
    -> memory_dev_init(): create memory block devices
    -> node_dev_init(): link memory block devices to the node and set the
    		    node id
    
    So in summary, we detect if there is a single zone responsible for this
    memory block and we consequently store the zone in that case in the
    memory block, updating it during memory onlining/offlining.
    
    Link: https://lkml.kernel.org/r/20220210184359.235565-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
    Reported-by: default avatarRafael Parra <rparrazo@redhat.com>
    Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
    Cc: "Rafael J. Wysocki" <rafael@kernel.org>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rafael Parra <rparrazo@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    395f6081
node.c 28.4 KB