• Omar Sandoval's avatar
    mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore · c12cd77c
    Omar Sandoval authored
    Commit 3ee48b6a ("mm, x86: Saving vmcore with non-lazy freeing of
    vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
    lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
    purge the vmap areas instead of doing it lazily.
    
    Commit 690467c8 ("mm/vmalloc: Move draining areas out of caller
    context") moved the purging from the vunmap() caller to a worker thread.
    Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
    (possibly forever).  For example, consider the following scenario:
    
     1. Thread reads from /proc/vmcore. This eventually calls
        __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
        vmap_lazy_nr to lazy_max_pages() + 1.
    
     2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
        pages (one page plus the guard page) to the purge list and
        vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
        drain_vmap_work is scheduled.
    
     3. Thread returns from the kernel and is scheduled out.
    
     4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
        frees the 2 pages on the purge list. vmap_lazy_nr is now
        lazy_max_pages() + 1.
    
     5. This is still over the threshold, so it tries to purge areas again,
        but doesn't find anything.
    
     6. Repeat 5.
    
    If the system is running with only one CPU (which is typicial for kdump)
    and preemption is disabled, then this will never make forward progress:
    there aren't any more pages to purge, so it hangs.  If there is more
    than one CPU or preemption is enabled, then the worker thread will spin
    forever in the background.  (Note that if there were already pages to be
    purged at the time that set_iounmap_nonlazy() was called, this bug is
    avoided.)
    
    This can be reproduced with anything that reads from /proc/vmcore
    multiple times.  E.g., vmcore-dmesg /proc/vmcore.
    
    It turns out that improvements to vmap() over the years have obsoleted
    the need for this "optimization".  I benchmarked `dd if=/proc/vmcore
    of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore.
    The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and
    5.18-rc1 with set_iounmap_nonlazy() removed entirely:
    
        |5.17  |5.18+fix|5.18+removal
      4k|40.86s|  40.09s|      26.73s
      1M|24.47s|  23.98s|      21.84s
    
    The removal was the fastest (by a wide margin with 4k reads).  This
    patch removes set_iounmap_nonlazy().
    
    Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com
    Fixes: 690467c8  ("mm/vmalloc: Move draining areas out of caller context")
    Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
    Acked-by: default avatarChris Down <chris@chrisdown.name>
    Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Acked-by: default avatarBaoquan He <bhe@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    c12cd77c
vmalloc.c 106 KB