1. 05 May, 2021 40 commits
    • Charan Teja Reddy's avatar
      mm: compaction: update the COMPACT[STALL|FAIL] events properly · 06dac2f4
      Charan Teja Reddy authored
      By definition, COMPACT[STALL|FAIL] events needs to be counted when there
      is 'At least in one zone compaction wasn't deferred or skipped from the
      direct compaction'.  And when compaction is skipped or deferred,
      COMPACT_SKIPPED will be returned but it will still go and update these
      compaction events which is wrong in the sense that COMPACT[STALL|FAIL]
      is counted without even trying the compaction.
      
      Correct this by skipping the counting of these events when
      COMPACT_SKIPPED is returned for compaction.  This indirectly also avoid
      the unnecessary try into the get_page_from_freelist() when compaction is
      not even tried.
      
      There is a corner case where compaction is skipped but still count
      COMPACTSTALL event, which is that IRQ came and freed the page and the
      same is captured in capture_control.
      
      Link: https://lkml.kernel.org/r/1613151184-21213-1-git-send-email-charante@codeaurora.orgSigned-off-by: default avatarCharan Teja Reddy <charante@codeaurora.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06dac2f4
    • Pintu Kumar's avatar
      mm/compaction: remove unused variable sysctl_compact_memory · ef498438
      Pintu Kumar authored
      The sysctl_compact_memory is mostly unused in mm/compaction.c It just
      acts as a place holder for sysctl to store .data.
      
      But the .data itself is not needed here.
      
      So we can get ride of this variable completely and make .data as NULL.
      This will also eliminate the extern declaration from header file.  No
      functionality is broken or changed this way.
      
      Link: https://lkml.kernel.org/r/1614852224-14671-1-git-send-email-pintu@codeaurora.orgSigned-off-by: default avatarPintu Kumar <pintu@codeaurora.org>
      Signed-off-by: default avatarPintu Agarwal <pintu.ping@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef498438
    • Yang Shi's avatar
      mm: vmscan: shrink deferred objects proportional to priority · 18bb473e
      Yang Shi authored
      The number of deferred objects might get windup to an absurd number, and
      it results in clamp of slab objects.  It is undesirable for sustaining
      workingset.
      
      So shrink deferred objects proportional to priority and cap nr_deferred
      to twice of cache items.
      
      The idea is borrowed from Dave Chinner's patch:
        https://lore.kernel.org/linux-xfs/20191031234618.15403-13-david@fromorbit.com/
      
      Tested with kernel build and vfs metadata heavy workload in our
      production environment, no regression is spotted so far.
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-14-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18bb473e
    • Yang Shi's avatar
      mm: memcontrol: reparent nr_deferred when memcg offline · a178015c
      Yang Shi authored
      Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add
      to parent's corresponding nr_deferred when memcg offline.
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-13-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a178015c
    • Yang Shi's avatar
      mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers · 476b30a0
      Yang Shi authored
      Now nr_deferred is available on per memcg level for memcg aware
      shrinkers, so don't need allocate shrinker->nr_deferred for such
      shrinkers anymore.
      
      The prealloc_memcg_shrinker() would return -ENOSYS if !CONFIG_MEMCG or
      memcg is disabled by kernel command line, then shrinker's
      SHRINKER_MEMCG_AWARE flag would be cleared.  This makes the
      implementation of this patch simpler.
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-12-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      476b30a0
    • Yang Shi's avatar
      mm: vmscan: use per memcg nr_deferred of shrinker · 86750830
      Yang Shi authored
      Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's
      nr_deferred will be used in the following cases:
      
          1. Non memcg aware shrinkers
          2. !CONFIG_MEMCG
          3. memcg is disabled by boot parameter
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-11-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86750830
    • Yang Shi's avatar
      mm: vmscan: add per memcg shrinker nr_deferred · 3c6f17e6
      Yang Shi authored
      Currently the number of deferred objects are per shrinker, but some
      slabs, for example, vfs inode/dentry cache are per memcg, this would
      result in poor isolation among memcgs.
      
      The deferred objects typically are generated by __GFP_NOFS allocations,
      one memcg with excessive __GFP_NOFS allocations may blow up deferred
      objects, then other innocent memcgs may suffer from over shrink,
      excessive reclaim latency, etc.
      
      For example, two workloads run in memcgA and memcgB respectively,
      workload in B is vfs heavy workload.  Workload in A generates excessive
      deferred objects, then B's vfs cache might be hit heavily (drop half of
      caches) by B's limit reclaim or global reclaim.
      
      We observed this hit in our production environment which was running vfs
      heavy workload shown as the below tracing log:
      
        <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
        nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
        cache items 246404277 delta 31345 total_scan 123202138
        <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
        nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
        last shrinker return val 123186855
      
      The vfs cache and page cache ratio was 10:1 on this machine, and half of
      caches were dropped.  This also resulted in significant amount of page
      caches were dropped due to inodes eviction.
      
      Make nr_deferred per memcg for memcg aware shrinkers would solve the
      unfairness and bring better isolation.
      
      The following patch will add nr_deferred to parent memcg when memcg
      offline.  To preserve nr_deferred when reparenting memcgs to root, root
      memcg needs shrinker_info allocated too.
      
      When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the
      shrinker's nr_deferred would be used.  And non memcg aware shrinkers use
      shrinker's nr_deferred all the time.
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-10-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c6f17e6
    • Yang Shi's avatar
      mm: vmscan: use a new flag to indicate shrinker is registered · 41ca668a
      Yang Shi authored
      Currently registered shrinker is indicated by non-NULL
      shrinker->nr_deferred.  This approach is fine with nr_deferred at the
      shrinker level, but the following patches will move MEMCG_AWARE
      shrinkers' nr_deferred to memcg level, so their shrinker->nr_deferred
      would always be NULL.  This would prevent the shrinkers from
      unregistering correctly.
      
      Remove SHRINKER_REGISTERING since we could check if shrinker is
      registered successfully by the new flag.
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-9-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41ca668a
    • Yang Shi's avatar
      mm: vmscan: add shrinker_info_protected() helper · 468ab843
      Yang Shi authored
      The shrinker_info is dereferenced in a couple of places via
      rcu_dereference_protected with different calling conventions, for
      example, using mem_cgroup_nodeinfo helper or dereferencing
      memcg->nodeinfo[nid]->shrinker_info.  And the later patch will add more
      dereference places.
      
      So extract the dereference into a helper to make the code more readable.
      No functional change.
      
      [akpm@linux-foundation.org: retain rcu_dereference_protected() in free_shrinker_info(), per Hugh]
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-8-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      468ab843
    • Yang Shi's avatar
      mm: memcontrol: rename shrinker_map to shrinker_info · e4262c4f
      Yang Shi authored
      The following patch is going to add nr_deferred into shrinker_map, the
      change will make shrinker_map not only include map anymore, so rename it
      to "memcg_shrinker_info".  And this should make the patch adding
      nr_deferred cleaner and readable and make review easier.  Also remove the
      "memcg_" prefix.
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-7-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4262c4f
    • Yang Shi's avatar
      mm: vmscan: use kvfree_rcu instead of call_rcu · 72673e86
      Yang Shi authored
      Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
      We don't have to define a dedicated callback for call_rcu() anymore.
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-6-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72673e86
    • Yang Shi's avatar
      mm: vmscan: remove memcg_shrinker_map_size · a2fb1261
      Yang Shi authored
      Both memcg_shrinker_map_size and shrinker_nr_max is maintained, but
      actually the map size can be calculated via shrinker_nr_max, so it seems
      unnecessary to keep both.  Remove memcg_shrinker_map_size since
      shrinker_nr_max is also used by iterating the bit map.
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-5-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2fb1261
    • Yang Shi's avatar
      mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation · d27cf2aa
      Yang Shi authored
      Since memcg_shrinker_map_size just can be changed under holding
      shrinker_rwsem exclusively, the read side can be protected by holding read
      lock, so it sounds superfluous to have a dedicated mutex.
      
      Kirill Tkhai suggested use write lock since:
      
        * We want the assignment to shrinker_maps is visible for shrink_slab_memcg().
        * The rcu_dereference_protected() dereferrencing in shrink_slab_memcg(), but
          in case of we use READ lock in alloc_shrinker_maps(), the dereferrencing
          is not actually protected.
        * READ lock makes alloc_shrinker_info() racy against memory allocation fail.
          alloc_shrinker_info()->free_shrinker_info() may free memory right after
          shrink_slab_memcg() dereferenced it. You may say
          shrink_slab_memcg()->mem_cgroup_online() protects us from it? Yes, sure,
          but this is not the thing we want to remember in the future, since this
          spreads modularity.
      
      And a test with heavy paging workload didn't show write lock makes things worse.
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-4-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d27cf2aa
    • Yang Shi's avatar
      mm: vmscan: consolidate shrinker_maps handling code · 2bfd3637
      Yang Shi authored
      The shrinker map management is not purely memcg specific, it is at the
      intersection between memory cgroup and shrinkers.  It's allocation and
      assignment of a structure, and the only memcg bit is the map is being
      stored in a memcg structure.  So move the shrinker_maps handling code
      into vmscan.c for tighter integration with shrinker code, and remove the
      "memcg_" prefix.  There is no functional change.
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-3-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2bfd3637
    • Yang Shi's avatar
      mm: vmscan: use nid from shrink_control for tracepoint · 8efb4b59
      Yang Shi authored
      Patch series "Make shrinker's nr_deferred memcg aware", v10.
      
      Recently huge amount one-off slab drop was seen on some vfs metadata
      heavy workloads, it turned out there were huge amount accumulated
      nr_deferred objects seen by the shrinker.
      
      On our production machine, I saw absurd number of nr_deferred shown as
      the below tracing result:
      
        <...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
        super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
        2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
        9300 cache items 1667 delta 11 total_scan 833
      
      There are 2.5 trillion deferred objects on one node, assuming all of them
      are dentry (192 bytes per object), so the total size of deferred on one
      node is ~480TB.  It is definitely ridiculous.
      
      I managed to reproduce this problem with kernel build workload plus
      negative dentry generator.
      
      First step, run the below kernel build test script:
      
      NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
      
      cd /root/Buildarea/linux-stable
      
      for i in `seq 1500`; do
              cgcreate -g memory:kern_build
              echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes
      
              echo 3 > /proc/sys/vm/drop_caches
              cgexec -g memory:kern_build make clean > /dev/null 2>&1
              cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1
      
              cgdelete -g memory:kern_build
      done
      
      Then run the below negative dentry generator script:
      
      NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
      
      mkdir /sys/fs/cgroup/memory/test
      echo $$ > /sys/fs/cgroup/memory/test/tasks
      
      for i in `seq $NR_CPUS`; do
              while true; do
                      FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
                      cat $FILE 2>/dev/null
              done &
      done
      
      Then kswapd will shrink half of dentry cache in just one loop as the below
      tracing result showed:
      
      	kswapd0-475   [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
      	kswapd0-475   [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928
      
      There were huge number of deferred objects before the shrinker was called,
      the behavior does match the code but it might be not desirable from the
      user's stand of point.
      
      The excessive amount of nr_deferred might be accumulated due to various
      reasons, for example:
      
      * GFP_NOFS allocation
      
      * Significant times of small amount scan (< scan_batch, 1024 for vfs
        metadata)
      
      However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the
      deferred objects is per shrinker, this may have some bad effects:
      
      * Poor isolation among memcgs.  Some memcgs which happen to have
        frequent limit reclaim may get nr_deferred accumulated to a huge number,
        then other innocent memcgs may take the fall.  In our case the main
        workload was hit.
      
      * Unbounded deferred objects.  There is no cap for deferred objects, it
        can outgrow ridiculously as the tracing result showed.
      
      * Easy to get out of control.  Although shrinkers take into account
        deferred objects, but it can go out of control easily.  One
        misconfigured memcg could incur absurd amount of deferred objects in a
        period of time.
      
      * Sort of reclaim problems, i.e.  over reclaim, long reclaim latency,
        etc.  There may be hundred GB slab caches for vfe metadata heavy
        workload, shrink half of them may take minutes.  We observed latency
        spike due to the prolonged reclaim.
      
      These issues also have been discussed in
      https://lore.kernel.org/linux-mm/20200916185823.5347-1-shy828301@gmail.com/.
      The patchset is the outcome of that discussion.
      
      So this patchset makes nr_deferred per-memcg to tackle the problem.  It
      does:
      
      * Have memcg_shrinker_deferred per memcg per node, just like what
        shrinker_map does.  Instead it is an atomic_long_t array, each element
        represent one shrinker even though the shrinker is not memcg aware, this
        simplifies the implementation.  For memcg aware shrinkers, the deferred
        objects are just accumulated to its own memcg.  The shrinkers just see
        nr_deferred from its own memcg.  Non memcg aware shrinkers still use
        global nr_deferred from struct shrinker.
      
      * Once the memcg is offlined, its nr_deferred will be reparented to its
        parent along with LRUs.
      
      * The root memcg has memcg_shrinker_deferred array too.  It simplifies
        the handling of reparenting to root memcg.
      
      * Cap nr_deferred to 2x of the length of lru.  The idea is borrowed from
        Dave Chinner's series
        (https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@fromorbit.com/)
      
      The downside is each memcg has to allocate extra memory to store the
      nr_deferred array.  On our production environment, there are typically
      around 40 shrinkers, so each memcg needs ~320 bytes.  10K memcgs would
      need ~3.2MB memory.  It seems fine.
      
      We have been running the patched kernel on some hosts of our fleet (test
      and production) for months, it works very well.  The monitor data shows
      the working set is sustained as expected.
      
      This patch (of 13):
      
      The tracepoint's nid should show what node the shrink happens on, the
      start tracepoint uses nid from shrinkctl, but the nid might be set to 0
      before end tracepoint if the shrinker is not NUMA aware, so the tracing
      log may show the shrink happens on one node but end up on the other node.
      It seems confusing.  And the following patch will remove using nid
      directly in do_shrink_slab(), this patch also helps cleanup the code.
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-1-shy828301@gmail.com
      Link: https://lkml.kernel.org/r/20210311190845.9708-2-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8efb4b59
    • Dave Hansen's avatar
      mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks · 202e35db
      Dave Hansen authored
      RECLAIM_ZONE was assumed to be unused because it was never explicitly
      used in the kernel.  However, there were a number of places where it was
      checked implicitly by checking 'node_reclaim_mode' for a zero value.
      
      These zero checks are not great because it is not obvious what a zero
      mode *means* in the code.  Replace them with a helper which makes it
      more obvious: node_reclaim_enabled().
      
      This helper also provides a handy place to explicitly check the
      RECLAIM_ZONE bit itself.  Check it explicitly there to make it more
      obvious where the bit can affect behavior.
      
      This should have no functional impact.
      
      Link: https://lkml.kernel.org/r/20210219172559.BF589C44@viggo.jf.intel.comSigned-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: "Tobin C. Harding" <tobin@kernel.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Daniel Wagner <dwagner@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      202e35db
    • Dave Hansen's avatar
      mm/vmscan: move RECLAIM* bits to uapi header · b6676de8
      Dave Hansen authored
      It is currently not obvious that the RECLAIM_* bits are part of the uapi
      since they are defined in vmscan.c.  Move them to a uapi header to make it
      obvious.
      
      This should have no functional impact.
      
      Link: https://lkml.kernel.org/r/20210219172557.08074910@viggo.jf.intel.comSigned-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Daniel Wagner <dwagner@suse.de>
      Cc: "Tobin C. Harding" <tobin@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6676de8
    • Axel Rasmussen's avatar
      userfaultfd/selftests: add test exercising minor fault handling · f0fa9433
      Axel Rasmussen authored
      Fix a dormant bug in userfaultfd_events_test(), where we did `return
      faulting_process(0)` instead of `exit(faulting_process(0))`.  This
      caused the forked process to keep running, trying to execute any further
      test cases after the events test in parallel with the "real" process.
      
      Add a simple test case which exercises minor faults.  In short, it does
      the following:
      
      1. "Sets up" an area (area_dst) and a second shared mapping to the same
         underlying pages (area_dst_alias).
      
      2. Register one of these areas with userfaultfd, in minor fault mode.
      
      3. Start a second thread to handle any minor faults.
      
      4. Populate the underlying pages with the non-UFFD-registered side of
         the mapping. Basically, memset() each page with some arbitrary
         contents.
      
      5. Then, using the UFFD-registered mapping, read all of the page
         contents, asserting that the contents match expectations (we expect
         the minor fault handling thread can modify the page contents before
         resolving the fault).
      
      The minor fault handling thread, upon receiving an event, flips all the
      bits (~) in that page, just to prove that it can modify it in some
      arbitrary way.  Then it issues a UFFDIO_CONTINUE ioctl, to setup the
      mapping and resolve the fault.  The reading thread should wake up and
      see this modification.
      
      Currently the minor fault test is only enabled in hugetlb_shared mode,
      as this is the only configuration the kernel feature supports.
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-7-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0fa9433
    • Axel Rasmussen's avatar
      userfaultfd: update documentation to describe minor fault handling · b8da5cd4
      Axel Rasmussen authored
      Reword / reorganize things a little bit into "lists", so new features /
      modes / ioctls can sort of just be appended.
      
      Describe how UFFDIO_REGISTER_MODE_MINOR and UFFDIO_CONTINUE can be used to
      intercept and resolve minor faults.  Make it clear that COPY and ZEROPAGE
      are used for MISSING faults, whereas CONTINUE is used for MINOR faults.
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-6-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8da5cd4
    • Axel Rasmussen's avatar
      userfaultfd: add UFFDIO_CONTINUE ioctl · f6191471
      Axel Rasmussen authored
      This ioctl is how userspace ought to resolve "minor" userfaults.  The
      idea is, userspace is notified that a minor fault has occurred.  It
      might change the contents of the page using its second non-UFFD mapping,
      or not.  Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
      ensured the page contents are correct, carry on setting up the mapping".
      
      Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
      MINOR registered VMAs.  ZEROPAGE maps the VMA to the zero page; but in
      the minor fault case, we already have some pre-existing underlying page.
      Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
      We'd just use memcpy() or similar instead.
      
      It turns out hugetlb_mcopy_atomic_pte() already does very close to what
      we want, if an existing page is provided via `struct page **pagep`.  We
      already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
      just extend that design: add an enum for the three modes of operation,
      and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
      case.  (Basically, look up the existing page, and avoid adding the
      existing page to the page cache or calling set_page_huge_active() on
      it.)
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6191471
    • Axel Rasmussen's avatar
      userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled · 714c1891
      Axel Rasmussen authored
      For background, mm/userfaultfd.c provides a general mcopy_atomic
      implementation.  But some types of memory (i.e., hugetlb and shmem) need
      a slightly different implementation, so they provide their own helpers
      for this.  In other words, userfaultfd is the only caller of these
      functions.
      
      This patch achieves two things:
      
      1. Don't spend time compiling code which will end up never being
         referenced anyway (a small build time optimization).
      
      2. In patches later in this series, we extend the signature of these
         helpers with UFFD-specific state (a mode enumeration).  Once this
         happens, we *have to* either not compile the helpers, or
         unconditionally define the UFFD-only state (which seems messier to me).
         This includes the declarations in the headers, as otherwise they'd
         yield warnings about implicitly defining the type of those arguments.
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-4-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      714c1891
    • Axel Rasmussen's avatar
      userfaultfd: disable huge PMD sharing for MINOR registered VMAs · 0d9cadab
      Axel Rasmussen authored
      As the comment says: for the MINOR fault use case, although the page
      might be present and populated in the other (non-UFFD-registered) half
      of the mapping, it may be out of date, and we explicitly want userspace
      to get a minor fault so it can check and potentially update the page's
      contents.
      
      Huge PMD sharing would prevent these faults from occurring for suitably
      aligned areas, so disable it upon UFFD registration.
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-3-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d9cadab
    • Axel Rasmussen's avatar
      userfaultfd: add minor fault registration mode · 7677f7fd
      Axel Rasmussen authored
      Patch series "userfaultfd: add minor fault handling", v9.
      
      Overview
      ========
      
      This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
      When enabled (via the UFFDIO_API ioctl), this feature means that any
      hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
      get events for "minor" faults.  By "minor" fault, I mean the following
      situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
      memory).  One of the mappings is registered with userfaultfd (in minor
      mode), and the other is not.  Via the non-UFFD mapping, the underlying
      pages have already been allocated & filled with some contents.  The UFFD
      mapping has not yet been faulted in; when it is touched for the first
      time, this results in what I'm calling a "minor" fault.  As a concrete
      example, when working with hugetlbfs, we have huge_pte_none(), but
      find_lock_page() finds an existing page.
      
      We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
      is, userspace resolves the fault by either a) doing nothing if the
      contents are already correct, or b) updating the underlying contents using
      the second, non-UFFD mapping (via memcpy/memset or similar, or something
      fancier like RDMA, or etc...).  In either case, userspace issues
      UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
      correct, carry on setting up the mapping".
      
      Use Case
      ========
      
      Consider the use case of VM live migration (e.g. under QEMU/KVM):
      
      1. While a VM is still running, we copy the contents of its memory to a
         target machine. The pages are populated on the target by writing to the
         non-UFFD mapping, using the setup described above. The VM is still running
         (and therefore its memory is likely changing), so this may be repeated
         several times, until we decide the target is "up to date enough".
      
      2. We pause the VM on the source, and start executing on the target machine.
         During this gap, the VM's user(s) will *see* a pause, so it is desirable to
         minimize this window.
      
      3. Between the last time any page was copied from the source to the target, and
         when the VM was paused, the contents of that page may have changed - and
         therefore the copy we have on the target machine is out of date. Although we
         can keep track of which pages are out of date, for VMs with large amounts of
         memory, it is "slow" to transfer this information to the target machine. We
         want to resume execution before such a transfer would complete.
      
      4. So, the guest begins executing on the target machine. The first time it
         touches its memory (via the UFFD-registered mapping), userspace wants to
         intercept this fault. Userspace checks whether or not the page is up to date,
         and if not, copies the updated page from the source machine, via the non-UFFD
         mapping. Finally, whether a copy was performed or not, userspace issues a
         UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
         are correct, carry on setting up the mapping".
      
      We don't have to do all of the final updates on-demand. The userfaultfd manager
      can, in the background, also copy over updated pages once it receives the map of
      which pages are up-to-date or not.
      
      Interaction with Existing APIs
      ==============================
      
      Because this is a feature, a registered VMA could potentially receive both
      missing and minor faults.  I spent some time thinking through how the
      existing API interacts with the new feature:
      
      UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
      allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:
      
      - For non-shared memory or shmem, -EINVAL is returned.
      - For hugetlb, -EFAULT is returned.
      
      UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
      Without modifications, the existing codepath assumes a new page needs to
      be allocated.  This is okay, since userspace must have a second
      non-UFFD-registered mapping anyway, thus there isn't much reason to want
      to use these in any case (just memcpy or memset or similar).
      
      - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
      - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
        in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
      - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
        -ENOENT in that case (regardless of the kind of fault).
      
      Future Work
      ===========
      
      This series only supports hugetlbfs.  I have a second series in flight to
      support shmem as well, extending the functionality.  This series is more
      mature than the shmem support at this point, and the functionality works
      fully on hugetlbfs, so this series can be merged first and then shmem
      support will follow.
      
      This patch (of 6):
      
      This feature allows userspace to intercept "minor" faults.  By "minor"
      faults, I mean the following situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
      mappings is registered with userfaultfd (in minor mode), and the other is
      not.  Via the non-UFFD mapping, the underlying pages have already been
      allocated & filled with some contents.  The UFFD mapping has not yet been
      faulted in; when it is touched for the first time, this results in what
      I'm calling a "minor" fault.  As a concrete example, when working with
      hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
      page.
      
      This commit adds the new registration mode, and sets the relevant flag on
      the VMAs being registered.  In the hugetlb fault path, if we find that we
      have huge_pte_none(), but find_lock_page() does indeed find an existing
      page, then we have a "minor" fault, and if the VMA has the userfaultfd
      registration flag, we call into userfaultfd to handle it.
      
      This is implemented as a new registration mode, instead of an API feature.
      This is because the alternative implementation has significant drawbacks
      [1].
      
      However, doing it this was requires we allocate a VM_* flag for the new
      registration mode.  On 32-bit systems, there are no unused bits, so this
      feature is only supported on architectures with
      CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
      MINOR mode on 32-bit architectures, we return -EINVAL.
      
      [1] https://lore.kernel.org/patchwork/patch/1380226/
      
      [peterx@redhat.com: fix minor fault page leak]
        Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7677f7fd
    • Oscar Salvador's avatar
      mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig · eb14d4ee
      Oscar Salvador authored
      pfn_range_valid_contig() bails out when it finds an in-use page or a
      hugetlb page, among other things.  We can drop the in-use page check since
      __alloc_contig_pages can migrate away those pages, and the hugetlb page
      check can go too since isolate_migratepages_range is now capable of
      dealing with hugetlb pages.  Either way, those checks are racy so let the
      end function handle it when the time comes.
      
      Link: https://lkml.kernel.org/r/20210419075413.1064-8-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb14d4ee
    • Oscar Salvador's avatar
      mm: make alloc_contig_range handle in-use hugetlb pages · ae37c7ff
      Oscar Salvador authored
      alloc_contig_range() will fail if it finds a HugeTLB page within the
      range, without a chance to handle them.  Since HugeTLB pages can be
      migrated as any LRU or Movable page, it does not make sense to bail out
      without trying.  Enable the interface to recognize in-use HugeTLB pages so
      we can migrate them, and have much better chances to succeed the call.
      
      Link: https://lkml.kernel.org/r/20210419075413.1064-7-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae37c7ff
    • Oscar Salvador's avatar
      mm: make alloc_contig_range handle free hugetlb pages · 369fa227
      Oscar Salvador authored
      alloc_contig_range will fail if it ever sees a HugeTLB page within the
      range we are trying to allocate, even when that page is free and can be
      easily reallocated.
      
      This has proved to be problematic for some users of alloc_contic_range,
      e.g: CMA and virtio-mem, where those would fail the call even when those
      pages lay in ZONE_MOVABLE and are free.
      
      We can do better by trying to replace such page.
      
      Free hugepages are tricky to handle so as to no userspace application
      notices disruption, we need to replace the current free hugepage with a
      new one.
      
      In order to do that, a new function called alloc_and_dissolve_huge_page is
      introduced.  This function will first try to get a new fresh hugepage, and
      if it succeeds, it will replace the old one in the free hugepage pool.
      
      The free page replacement is done under hugetlb_lock, so no external users
      of hugetlb will notice the change.  To allocate the new huge page, we use
      alloc_buddy_huge_page(), so we do not have to deal with any counters, and
      prep_new_huge_page() is not called.  This is valulable because in case we
      need to free the new page, we only need to call __free_pages().
      
      Once we know that the page to be replaced is a genuine 0-refcounted huge
      page, we remove the old page from the freelist by remove_hugetlb_page().
      Then, we can call __prep_new_huge_page() and
      __prep_account_new_huge_page() for the new huge page to properly
      initialize it and increment the hstate->nr_huge_pages counter (previously
      decremented by remove_hugetlb_page()).  Once done, the page is enqueued by
      enqueue_huge_page() and it is ready to be used.
      
      There is one tricky case when page's refcount is 0 because it is in the
      process of being released.  A missing PageHugeFreed bit will tell us that
      freeing is in flight so we retry after dropping the hugetlb_lock.  The
      race window should be small and the next retry should make a forward
      progress.
      
      E.g:
      
      CPU0				CPU1
      free_huge_page()		isolate_or_dissolve_huge_page
      				  PageHuge() == T
      				  alloc_and_dissolve_huge_page
      				    alloc_buddy_huge_page()
      				    spin_lock_irq(hugetlb_lock)
      				    // PageHuge() && !PageHugeFreed &&
      				    // !PageCount()
      				    spin_unlock_irq(hugetlb_lock)
        spin_lock_irq(hugetlb_lock)
        1) update_and_free_page
             PageHuge() == F
             __free_pages()
        2) enqueue_huge_page
             SetPageHugeFreed()
        spin_unlock_irq(&hugetlb_lock)
      				  spin_lock_irq(hugetlb_lock)
                                         1) PageHuge() == F (freed by case#1 from CPU0)
      				   2) PageHuge() == T
                                             PageHugeFreed() == T
                                             - proceed with replacing the page
      
      In the case above we retry as the window race is quite small and we have
      high chances to succeed next time.
      
      With regard to the allocation, we restrict it to the node the page belongs
      to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
      
      Note that gigantic hugetlb pages are fenced off since there is a cyclic
      dependency between them and alloc_contig_range.
      
      Link: https://lkml.kernel.org/r/20210419075413.1064-6-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      369fa227
    • Oscar Salvador's avatar
      mm,hugetlb: split prep_new_huge_page functionality · d3d99fcc
      Oscar Salvador authored
      Currently, prep_new_huge_page() performs two functions.  It sets the
      right state for a new hugetlb, and increases the hstate's counters to
      account for the new page.
      
      Let us split its functionality into two separate functions, decoupling
      the handling of the counters from initializing a hugepage.  The outcome
      is having __prep_new_huge_page(), which only initializes the page , and
      __prep_account_new_huge_page(), which adds the new page to the hstate's
      counters.
      
      This allows us to be able to set a hugetlb without having to worry about
      the counter/locking.  It will prove useful in the next patch.
      prep_new_huge_page() still calls both functions.
      
      Link: https://lkml.kernel.org/r/20210419075413.1064-5-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3d99fcc
    • Oscar Salvador's avatar
      mm,hugetlb: drop clearing of flag from prep_new_huge_page · 9f27b34f
      Oscar Salvador authored
      Pages allocated via the page allocator or CMA get its private field
      cleared by means of post_alloc_hook().
      
      Pages allocated during boot, that is directly from the memblock
      allocator, get cleared by paging_init()-> ..  ->memmap_init_zone-> ..
      ->__init_single_page() before any memblock allocation.
      
      Based on this ground, let us remove the clearing of the flag from
      prep_new_huge_page() as it is not needed.  This was a leftover from
      commit 6c037149 ("hugetlb: convert PageHugeFreed to HPageFreed
      flag").
      
      Previously the explicit clearing was necessary because compound
      allocations do not get this initialization (see prep_compound_page).
      
      Link: https://lkml.kernel.org/r/20210419075413.1064-4-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f27b34f
    • Oscar Salvador's avatar
      mm,compaction: let isolate_migratepages_{range,block} return error codes · c2ad7a1f
      Oscar Salvador authored
      Currently, isolate_migratepages_{range,block} and their callers use a pfn
      == 0 vs pfn != 0 scheme to let the caller know whether there was any error
      during isolation.
      
      This does not work as soon as we need to start reporting different error
      codes and make sure we pass them down the chain, so they are properly
      interpreted by functions like e.g: alloc_contig_range.
      
      Let us rework isolate_migratepages_{range,block} so we can report error
      codes.  Since isolate_migratepages_block will stop returning the next pfn
      to be scanned, we reuse the cc->migrate_pfn field to keep track of that.
      
      Link: https://lkml.kernel.org/r/20210419075413.1064-3-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2ad7a1f
    • Oscar Salvador's avatar
      mm,page_alloc: bail out earlier on -ENOMEM in alloc_contig_migrate_range · c8e28b47
      Oscar Salvador authored
      Patch series "Make alloc_contig_range handle Hugetlb pages", v10.
      
      alloc_contig_range lacks the ability to handle HugeTLB pages.  This can
      be problematic for some users, e.g: CMA and virtio-mem, where those
      users will fail the call if alloc_contig_range ever sees a HugeTLB page,
      even when those pages lay in ZONE_MOVABLE and are free.  That problem
      can be easily solved by replacing the page in the free hugepage pool.
      
      In-use HugeTLB are no exception though, as those can be isolated and
      migrated as any other LRU or Movable page.
      
      This aims to improve alloc_contig_range->isolate_migratepages_block, so
      that HugeTLB pages can be recognized and handled.
      
      Since we also need to start reporting errors down the chain (e.g:
      -ENOMEM due to not be able to allocate a new hugetlb page),
      isolate_migratepages_{range,block} interfaces need to change to start
      reporting error codes instead of the pfn == 0 vs pfn != 0 scheme it is
      using right now.  From now on, isolate_migratepages_block will not
      return the next pfn to be scanned anymore, but -EINTR, -ENOMEM or 0, so
      we the next pfn to be scanned will be recorded in cc->migrate_pfn field
      (as it is already done in isolate_migratepages_range()).
      
      Below is an insight from David (thanks), where the problem can clearly be
      seen:
      
       "Start a VM with 4G. Hotplug 1G via virtio-mem and online it to
        ZONE_MOVABLE. Allocate 512 huge pages.
      
        [root@localhost ~]# cat /proc/meminfo
        MemTotal:        5061512 kB
        MemFree:         3319396 kB
        MemAvailable:    3457144 kB
        ...
        HugePages_Total:     512
        HugePages_Free:      512
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:       2048 kB
      
        The huge pages get partially allocate from ZONE_MOVABLE. Try unplugging
        1G via virtio-mem (remember, all ZONE_MOVABLE). Inside the guest:
      
        [  180.058992] alloc_contig_range: [1b8000, 1c0000) PFNs busy
        [  180.060531] alloc_contig_range: [1b8000, 1c0000) PFNs busy
        [  180.061972] alloc_contig_range: [1b8000, 1c0000) PFNs busy
        [  180.063413] alloc_contig_range: [1b8000, 1c0000) PFNs busy
        [  180.064838] alloc_contig_range: [1b8000, 1c0000) PFNs busy
        [  180.065848] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
        [  180.066794] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
        [  180.067738] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
        [  180.068669] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
        [  180.069598] alloc_contig_range: [1bfc00, 1c0000) PFNs busy"
      
      And then with this patchset running:
      
       "Same experiment with ZONE_MOVABLE:
      
        a) Free huge pages: all memory can get unplugged again.
      
        b) Allocated/populated but idle huge pages: all memory can get unplugged
           again.
      
        c) Allocated/populated but all 512 huge pages are read/written in a
           loop: all memory can get unplugged again, but I get a single
      
           [  121.192345] alloc_contig_range: [180000, 188000) PFNs busy
      
           Most probably because it happened to try migrating a huge page
           while it was busy.  As virtio-mem retries on ZONE_MOVABLE a couple of
           times, it can deal with this temporary failure.
      
        Last but not least, I did something extreme:
      
        # cat /proc/meminfo
        MemTotal:        5061568 kB
        MemFree:          186560 kB
        MemAvailable:     354524 kB
        ...
        HugePages_Total:    2048
        HugePages_Free:     2048
        HugePages_Rsvd:        0
        HugePages_Surp:        0
      
        Triggering unplug would require to dissolve+alloc - which now fails
        when trying to allocate an additional ~512 huge pages (1G).
      
        As expected, I can properly see memory unplug not fully succeeding.  +
        I get a fairly continuous stream of
      
        [  226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy
        ...
      
        But more importantly, the hugepage count remains stable, as configured
        by the admin (me):
      
        HugePages_Total:    2048
        HugePages_Free:     2048
        HugePages_Rsvd:        0
        HugePages_Surp:        0"
      
      This patch (of 7):
      
      Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or
      -EBUSY, and report them down the chain.  The problem is that when
      migrate_pages() reports -ENOMEM, we keep going till we exhaust all the
      try-attempts (5 at the moment) instead of bailing out.
      
      migrate_pages() bails out right away on -ENOMEM because it is considered a
      fatal error.  Do the same here instead of keep going and retrying.  Note
      that this is not fixing a real issue, just a cosmetic change.  Although we
      can save some cycles by backing off ealier
      
      Link: https://lkml.kernel.org/r/20210419075413.1064-1-osalvador@suse.de
      Link: https://lkml.kernel.org/r/20210419075413.1064-2-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8e28b47
    • Mike Kravetz's avatar
      hugetlb: add lockdep_assert_held() calls for hugetlb_lock · 9487ca60
      Mike Kravetz authored
      After making hugetlb lock irq safe and separating some functionality
      done under the lock, add some lockdep_assert_held to help verify
      locking.
      
      Link: https://lkml.kernel.org/r/20210409205254.242291-9-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9487ca60
    • Mike Kravetz's avatar
      hugetlb: make free_huge_page irq safe · db71ef79
      Mike Kravetz authored
      Commit c77c0a8a ("mm/hugetlb: defer freeing of huge pages if in
      non-task context") was added to address the issue of free_huge_page being
      called from irq context.  That commit hands off free_huge_page processing
      to a workqueue if !in_task.  However, this doesn't cover all the cases as
      pointed out by 0day bot lockdep report [1].
      
      :  Possible interrupt unsafe locking scenario:
      :
      :        CPU0                    CPU1
      :        ----                    ----
      :   lock(hugetlb_lock);
      :                                local_irq_disable();
      :                                lock(slock-AF_INET);
      :                                lock(hugetlb_lock);
      :   <Interrupt>
      :     lock(slock-AF_INET);
      
      Shakeel has later explained that this is very likely TCP TX zerocopy from
      hugetlb pages scenario when the networking code drops a last reference to
      hugetlb page while having IRQ disabled.  Hugetlb freeing path doesn't
      disable IRQ while holding hugetlb_lock so a lock dependency chain can lead
      to a deadlock.
      
      This commit addresses the issue by doing the following:
       - Make hugetlb_lock irq safe. This is mostly a simple process of
         changing spin_*lock calls to spin_*lock_irq* calls.
       - Make subpool lock irq safe in a similar manner.
       - Revert the !in_task check and workqueue handoff.
      
      [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
      
      Link: https://lkml.kernel.org/r/20210409205254.242291-8-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db71ef79
    • Mike Kravetz's avatar
      hugetlb: change free_pool_huge_page to remove_pool_huge_page · 10c6ec49
      Mike Kravetz authored
      free_pool_huge_page was called with hugetlb_lock held.  It would remove
      a hugetlb page, and then free the corresponding pages to the lower level
      allocators such as buddy.  free_pool_huge_page was called in a loop to
      remove hugetlb pages and these loops could hold the hugetlb_lock for a
      considerable time.
      
      Create new routine remove_pool_huge_page to replace free_pool_huge_page.
      remove_pool_huge_page will remove the hugetlb page, and it must be
      called with the hugetlb_lock held.  It will return the removed page and
      it is the responsibility of the caller to free the page to the lower
      level allocators.  The hugetlb_lock is dropped before freeing to these
      allocators which results in shorter lock hold times.
      
      Add new helper routine to call update_and_free_page for a list of pages.
      
      Note: Some changes to the routine return_unused_surplus_pages are in
      need of explanation.  Commit e5bbc8a6 ("mm/hugetlb.c: fix
      reservation race when freeing surplus pages") modified this routine to
      address a race which could occur when dropping the hugetlb_lock in the
      loop that removes pool pages.  Accounting changes introduced in that
      commit were subtle and took some thought to understand.  This commit
      removes the cond_resched_lock() and the potential race.  Therefore,
      remove the subtle code and restore the more straight forward accounting
      effectively reverting the commit.
      
      Link: https://lkml.kernel.org/r/20210409205254.242291-7-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10c6ec49
    • Mike Kravetz's avatar
      hugetlb: call update_and_free_page without hugetlb_lock · 1121828a
      Mike Kravetz authored
      With the introduction of remove_hugetlb_page(), there is no need for
      update_and_free_page to hold the hugetlb lock.  Change all callers to
      drop the lock before calling.
      
      With additional code modifications, this will allow loops which decrease
      the huge page pool to drop the hugetlb_lock with each page to reduce
      long hold times.
      
      The ugly unlock/lock cycle in free_pool_huge_page will be removed in a
      subsequent patch which restructures free_pool_huge_page.
      
      Link: https://lkml.kernel.org/r/20210409205254.242291-6-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1121828a
    • Mike Kravetz's avatar
      hugetlb: create remove_hugetlb_page() to separate functionality · 6eb4e88a
      Mike Kravetz authored
      The new remove_hugetlb_page() routine is designed to remove a hugetlb
      page from hugetlbfs processing.  It will remove the page from the active
      or free list, update global counters and set the compound page
      destructor to NULL so that PageHuge() will return false for the 'page'.
      After this call, the 'page' can be treated as a normal compound page or
      a collection of base size pages.
      
      update_and_free_page no longer decrements h->nr_huge_pages{_node} as
      this is performed in remove_hugetlb_page.  The only functionality
      performed by update_and_free_page is to free the base pages to the lower
      level allocators.
      
      update_and_free_page is typically called after remove_hugetlb_page.
      
      remove_hugetlb_page is to be called with the hugetlb_lock held.
      
      Creating this routine and separating functionality is in preparation for
      restructuring code to reduce lock hold times.  This commit should not
      introduce any changes to functionality.
      
      Link: https://lkml.kernel.org/r/20210409205254.242291-5-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6eb4e88a
    • Mike Kravetz's avatar
      hugetlb: add per-hstate mutex to synchronize user adjustments · 29383967
      Mike Kravetz authored
      The helper routine hstate_next_node_to_alloc accesses and modifies the
      hstate variable next_nid_to_alloc.  The helper is used by the routines
      alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
      called with hugetlb_lock held.  However, alloc_pool_huge_page can not be
      called with the hugetlb lock held as it will call the page allocator.
      Two instances of alloc_pool_huge_page could be run in parallel or
      alloc_pool_huge_page could run in parallel with adjust_pool_surplus
      which may result in the variable next_nid_to_alloc becoming invalid for
      the caller and pages being allocated on the wrong node.
      
      Both alloc_pool_huge_page and adjust_pool_surplus are only called from
      the routine set_max_huge_pages after boot.  set_max_huge_pages is only
      called as the reusult of a user writing to the proc/sysfs nr_hugepages,
      or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.
      
      It makes little sense to allow multiple adjustment to the number of
      hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
      allow one hugetlb page adjustment at a time.  This will synchronize
      modifications to the next_nid_to_alloc variable.
      
      Link: https://lkml.kernel.org/r/20210409205254.242291-4-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29383967
    • Mike Kravetz's avatar
      hugetlb: no need to drop hugetlb_lock to call cma_release · 262443c0
      Mike Kravetz authored
      Now that cma_release is non-blocking and irq safe, there is no need to
      drop hugetlb_lock before calling.
      
      Link: https://lkml.kernel.org/r/20210409205254.242291-3-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      262443c0
    • Mike Kravetz's avatar
      mm/cma: change cma mutex to irq safe spinlock · 0ef7dcac
      Mike Kravetz authored
      Patch series "make hugetlb put_page safe for all calling contexts", v5.
      
      This effort is the result a recent bug report [1].  Syzbot found a
      potential deadlock in the hugetlb put_page/free_huge_page_path.  WARNING:
      SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected Since the
      free_huge_page_path already has code to 'hand off' page free requests to a
      workqueue, a suggestion was proposed to make the in_irq() detection
      accurate by always enabling PREEMPT_COUNT [2].  The outcome of that
      discussion was that the hugetlb put_page path (free_huge_page) path should
      be properly fixed and safe for all calling contexts.
      
      [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/
      [2] http://lkml.kernel.org/r/20210311021321.127500-1-mike.kravetz@oracle.com
      
      This patch (of 8):
      
      cma_release is currently a sleepable operatation because the bitmap
      manipulation is protected by cma->lock mutex.  Hugetlb code which relies
      on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
      irq safe.
      
      The lock doesn't protect any sleepable operation so it can be changed to a
      (irq aware) spin lock.  The bitmap processing should be quite fast in
      typical case but if cma sizes grow to TB then we will likely need to
      replace the lock by a more optimized bitmap implementation.
      
      Link: https://lkml.kernel.org/r/20210409205254.242291-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20210409205254.242291-2-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ef7dcac
    • Miaohe Lin's avatar
      mm/hugetlb: remove unused variable pseudo_vma in remove_inode_hugepages() · 15b83653
      Miaohe Lin authored
      The local variable pseudo_vma is not used anymore.
      
      Link: https://lkml.kernel.org/r/20210410072348.20437-6-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Feilong Lin <linfeilong@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15b83653
    • Miaohe Lin's avatar
      mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts() · da56388c
      Miaohe Lin authored
      A rare out of memory error would prevent removal of the reserve map region
      for a page.  hugetlb_fix_reserve_counts() handles this rare case to avoid
      dangling with incorrect counts.  Unfortunately, hugepage_subpool_get_pages
      and hugetlb_acct_memory could possibly fail too.  We should correctly
      handle these cases.
      
      Link: https://lkml.kernel.org/r/20210410072348.20437-5-linmiaohe@huawei.com
      Fixes: b5cec28d ("hugetlbfs: truncate_hugepages() takes a range of pages")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Feilong Lin <linfeilong@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da56388c