• Mike Kravetz's avatar
    hugetlb_cgroup: fix offline of hugetlb cgroup with reservations · 7a5bde37
    Mike Kravetz authored
    Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
    using hugetlbfs.  In this environment the issue is reproduced by:
    
     - Start a simple pod that uses the recently added HugePages medium
       feature (pod yaml attached)
    
     - Start a DPDK app. It doesn't need to run successfully (as in transfer
       packets) nor interact with real hardware. It seems just initializing
       the EAL layer (which handles hugepage reservation and locking) is
       enough to trigger the issue
    
     - Delete the Pod (or let it "Complete").
    
    This would result in a kworker thread going into a tight loop (top output):
    
       1425 root      20   0       0      0      0 R  99.7   0.0   5:22.45 kworker/28:7+cgroup_destroy
    
    'perf top -g' reports:
    
      -   63.28%     0.01%  [kernel]                    [k] worker_thread
         - 49.97% worker_thread
            - 52.64% process_one_work
               - 62.08% css_killed_work_fn
                  - hugetlb_cgroup_css_offline
                       41.52% _raw_spin_lock
                     - 2.82% _cond_resched
                          rcu_all_qs
                       2.66% PageHuge
            - 0.57% schedule
               - 0.57% __schedule
    
    We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
    Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
    infinitely spinning.  Little else can be done on the system as the
    cgroup_mutex can not be acquired.
    
    Do note that the issue can be reproduced by simply offlining a hugetlb
    cgroup containing pages with reservation counts.
    
    The loop in hugetlb_cgroup_css_offline is moving page counts from the
    cgroup being offlined to the parent cgroup.  This is done for each
    hstate, and is repeated until hugetlb_cgroup_have_usage returns false.
    The routine moving counts (hugetlb_cgroup_move_parent) is only moving
    'usage' counts.  The routine hugetlb_cgroup_have_usage is checking for
    both 'usage' and 'reservation' counts.  Discussion about what to do with
    reservation counts when reparenting was discussed here:
    
    https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/
    
    The decision was made to leave a zombie cgroup for with reservation
    counts.  Unfortunately, the code checking reservation counts was
    incorrectly added to hugetlb_cgroup_have_usage.
    
    To fix the issue, simply remove the check for reservation counts.  While
    fixing this issue, a related bug in hugetlb_cgroup_css_offline was
    noticed.  The hstate index is not reinitialized each time through the
    do-while loop.  Fix this as well.
    
    Fixes: 1adc4d41 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
    Reported-by: default avatarAdrian Moreno <amorenoz@redhat.com>
    Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Tested-by: default avatarAdrian Moreno <amorenoz@redhat.com>
    Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
    Cc: Mina Almasry <almasrymina@google.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Sandipan Das <sandipan@linux.ibm.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: <stable@vger.kernel.org>
    Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    7a5bde37
hugetlb_cgroup.c 21.4 KB