• Bitao Hu's avatar
    watchdog/softlockup: Low-overhead detection of interrupt storm · d7037381
    Bitao Hu authored
    
    
    The following softlockup is caused by interrupt storm, but it cannot be
    identified from the call tree. Because the call tree is just a snapshot
    and doesn't fully capture the behavior of the CPU during the soft lockup.
      watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
      ...
      Call trace:
        __do_softirq+0xa0/0x37c
        __irq_exit_rcu+0x108/0x140
        irq_exit+0x14/0x20
        __handle_domain_irq+0x84/0xe0
        gic_handle_irq+0x80/0x108
        el0_irq_naked+0x50/0x58
    
    Therefore, it is necessary to report CPU utilization during the
    softlockup_threshold period (report once every sample_period, for a total
    of 5 reportings), like this:
      watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
      CPU#28 Utilization every 4s during lockup:
        #1: 0% system, 0% softirq, 100% hardirq, 0% idle
        #2: 0% system, 0% softirq, 100% hardirq, 0% idle
        #3: 0% system, 0% softirq, 100% hardirq, 0% idle
        #4: 0% system, 0% softirq, 100% hardirq, 0% idle
        #5: 0% system, 0% softirq, 100% hardirq, 0% idle
      ...
    
    This is helpful in determining whether an interrupt storm has occurred or
    in identifying the cause of the softlockup. The criteria for determination
    are as follows:
    
      a. If the hardirq utilization is high, then interrupt storm should be
         considered and the root cause cannot be determined from the call tree.
      b. If the softirq utilization is high, then the call might not necessarily
         point at the root cause.
      c. If the system utilization is high, then analyzing the root
         cause from the call tree is possible in most cases.
    
    The mechanism requires a considerable amount of global storage space
    when configured for the maximum number of CPUs. Therefore, adding a
    SOFTLOCKUP_DETECTOR_INTR_STORM Kconfig knob that defaults to "yes"
    if the max number of CPUs is <= 128.
    Signed-off-by: default avatarBitao Hu <yaoma@linux.alibaba.com>
    Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    Reviewed-by: default avatarDouglas Anderson <dianders@chromium.org>
    Reviewed-by: default avatarLiu Song <liusong@linux.alibaba.com>
    Link: https://lore.kernel.org/r/20240411074134.30922-5-yaoma@linux.alibaba.com
    d7037381
watchdog.c 30.6 KB