• Jane Chu's avatar
    sparc64: Measure receiver forward progress to avoid send mondo timeout · 9d53caec
    Jane Chu authored
    A large sun4v SPARC system may have moments of intensive xcall activities,
    usually caused by unmapping many pages on many CPUs concurrently. This can
    flood receivers with CPU mondo interrupts for an extended period, causing
    some unlucky senders to hit send-mondo timeout. This problem gets worse
    as cpu count increases because sometimes mappings must be invalidated on
    all CPUs, and sometimes all CPUs may gang up on a single CPU.
    
    But a busy system is not a broken system. In the above scenario, as long
    as the receiver is making forward progress processing mondo interrupts,
    the sender should continue to retry.
    
    This patch implements the receiver's forward progress meter by introducing
    a per cpu counter 'cpu_mondo_counter[cpu]' where 'cpu' is in the range
    of 0..NR_CPUS. The receiver increments its counter as soon as it receives
    a mondo and the sender tracks the receiver's counter. If the receiver has
    stopped making forward progress when the retry limit is reached, the sender
    declares send-mondo-timeout and panic; otherwise, the receiver is allowed
    to keep making forward progress.
    
    In addition, it's been observed that PCIe hotplug events generate Correctable
    Errors that are handled by hypervisor and then OS. Hypervisor 'borrows'
    a guest cpu strand briefly to provide the service. If the cpu strand is
    simultaneously the only cpu targeted by a mondo, it may not be available
    for the mondo in 20msec, causing SUN4V mondo timeout. It appears that 1 second
    is the agreed wait time between hypervisor and guest OS, this patch makes
    the adjustment.
    
    Orabug: 25476541
    Orabug: 26417466
    Signed-off-by: default avatarJane Chu <jane.chu@oracle.com>
    Reviewed-by: default avatarSteve Sistare <steven.sistare@oracle.com>
    Reviewed-by: default avatarAnthony Yznaga <anthony.yznaga@oracle.com>
    Reviewed-by: default avatarRob Gardner <rob.gardner@oracle.com>
    Reviewed-by: default avatarThomas Tai <thomas.tai@oracle.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    9d53caec
smp_64.c 38.6 KB