• Austin Clements's avatar
    runtime: start an M when handing off a P when there's GC work · f309bf3e
    Austin Clements authored
    Currently it's possible for the scheduler to deadlock with the right
    confluence of locked Gs, assists, and scheduling of background mark
    workers. Broadly, this happens because handoffp is stricter than
    findrunnable, and if the only work for a P is GC work, handoffp will
    put the P into idle, rather than starting an M to execute that P. One
    way this can happen is as follows:
    
    0. There is only one user G, which we'll call G 1. There is more than
       one P, but they're all idle except the one running G 1.
    
    1. G 1 locks itself to an M using runtime.LockOSThread.
    
    2. GC starts up and enters mark 1.
    
    3. G 1 performs a GC assist, which completes mark 1 without being
       fully satisfied. Completing mark 1 causes all background mark
       workers to park. And since the assist isn't fully satisfied, it
       parks as well, waiting for a background mark worker to satisfy its
       remaining assist debt.
    
    4. The assist park enters the scheduler. Since G 1 is locked to the M,
       the scheduler releases the P and calls handoffp to hand the P to
       another M.
    
    5. handoffp checks the local and global run queues, which are empty,
       and sees that there are idle Ps, so rather than start an M, it puts
       the P into idle.
    
    At this point, all of the Gs are waiting and all of the Ps are idle.
    In particular, none of the GC workers are running, so no mark work
    gets done and the assist on the main G is never satisfied, so the
    whole process soft locks up.
    
    Fix this by making handoffp start an M if there is GC work. This
    reintroduces a key invariant: that in any situation where findrunnable
    would return a G to run on a P, handoffp for that P will start an M to
    run work on that P.
    
    Fixes #13645.
    
    Tested by running 2,689 iterations of `go tool dist test -no-rebuild
    runtime:cpu124` across 10 linux-amd64-noopt VMs with no failures.
    Without this change, the failure rate was somewhere around 1%.
    
    Performance change is negligible.
    
    name              old time/op  new time/op  delta
    XBenchGarbage-12  2.48ms ± 2%  2.48ms ± 1%  -0.24%  (p=0.000 n=92+93)
    
    name                      old time/op    new time/op    delta
    BinaryTree17-12              2.86s ± 2%     2.87s ± 2%    ~     (p=0.667 n=19+20)
    Fannkuch11-12                2.52s ± 1%     2.47s ± 1%  -2.05%  (p=0.000 n=18+20)
    FmtFprintfEmpty-12          51.7ns ± 1%    51.5ns ± 3%    ~     (p=0.931 n=16+20)
    FmtFprintfString-12          170ns ± 1%     168ns ± 1%  -0.65%  (p=0.000 n=19+19)
    FmtFprintfInt-12             160ns ± 0%     160ns ± 0%  +0.18%  (p=0.033 n=17+19)
    FmtFprintfIntInt-12          265ns ± 1%     273ns ± 1%  +2.98%  (p=0.000 n=17+19)
    FmtFprintfPrefixedInt-12     235ns ± 1%     239ns ± 1%  +1.99%  (p=0.000 n=16+19)
    FmtFprintfFloat-12           315ns ± 0%     315ns ± 1%    ~     (p=0.250 n=17+19)
    FmtManyArgs-12              1.04µs ± 1%    1.05µs ± 0%  +0.87%  (p=0.000 n=17+19)
    GobDecode-12                7.93ms ± 0%    7.85ms ± 1%  -1.03%  (p=0.000 n=16+18)
    GobEncode-12                6.62ms ± 1%    6.58ms ± 1%  -0.60%  (p=0.000 n=18+19)
    Gzip-12                      322ms ± 1%     320ms ± 1%  -0.46%  (p=0.009 n=20+20)
    Gunzip-12                   42.5ms ± 1%    42.5ms ± 0%    ~     (p=0.751 n=19+19)
    HTTPClientServer-12         69.7µs ± 1%    70.0µs ± 2%    ~     (p=0.056 n=19+19)
    JSONEncode-12               16.9ms ± 1%    16.7ms ± 1%  -1.13%  (p=0.000 n=19+19)
    JSONDecode-12               61.5ms ± 1%    61.3ms ± 1%  -0.35%  (p=0.001 n=20+17)
    Mandelbrot200-12            3.94ms ± 0%    3.91ms ± 0%  -0.67%  (p=0.000 n=20+18)
    GoParse-12                  3.71ms ± 1%    3.70ms ± 1%    ~     (p=0.244 n=17+19)
    RegexpMatchEasy0_32-12       101ns ± 1%     102ns ± 2%  +0.54%  (p=0.037 n=19+20)
    RegexpMatchEasy0_1K-12       349ns ± 0%     350ns ± 0%  +0.33%  (p=0.000 n=17+18)
    RegexpMatchEasy1_32-12      84.5ns ± 2%    84.2ns ± 1%  -0.43%  (p=0.048 n=19+20)
    RegexpMatchEasy1_1K-12       510ns ± 1%     513ns ± 2%  +0.58%  (p=0.002 n=18+20)
    RegexpMatchMedium_32-12      132ns ± 1%     134ns ± 1%  +0.95%  (p=0.000 n=20+20)
    RegexpMatchMedium_1K-12     40.1µs ± 1%    39.6µs ± 1%  -1.39%  (p=0.000 n=20+20)
    RegexpMatchHard_32-12       2.08µs ± 0%    2.06µs ± 1%  -0.95%  (p=0.000 n=18+18)
    RegexpMatchHard_1K-12       62.2µs ± 1%    61.9µs ± 1%  -0.42%  (p=0.001 n=19+20)
    Revcomp-12                   537ms ± 0%     536ms ± 0%    ~     (p=0.076 n=20+20)
    Template-12                 71.3ms ± 1%    69.3ms ± 1%  -2.75%  (p=0.000 n=20+20)
    TimeParse-12                 361ns ± 0%     360ns ± 1%    ~     (p=0.056 n=19+19)
    TimeFormat-12                353ns ± 0%     352ns ± 0%  -0.23%  (p=0.000 n=17+18)
    [Geo mean]                  62.6µs         62.5µs       -0.17%
    
    Change-Id: I0fbbbe4d7d99653ba5600ffb4394fa03558bc4e9
    Reviewed-on: https://go-review.googlesource.com/19107Reviewed-by: default avatarRick Hudson <rlh@golang.org>
    Reviewed-by: default avatarRuss Cox <rsc@golang.org>
    Run-TryBot: Austin Clements <austin@google.com>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    f309bf3e
proc.go 116 KB