Commit fbaf881c authored by Russ Cox's avatar Russ Cox

net/http: fix Transport.MaxConnsPerHost limits & idle pool races

There were at least three races in the implementation of the pool of
idle HTTP connections before this CL.

The first race is that HTTP/2 connections can be shared for many
requests, but each requesting goroutine would take the connection out
of the pool and then immediately return it before using it; this
created unnecessary, tiny little race windows during which another
goroutine might dial a second connection instead of reusing the first.
This CL changes the idle pool to just leave the HTTP/2 connection in
the pool permanently (until there is reason to close it), instead of
doing the take-it-out-put-it-back dance race.

The second race is that “is there an idle connection?” and
“register to wait for an idle connection” were implemented as two
separate steps, in different critical sections. So a client could end
up registered to wait for an idle connection and be waiting or perhaps
dialing, not having noticed the idle connection sitting in the pool
that arrived between the two steps.

The third race is that t.getIdleConnCh assumes that the inability to
send on the channel means the client doesn't need the result, when it
could mean that the client has not yet entered the select.
That is, the main dial does:

	idleConnCh := t.getIdleConnCh(cm)
	select {
	case v := <-dialc:
		...
	case pc := <-idleConnCh
		...
	...
	}

But then tryPutIdleConn does:

	waitingDialer := t.idleConnCh[key] // what getIdleConnCh(cm) returned
	select {
	case waitingDialer <- pconn:
		// We're done ...
		return nil
	default:
		if waitingDialer != nil {
			// They had populated this, but their dial won
			// first, so we can clean up this map entry.
			delete(t.idleConnCh, key)
		}
	}

If the client has returned from getIdleConnCh but not yet reached the
select, tryPutIdleConn will be unable to do the send, incorrectly
conclude that the client does not care anymore, and put the connection
in the idle pool instead, again leaving the client dialing unnecessarily
while a connection sits in the idle pool.

(It's also odd that the success case does not clean up the map entry,
and also that the map has room for only a single waiting goroutine for
a given host.)

None of these races mattered too much before Go 1.11: at most they
meant that connections were not reused quite as promptly as possible,
or a few more than necessary would be created. But Go 1.11 added
Transport.MaxConnsPerHost, which limited the number of connections
created for a given host. The default is 0 (unlimited), but if a user
did explicitly impose a low limit (2 is common), all these misplaced
conns could easily add up to the entire limit, causing a deadlock.
This was causing intermittent timeouts in TestTransportMaxConnsPerHost.

The addition of the MaxConnsPerHost support added its own races.

For example, here t.incHostConnCount could increment the count
and return a channel ready for receiving, and then the client would
not receive from it nor ever issue the decrement, because the select
need not evaluate these two cases in order:

	select {
	case <-t.incHostConnCount(cmKey):
		// count below conn per host limit; proceed
	case pc := <-t.getIdleConnCh(cm):
		if trace != nil && trace.GotConn != nil {
			trace.GotConn(httptrace.GotConnInfo{Conn: pc.conn, Reused: pc.isReused()})
		}
		return pc, nil
	...
	}

Obviously, unmatched increments are another way to get to a deadlock.
TestTransportMaxConnsPerHost deadlocked approximately 100% of
the time with a small random sleep added between incHostConnCount
and the select:

	ch := t.incHostConnCount(cmKey):
	time.Sleep(time.Duration(rand.Intn(10))*time.Millisecond)
	select {
	case <-ch
		// count below conn per host limit; proceed
	case pc := <-t.getIdleConnCh(cm):
		...
	}

The limit also did not properly apply to HTTP/2, because of the
decrement being attached to the underlying net.Conn.Close
and net/http not having access to the underlying HTTP/2 conn.
The alternate decrements for HTTP/2 may also have introduced
spurious decrements (discussion in #29889). Perhaps those
spurious decrements or other races caused the other intermittent
non-deadlock failures in TestTransportMaxConnsPerHost,
in which the HTTP/2 phase created too many connections (#31982).

This CL replaces the buggy, racy code with new code that is hopefully
neither buggy nor racy.

Fixes #29889.
Fixes #31982.
Fixes #32336.

Change-Id: I0dfac3a6fe8a6cdf5f0853722781fe2ec071ac97
Reviewed-on: https://go-review.googlesource.com/c/go/+/184262
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: default avatarBryan C. Mills <bcmills@google.com>
parent ddc8439b
...@@ -166,30 +166,40 @@ func (t *Transport) IdleConnCountForTesting(scheme, addr string) int { ...@@ -166,30 +166,40 @@ func (t *Transport) IdleConnCountForTesting(scheme, addr string) int {
return 0 return 0
} }
func (t *Transport) IdleConnChMapSizeForTesting() int { func (t *Transport) IdleConnWaitMapSizeForTesting() int {
t.idleMu.Lock() t.idleMu.Lock()
defer t.idleMu.Unlock() defer t.idleMu.Unlock()
return len(t.idleConnCh) return len(t.idleConnWait)
} }
func (t *Transport) IsIdleForTesting() bool { func (t *Transport) IsIdleForTesting() bool {
t.idleMu.Lock() t.idleMu.Lock()
defer t.idleMu.Unlock() defer t.idleMu.Unlock()
return t.wantIdle return t.closeIdle
} }
func (t *Transport) RequestIdleConnChForTesting() { func (t *Transport) QueueForIdleConnForTesting() {
t.getIdleConnCh(connectMethod{nil, "http", "example.com", false}) t.queueForIdleConn(nil)
} }
// PutIdleTestConn reports whether it was able to insert a fresh
// persistConn for scheme, addr into the idle connection pool.
func (t *Transport) PutIdleTestConn(scheme, addr string) bool { func (t *Transport) PutIdleTestConn(scheme, addr string) bool {
c, _ := net.Pipe() c, _ := net.Pipe()
key := connectMethodKey{"", scheme, addr, false} key := connectMethodKey{"", scheme, addr, false}
select {
case <-t.incHostConnCount(key): if t.MaxConnsPerHost > 0 {
default: // Transport is tracking conns-per-host.
return false // Increment connection count to account
// for new persistConn created below.
t.connsPerHostMu.Lock()
if t.connsPerHost == nil {
t.connsPerHost = make(map[connectMethodKey]int)
} }
t.connsPerHost[key]++
t.connsPerHostMu.Unlock()
}
return t.tryPutIdleConn(&persistConn{ return t.tryPutIdleConn(&persistConn{
t: t, t: t,
conn: c, // dummy conn: c, // dummy
......
...@@ -2407,6 +2407,7 @@ func TestTimeoutHandlerRace(t *testing.T) { ...@@ -2407,6 +2407,7 @@ func TestTimeoutHandlerRace(t *testing.T) {
} }
// See issues 8209 and 8414. // See issues 8209 and 8414.
// Both issues involved panics in the implementation of TimeoutHandler.
func TestTimeoutHandlerRaceHeader(t *testing.T) { func TestTimeoutHandlerRaceHeader(t *testing.T) {
setParallel(t) setParallel(t)
defer afterTest(t) defer afterTest(t)
...@@ -2434,7 +2435,9 @@ func TestTimeoutHandlerRaceHeader(t *testing.T) { ...@@ -2434,7 +2435,9 @@ func TestTimeoutHandlerRaceHeader(t *testing.T) {
defer func() { <-gate }() defer func() { <-gate }()
res, err := c.Get(ts.URL) res, err := c.Get(ts.URL)
if err != nil { if err != nil {
t.Error(err) // We see ECONNRESET from the connection occasionally,
// and that's OK: this test is checking that the server does not panic.
t.Log(err)
return return
} }
defer res.Body.Close() defer res.Body.Close()
...@@ -5507,19 +5510,23 @@ func TestServerSetKeepAlivesEnabledClosesConns(t *testing.T) { ...@@ -5507,19 +5510,23 @@ func TestServerSetKeepAlivesEnabledClosesConns(t *testing.T) {
if a1 != a2 { if a1 != a2 {
t.Fatal("expected first two requests on same connection") t.Fatal("expected first two requests on same connection")
} }
var idle0 int addr := strings.TrimPrefix(ts.URL, "http://")
if !waitCondition(2*time.Second, 10*time.Millisecond, func() bool {
idle0 = tr.IdleConnKeyCountForTesting() // The two requests should have used the same connection,
return idle0 == 1 // and there should not have been a second connection that
}) { // was created by racing dial against reuse.
t.Fatalf("idle count before SetKeepAlivesEnabled called = %v; want 1", idle0) // (The first get was completed when the second get started.)
n := tr.IdleConnCountForTesting("http", addr)
if n != 1 {
t.Fatalf("idle count for %q after 2 gets = %d, want 1", addr, n)
} }
// SetKeepAlivesEnabled should discard idle conns.
ts.Config.SetKeepAlivesEnabled(false) ts.Config.SetKeepAlivesEnabled(false)
var idle1 int var idle1 int
if !waitCondition(2*time.Second, 10*time.Millisecond, func() bool { if !waitCondition(2*time.Second, 10*time.Millisecond, func() bool {
idle1 = tr.IdleConnKeyCountForTesting() idle1 = tr.IdleConnCountForTesting("http", addr)
return idle1 == 0 return idle1 == 0
}) { }) {
t.Fatalf("idle count after SetKeepAlivesEnabled called = %v; want 0", idle1) t.Fatalf("idle count after SetKeepAlivesEnabled called = %v; want 0", idle1)
......
This diff is collapsed.
...@@ -655,13 +655,17 @@ func TestTransportMaxConnsPerHost(t *testing.T) { ...@@ -655,13 +655,17 @@ func TestTransportMaxConnsPerHost(t *testing.T) {
expected := int32(tr.MaxConnsPerHost) expected := int32(tr.MaxConnsPerHost)
if dialCnt != expected { if dialCnt != expected {
t.Errorf("Too many dials (%s): %d", scheme, dialCnt) t.Errorf("round 1: too many dials (%s): %d != %d", scheme, dialCnt, expected)
} }
if gotConnCnt != expected { if gotConnCnt != expected {
t.Errorf("Too many get connections (%s): %d", scheme, gotConnCnt) t.Errorf("round 1: too many get connections (%s): %d != %d", scheme, gotConnCnt, expected)
} }
if ts.TLS != nil && tlsHandshakeCnt != expected { if ts.TLS != nil && tlsHandshakeCnt != expected {
t.Errorf("Too many tls handshakes (%s): %d", scheme, tlsHandshakeCnt) t.Errorf("round 1: too many tls handshakes (%s): %d != %d", scheme, tlsHandshakeCnt, expected)
}
if t.Failed() {
t.FailNow()
} }
(<-connCh).Close() (<-connCh).Close()
...@@ -670,13 +674,13 @@ func TestTransportMaxConnsPerHost(t *testing.T) { ...@@ -670,13 +674,13 @@ func TestTransportMaxConnsPerHost(t *testing.T) {
doReq() doReq()
expected++ expected++
if dialCnt != expected { if dialCnt != expected {
t.Errorf("Too many dials (%s): %d", scheme, dialCnt) t.Errorf("round 2: too many dials (%s): %d", scheme, dialCnt)
} }
if gotConnCnt != expected { if gotConnCnt != expected {
t.Errorf("Too many get connections (%s): %d", scheme, gotConnCnt) t.Errorf("round 2: too many get connections (%s): %d != %d", scheme, gotConnCnt, expected)
} }
if ts.TLS != nil && tlsHandshakeCnt != expected { if ts.TLS != nil && tlsHandshakeCnt != expected {
t.Errorf("Too many tls handshakes (%s): %d", scheme, tlsHandshakeCnt) t.Errorf("round 2: too many tls handshakes (%s): %d != %d", scheme, tlsHandshakeCnt, expected)
} }
} }
...@@ -2795,8 +2799,8 @@ func TestIdleConnChannelLeak(t *testing.T) { ...@@ -2795,8 +2799,8 @@ func TestIdleConnChannelLeak(t *testing.T) {
<-didRead <-didRead
} }
if got := tr.IdleConnChMapSizeForTesting(); got != 0 { if got := tr.IdleConnWaitMapSizeForTesting(); got != 0 {
t.Fatalf("ForDisableKeepAlives = %v, map size = %d; want 0", disableKeep, got) t.Fatalf("for DisableKeepAlives = %v, map size = %d; want 0", disableKeep, got)
} }
} }
} }
...@@ -3378,9 +3382,9 @@ func TestTransportCloseIdleConnsThenReturn(t *testing.T) { ...@@ -3378,9 +3382,9 @@ func TestTransportCloseIdleConnsThenReturn(t *testing.T) {
} }
wantIdle("after second put", 0) wantIdle("after second put", 0)
tr.RequestIdleConnChForTesting() // should toggle the transport out of idle mode tr.QueueForIdleConnForTesting() // should toggle the transport out of idle mode
if tr.IsIdleForTesting() { if tr.IsIdleForTesting() {
t.Error("shouldn't be idle after RequestIdleConnChForTesting") t.Error("shouldn't be idle after QueueForIdleConnForTesting")
} }
if !tr.PutIdleTestConn("http", "example.com") { if !tr.PutIdleTestConn("http", "example.com") {
t.Fatal("after re-activation") t.Fatal("after re-activation")
...@@ -3802,8 +3806,8 @@ func TestNoCrashReturningTransportAltConn(t *testing.T) { ...@@ -3802,8 +3806,8 @@ func TestNoCrashReturningTransportAltConn(t *testing.T) {
ln := newLocalListener(t) ln := newLocalListener(t)
defer ln.Close() defer ln.Close()
handledPendingDial := make(chan bool, 1) var wg sync.WaitGroup
SetPendingDialHooks(nil, func() { handledPendingDial <- true }) SetPendingDialHooks(func() { wg.Add(1) }, wg.Done)
defer SetPendingDialHooks(nil, nil) defer SetPendingDialHooks(nil, nil)
testDone := make(chan struct{}) testDone := make(chan struct{})
...@@ -3873,7 +3877,7 @@ func TestNoCrashReturningTransportAltConn(t *testing.T) { ...@@ -3873,7 +3877,7 @@ func TestNoCrashReturningTransportAltConn(t *testing.T) {
doReturned <- true doReturned <- true
<-madeRoundTripper <-madeRoundTripper
<-handledPendingDial wg.Wait()
} }
func TestTransportReuseConnection_Gzip_Chunked(t *testing.T) { func TestTransportReuseConnection_Gzip_Chunked(t *testing.T) {
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment