go/zodb/internal/weak: Try to fix GC crashes via reworking Ref to keep only one word instead of two
We are observing garbage-collector crashes due to weak package under load(*) with GC crashing as e.g. runtime: full=0xc0001f10000005 next=205 jobs=204 nDataRoots=1 nBSSRoots=1 nSpanRoots=16 nStackRoots=184 panic: non-empty mark queue after concurrent mark fatal error: panic on system stack runtime stack: runtime.throw({0x5c60fe?, 0x601d70?}) /home/kirr/src/tools/go/go1.21/src/runtime/panic.go:1077 +0x5c fp=0xc000051e88 sp=0xc000051e58 pc=0x436efc panic({0x585100?, 0x601d70?}) /home/kirr/src/tools/go/go1.21/src/runtime/panic.go:840 +0x6ea fp=0xc000051f38 sp=0xc000051e88 pc=0x436e0a runtime.gcMark(0x118946?) /home/kirr/src/tools/go/go1.21/src/runtime/mgc.go:1464 +0x40c fp=0xc000051fb0 sp=0xc000051f38 pc=0x41bd6c runtime.gcMarkTermination.func1() /home/kirr/src/tools/go/go1.21/src/runtime/mgc.go:962 +0x17 fp=0xc000051fc8 sp=0xc000051fb0 pc=0x41b277 runtime.systemstack() /home/kirr/src/tools/go/go1.21/src/runtime/asm_amd64.s:509 +0x4a fp=0xc000051fd8 sp=0xc000051fc8 pc=0x468eea One problem in current implementation is that weak.Ref keeps two words for the copy of original interface object and recreates that interface object on Ref.Get from those two words _nonatomically_. Which is explicitly documented by Go memory model as something that can lead to corrupted memory and crashes. From https://go.dev/ref/mem#restrictions: This means that races on multiword data structures can lead to inconsistent values not corresponding to a single write. When the values depend on the consistency of internal (pointer, length) or (pointer, type) pairs, as can be the case for interface values, maps, slices, and strings in most Go implementations, such races can in turn lead to arbitrary memory corruption. We can avoid doing multiword writes during object resurrection by using concrete type T* instead of interface{}. Unfortunately as wendelin.core@9b44fc23 shows it does not help with the above issue and the GC continues to crash with the same "panic: non-empty mark queue after concurrent mark" message. This is because weak.Ref implementation needs tight integration with concurrent GC that Go does and in practice we are unable to do that from outside of Go runtime internals. See e.g. https://github.com/golang/go/commit/dfc86e922cd0 and https://github.com/golang/go/commit/79fd633632cd to get an idea what kind of integration that needs to be. Anyway before disabling weak references support I wanted to commit this change to show that one-word approach was tried and it does not work. This leaves a trace in the history. On the good side we are going to use standard package weak in the future hopefully (https://github.com/golang/go/issues/67552). That needs to wait for at least Go 1.24 though. (*) see https://github.com/golang/go/issues/41303 for details
Showing
Please register or sign in to comment