Open Groxx opened 6 months ago
Yea, applying this patch brings me from ~20% failure to zero after thousands of iterations:
diff --git a/transport/grpc/transport.go b/transport/grpc/transport.go
index 15d69460..2c7a2c32 100644
--- a/transport/grpc/transport.go
+++ b/transport/grpc/transport.go
@@ -39,6 +39,7 @@ type Transport struct {
once *lifecycle.Once
options *transportOptions
addressToPeer map[string]*grpcPeer
+ waitPeers []*grpcPeer
}
// NewTransport returns a new Transport.
@@ -71,6 +72,9 @@ func (t *Transport) Stop() error {
for _, grpcPeer := range t.addressToPeer {
grpcPeer.wait()
}
+ for _, stoppedGrpcPeer := range t.waitPeers {
+ stoppedGrpcPeer.wait()
+ }
return nil
})
}
@@ -144,6 +148,7 @@ func (t *Transport) ReleasePeer(pid peer.Identifier, ps peer.Subscriber) error {
if p.NumSubscribers() == 0 {
delete(t.addressToPeer, address)
p.stop()
+ t.waitPeers = append(t.waitPeers, p)
}
return nil
}
I have no idea if this^ is worth using, I'm not familiar enough with the code/expectations in here. But it's an effective proof of concept at least.
When converting to some zaptest loggers in some internal tests, I started getting occasional test panics like:
After digging around a bit, I can see we are using some single-peer-choosers with grpc, and:
So if shutdown calls
peer.Single.Stop()
and thengrpc.Transport.Stop()
, the peer will be removed after having only been told tostop()
, and the transport'sStop()
will not wait for it to stop its background goroutine.I'm not 100% certain that shutdown occurs in this order (fx logs don't make that explicit), but it seems like it probably has to as peers are used in outbounds. Stop RPC == stop outbounds -> stop peers -> stop transports, right?
I'm not seeing any way to patch this from the outside, as the peer's instance and API doesn't seem to be exposed anywhere. Which is probably a good thing. So I think this has to be fixed internally.
As a possibly simple option: maybe
grpc.Transport
should just keep all stop-chans (remove the peer but not the chan inReleasePeer
) and wait on all of them duringStop()
? It would leak empty chans unless some cleanup process was run, but if that's an issue then closed chans could probably be cleared out inReleasePeer
as a garbage collector.Or should ReleasePeer just wait too? I'm not sure what the semantics are here, but it seems like it may be intentional that it doesn't wait.
I haven't carefully checked the other transports to see if they have similar issues, but e.g. http is sufficiently different that it doesn't obviously have the same problem.