Open renjiezh opened 4 months ago
DEADLINE_EXCEEDED is not retriable by default. It's only retriable if you know the method is idempotent, i.e. calling it multiple times with the same params will always give the same result.
Another DEADLINE_EXCEEDED error found in rpc advanceComputationStage
.
Zq8-beACz3M@aggregator-liquid-legions-v2-mill-daemon-deployment-7d7478n9595: Error advancing computation stage
org.wfanet.measurement.duchy.db.computation.ComputationDataClients$PermanentErrorException: Error advancing computation stage
at org.wfanet.measurement.duchy.db.computation.ComputationDataClients.transitionComputationToStage(ComputationDataClients.kt:80)
at org.wfanet.measurement.duchy.db.computation.ComputationDataClients$transitionComputationToStage$1.invokeSuspend(ComputationDataClients.kt)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.internal.ScopeCoroutine.afterResume(Scopes.kt:32)
at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:280)
at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
at org.wfanet.measurement.duchy.deploy.common.daemon.mill.liquidlegionsv2.LiquidLegionsV2MillDaemon.run(LiquidLegionsV2MillDaemon.kt:225)
at org.wfanet.measurement.duchy.deploy.gcloud.daemon.mill.liquidlegionsv2.GcsLiquidLegionsV2MillDaemon.run(GcsLiquidLegionsV2MillDaemon.kt:34)
at picocli.CommandLine.executeUserObject(CommandLine.java:1919)
at picocli.CommandLine.access$1100(CommandLine.java:145)
at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2159)
at picocli.CommandLine.execute(CommandLine.java:2058)
at org.wfanet.measurement.common.CommandLinesKt.commandLineMain(CommandLines.kt:47)
at org.wfanet.measurement.common.CommandLinesKt.commandLineMain$default(CommandLines.kt:42)
at org.wfanet.measurement.duchy.deploy.gcloud.daemon.mill.liquidlegionsv2.GcsLiquidLegionsV2MillDaemonKt.main(GcsLiquidLegionsV2MillDaemon.kt:38)
Caused by: io.grpc.StatusException: DEADLINE_EXCEEDED: ClientCall was cancelled at or after deadline. [closed=[CANCELLED], committed=[remote_addr=/10.4.1.104:8443]]
at io.grpc.Status.asException(Status.java:552)
at io.grpc.kotlin.ClientCalls$rpcImpl$1$1$1.onClose(ClientCalls.kt:300)
at io.opentelemetry.javaagent.shaded.instrumentation.grpc.v1_6.TracingClientInterceptor$TracingClientCall$TracingClientCallListener.onClose(TracingClientInterceptor.java:159)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:574)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:72)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializeReentrantCallsDirectExecutor.execute(SerializeReentrantCallsDirectExecutor.java:49)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.closedInternal(ClientCallImpl.java:750)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.closed(ClientCallImpl.java:690)
at io.grpc.internal.RetriableStream$4.run(RetriableStream.java:840)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:94)
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:126)
at io.grpc.internal.RetriableStream.safeCloseMasterListener(RetriableStream.java:835)
at io.grpc.internal.RetriableStream.access$2200(RetriableStream.java:55)
at io.grpc.internal.RetriableStream$Sublistener.closed(RetriableStream.java:1028)
at io.grpc.internal.ForwardingClientStreamListener.closed(ForwardingClientStreamListener.java:34)
at io.grpc.internal.InternalSubchannel$CallTracingTransport$1$1.closed(InternalSubchannel.java:691)
at io.grpc.internal.AbstractClientStream$TransportState.closeListener(AbstractClientStream.java:458)
at io.grpc.internal.AbstractClientStream$TransportState.access$400(AbstractClientStream.java:221)
at io.grpc.internal.AbstractClientStream$TransportState$1.run(AbstractClientStream.java:441)
at io.grpc.internal.AbstractClientStream$TransportState.deframerClosed(AbstractClientStream.java:278)
at io.grpc.internal.Http2ClientStreamTransportState.deframerClosed(Http2ClientStreamTransportState.java:31)
at io.grpc.internal.MessageDeframer.close(MessageDeframer.java:234)
at io.grpc.internal.MessageDeframer.closeWhenComplete(MessageDeframer.java:192)
at io.grpc.internal.AbstractStream$TransportState.closeDeframer(AbstractStream.java:201)
at io.grpc.internal.AbstractClientStream$TransportState.transportReportStatus(AbstractClientStream.java:444)
at io.grpc.netty.NettyClientHandler.onRstStreamRead(NettyClientHandler.java:412)
at io.grpc.netty.NettyClientHandler.access$1300(NettyClientHandler.java:95)
at io.grpc.netty.NettyClientHandler$FrameListener.onRstStreamRead(NettyClientHandler.java:946)
at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onRstStreamRead(DefaultHttp2ConnectionDecoder.java:442)
at io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onRstStreamRead(Http2InboundFrameLogger.java:80)
at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readRstStreamFrame(DefaultHttp2FrameReader.java:509)
at io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:259)
at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:159)
at io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41)
at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:173)
at io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:393)
at io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:453)
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475)
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387)
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:840)
Is this issue for trying to investigate the cause of DEADLINE_EXCEEDED? Or is it for making the Duchy more tolerant to DEADLINE_EXCEEDED?
If it's the latter, then you need a retry mechanism. For methods that mutate, DEADLINE_EXCEEDED is not inherently retryable. To make it retryable, you need a way to check whether the mutation actually happened. In practice, this either means an idempotency mechanism (i.e. passing a deterministic request ID) or a freshness mechanism (i.e. reading to check if the previous RPC performed the mutation and passing an etag to ensure the read and mutate are consistent).
Describe the bug During QA stress test, the recording output blob error from duchy with code DEADLINE_EXCEEDED is regarded as permanent. However, retry should've resolved this error. It should be a transient error.
Steps to reproduce Run stress test in cloud and there is chance to reproduce this error.
Component(s) affected Duchy
Version v0.5.5
Environment halo-cmm-qa
Additional context
measurementId=7313662499326307228 externalComputationId= 491229066843608164 globalComputationId="BtEyO9CPVGQ"
From the kingdom
From worker1 mill