world-federation-of-advertisers / cross-media-measurement

Apache License 2.0
35 stars 11 forks source link

Rpc calls raise error DEADLINE_EXCEEDED in Duchy #1640

Open renjiezh opened 4 months ago

renjiezh commented 4 months ago

Describe the bug During QA stress test, the recording output blob error from duchy with code DEADLINE_EXCEEDED is regarded as permanent. However, retry should've resolved this error. It should be a transient error.

Steps to reproduce Run stress test in cloud and there is chance to reproduce this error.

Component(s) affected Duchy

Version v0.5.5

Environment halo-cmm-qa

Additional context

measurementId=7313662499326307228 externalComputationId= 491229066843608164 globalComputationId="BtEyO9CPVGQ"

From the kingdom

"COMPUTATION_PARTICIPANT_FAILED","message":"Computation Participant failed. Error recording blob output path"

From worker1 mill

BtEyO9CPVGQ@worker1-liquid-legions-v2-mill-daemon-deployment-6c5f89dcfkz9fd: Error recording blob output path
org.wfanet.measurement.duchy.db.computation.ComputationDataClients$PermanentErrorException: Error recording blob output path
    at org.wfanet.measurement.duchy.db.computation.ComputationDataClients.writeBlobIfNotPresent(ComputationDataClients.kt:147)
    at org.wfanet.measurement.duchy.db.computation.ComputationDataClients.access$writeBlobIfNotPresent(ComputationDataClients.kt:39)
    at org.wfanet.measurement.duchy.db.computation.ComputationDataClients$writeBlobIfNotPresent$1.invokeSuspend(ComputationDataClients.kt)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.internal.ScopeCoroutine.afterResume(Scopes.kt:32)
    at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
    at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:280)
    at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
    at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
    at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
    at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
    at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
    at org.wfanet.measurement.duchy.deploy.common.daemon.mill.liquidlegionsv2.LiquidLegionsV2MillDaemon.run(LiquidLegionsV2MillDaemon.kt:225)
    at org.wfanet.measurement.duchy.deploy.gcloud.daemon.mill.liquidlegionsv2.GcsLiquidLegionsV2MillDaemon.run(GcsLiquidLegionsV2MillDaemon.kt:34)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1919)
    at picocli.CommandLine.access$1100(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2159)
    at picocli.CommandLine.execute(CommandLine.java:2058)
    at org.wfanet.measurement.common.CommandLinesKt.commandLineMain(CommandLines.kt:47)
    at org.wfanet.measurement.common.CommandLinesKt.commandLineMain$default(CommandLines.kt:42)
    at org.wfanet.measurement.duchy.deploy.gcloud.daemon.mill.liquidlegionsv2.GcsLiquidLegionsV2MillDaemonKt.main(GcsLiquidLegionsV2MillDaemon.kt:38)
Caused by: io.grpc.StatusException: DEADLINE_EXCEEDED: ClientCall was cancelled at or after deadline. [closed=[CANCELLED], committed=[remote_addr=/10.76.10.138:8443]]
    at io.grpc.Status.asException(Status.java:552)
    at io.grpc.kotlin.ClientCalls$rpcImpl$1$1$1.onClose(ClientCalls.kt:300)
    at io.opentelemetry.javaagent.shaded.instrumentation.grpc.v1_6.TracingClientInterceptor$TracingClientCall$TracingClientCallListener.onClose(TracingClientInterceptor.java:159)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:574)
    at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:72)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializeReentrantCallsDirectExecutor.execute(SerializeReentrantCallsDirectExecutor.java:49)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.closedInternal(ClientCallImpl.java:750)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.closed(ClientCallImpl.java:690)
    at io.grpc.internal.RetriableStream$4.run(RetriableStream.java:840)
    at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:94)
    at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:126)
    at io.grpc.internal.RetriableStream.safeCloseMasterListener(RetriableStream.java:835)
    at io.grpc.internal.RetriableStream.access$2200(RetriableStream.java:55)
    at io.grpc.internal.RetriableStream$Sublistener.closed(RetriableStream.java:1028)
    at io.grpc.internal.ForwardingClientStreamListener.closed(ForwardingClientStreamListener.java:34)
    at io.grpc.internal.InternalSubchannel$CallTracingTransport$1$1.closed(InternalSubchannel.java:691)
    at io.grpc.internal.AbstractClientStream$TransportState.closeListener(AbstractClientStream.java:458)
    at io.grpc.internal.AbstractClientStream$TransportState.access$400(AbstractClientStream.java:221)
    at io.grpc.internal.AbstractClientStream$TransportState$1.run(AbstractClientStream.java:441)
    at io.grpc.internal.AbstractClientStream$TransportState.deframerClosed(AbstractClientStream.java:278)
    at io.grpc.internal.Http2ClientStreamTransportState.deframerClosed(Http2ClientStreamTransportState.java:31)
    at io.grpc.internal.MessageDeframer.close(MessageDeframer.java:234)
    at io.grpc.internal.MessageDeframer.closeWhenComplete(MessageDeframer.java:192)
    at io.grpc.internal.AbstractStream$TransportState.closeDeframer(AbstractStream.java:201)
    at io.grpc.internal.AbstractClientStream$TransportState.transportReportStatus(AbstractClientStream.java:444)
    at io.grpc.netty.NettyClientHandler.onRstStreamRead(NettyClientHandler.java:412)
    at io.grpc.netty.NettyClientHandler.access$1300(NettyClientHandler.java:95)
    at io.grpc.netty.NettyClientHandler$FrameListener.onRstStreamRead(NettyClientHandler.java:946)
    at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onRstStreamRead(DefaultHttp2ConnectionDecoder.java:442)
    at io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onRstStreamRead(Http2InboundFrameLogger.java:80)
    at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readRstStreamFrame(DefaultHttp2FrameReader.java:509)
    at io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:259)
    at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:159)
    at io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41)
    at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:173)
    at io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:393)
    at io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:453)
    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475)
    at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338)
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387)
    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)
SanjayVas commented 4 months ago

DEADLINE_EXCEEDED is not retriable by default. It's only retriable if you know the method is idempotent, i.e. calling it multiple times with the same params will always give the same result.

renjiezh commented 4 months ago

Another DEADLINE_EXCEEDED error found in rpc advanceComputationStage.

Zq8-beACz3M@aggregator-liquid-legions-v2-mill-daemon-deployment-7d7478n9595: Error advancing computation stage
org.wfanet.measurement.duchy.db.computation.ComputationDataClients$PermanentErrorException: Error advancing computation stage
    at org.wfanet.measurement.duchy.db.computation.ComputationDataClients.transitionComputationToStage(ComputationDataClients.kt:80)
    at org.wfanet.measurement.duchy.db.computation.ComputationDataClients$transitionComputationToStage$1.invokeSuspend(ComputationDataClients.kt)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.internal.ScopeCoroutine.afterResume(Scopes.kt:32)
    at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
    at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:280)
    at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
    at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
    at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
    at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
    at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
    at org.wfanet.measurement.duchy.deploy.common.daemon.mill.liquidlegionsv2.LiquidLegionsV2MillDaemon.run(LiquidLegionsV2MillDaemon.kt:225)
    at org.wfanet.measurement.duchy.deploy.gcloud.daemon.mill.liquidlegionsv2.GcsLiquidLegionsV2MillDaemon.run(GcsLiquidLegionsV2MillDaemon.kt:34)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1919)
    at picocli.CommandLine.access$1100(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2159)
    at picocli.CommandLine.execute(CommandLine.java:2058)
    at org.wfanet.measurement.common.CommandLinesKt.commandLineMain(CommandLines.kt:47)
    at org.wfanet.measurement.common.CommandLinesKt.commandLineMain$default(CommandLines.kt:42)
    at org.wfanet.measurement.duchy.deploy.gcloud.daemon.mill.liquidlegionsv2.GcsLiquidLegionsV2MillDaemonKt.main(GcsLiquidLegionsV2MillDaemon.kt:38)
Caused by: io.grpc.StatusException: DEADLINE_EXCEEDED: ClientCall was cancelled at or after deadline. [closed=[CANCELLED], committed=[remote_addr=/10.4.1.104:8443]]
    at io.grpc.Status.asException(Status.java:552)
    at io.grpc.kotlin.ClientCalls$rpcImpl$1$1$1.onClose(ClientCalls.kt:300)
    at io.opentelemetry.javaagent.shaded.instrumentation.grpc.v1_6.TracingClientInterceptor$TracingClientCall$TracingClientCallListener.onClose(TracingClientInterceptor.java:159)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:574)
    at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:72)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializeReentrantCallsDirectExecutor.execute(SerializeReentrantCallsDirectExecutor.java:49)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.closedInternal(ClientCallImpl.java:750)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.closed(ClientCallImpl.java:690)
    at io.grpc.internal.RetriableStream$4.run(RetriableStream.java:840)
    at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:94)
    at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:126)
    at io.grpc.internal.RetriableStream.safeCloseMasterListener(RetriableStream.java:835)
    at io.grpc.internal.RetriableStream.access$2200(RetriableStream.java:55)
    at io.grpc.internal.RetriableStream$Sublistener.closed(RetriableStream.java:1028)
    at io.grpc.internal.ForwardingClientStreamListener.closed(ForwardingClientStreamListener.java:34)
    at io.grpc.internal.InternalSubchannel$CallTracingTransport$1$1.closed(InternalSubchannel.java:691)
    at io.grpc.internal.AbstractClientStream$TransportState.closeListener(AbstractClientStream.java:458)
    at io.grpc.internal.AbstractClientStream$TransportState.access$400(AbstractClientStream.java:221)
    at io.grpc.internal.AbstractClientStream$TransportState$1.run(AbstractClientStream.java:441)
    at io.grpc.internal.AbstractClientStream$TransportState.deframerClosed(AbstractClientStream.java:278)
    at io.grpc.internal.Http2ClientStreamTransportState.deframerClosed(Http2ClientStreamTransportState.java:31)
    at io.grpc.internal.MessageDeframer.close(MessageDeframer.java:234)
    at io.grpc.internal.MessageDeframer.closeWhenComplete(MessageDeframer.java:192)
    at io.grpc.internal.AbstractStream$TransportState.closeDeframer(AbstractStream.java:201)
    at io.grpc.internal.AbstractClientStream$TransportState.transportReportStatus(AbstractClientStream.java:444)
    at io.grpc.netty.NettyClientHandler.onRstStreamRead(NettyClientHandler.java:412)
    at io.grpc.netty.NettyClientHandler.access$1300(NettyClientHandler.java:95)
    at io.grpc.netty.NettyClientHandler$FrameListener.onRstStreamRead(NettyClientHandler.java:946)
    at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onRstStreamRead(DefaultHttp2ConnectionDecoder.java:442)
    at io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onRstStreamRead(Http2InboundFrameLogger.java:80)
    at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readRstStreamFrame(DefaultHttp2FrameReader.java:509)
    at io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:259)
    at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:159)
    at io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41)
    at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:173)
    at io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:393)
    at io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:453)
    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475)
    at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338)
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387)
    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)
SanjayVas commented 3 months ago

Is this issue for trying to investigate the cause of DEADLINE_EXCEEDED? Or is it for making the Duchy more tolerant to DEADLINE_EXCEEDED?

If it's the latter, then you need a retry mechanism. For methods that mutate, DEADLINE_EXCEEDED is not inherently retryable. To make it retryable, you need a way to check whether the mutation actually happened. In practice, this either means an idempotency mechanism (i.e. passing a deterministic request ID) or a freshness mechanism (i.e. reading to check if the previous RPC performed the mutation and passing an etag to ensure the read and mutate are consistent).