world-federation-of-advertisers / cross-media-measurement

A privacy centric system for cross publisher, cross media ads measurement through secure multiparty computations.
https://halo.wfanet.org/
Apache License 2.0
36 stars 11 forks source link

Error advancing computation at Duchy with error message type UNKNOWN. #1785

Closed ple13 closed 2 months ago

ple13 commented 2 months ago

Describe the bug The nightly build (https://github.com/world-federation-of-advertisers/cross-media-measurement/actions/runs/10591455886/job/29349447074) reported a failed llv2 R/F test.

More details: There was 1 failure: 1) reach and frequency measurement completes with expected result(org.wfanet.measurement.integration.k8s.SyntheticGeneratorCorrectnessTest) java.lang.IllegalStateException: Measurement failed with reason COMPUTATION_PARTICIPANT_FAILED: Computation Participant failed. Error advancing computation at other Duchy at org.wfanet.measurement.loadtest.measurementconsumer.MeasurementConsumerSimulator.checkNotFailed(MeasurementConsumerSimulator.kt:879) at org.wfanet.measurement.loadtest.measurementconsumer.MeasurementConsumerSimulator.getReachAndFrequencyResult(MeasurementConsumerSimulator.kt:826) at org.wfanet.measurement.loadtest.measurementconsumer.MeasurementConsumerSimulator.access$getReachAndFrequencyResult(MeasurementConsumerSimulator.kt:134) at org.wfanet.measurement.loadtest.measurementconsumer.MeasurementConsumerSimulator$getReachAndFrequencyResult$1.invokeSuspend(MeasurementConsumerSimulator.kt) at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) at kotlinx.coroutines.internal.ScopeCoroutine.afterResume(Scopes.kt:32) at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102) at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46) at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:108) at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:280) at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85) at kotlinx.coroutines.BuildersKtBuildersKt.runBlocking(Builders.kt:59) at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source) at kotlinx.coroutines.BuildersKtBuildersKt.runBlocking$default(Builders.kt:38) at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source) at org.wfanet.measurement.integration.k8s.AbstractCorrectnessTest.reach and frequency measurement completes with expected result(AbstractCorrectnessTest.kt:51)

Steps to reproduce This is a flaky test.

Component(s) affected Duchies.

Version nightly/20240827.1

Environment halo-cmm-head

Additional context Run-time error happens at https://github.com/world-federation-of-advertisers/cross-media-measurement/blob/f30c0951cf6c5a39cc545ef790c6c71010340995/src/main/kotlin/org/wfanet/measurement/duchy/service/internal/computationcontrol/AsyncComputationControlService.kt#L76.

SanjayVas commented 2 months ago

Just filing that there's an UNKNOWN at the async service isn't particularly helpful. What we really need is the underlying error from the backend. We should have the error logging interceptor installed there - if not, that's a separate bug. Given that, you should be able to find the log entries with the matching trace ID.

To clarify, the issue isn't a flaky/failing test on nightly. It's instead whatever caused that test to fail.

I took the liberty of fixing the version and environment in the initial filing.

SanjayVas commented 2 months ago

Log entry from system API server (found by searching for FailComputationParticipant around the test failure time):

[pool-6-thread-1] gRPC 882dfb63d7955cf35a874dbd2f3985f2 request: Metadata(content-type=application/grpc,user-agent=grpc-java-netty/1.60.2,duchy-identity=aggregator,traceparent=00-882dfb63d7955cf35a874dbd2f3985f2-8c4306552f3774e7-01,grpc-accept-encoding=gzip) name: "computations/Txc0N5fA5NM/participants/aggregator"
failure {
  participant_child_reference_id: "aggregator-llv2-mill-job-tlb9k"
  error_message: "Error advancing computation at other Duchy"
  error_time {
    seconds: 1724829812
    nanos: 67410818
  }
  stage_attempt {
    stage: 8
    stage_name: "EXECUTION_PHASE_ONE"
    stage_start_time {
      seconds: 1724829812
      nanos: 67410818
    }
    attempt_number: 1
  }
}
etag: "W/\"9374F4A6101A7FA3E4EC95040B710AD1ECC13A041BB0FBE4FA73E183F948C967\""

This means the error is from the aggregator Duchy calling another Duchy. Looking at the logs from the specified Mill job, it looks like it's unfortunately not logged by a gRPC interceptor so we don't have the trace ID nor any information on which Duchy is being called. Therefore we'll have to look at the computation control server at the other Duchies and try to match by timestamp.

Txc0N5fA5NM@aggregator-llv2-mill-job-tlb9k: Failing Computation. Error advancing computation at other Duchy
org.wfanet.measurement.duchy.db.computation.ComputationDataClients$PermanentErrorException: Error advancing computation at other Duchy
    at org.wfanet.measurement.duchy.mill.MillBase.sendAdvanceComputationRequest(MillBase.kt:626)
    at org.wfanet.measurement.duchy.mill.MillBase$sendAdvanceComputationRequest$1.invokeSuspend(MillBase.kt)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.internal.ScopeCoroutine.afterResume(Scopes.kt:32)
    at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
    at kotlinx.coroutines.UndispatchedCoroutine.afterResume(CoroutineContext.kt:270)
    at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:108)
    at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:280)
    at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
    at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
    at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
    at org.wfanet.measurement.duchy.deploy.common.job.mill.liquidlegionsv2.LiquidLegionsV2MillJob.run(LiquidLegionsV2MillJob.kt:175)
    at org.wfanet.measurement.duchy.deploy.gcloud.job.mill.liquidlegionsv2.GcsLiquidLegionsV2MillJob.run(GcsLiquidLegionsV2MillJob.kt:34)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1919)
    at picocli.CommandLine.access$1100(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2159)
    at picocli.CommandLine.execute(CommandLine.java:2058)
    at org.wfanet.measurement.common.CommandLinesKt.commandLineMain(CommandLines.kt:47)
    at org.wfanet.measurement.common.CommandLinesKt.commandLineMain$default(CommandLines.kt:42)
    at org.wfanet.measurement.duchy.deploy.gcloud.job.mill.liquidlegionsv2.GcsLiquidLegionsV2MillJobKt.main(GcsLiquidLegionsV2MillJob.kt:38)
Caused by: io.grpc.StatusException: UNKNOWN
    at io.grpc.Status.asException(Status.java:552)
    at io.grpc.kotlin.ClientCalls$rpcImpl$1$1$1.onClose(ClientCalls.kt:300)
    ...

No errors appear in worker1, so we need to check worker2 in AWS EKS. The logs there appear in CloudWatch. There we find the log entry of the UNKNOWN error:

WARNING: [DefaultDispatcher-worker-2] gRPC 66cecfe332b8d7dc2baaa785a6e576ca error: UNKNOWN
io.r2dbc.postgresql.ExceptionFactory$PostgresqlAuthenticationFailure: [28P01] password authentication failed for user \"postgres\"

This is a known issue for the AWS Postgres Duchy implementation that we're still trying to address, where it loses access to the underlying Postgres database. The workaround is to manually restart the internal API server.

SanjayVas commented 2 months ago

Duplicate of #1648