ydb-platform / ydb-java-sdk

YDB Java SDK
https://ydb.tech
Apache License 2.0
36 stars 18 forks source link

PeriodicDiscoveryTask can deadlock scheduler thread #262

Closed uranix closed 4 months ago

uranix commented 4 months ago

Under heavy load we observed several nodes of our application not performing regular tasks. Further debugging revealed that the scheduler thread used application-wide and passed to tech.ydb.core.grpc.GrpcTransport::withSchedulerFactory got stuck with the following stack trace:

java.base@17.0.3-vanilla/jdk.internal.misc.Unsafe.park(Native Method)
java.base@17.0.3-vanilla/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
java.base@17.0.3-vanilla/java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1864)
java.base@17.0.3-vanilla/java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool.java:3463)
java.base@17.0.3-vanilla/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3434)
java.base@17.0.3-vanilla/java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1898)
java.base@17.0.3-vanilla/java.util.concurrent.CompletableFuture.join(CompletableFuture.java:2117)
app//tech.ydb.core.impl.pool.GrpcChannelPool.removeChannels(GrpcChannelPool.java:73)
app//tech.ydb.core.impl.YdbTransportImpl$YdbDiscoveryHandler.handleDiscoveryResult(YdbTransportImpl.java:178)
app//tech.ydb.core.impl.discovery.PeriodicDiscoveryTask.handleDiscoveryResponse(PeriodicDiscoveryTask.java:123)
app//tech.ydb.core.impl.discovery.PeriodicDiscoveryTask.lambda$runDiscovery$1(PeriodicDiscoveryTask.java:151)
app//tech.ydb.core.impl.discovery.PeriodicDiscoveryTask$$Lambda$1003/0x0000000801425e78.accept(Unknown Source)
java.base@17.0.3-vanilla/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
java.base@17.0.3-vanilla/java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:887)
java.base@17.0.3-vanilla/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2325)
app//tech.ydb.core.impl.discovery.PeriodicDiscoveryTask.runDiscovery(PeriodicDiscoveryTask.java:139)
app//tech.ydb.core.impl.discovery.PeriodicDiscoveryTask.run(PeriodicDiscoveryTask.java:99)
java.base@17.0.3-vanilla/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
java.base@17.0.3-vanilla/java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.base@17.0.3-vanilla/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
java.base@17.0.3-vanilla/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
java.base@17.0.3-vanilla/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.base@17.0.3-vanilla/java.lang.Thread.run(Thread.java:833)

The thread was infinitely waiting (actually 3 days before we found the issue) for the channels to shut down at https://github.com/ydb-platform/ydb-java-sdk/blob/4bfda6831fe0fd64c12169aadebbe8f5cd8c6873/core/src/main/java/tech/ydb/core/impl/pool/GrpcChannelPool.java#L73

IMO doing any non-trivial work in the scheduler thread should be avoided.