yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.66k stars 1.04k forks source link

[CDC + xCluster + Rolling Upgrade] LRU CDC stopped after ~4 hours of upgrade with GCed error #15066

Open shamanthchandra-yb opened 1 year ago

shamanthchandra-yb commented 1 year ago

Jira Link: DB-4291

Description

We had been running LRU for more than few weeks now, on Nov 16th 2022, rolling upgrade was done to b201 by @Arjun-yb from ~ 1:30 PM to 2 PM IST.

Screenshot 2022-11-18 at 3 10 29 PM

Here are the observations : • We do CDC verification by the latest timestamp on postgres (sink database) • At 5:30 PM IST on same day, when I checked CDC was still running and latest timestamp I was able to see. • But after that, around 5:46 PM IST CDC has stopped working. Latest timestamp recorded in postgres is epoch 1668600968682 (i.e 16 November 2022 17:46:08.682 GMT+05:30) • CDC as enabled on 2 tables. cdc_test and employees_1. Both of the them has been stopped.

Universe details: Universe link

Connector log: connector_log.zip

GCed error is seen when CDC stopped i.e. almost 4 hours after upgrade.

2022-11-16 12:16:13,170 ERROR  ||  WorkerSourceTask{id=ybconnector_employees-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted   [org.apache.kafka.connect.runtime.WorkerTask]
org.apache.kafka.connect.errors.ConnectException: An exception occurred in the change event producer. This connector will be stopped.
        at io.debezium.pipeline.ErrorHandler.setProducerThrowable(ErrorHandler.java:42)
        at io.debezium.connector.yugabytedb.YugabyteDBStreamingChangeEventSource.execute(YugabyteDBStreamingChangeEventSource.java:188)
        at io.debezium.connector.yugabytedb.YugabyteDBStreamingChangeEventSource.execute(YugabyteDBStreamingChangeEventSource.java:44)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:165)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:127)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.yb.client.CDCErrorException: Server[7049ebf96b424dff93c4ed07b669c9e7] INTERNAL_ERROR[code 21]: CDCSDK Trying to fetch already GCed intents for transaction eff57145-8029-4c8f-9e16-5077112d4967
        at org.yb.client.TabletClient.dispatchCDCErrorOrReturnException(TabletClient.java:495)
        at org.yb.client.TabletClient.decode(TabletClient.java:426)
        at org.yb.client.TabletClient.decode(TabletClient.java:87)
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500)
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.yb.client.TabletClient.handleUpstream(TabletClient.java:641)
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.yb.client.AsyncYBClient$TabletClientPipeline.sendUpstream(AsyncYBClient.java:2525)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        ... 3 more 

Other observations from @Arjun-yb : after upgrade he saw that soft memory issue at tserver.

Tserver INFO log

I1116 11:53:43.307467 30693 service_util.cc:422] Rejecting Write request: Soft memory limit exceeded for root (at 88.53% of capacity), score: 0.95 [suppressed 3 similar messages]
I1116 11:53:43.823812 28890 mem_tracker.cc:997] Rejecting CQL call: Soft memory limit exceeded for root (at 88.53% of capacity), score: 0.92 [suppressed 1 similar messages]
Postgresql log
I1116 11:54:50.770002 7873 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:80): Scheduler is shutting down (system error 108)
Client log
3550132327 [cluster1-worker-32616] ERROR com.datastax.driver.core.ControlConnection - [Control connection] Unexpected error while refreshing node list and token map
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.OverloadedException: Queried host (/172.151.24.171:9042) was overloaded:
    at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:476)
    at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:435)
    at com.datastax.driver.core.ControlConnection.refreshNodeListAndTokenMap(ControlConnection.java:843)
    at com.datastax.driver.core.ControlConnection.refreshNodeListAndTokenMap(ControlConnection.java:457)
    at com.datastax.driver.core.Cluster$Manager$NodeListRefreshRequestDeliveryCallback$1.runMayThrow(Cluster.java:3242)
    at com.datastax.driver.core.ExceptionCatchingRunnable.run(ExceptionCatchingRunnable.java:32)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
    at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
    at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.datastax.driver.core.exceptions.OverloadedException: Queried host (/172.151.24.171:9042) was overloaded:
    at com.datastax.driver.core.Responses$Error.asException(Responses.java:161)
    at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:234)
    at com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:261)
    at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1292)
    at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1210)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:312)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:286)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
    at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
    at com.datastax.driver.core.InboundTrafficMeter.channelRead(InboundTrafficMeter.java:38)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1304)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:921)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:135)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:546)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:500)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
    ... 2 more

Disk details:

Filesystem   Size Used Avail Use% Mounted on
/dev/nvme0n1p1  40G  27G  14G 67% /
devtmpfs    3.7G   0 3.7G  0% /dev
tmpfs      3.7G 180K 3.7G  1% /dev/shm
tmpfs      3.7G 385M 3.3G 11% /run
tmpfs      3.7G   0 3.7G  0% /sys/fs/cgroup
/dev/nvme1n1  250G 170G  81G 68% /mnt/d0
/dev/nvme2n1  250G 177G  74G 71% /mnt/d1
/dev/nvme3n1  250G 169G  82G 68% /mnt/d2
tmpfs      753M   0 753M  0% /run/user/997
sureshdash2022-yb commented 1 year ago

Hi @shamanthchandra-yb , During the analysis, we found that:-

  1. I1116 09:25:42.695812 8203 mem_tracker.cc:997] Rejecting CQL call: Soft memory limit exceeded for root (at 85.03% of capacity), score: 1.00

soft memory error, In the tserver(172.151.17.239).

  1. To retain the intents and WAL segment, In CDC service we have a separate thread, which regularly updates the intent retention time and min checkpoint into the tablet peer. but it got stopped after I1116 08:15:23. because of this we see CDC min checkpoint is also reset to MAX(This for XCluster)

I1116 12:12:10.053333 8154 tablet_peer.cc:1075] T f71eb98a74ae4a8dae073a4d3f6ba510 P 7049ebf96b424dff93c4ed07b669c9e7 [state=RUNNING]: Resetting cdc min replicated index. Seconds since last update: 4090.41 I1116 12:12:10.053362 8154 tablet_peer.cc:1054] T f71eb98a74ae4a8dae073a4d3f6ba510 P 7049ebf96b424dff93c4ed07b669c9e7 [state=RUNNING]: Setting cdc min replicated index to 9223372036854775807 I1116 12:12:10.057525 8154 log.cc:1323] T f71eb98a74ae4a8dae073a4d3f6ba510 P 7049ebf96b424dff93c4ed07b669c9e7: Running Log GC on /mnt/d0/yb-data/tserver/wals/table-000033e6000030008000000000004005/tablet-f71eb98a74ae4a8dae073a4d3f6ba510: retaining ops >= 129686440, log segment size = 67108864

  1. Even if GetChanges is called for the tablet still E1116 12:09 and the stream is active because the separate thread for CDC service that we mentioned in step:-2 is stopped after I1116 08:15:23. after 4 hours we will see this GC issue.

Here is the complete analysis doc:- https://docs.google.com/document/d/1FlASehcgSHCZOXaVFjSJcMtGxoGVD8aTwFG_AUYtPKk/edit#

Conclusion:- As per the log analysis, it gives proof that “UpdatePeersAndMetrics” Thread is aborted, causing this issue. This abort stop may be because of Soft memory limit exceeded error, which may create a resource crunch in the cluster. From this, it doesn't look like an issue from the CDC side. But we need further investigation of the real cause of “UpdatePeersAndMetrics” thread stop.