Open shamanthchandra-yb opened 1 year ago
Hi @shamanthchandra-yb , During the analysis, we found that:-
I1116 09:25:42.695812 8203 mem_tracker.cc:997] Rejecting CQL call: Soft memory limit exceeded for root (at 85.03% of capacity), score: 1.00
soft memory error, In the tserver(172.151.17.239).
I1116 12:12:10.053333 8154 tablet_peer.cc:1075] T f71eb98a74ae4a8dae073a4d3f6ba510 P 7049ebf96b424dff93c4ed07b669c9e7 [state=RUNNING]: Resetting cdc min replicated index. Seconds since last update: 4090.41 I1116 12:12:10.053362 8154 tablet_peer.cc:1054] T f71eb98a74ae4a8dae073a4d3f6ba510 P 7049ebf96b424dff93c4ed07b669c9e7 [state=RUNNING]: Setting cdc min replicated index to 9223372036854775807 I1116 12:12:10.057525 8154 log.cc:1323] T f71eb98a74ae4a8dae073a4d3f6ba510 P 7049ebf96b424dff93c4ed07b669c9e7: Running Log GC on /mnt/d0/yb-data/tserver/wals/table-000033e6000030008000000000004005/tablet-f71eb98a74ae4a8dae073a4d3f6ba510: retaining ops >= 129686440, log segment size = 67108864
Here is the complete analysis doc:- https://docs.google.com/document/d/1FlASehcgSHCZOXaVFjSJcMtGxoGVD8aTwFG_AUYtPKk/edit#
Conclusion:- As per the log analysis, it gives proof that “UpdatePeersAndMetrics” Thread is aborted, causing this issue. This abort stop may be because of Soft memory limit exceeded error, which may create a resource crunch in the cluster. From this, it doesn't look like an issue from the CDC side. But we need further investigation of the real cause of “UpdatePeersAndMetrics” thread stop.
Jira Link: DB-4291
Description
We had been running LRU for more than few weeks now, on Nov 16th 2022, rolling upgrade was done to b201 by @Arjun-yb from ~ 1:30 PM to 2 PM IST.
Here are the observations : • We do CDC verification by the latest timestamp on postgres (sink database) • At 5:30 PM IST on same day, when I checked CDC was still running and latest timestamp I was able to see. • But after that, around 5:46 PM IST CDC has stopped working. Latest timestamp recorded in postgres is epoch 1668600968682 (i.e 16 November 2022 17:46:08.682 GMT+05:30) • CDC as enabled on 2 tables. cdc_test and employees_1. Both of the them has been stopped.
Universe details: Universe link
Connector log: connector_log.zip
GCed error is seen when CDC stopped i.e. almost 4 hours after upgrade.
Other observations from @Arjun-yb : after upgrade he saw that soft memory issue at tserver.
Tserver INFO log
Disk details: