[DocDB] Frequent Paused Compactions and High CPU Usage on one node; Observed many SST Files Not Compacted

Description

A potential bug has been observed in the YugabyteDB cluster where SST files are not being compacted as expected, and node n2 is experiencing frequent paused compactions along with high CPU usage. Please find slack thread in JIRA description.

Setup Details:

Replication Factor (RF): 5
Nodes: 7 (8 cores c5.2xlarge machines)
YCQL Workload: Cassandra Batch Time Series on 4 keyspaces, with 5 write threads and 1 read thread each
This is observed on 2.14 but also observed in another experiment after upgrading it to 2.20.6

Configuration:

For #proj-ttl-expiry setup
rocksdb_max_file_size_for_compaction: 300 MB
TTL: 6 hours

Observations:

Node n2 (who was also a master leader) showed high CPU usage, reaching up to 97%, with compaction threads piling up. Disk utilisation was around 50% on each disk (confirmed via iostat), while other nodes maintained around 75% CPU usage.
We suspected a bug in the priority pool. We should certainly see why we mark tasks as non active and never review them.
Need to find RCA for why there are so many paused compactions. Latest update from slack, by @arybochkin :

We restarted n2 to get the fresh set of logs. It helped with number of SST files — all old files have been deleted by ‘universal deletion compaction’, which again confirms the files had been hold by some paused compactions, and that lead to a huge reduction of SST files and write rejections stoped as expected. I’m monitoring the situation to get more pointers from the fresh logs, seems the situation is going to repeat, as I already see a couple of dozens of paused compactions.

image (1)

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

yugabyte / yugabyte-db

[DocDB] Frequent Paused Compactions and High CPU Usage on one node; Observed many SST Files Not Compacted #23757

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information