Open shamanthchandra-yb opened 2 months ago
Per @ttyusupov "Compactions tasks priorities are frequently changed because they are based on number of SST files in current RocksDB state. And that is causing pausing and transferring control to tasks with higher priority again and again. That was increasing the effect of going out of stable state, because that makes node to work slowly on almost all 150-250 background compactions switching between them instead of completing them one by one."
The following tserver gflags changes helped as it avoided frequent priority changes to compactions tasks and therefore pausing/resuming of compactions.
compaction_priority_step_size=10
compaction_priority_start_bound=20
As part of the analysis by @ttyusupov , We identified two items:
[x] #24540 : Avoid adding new compaction task for the same RocksDB instance while we already have a pending one. That was the way compaction logic worked in original RocksDB but wasn’t implemented for compaction tasks with priority thread pool implementation. This logic allows for one large (major) and one small (minor) compaction to be scheduled for a tablet.
[ ] #24541 (In review) - Determine input files at the compaction start time instead of the time compaction was scheduled. Original RocksDB compaction queue stored pointers to ColumnFamilyData instead of compactions and decision about which files to include in compaction was made at compaction start. However it was changed in Yugabyte DB as part of D2700 (in 2017) to support small/large compaction queues. The plan is to go back to the RockDB stratgey by adding rocksdb_determine_compaction_input_at_start flag which switches RocksDB to put pointer to ColumnFamilyData and presumed compaction size (small/large) into priority thread pool CompactionTask object. Once compaction task is actually started, we pick the appropriate compaction and its input files. Delaying picking files should help with better compaction decision and potentially avoid repeated work.
The two ideas together should help with better compaction strategies, preventing too many compactions in the queue etc.
Jira Link: DB-12659
Description
A potential bug has been observed in the YugabyteDB cluster where SST files are not being compacted as expected, and node n2 is experiencing frequent paused compactions along with high CPU usage. Please find slack thread in JIRA description.
Setup Details:
Configuration:
Observations:
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information