yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
9.03k stars 1.08k forks source link

[DocDB] Tablet counts exceeding the tablet guardrail limit #24589

Open massifcoder-yb opened 1 month ago

massifcoder-yb commented 1 month ago

Jira Link: DB-13628

Description

Steps:

Artifacts: https://drive.google.com/file/d/1k0W604pvSCuuDgwvOSPodrZsEpZ8QWbQ/view?usp=drive_link

Warning: Please confirm that this issue does not contain any sensitive information

druzac commented 2 weeks ago

Looks like tablet splits caused the number of tablet replicas to exceed the limit.

At 11:47, the number of tablet replicas was 726, which was under the tablet limit of 730:

2024-10-23 11:47:12.087 UTC [42654] ERROR:  Invalid table definition: Error creating table tablet_guardrail_pitr_2.verify_range_table_1 on the master: The requested number of tablet replicas (9) would cause the total running tablet replica count (735) to exceed the safe system maximum (730)

However tablet splits were still scheduled around this time:

I1023 11:47:02.460975 40149 tablet_split_manager.cc:550] Scheduled split for tablet_id: 05698db926094aca9d026feea9f0bca8 with size 338075 bytes
I1023 11:47:07.080113 40149 tablet_split_manager.cc:550] Scheduled split for tablet_id: 1916406415964693844ea8c7c3e08ea9 with size 338052 bytes
I1023 11:47:13.148397 40149 tablet_split_manager.cc:550] Scheduled split for tablet_id: d6140c39a1eb404c806de214a8735bce with size 336653 bytes

It seems the tablet split at 11:47:13 was scheduled when there were 729 tablet replicas either live or in the process of being created via a scheduled split. This additional split brought the number of tablet replicas to 731, which is greater than the limit of 730. Tablet splits did seem to respect the limit, as we see a break in tablet splits until some delete tablet requests at 12:13.