Open massifcoder-yb opened 1 month ago
Looks like tablet splits caused the number of tablet replicas to exceed the limit.
At 11:47, the number of tablet replicas was 726, which was under the tablet limit of 730:
2024-10-23 11:47:12.087 UTC [42654] ERROR: Invalid table definition: Error creating table tablet_guardrail_pitr_2.verify_range_table_1 on the master: The requested number of tablet replicas (9) would cause the total running tablet replica count (735) to exceed the safe system maximum (730)
However tablet splits were still scheduled around this time:
I1023 11:47:02.460975 40149 tablet_split_manager.cc:550] Scheduled split for tablet_id: 05698db926094aca9d026feea9f0bca8 with size 338075 bytes
I1023 11:47:07.080113 40149 tablet_split_manager.cc:550] Scheduled split for tablet_id: 1916406415964693844ea8c7c3e08ea9 with size 338052 bytes
I1023 11:47:13.148397 40149 tablet_split_manager.cc:550] Scheduled split for tablet_id: d6140c39a1eb404c806de214a8735bce with size 336653 bytes
It seems the tablet split at 11:47:13 was scheduled when there were 729 tablet replicas either live or in the process of being created via a scheduled split. This additional split brought the number of tablet replicas to 731, which is greater than the limit of 730. Tablet splits did seem to respect the limit, as we see a break in tablet splits until some delete tablet requests at 12:13.
Jira Link: DB-13628
Description
Steps:
Artifacts: https://drive.google.com/file/d/1k0W604pvSCuuDgwvOSPodrZsEpZ8QWbQ/view?usp=drive_link
Warning: Please confirm that this issue does not contain any sensitive information