yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.95k stars 1.07k forks source link

[YSQL] DDL fails with WaitForBackendsCatalogVersion RPC timeout #21152

Closed shishir2001-yb closed 7 months ago

shishir2001-yb commented 8 months ago

Jira Link: DB-10087

Description

Tried on version: 2.21.1.0-b124

Encountered the following while running the Cross DB DDLs sample app

DB Name: postgres_4 DDL query: CREATE INDEX idx2_tb_0 ON tempTable2_tb_0(k)

ERROR: WaitForBackendsCatalogVersion RPC (request call id 661) to 172.151.30.177:9100 timed out after 25.000s

Logs: https://drive.google.com/file/d/1hwVm1XqXGbG9kf_9gNxPYLgVlGOvpnx6/view?usp=sharing

Test details

1. Run a workload that changes databases for 20-30 minutes.
2. Schedule point-in-time recovery (PITR) for 10 random databases.
3. Create a backup for one random database.
4. Start and stop the workload after 10 minutes.
5. Note the time for the first PITR.
6. Keep the workload running.
7. Perform another PITR at a random time while the workload continues.
tserver_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "ysql_max_connections": "500",
                'client_read_write_timeout_ms': str(30 * 60 * 1000),
                'yb_client_admin_operation_timeout_sec': str(30 * 60),
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode"
            },
master_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "tablet_split_high_phase_shard_count_per_node": 20000,
                "tablet_split_high_phase_size_threshold_bytes": 2097152,  # 2MB
                # low_phase_size 100KB
                "tablet_split_low_phase_size_threshold_bytes": 102400,  # 100 KB
                "tablet_split_low_phase_shard_count_per_node": 10000,
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode"
            }

List of DDLs executed in sample app

private static List<List<String>> ddlList = List.of(
            List.of("CREATE INDEX idx1 ON ? (k)", "DROP INDEX idx1"),
            List.of("CREATE TABLE tempTable1 AS SELECT * FROM ? limit 1000000", "ALTER TABLE tempTable1 RENAME TO tempTable1_new", "DROP TABLE tempTable1_new"),
            List.of("CREATE MATERIALIZED VIEW mv1 as SELECT k from ? limit 10000", "REFRESH MATERIALIZED VIEW mv1", "DROP MATERIALIZED VIEW mv1"),
            List.of("ALTER TABLE ? ADD newColumn1 TEXT DEFAULT 'dummyString'", "ALTER TABLE ? DROP newColumn1"),
            List.of("ALTER TABLE ? ADD newColumn2 TEXT NULL", "ALTER TABLE ? DROP newColumn2"),
            List.of("CREATE VIEW view1_? AS SELECT k from ?", "DROP VIEW view1_?"),
            List.of("ALTER TABLE ? ADD newColumn3 TEXT DEFAULT 'dummyString'", "ALTER TABLE ? ALTER newColumn3 TYPE VARCHAR(1000)", "ALTER TABLE ? DROP newColumn3"),
            List.of("CREATE TABLE tempTable2 AS SELECT * FROM ? limit 1000000", "CREATE INDEX idx2 ON tempTable2(k)", "ALTER TABLE ? ADD newColumn4 TEXT DEFAULT 'dummyString'", "ALTER TABLE tempTable2 ADD newColumn2 TEXT DEFAULT 'dummyString'", "TRUNCATE table ? cascade", "ALTER TABLE ? DROP newColumn4", "ALTER TABLE tempTable2 DROP newColumn2", "DROP INDEX idx2", "DROP TABLE tempTable2"),
            List.of("CREATE VIEW view2_? AS SELECT k from ?", "CREATE MATERIALIZED VIEW mv2 as SELECT k from ? limit 10000", "REFRESH MATERIALIZED VIEW mv2", "DROP MATERIALIZED VIEW mv2", "DROP VIEW view2_?")
 );

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

myang2021 commented 7 months ago

Per @jaki slack message:

I'm seeing other DDLs timeout as well Exception occurred during DDL postgres_4 DDL query ALTER TABLE tempTable2_tb_0 ADD newColumn5 TEXT DEFAULT 'dummyString': ERROR: Timed out waiting for Alter Table

If the system is that much overloaded, perhaps it is expected

Looks like master is overloaded so it is causing various DDLs to time out, not just create index.