Open shishir2001-yb opened 4 months ago
Issue seems to be racing DDL and PITR. Two schema bump requests are in flight.
In this run, the schema version mutation from the drop index DDL wanted to set the schema version to 10. The schema version mutation from the restore wanted to set the schema version to 9. The restore mutation was processed after the drop index mutation on the tservers, which left them in the state of having all schemas - 8, 9, and 10 - but with their current schema version set to 9. On a tablet report, master noticed the discrepancy - its own schema for the table was 10 - and it sent an alter table. Tservers printed this error message because they did have a schema for version 10, it just wasn't their current schema.
# restore touching tablet 3e220a7081da43f9acd970fa26ffd2ad. schema version for this table is initially 8.
I0401 18:27:06.619601 31368 master_snapshot_coordinator.cc:2095] Creating a new restoration entry with id ff4fe147-9cc9-48f7-9981-a4b88938db59
I0401 18:27:06.621199 33107 master_snapshot_coordinator.cc:584] Restore sys catalog from snapshot: { id: 0c6e334b-5944-41b4-afc4-fa48d1020058 snapshot_hybrid_time: { physical: 1711995401844906 } schedule_id: fdccc0ae-e3ce-4e1f-856a-8a23618beaea previous_snapshot_hybrid_time: { physical: 1711993631997185 } version: 3 initial_state: CREATING tablets: [{ id: 08fb93825a8e481a8d0eb7a175c28390 state: 2 last_error: OK running: 0 }, { id: ef25f698c9174f19a9063c253209280e state: 2 last_error: OK running: 0 }, { id: 2f6d373059fb4d40bbbd320e4853bcf8 state: 2 last_error: OK running: 0 }, { id: ab8dabb20ba64f15b83829ff4ce55bb9 state: 2 last_error: OK running: 0 }, { id: 864c71e225224603b6919eea175d9ff8 state: 2 last_error: OK running: 0 }, { id: 3e220a7081da43f9acd970fa26ffd2ad state: 2 last_error: OK running: 0 }, { id: 5b96194bb5fe4fdab4ac26fd0a7f3f6b state: 2 last_error: OK running: 0 }, { id: c3b47eeabcbf4aaaac6cb6475abbdd70 state: 2 last_error: OK running: 0 }, { id: 0c9212f0ba0645ce96bf4bfd3c1fd7fa state: 2 last_error: OK running: 0 }] }, schedule: { id: fdccc0ae-e3ce-4e1f-856a-8a23618beaea options: filter { tables { tables { table_name: "" namespace { id: "0000405a000030008000000000000000" name: "postgres_90" database_type: YQL_DATABASE_PGSQL } } } } interval_sec: 12000 retention_duration_sec: 60000 restoration_times: 7012333183402876928 } at { physical: 1711994529957599 }, op id: 1.665832
I0401 18:27:08.060552 33107 master_snapshot_coordinator.cc:1205] PITR: ff4fe147-9cc9-48f7-9981-a4b88938db59, tablets to restore: [c3b47eeabcbf4aaaac6cb6475abbdd70, ab8dabb20ba64f15b83829ff4ce55bb9, 3e220a7081da43f9acd970fa26ffd2ad, 0c9212f0ba0645ce96bf4bfd3c1fd7fa, ef25f698c9174f19a9063c253209280e, 864c71e225224603b6919eea175d9ff8, 2f6d373059fb4d40bbbd320e4853bcf8, 08fb93825a8e481a8d0eb7a175c28390, 5b96194bb5fe4fdab4ac26fd0a7f3f6b]
# this updates schema version to 9 on master.
I0401 18:27:08.325865 177726 restore_sys_catalog_state.cc:409] PITR: Patching the schema version for table 0000405a00003000800000000000426c. Existing version 8, restoring version 7
# this drop index DDL bumps the schema version on the master leader. so it is now 10.
I0401 18:27:10.307303 350355 catalog_manager.cc:6077] MarkIndexInfoFromTableForDeletion table 0000405a00003000800000000000426c index 0000405a00003000800000000000428f multi_stage=0
I0401 18:27:10.307324 350355 catalog_manager.cc:6112] DeleteIndexInfoFromTable table 0000405a00003000800000000000426c index 0000405a00003000800000000000428f
I0401 18:27:10.309399 350361 async_rpc_tasks.cc:201] Alter Table RPC for tablet 0x00002467659c0600 -> ab8dabb20ba64f15b83829ff4ce55bb9 (table tb_0 [id=0000405a00003000800000000000426c]) (tb_0 [id=0000405a00003000800000000000426c]) (task=0x0000246756f64558, state=kWaiting): Unable to pick replica: Not found (yb/master/catalog_entity_info.cc:222): No leader found for tablet ab8dabb20ba64f15b83829ff4ce55bb9 (table tb_0 [id=0000405a00003000800000000000426c]) with 0 replicas: [].
# this alter schema just bumps version from 8 to 10. successful, no version collision
I0401 18:27:10.311276 266648 tablet.cc:2350] T 3e220a7081da43f9acd970fa26ffd2ad P 7e53919270d141569de17c96ece87561: Alter schema from Schema... version 8 to ... version 10
I0401 18:27:10.311378 266648 doc_read_context.cc:48] T 3e220a7081da43f9acd970fa26ffd2ad P 7e53919270d141569de17c96ece87561: DocReadContext, copy and add: [0, 1, 2, 3, 4, 5, 6, 7, 8, 10], added: 10
# tablet leader starts restoring 3e220a7081da43f9acd970fa26ffd2ad. during this tablet restore, schema version is set to 9 as that is set in the tablet restore operation.
I0401 18:27:10.362481 43986 backup_service.cc:70] Processing TabletSnapshotOp for tablet 3e220a7081da43f9acd970fa26ffd2ad from 172.151.22.165:35403: RESTORE_ON_TABLET
I0401 18:27:10.365592 204524 version_set.cc:2966] T 3e220a7081da43f9acd970fa26ffd2ad P 7e53919270d141569de17c96ece87561: Recovered from manifest file:/mnt/d0/yb-data/tserver/data/rocksdb/table-0000405a00003000800000000000426c/tablet-3e220a7081da43f9acd970fa26ffd2ad/MANIFEST-000011 succeeded,manifest_file_number is 11, next_file_number is 13, last_sequence is 1125899906842809, log_number is 0,prev_log_number is 0,max_column_family is 0, flushed_values is 0x000006871c9ece40 -> Global filter: <invalid>, Cotables filter: []{ op_id: 1.112 hybrid_time: { physical: 1711995269632341 } history_cutoff: { cotables cutoff: <invalid>, primary cutoff: <invalid> } max_value_level_ttl_expiration_time: <invalid> primary_schema_version: 8 cotable_schema_versions: [] }
I0401 18:27:10.374156 204524 doc_read_context.cc:48] T 3e220a7081da43f9acd970fa26ffd2ad P 7e53919270d141569de17c96ece87561: DocReadContext, copy and add: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], added: 9
I0401 18:27:10.374197 204524 tablet_snapshots.cc:441] T 3e220a7081da43f9acd970fa26ffd2ad P 7e53919270d141569de17c96ece87561: Merging metadata with restored: /mnt/d0/yb-data/tserver/data/rocksdb/table-0000405a00003000800000000000426c/tablet-3e220a7081da43f9acd970fa26ffd2ad.snapshots/0c6e334b-5944-41b4-afc4-fa48d1020058/tablet.metadata , force overwrite of schema packing 0
I0401 18:27:10.374764 204524 doc_read_context.cc:78] T 3e220a7081da43f9acd970fa26ffd2ad P 7e53919270d141569de17c96ece87561: LogAfterMerge: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], overwrite: 0
I0401 18:27:10.391292 204524 tablet_snapshots.cc:464] T 3e220a7081da43f9acd970fa26ffd2ad P 7e53919270d141569de17c96ece87561: Checkpoint restored from /mnt/d0/yb-data/tserver/data/rocksdb/table-0000405a00003000800000000000426c/tablet-3e220a7081da43f9acd970fa26ffd2ad.snapshots/0c6e334b-5944-41b4-afc4-fa48d1020058
I0401 18:27:10.391300 204524 tablet_snapshots.cc:465] T 3e220a7081da43f9acd970fa26ffd2ad P 7e53919270d141569de17c96ece87561: Re-enabling compactions
I0401 18:27:10.409065 33000 version_set.cc:2966] T 3e220a7081da43f9acd970fa26ffd2ad P fc707a69eb1f48339771e125e821d7ec: Recovered from manifest file:/mnt/d0/yb-data/tserver/data/rocksdb/table-0000405a00003000800000000000426c/tablet-3e220a7081da43f9acd970fa26ffd2ad/MANIFEST-000011 succeeded,manifest_file_number is 11, next_file_number is 13, last_sequence is 1125899906842809, log_number is 0,prev_log_number is 0,max_column_family is 0, flushed_values is 0x000014f87913b680 -> Global filter: <invalid>, Cotables filter: []{ op_id: 1.112 hybrid_time: { physical: 1711995269632341 } history_cutoff: { cotables cutoff: <invalid>, primary cutoff: <invalid> } max_value_level_ttl_expiration_time: <invalid> primary_schema_version: 8 cotable_schema_versions: [] }
I0401 18:27:10.420894 33000 doc_read_context.cc:48] T 3e220a7081da43f9acd970fa26ffd2ad P fc707a69eb1f48339771e125e821d7ec: DocReadContext, copy and add: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], added: 9
I0401 18:27:10.428366 33000 version_set.cc:2966] T 3e220a7081da43f9acd970fa26ffd2ad P fc707a69eb1f48339771e125e821d7ec [R]: Recovered from manifest file:/mnt/d0/yb-data/tserver/data/rocksdb/table-0000405a00003000800000000000426c/tablet-3e220a7081da43f9acd970fa26ffd2ad/MANIFEST-000013 succeeded,manifest_file_number is 13, next_file_number is 15, last_sequence is 1125899906842809, log_number is 0,prev_log_number is 0,max_column_family is 0, flushed_values is 0x000014f88f597b00 -> Global filter: <invalid>, Cotables filter: []{ op_id: 1.118 hybrid_time: { physical: 1711996030362598 } history_cutoff: { cotables cutoff: <invalid>, primary cutoff: <invalid> } max_value_level_ttl_expiration_time: <invalid> primary_schema_version: <nullopt> cotable_schema_versions: [] }
I0401 18:27:10.442688 33000 tablet_snapshots.cc:464] T 3e220a7081da43f9acd970fa26ffd2ad P fc707a69eb1f48339771e125e821d7ec: Checkpoint restored from /mnt/d0/yb-data/tserver/data/rocksdb/table-0000405a00003000800000000000426c/tablet-3e220a7081da43f9acd970fa26ffd2ad.snapshots/0c6e334b-5944-41b4-afc4-fa48d1020058
I0401 18:27:10.442695 33000 tablet_snapshots.cc:465] T 3e220a7081da43f9acd970fa26ffd2ad P fc707a69eb1f48339771e125e821d7ec: Re-enabling compactions
# restore finishes
I0401 18:27:10.406605 350359 state_with_tablets.cc:161] Restoration[ff4fe147-9cc9-48f7-9981-a4b88938db59/0c6e334b-5944-41b4-afc4-fa48d1020058/fdccc0ae-e3ce-4e1f-856a-8a23618beaea]: Finished RESTORING at 3e220a7081da43f9acd970fa26ffd2ad, 9 was running
I0401 18:27:10.406663 350359 catalog_manager_ext.cc:3109] Handling Restore Tablet Snapshot Response for tablet 3e220a7081da43f9acd970fa26ffd2ad (table tb_0 [id=0000405a00003000800000000000426c]) OK
# later the tservers send full tablet reports to master.
I0401 18:28:09.660087 31997 heartbeater.cc:411] P 7e53919270d141569de17c96ece87561: Sending a full tablet report to master...
# version is 9 instead of master's 10, so master sends an alterschema request
I0401 18:28:09.899121 33053 catalog_manager.cc:8242] TS 7e53919270d141569de17c96ece87561 does not have the latest schema for tablet 3e220a7081da43f9acd970fa26ffd2ad (table tb_0 [id=0000405a00003000800000000000426c]). Expected version 10 got 9
# alter schema request tells tservers to add schema version 10, which they already have.
E0401 18:28:09.903576 266648 schema_packing.cc:423] Duplicate schema version: 10,
Jira Link: DB-10127
Description
Tried on version: 2.21.1.0-b124
Logs: https://drive.google.com/file/d/15ZRoz1VHWiKI4fbhMVnQsj0NzJVHwmQM/view?usp=sharing Encountered the following Tserver error while running PITR in parallel with Cross-DB-DDLs sample app. Find the whole error in logs.
Test details
List of DDLs executed in sample app
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information