Closed pilshchikov closed 1 year ago
Possibly DUP of #15090
@rthallamko3 this is not a tablet splitting issue. I've checked the latest logs provided by @pilshchikov: no tablet splitting related logs are found, metrics have no splitting operations also, and the full error message has no children specified:
W0412 23:05:05.482554 22593 meta_cache.cc:857] 0x00005600a9e641d8 -> LookupByIdRpc(tablet: 194072ae1600413e9b8625512864e57d, num_attempts: 1): Not found (yb/master/catalog_manager.cc:11572): Tablet deleted: Table deleted at 2023-04-12 22:57:15 UTC (split child tablet IDs [])
W0412 23:05:05.482600 22593 tablet_rpc.cc:460] Not found (yb/master/catalog_manager.cc:11572): Failed Read(tablet: 194072ae1600413e9b8625512864e57d, num_ops: 1, num_attempts: 4, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) to tablet 194072ae1600413e9b8625512864e57d (no tablet server available) after 4 attempt(s): LookupByIdRpc(tablet: 194072ae1600413e9b8625512864e57d, num_attempts: 1) failed: Tablet deleted: Table deleted at 2023-04-12 22:57:15 UTC (split child tablet IDs [])
There's an attempt to access the tablet 8 minutes after it had been deleted. It's actually looking like a stale request/cache issue on a client:
I0412 22:57:16.686081 31065 async_rpc_tasks.cc:826] DeleteTablet RPC for tablet 194072ae1600413e9b8625512864e57d (sqlfkusers [id=00004000000030008000000000004030]) on TS=22b05d8bb58f4818a51656342af043de (task=0x000055e5c5d74c58, state=kRunning): TS 22b05d8bb58f4818a51656342af043de: tablet 194072ae1600413e9b8625512864e57d (table sqlfkusers [id=00004000000030008000000000004030]) successfully done
...
I0412 22:57:16.694391 31067 async_rpc_tasks.cc:826] DeleteTablet RPC for tablet 194072ae1600413e9b8625512864e57d (sqlfkusers [id=00004000000030008000000000004030]) on TS=d273c45e2a844b7eaa28b3b3366d2cc1 (task=0x000055e5bea80558, state=kRunning): TS d273c45e2a844b7eaa28b3b3366d2cc1: tablet 194072ae1600413e9b8625512864e57d (table sqlfkusers [id=00004000000030008000000000004030]) successfully done
...
I0412 22:57:18.089290 31077 async_rpc_tasks.cc:826] DeleteTablet RPC for tablet 194072ae1600413e9b8625512864e57d (sqlfkusers [id=00004000000030008000000000004030]) on TS=5fb3760afbce45c9b91892a33be4b7a8 (task=0x000055e5c1976a98, state=kRunning): TS 5fb3760afbce45c9b91892a33be4b7a8: tablet 194072ae1600413e9b8625512864e57d (table sqlfkusers [id=00004000000030008000000000004030]) successfully done
...
W0412 23:05:05.482035 21003 catalog_manager.cc:11652] Expected replicas 3 but found 0 for tablet 194072ae1600413e9b8625512864e57d: expected_live_replicas: 3 expected_read_replicas: 0
W0412 23:05:05.543643 2801 backfill_index.cc:1512] TS d273c45e2a844b7eaa28b3b3366d2cc1: backfill failed for tablet e78883fb71ed494d96bb3ba8c50d44c9 (table table_172_151_20_226_87e80b_sqlsecondaryindex [id=00004000000030008000000000004030]) no further retry: Illegal state (yb/tablet/tablet.cc:2256): ERROR: LookupByIdRpc(tablet: 194072ae1600413e9b8625512864e57d, num_attempts: 1) failed: Tablet deleted: Table deleted at 2023-04-12 22:57:15 UTC response was error { code: OPERATION_NOT_SUPPORTED status { code: ILLEGAL_STATE message: "ERROR: LookupByIdRpc(tablet: 194072ae1600413e9b8625512864e57d, num_attempts: 1) failed: Tablet deleted: Table deleted at 2023-04-12 22:57:15 UTC" source_file: "../../src/yb/tablet/tablet.cc" source_line: 2256 errors: "\000" } } propagated_hybrid_time: 6886771529663635456 backfilled_until: "" failed_index_ids: "00004000000030008000000000004035" number_rows_processed: 0
I0412 23:05:05.544221 2801 backfill_index.cc:1264] Failed to backfill the tablet 0x000055e5bafe42c0 -> e78883fb71ed494d96bb3ba8c50d44c9 (table table_172_151_20_226_87e80b_sqlsecondaryindex [id=00004000000030008000000000004030]): Illegal state (yb/tablet/tablet.cc:2256): ERROR: LookupByIdRpc(tablet: 194072ae1600413e9b8625512864e57d, num_attempts: 1) failed: Tablet deleted: Table deleted at 2023-04-12 22:57:15 UTC
E0412 23:05:05.544232 2801 backfill_index.cc:934] Backfill Index Table(s) { table_172_151_20_226_87e80b_sqlsecondaryindex_index } failed to backfill the index: [00004000000030008000000000004035] due to Illegal state (yb/tablet/tablet.cc:2256): ERROR: LookupByIdRpc(tablet: 194072ae1600413e9b8625512864e57d, num_attempts: 1) failed: Tablet deleted: Table deleted at 2023-04-12 22:57:15 UTC
Looked into the logs deeper:
I0412 22:48:08.057664 21002 catalog_manager.cc:3542] CreateTable from 172.151.31.84:34401:
name: "sqlfkusers"
table_id: "00004000000030008000000000004030"
…
I0412 22:48:08.290817 19887 catalog_manager.cc:11074] Initial tserver protege for tablet 194072ae1600413e9b8625512864e57d: 22b05d8bb58f4818a51656342af043de
I0412 22:48:08.290853 19887 catalog_manager.cc:11074] Initial tserver protege for tablet 4b2d29c10c944716a8e2116bd0b5af87: 5fb3760afbce45c9b91892a33be4b7a8
I0412 22:48:08.290885 19887 catalog_manager.cc:11074] Initial tserver protege for tablet 6583680a4df64141a98f0dc6c5871ea3: d273c45e2a844b7eaa28b3b3366d2cc1
…
I0412 22:57:15.625838 21341 catalog_manager.cc:5773] Servicing DeleteTable request from 172.151.25.245:45939: table { table_id: "00004000000030008000000000004030" } is_index_table: false
…
I0412 22:57:18.335369 31077 catalog_manager.cc:6238] Marking table as DELETED: sqlfkusers [id=00004000000030008000000000004030]
…
I0412 23:04:59.902832 25332 catalog_manager.cc:3542] CreateTable from 172.151.28.43:34555:
name: "table_172_151_20_226_87e80b_sqlsecondaryindex"
table_id: "00004000000030008000000000004030"
…
I0412 23:04:59.913269 19887 catalog_manager.cc:11074] Initial tserver protege for tablet bd559be11f764e67aefe44fbb538d7fd: 22b05d8bb58f4818a51656342af043de
I0412 23:04:59.913306 19887 catalog_manager.cc:11074] Initial tserver protege for tablet dc3d9490157d472490a73bd7d99f9fa1: 5fb3760afbce45c9b91892a33be4b7a8
I0412 23:04:59.913340 19887 catalog_manager.cc:11074] Initial tserver protege for tablet e78883fb71ed494d96bb3ba8c50d44c9: d273c45e2a844b7eaa28b3b3366d2cc1
…
I0412 23:05:00.896263 20976 catalog_manager.cc:3542] CreateTable from 172.151.28.43:34366:
name: "table_172_151_20_226_87e80b_sqlsecondaryindex_index"
…
I0412 23:05:05.544221 2801 backfill_index.cc:1264] Failed to backfill the tablet 0x000055e5bafe42c0 -> e78883fb71ed494d96bb3ba8c50d44c9 (table table_172_151_20_226_87e80b_sqlsecondaryindex [id=00004000000030008000000000004030]): Illegal state (yb/tablet/tablet.cc:2256): ERROR: LookupByIdRpc(tablet: 194072ae1600413e9b8625512864e57d, num_attempts: 1) failed: Tablet deleted: Table deleted at 2023-04-12 22:57:15 UTC
The tablet, which is failing, actually belongs to the another table sqlfkusers
, which had been dropped before table_172_151_20_226_87e80b_sqlsecondaryindex
was created. The thing is the later table got the same TableId
as the deleted one and meta-cache was not able to handle this case correctly. So, on the first glance it is looking fine to reuse the table id (as it is allowed to do it manually) and the problem is in proper handling of dropped tablets at pg-table-cache and meta-cache side.
Steps to reproduce the same error:
(1) # set yb_enable_create_with_table_oid=true;
(2) # create table test_a(k int primary key, v int) with (table_oid=17172);
(3) # insert into test_a select i, i from generate_series(1, 100) as i;
(4) # drop table test_a;
(5) # create table test_b(k int primary key, v int) with (table_oid=17172);
(6) # create index idx_b on test_b(v);
Easiest way to reproduce the error:
(1) # set yb_enable_create_with_table_oid=true;
(2) # create table test_a(k int primary key, v int) with (table_oid=17172);
(3) # select count(*) from test_a;
(4) # drop table test_a;
(5) # create table test_b(k int primary key, v int) with (table_oid=17172);
(6) # select count(*) from test_b;
Step (2) is required in both cases to cache the table and its tablets in the meta-cache.
W/A is to repeat any operation, which will make meta-cache update partitions information as all "old" replicas are in failed state.
Gentle reminder @arybochkin
A stress case is failing across, almost every run. Just thought I'd inform you about it so that it can be addressed.
Testcase name: test_rapid_sql_inserts_mixed
cc: @pilshchikov
This issue will be fixed in the context of https://github.com/yugabyte/yugabyte-db/issues/16130 as it is the same issue, confirmed with @yifanguan.
Jira Link: DB-4766
Description
ENV:
Case:
On cycle №4 we dropping tables that are created by workloads After that we running SqlInserts - which is starting find, but after that, we starting SqlUpdates And it throw error:
It happen after tablet was created and next we making 1000 inserts like
insert into test(k, v) values ('[uuid]:1', '[uuid]:val:1')