yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.99k stars 1.07k forks source link

[DocDB] No found LookupByIdRpc tablet because it deleted during SQL insert #15468

Closed pilshchikov closed 1 year ago

pilshchikov commented 1 year ago

Jira Link: DB-4766

Description

ENV:

Case:

  1. Start all SQL workloads of sample-apps like SqlInserts, SqlUpdate, SqlDataLoad, SqlSecondadyIndex
    • start workload sequentially one after each other
    • workload starts after it start start inserting values (waiting for table creation)
  2. Wait for 4 minutes, in that time watching that workloads are running
  3. With 20% chance dropping tables before cycle
  4. After that kill sample-apps and start cycle again from step 1

On cycle №4 we dropping tables that are created by workloads After that we running SqlInserts - which is starting find, but after that, we starting SqlUpdates And it throw error:

54 [main] INFO com.yugabyte.sample.common.CmdLineOpts  - Read only load: false
57 [main] INFO com.yugabyte.sample.common.CmdLineOpts  - SqlTransactions: will use 1 value columns for the main table
456 [main] INFO com.yugabyte.sample.apps.SqlUpdates  - Created table: table_172_151_29_238_3853a1_sqlupdates
490 [main] ERROR com.yugabyte.sample.apps.SqlUpdates  - Failed to prepare data for workload
com.yugabyte.util.PSQLException: ERROR: Not found: LookupByIdRpc(tablet: ac02db33192e477cb885be987ef1e591, num_attempts: 1) failed: Tablet deleted: Table deleted at 2022-12-29 20:50:05 UTC
    at com.yugabyte.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2679)
    at com.yugabyte.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2359)
    at com.yugabyte.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:349)
    at com.yugabyte.jdbc.PgStatement.executeInternal(PgStatement.java:484)
    at com.yugabyte.jdbc.PgStatement.execute(PgStatement.java:404)
    at com.yugabyte.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:162)
    at com.yugabyte.jdbc.PgPreparedStatement.execute(PgPreparedStatement.java:151)
    at com.yugabyte.sample.apps.SqlUpdates.lambda$createTablesIfNeeded$0(SqlUpdates.java:92)
    at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
    at java.base/java.util.stream.IntPipeline$Head.forEach(IntPipeline.java:617)
    at com.yugabyte.sample.apps.SqlUpdates.createTablesIfNeeded(SqlUpdates.java:87)
    at com.yugabyte.sample.Main.run(Main.java:164)
    at com.yugabyte.sample.Main.main(Main.java:251)

It happen after tablet was created and next we making 1000 inserts like insert into test(k, v) values ('[uuid]:1', '[uuid]:val:1')

rthallamko3 commented 1 year ago

Possibly DUP of #15090

arybochkin commented 1 year ago

@rthallamko3 this is not a tablet splitting issue. I've checked the latest logs provided by @pilshchikov: no tablet splitting related logs are found, metrics have no splitting operations also, and the full error message has no children specified:

W0412 23:05:05.482554 22593 meta_cache.cc:857] 0x00005600a9e641d8 -> LookupByIdRpc(tablet: 194072ae1600413e9b8625512864e57d, num_attempts: 1): Not found (yb/master/catalog_manager.cc:11572): Tablet deleted: Table deleted at 2023-04-12 22:57:15 UTC (split child tablet IDs [])
W0412 23:05:05.482600 22593 tablet_rpc.cc:460] Not found (yb/master/catalog_manager.cc:11572): Failed Read(tablet: 194072ae1600413e9b8625512864e57d, num_ops: 1, num_attempts: 4, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) to tablet 194072ae1600413e9b8625512864e57d (no tablet server available) after 4 attempt(s): LookupByIdRpc(tablet: 194072ae1600413e9b8625512864e57d, num_attempts: 1) failed: Tablet deleted: Table deleted at 2023-04-12 22:57:15 UTC (split child tablet IDs [])

There's an attempt to access the tablet 8 minutes after it had been deleted. It's actually looking like a stale request/cache issue on a client:

I0412 22:57:16.686081 31065 async_rpc_tasks.cc:826] DeleteTablet RPC for tablet 194072ae1600413e9b8625512864e57d (sqlfkusers [id=00004000000030008000000000004030]) on TS=22b05d8bb58f4818a51656342af043de (task=0x000055e5c5d74c58, state=kRunning): TS 22b05d8bb58f4818a51656342af043de: tablet 194072ae1600413e9b8625512864e57d (table sqlfkusers [id=00004000000030008000000000004030]) successfully done
...
I0412 22:57:16.694391 31067 async_rpc_tasks.cc:826] DeleteTablet RPC for tablet 194072ae1600413e9b8625512864e57d (sqlfkusers [id=00004000000030008000000000004030]) on TS=d273c45e2a844b7eaa28b3b3366d2cc1 (task=0x000055e5bea80558, state=kRunning): TS d273c45e2a844b7eaa28b3b3366d2cc1: tablet 194072ae1600413e9b8625512864e57d (table sqlfkusers [id=00004000000030008000000000004030]) successfully done
...
I0412 22:57:18.089290 31077 async_rpc_tasks.cc:826] DeleteTablet RPC for tablet 194072ae1600413e9b8625512864e57d (sqlfkusers [id=00004000000030008000000000004030]) on TS=5fb3760afbce45c9b91892a33be4b7a8 (task=0x000055e5c1976a98, state=kRunning): TS 5fb3760afbce45c9b91892a33be4b7a8: tablet 194072ae1600413e9b8625512864e57d (table sqlfkusers [id=00004000000030008000000000004030]) successfully done
...
W0412 23:05:05.482035 21003 catalog_manager.cc:11652] Expected replicas 3 but found 0 for tablet 194072ae1600413e9b8625512864e57d: expected_live_replicas: 3 expected_read_replicas: 0
W0412 23:05:05.543643  2801 backfill_index.cc:1512] TS d273c45e2a844b7eaa28b3b3366d2cc1: backfill failed for tablet e78883fb71ed494d96bb3ba8c50d44c9 (table table_172_151_20_226_87e80b_sqlsecondaryindex [id=00004000000030008000000000004030]) no further retry: Illegal state (yb/tablet/tablet.cc:2256): ERROR:  LookupByIdRpc(tablet: 194072ae1600413e9b8625512864e57d, num_attempts: 1) failed: Tablet deleted: Table deleted at 2023-04-12 22:57:15 UTC response was error { code: OPERATION_NOT_SUPPORTED status { code: ILLEGAL_STATE message: "ERROR:  LookupByIdRpc(tablet: 194072ae1600413e9b8625512864e57d, num_attempts: 1) failed: Tablet deleted: Table deleted at 2023-04-12 22:57:15 UTC" source_file: "../../src/yb/tablet/tablet.cc" source_line: 2256 errors: "\000" } } propagated_hybrid_time: 6886771529663635456 backfilled_until: "" failed_index_ids: "00004000000030008000000000004035" number_rows_processed: 0
I0412 23:05:05.544221  2801 backfill_index.cc:1264] Failed to backfill the tablet 0x000055e5bafe42c0 -> e78883fb71ed494d96bb3ba8c50d44c9 (table table_172_151_20_226_87e80b_sqlsecondaryindex [id=00004000000030008000000000004030]): Illegal state (yb/tablet/tablet.cc:2256): ERROR:  LookupByIdRpc(tablet: 194072ae1600413e9b8625512864e57d, num_attempts: 1) failed: Tablet deleted: Table deleted at 2023-04-12 22:57:15 UTC
E0412 23:05:05.544232  2801 backfill_index.cc:934] Backfill Index Table(s) { table_172_151_20_226_87e80b_sqlsecondaryindex_index } failed to backfill the index: [00004000000030008000000000004035] due to Illegal state (yb/tablet/tablet.cc:2256): ERROR:  LookupByIdRpc(tablet: 194072ae1600413e9b8625512864e57d, num_attempts: 1) failed: Tablet deleted: Table deleted at 2023-04-12 22:57:15 UTC
arybochkin commented 1 year ago

Looked into the logs deeper:

I0412 22:48:08.057664 21002 catalog_manager.cc:3542] CreateTable from 172.151.31.84:34401:
name: "sqlfkusers"
table_id: "00004000000030008000000000004030"
…
I0412 22:48:08.290817 19887 catalog_manager.cc:11074] Initial tserver protege for tablet 194072ae1600413e9b8625512864e57d: 22b05d8bb58f4818a51656342af043de
I0412 22:48:08.290853 19887 catalog_manager.cc:11074] Initial tserver protege for tablet 4b2d29c10c944716a8e2116bd0b5af87: 5fb3760afbce45c9b91892a33be4b7a8
I0412 22:48:08.290885 19887 catalog_manager.cc:11074] Initial tserver protege for tablet 6583680a4df64141a98f0dc6c5871ea3: d273c45e2a844b7eaa28b3b3366d2cc1
…
I0412 22:57:15.625838 21341 catalog_manager.cc:5773] Servicing DeleteTable request from 172.151.25.245:45939: table { table_id: "00004000000030008000000000004030" } is_index_table: false
…
I0412 22:57:18.335369 31077 catalog_manager.cc:6238] Marking table as DELETED: sqlfkusers [id=00004000000030008000000000004030]
…
I0412 23:04:59.902832 25332 catalog_manager.cc:3542] CreateTable from 172.151.28.43:34555:
name: "table_172_151_20_226_87e80b_sqlsecondaryindex"
table_id: "00004000000030008000000000004030"
…
I0412 23:04:59.913269 19887 catalog_manager.cc:11074] Initial tserver protege for tablet bd559be11f764e67aefe44fbb538d7fd: 22b05d8bb58f4818a51656342af043de
I0412 23:04:59.913306 19887 catalog_manager.cc:11074] Initial tserver protege for tablet dc3d9490157d472490a73bd7d99f9fa1: 5fb3760afbce45c9b91892a33be4b7a8 
I0412 23:04:59.913340 19887 catalog_manager.cc:11074] Initial tserver protege for tablet e78883fb71ed494d96bb3ba8c50d44c9: d273c45e2a844b7eaa28b3b3366d2cc1
…
I0412 23:05:00.896263 20976 catalog_manager.cc:3542] CreateTable from 172.151.28.43:34366:
name: "table_172_151_20_226_87e80b_sqlsecondaryindex_index"
…
I0412 23:05:05.544221  2801 backfill_index.cc:1264] Failed to backfill the tablet 0x000055e5bafe42c0 -> e78883fb71ed494d96bb3ba8c50d44c9 (table table_172_151_20_226_87e80b_sqlsecondaryindex [id=00004000000030008000000000004030]): Illegal state (yb/tablet/tablet.cc:2256): ERROR:  LookupByIdRpc(tablet: 194072ae1600413e9b8625512864e57d, num_attempts: 1) failed: Tablet deleted: Table deleted at 2023-04-12 22:57:15 UTC

The tablet, which is failing, actually belongs to the another table sqlfkusers, which had been dropped before table_172_151_20_226_87e80b_sqlsecondaryindex was created. The thing is the later table got the same TableId as the deleted one and meta-cache was not able to handle this case correctly. So, on the first glance it is looking fine to reuse the table id (as it is allowed to do it manually) and the problem is in proper handling of dropped tablets at pg-table-cache and meta-cache side.

arybochkin commented 1 year ago

Steps to reproduce the same error:

(1) # set yb_enable_create_with_table_oid=true;
(2) # create table test_a(k int primary key, v int) with (table_oid=17172);
(3) # insert into test_a select i, i from generate_series(1, 100) as i;
(4) # drop table test_a;
(5) # create table test_b(k int primary key, v int) with (table_oid=17172);
(6) # create index idx_b on test_b(v);

Easiest way to reproduce the error:

(1) # set yb_enable_create_with_table_oid=true;
(2) # create table test_a(k int primary key, v int) with (table_oid=17172);
(3) # select count(*) from test_a;
(4) # drop table test_a;
(5) # create table test_b(k int primary key, v int) with (table_oid=17172);
(6) # select count(*) from test_b;

Step (2) is required in both cases to cache the table and its tablets in the meta-cache.

arybochkin commented 1 year ago

W/A is to repeat any operation, which will make meta-cache update partitions information as all "old" replicas are in failed state.

shamanthchandra-yb commented 1 year ago

Gentle reminder @arybochkin A stress case is failing across, almost every run. Just thought I'd inform you about it so that it can be addressed. Testcase name: test_rapid_sql_inserts_mixed cc: @pilshchikov

arybochkin commented 1 year ago

This issue will be fixed in the context of https://github.com/yugabyte/yugabyte-db/issues/16130 as it is the same issue, confirmed with @yifanguan.