yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.95k stars 1.07k forks source link

[YSQL] Insert query fails with ERROR: Table <unknown_table_name> (00004024000030008000000000004201) not found in Raft group 7bb5a004d0024d69b99d2588e3a94570 #21200

Closed shishir2001-yb closed 7 months ago

shishir2001-yb commented 7 months ago

Jira Link: DB-10130

Description

Tried on version: 2.21.1.0-b124

Logs: https://drive.google.com/file/d/1bNVlhO6rXE6AM7l1f7NMYgjKfa5S2iU_/view?usp=sharing

Insert query fails with the below error while running Cross-DB-DDL sample app.

ERROR: Table <unknown_table_name> (00004024000030008000000000004201) not found in Raft group 7bb5a004d0024d69b99d2588e3a94570
  Where: Catalog Version Mismatch: A DDL occurred while processing this query. Try again.  Call getNextException to see other errors in the batch.

Sample app details:


Start Cross-DB Concurrent DDLs Sample app which will execute both DDLs and DMLs across databases in parallel. (15 write threads, 40 databases(20 colocated + 20 Non-colocated) and 10 Read threads)
tserver_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "ysql_max_connections": "500",
                'client_read_write_timeout_ms': str(30 * 60 * 1000),
                'yb_client_admin_operation_timeout_sec': str(30 * 60),
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode"
            },
master_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "tablet_split_high_phase_shard_count_per_node": 20000,
                "tablet_split_high_phase_size_threshold_bytes": 2097152,  # 2MB
                # low_phase_size 100KB
                "tablet_split_low_phase_size_threshold_bytes": 102400,  # 100 KB
                "tablet_split_low_phase_shard_count_per_node": 10000,
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode"
            }

List of DDLs executed in sample app

private static List<List<String>> ddlList = List.of(
            List.of("CREATE INDEX idx1 ON ? (k)", "DROP INDEX idx1"),
            List.of("CREATE TABLE tempTable1 AS SELECT * FROM ? limit 1000000", "ALTER TABLE tempTable1 RENAME TO tempTable1_new", "DROP TABLE tempTable1_new"),
            List.of("CREATE MATERIALIZED VIEW mv1 as SELECT k from ? limit 10000", "REFRESH MATERIALIZED VIEW mv1", "DROP MATERIALIZED VIEW mv1"),
            List.of("ALTER TABLE ? ADD newColumn1 TEXT DEFAULT 'dummyString'", "ALTER TABLE ? DROP newColumn1"),
            List.of("ALTER TABLE ? ADD newColumn2 TEXT NULL", "ALTER TABLE ? DROP newColumn2"),
            List.of("CREATE VIEW view1_? AS SELECT k from ?", "DROP VIEW view1_?"),
            List.of("ALTER TABLE ? ADD newColumn3 TEXT DEFAULT 'dummyString'", "ALTER TABLE ? ALTER newColumn3 TYPE VARCHAR(1000)", "ALTER TABLE ? DROP newColumn3"),
            List.of("CREATE TABLE tempTable2 AS SELECT * FROM ? limit 1000000", "CREATE INDEX idx2 ON tempTable2(k)", "ALTER TABLE ? ADD newColumn4 TEXT DEFAULT 'dummyString'", "ALTER TABLE tempTable2 ADD newColumn2 TEXT DEFAULT 'dummyString'", "TRUNCATE table ? cascade", "ALTER TABLE ? DROP newColumn4", "ALTER TABLE tempTable2 DROP newColumn2", "DROP INDEX idx2", "DROP TABLE tempTable2"),
            List.of("CREATE VIEW view2_? AS SELECT k from ?", "CREATE MATERIALIZED VIEW mv2 as SELECT k from ? limit 10000", "REFRESH MATERIALIZED VIEW mv2", "DROP MATERIALIZED VIEW mv2", "DROP VIEW view2_?")
 );

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

shishir2001-yb commented 7 months ago

Similar Issues https://github.com/yugabyte/yugabyte-db/issues/16130 https://github.com/yugabyte/yugabyte-db/issues/18891 https://github.com/yugabyte/yugabyte-db/issues/15468

myang2021 commented 7 months ago

I think this is an expected error. The following is the relevant code.

In

Status Tablet::DoHandlePgsqlReadRequest(

We have

  // Assert the table is a Postgres table.
  DCHECK_EQ(table_info->table_type, TableType::PGSQL_TABLE_TYPE);
  if (table_info->schema_version != pgsql_read_request.schema_version()) {
    result->response.Clear();
    result->response.set_status(PgsqlResponsePB::PGSQL_STATUS_SCHEMA_VERSION_MISMATCH);
    result->response.set_error_message(
        Format("schema version mismatch for table $0: expected $1, got $2",
               table_info->table_id,
               table_info->schema_version,
               pgsql_read_request.schema_version()));
    return Status::OK();
  }

From the above code, in order to detect schema version mismatch error, we must have a valid table_info for the given table. That means the table isn’t deleted yet at this tserver.

If the table is deleted, then earlier in the above function DoHandlePgsqlReadRequest

  const shared_ptr<tablet::TableInfo> table_info =
      VERIFY_RESULT(metadata_->GetTableInfo(pgsql_read_request.table_id()));

would have returned an error. In other words, VERIFY_RESULT will cause the function to return an error status indicating the details of the error. So GetTableInfo can fail to find the table_id() from the pgsql_read_request.

Result<TableInfoPtr> RaftGroupMetadata::GetTableInfo(const TableId& table_id) const {
  std::lock_guard lock(data_mutex_);
  return GetTableInfoUnlocked(table_id);
}

Result<TableInfoPtr> RaftGroupMetadata::GetTableInfoUnlocked(const TableId& table_id) const {
  const auto& tables = kv_store_.tables;

  const auto& id = !table_id.empty() ? table_id : primary_table_id_;
  const auto iter = tables.find(id);
  if (iter == tables.end()) {
    RETURN_TABLE_NOT_FOUND(table_id, tables);
  }
  return iter->second;
}

We can see that RETURN_TABLE_NOT_FOUND is used to report error if the table is not found.

#define RETURN_TABLE_NOT_FOUND(table_id, tables) \
    return MakeTableNotFound((table_id), raft_group_id_, (tables), __FILE__, __LINE__)

template <class TablesMap>
Status MakeTableNotFound(const TableId& table_id, const RaftGroupId& raft_group_id,
                         const TablesMap& tables, const char* file_name, int line_number) {
  std::string table_name = "<unknown_table_name>";
  if (!table_id.empty()) {
    const auto iter = tables.find(table_id);
    if (iter != tables.end()) {
      table_name = iter->second->table_name;
    }
  }
  std::ostringstream string_stream;
  string_stream << "Table " << table_name << " (" << table_id << ") not found in Raft group "
      << raft_group_id;
  std::string msg = string_stream.str();
#ifndef NDEBUG
  // This very large message should be logged instead of being appended to STATUS.
  std::string suffix = Format(". Tables: $0.", tables);
  VLOG(1) << msg << suffix;
#endif
  return Status(Status::kNotFound, file_name, line_number, msg);
}

So not found in Raft group, the error we see in 21200, it’s the same nature error as a schema version mismatch. It’s just that the former happens when the table is deleted, while the latter happens when the table still exists but is altered. I say same nature because in both cases, the table is changed.

I think we should add not found in Raft group into the list of allowedDMLExceptions in the sample app because this error can legimitately appear when a table is deleted concurrently while the DML statement is executing.

tverona1 commented 7 months ago

Resolving as by design per above explanation.