ysqlsh: FATAL: Not found: Table 000000010000300080000000000004ec not found in Raft group 00000000000000000000000000000000

jaki commented 4 years ago

Jira Link: DB-2072 There is a certain error that doesn't show up often but has shown up at least three times:

When running while true; do ./bin/yb-ctl destroy; ./bin/yb-ctl create --rf 3; ./bin/ysqlsh -c 'create database co colocated true' || break; done (as reported in a comment of issue #3354)
By a user on the community slack channel: https://yugabyte-db.slack.com/archives/CG0KQF0GG/p1588348573008500
When running while true; do ./bin/yb-ctl destroy; ./bin/yb-ctl create --rf 3; ./bin/ysqlsh -c 'create database d' || break; done (full log: https://gist.githubusercontent.com/jaki/d86a6b7629402c74ec7c5eec9a6c7350/raw)

Of special note is that the UUIDs seem to be the same throughout all three failures. The table id corresponds to template1.pg_authid; the namespace id corresponds to the master system catalog tablet.

Here is an interesting log snippet:

I0505 22:13:30.472867 26391 catalog_manager.cc:4556] CreateNamespace from 127.0.0.1:39948: name: "template1"
database_type: YQL_DATABASE_PGSQL
namespace_id: "00000001000030008000000000000000"
next_pg_oid: 10000
colocated: false
W0505 22:13:30.472929 26391 catalog_manager.cc:4574] Found keyspace: 00000001000030008000000000000000. Failed creating keyspace with error: Already present (yb/master/catalog_manager.cc:4573): Keyspace 'template1' already exists Request:
name: "template1"
database_type: YQL_DATABASE_PGSQL
namespace_id: "00000001000030008000000000000000"
next_pg_oid: 10000
colocated: false
W0505 22:13:30.942656 26484 tablet_service.cc:1788] DoRead: Not found (yb/tablet/tablet_metadata.cc:352): Table 000000010000300080000000000004ec not found in Raft group 00000000000000000000000000000000

There are some leader changes and async leader stepdown going on in the logs, but they seem to happen in benign cases as well, so that probably isn't the issue.

bmatican commented 4 years ago

@jaki Is this latest master, or 2.1.5? I wonder if it's the batching work that @iSignal did before and we undid recently (see the revert here: c2390a7edb2bc950c3bf5b5501181dbbb79bc30a, the main commit was in 2.1.4 and 2.1.5).

jaki commented 4 years ago

@bmatican, this is pretty much latest master: commit 5ff03f067500584971c522a44c1c4a80213b7c56.

bmatican commented 4 years ago

Only other change I can think of in this area is 3b55a784a95ad442eabdbe22fca1f592989da344 from @nspiegelberg , but would have to see some logs to understand what's happening here.

It could be that the master leader is losing leadership after some part of the sys_catalog write (maybe #4353), but before some more metadata changes that were supposed to come afterwards...

jaki commented 4 years ago

@bmatican, I've already given one master log in the first comment.

bmatican commented 4 years ago

@jaki Something fishy happening here, why is ysql layer trying to recreate template1?

I0505 22:13:28.203014 26656 catalog_loaders.cc:221] Loaded metadata for namespace template1 [id=00000001000030008000000000000000]
...
I0505 22:13:30.472867 26391 catalog_manager.cc:4556] CreateNamespace from 127.0.0.1:39948: name: "template1"
database_type: YQL_DATABASE_PGSQL
namespace_id: "00000001000030008000000000000000"
next_pg_oid: 10000
colocated: false
W0505 22:13:30.472929 26391 catalog_manager.cc:4574] Found keyspace: 00000001000030008000000000000000. Failed creating keyspace with error: Already present (yb/master/catalog_manager.cc:4573): Keyspace 'template1' already exists Request:
name: "template1"
database_type: YQL_DATABASE_PGSQL
namespace_id: "00000001000030008000000000000000"
next_pg_oid: 10000
colocated: false
W0505 22:13:30.942656 26484 tablet_service.cc:1788] DoRead: Not found (yb/tablet/tablet_metadata.cc:352): Table 000000010000300080000000000004ec not found in Raft group 00000000000000000000000000000000

Seems like that's the real problem.

Alternatively, any chance the read itself is somehow triggering a CreateNamespace?

kmuthukk commented 4 years ago

Another user report of similar issue: https://yugabyte-db.slack.com/archives/CG0KQF0GG/p1589459571466700

<< I'm trying to do a demo of yb in an openshift environment. I've deployed ok using both helm chart and operator. But trying to use the ysqlsh interface, I get a very strange message: ysqlsh: FATAL: Not found: Table 000000010000300080000000000004ec not found in Raft group 00000000000000000000000000000000 >>

jaki commented 4 years ago

@bmatican, I just saw the found keyspace log message

Found keyspace: 00000001000030008000000000000000. Failed creating keyspace with error: Already present (yb/master/catalog_manager.cc:4573): Keyspace 'template1' already exists

in unrelated debugging, and it did not cause any usability issues, so this suggests that that is not the cause.

jaki commented 4 years ago

Possibly unrelated, but I got

ysqlsh: FATAL:  Not found: Error loading table with oid 1260 in database with oid 1: RPC timed out after deadline expired: GetTableSchemaRpc(table_identifier: table_id: "000000010000300080000000000004ec", num_attempts: 2) passed its deadline 7879692.323s (passed: 120.002s): Timed out (yb/rpc/outbound_call.cc:512): GetMasterRegistration RPC (request call id 2) to 127.0.0.71:7100 timed out after 59.997s

when investigating issue #5025. I loaded data with YCSB, then did a copy that OOMed and killed the tserver (rf 1). Then, I yb-ctl stop; yb-ctl start, and ysqlsh -d ycsb failed with the above message.

haslersn commented 4 years ago

I experienced the same issue after deploying rook-yugabytedb. It happened when (in storageClassName) I used a storage class backed by a CephFilesytem (from rook-ceph). The issue did not appear when I used the local-path storage class (from local-static-provisioner) instead.

kmuthukk commented 4 years ago

yugabyte / yugabyte-db

ysqlsh: FATAL: Not found: Table 000000010000300080000000000004ec not found in Raft group 00000000000000000000000000000000 #4398