Open jaki opened 4 years ago
@jaki Is this latest master, or 2.1.5? I wonder if it's the batching work that @iSignal did before and we undid recently (see the revert here: c2390a7edb2bc950c3bf5b5501181dbbb79bc30a, the main commit was in 2.1.4 and 2.1.5).
@bmatican, this is pretty much latest master: commit 5ff03f067500584971c522a44c1c4a80213b7c56.
Only other change I can think of in this area is 3b55a784a95ad442eabdbe22fca1f592989da344 from @nspiegelberg , but would have to see some logs to understand what's happening here.
It could be that the master leader is losing leadership after some part of the sys_catalog write (maybe #4353), but before some more metadata changes that were supposed to come afterwards...
@bmatican, I've already given one master log in the first comment.
@jaki Something fishy happening here, why is ysql layer trying to recreate template1
?
I0505 22:13:28.203014 26656 catalog_loaders.cc:221] Loaded metadata for namespace template1 [id=00000001000030008000000000000000]
...
I0505 22:13:30.472867 26391 catalog_manager.cc:4556] CreateNamespace from 127.0.0.1:39948: name: "template1"
database_type: YQL_DATABASE_PGSQL
namespace_id: "00000001000030008000000000000000"
next_pg_oid: 10000
colocated: false
W0505 22:13:30.472929 26391 catalog_manager.cc:4574] Found keyspace: 00000001000030008000000000000000. Failed creating keyspace with error: Already present (yb/master/catalog_manager.cc:4573): Keyspace 'template1' already exists Request:
name: "template1"
database_type: YQL_DATABASE_PGSQL
namespace_id: "00000001000030008000000000000000"
next_pg_oid: 10000
colocated: false
W0505 22:13:30.942656 26484 tablet_service.cc:1788] DoRead: Not found (yb/tablet/tablet_metadata.cc:352): Table 000000010000300080000000000004ec not found in Raft group 00000000000000000000000000000000
Seems like that's the real problem.
Alternatively, any chance the read itself is somehow triggering a CreateNamespace
?
Another user report of similar issue: https://yugabyte-db.slack.com/archives/CG0KQF0GG/p1589459571466700
<< I'm trying to do a demo of yb in an openshift environment. I've deployed ok using both helm chart and operator. But trying to use the ysqlsh interface, I get a very strange message: ysqlsh: FATAL: Not found: Table 000000010000300080000000000004ec not found in Raft group 00000000000000000000000000000000 >>
@bmatican, I just saw the found keyspace log message
Found keyspace: 00000001000030008000000000000000. Failed creating keyspace with error: Already present (yb/master/catalog_manager.cc:4573): Keyspace 'template1' already exists
in unrelated debugging, and it did not cause any usability issues, so this suggests that that is not the cause.
Possibly unrelated, but I got
ysqlsh: FATAL: Not found: Error loading table with oid 1260 in database with oid 1: RPC timed out after deadline expired: GetTableSchemaRpc(table_identifier: table_id: "000000010000300080000000000004ec", num_attempts: 2) passed its deadline 7879692.323s (passed: 120.002s): Timed out (yb/rpc/outbound_call.cc:512): GetMasterRegistration RPC (request call id 2) to 127.0.0.71:7100 timed out after 59.997s
when investigating issue #5025. I loaded data with YCSB, then did a copy that OOMed and killed the tserver (rf 1). Then, I yb-ctl stop; yb-ctl start
, and ysqlsh -d ycsb
failed with the above message.
I experienced the same issue after deploying rook-yugabytedb. It happened when (in storageClassName
) I used a storage class backed by a CephFilesytem
(from rook-ceph). The issue did not appear when I used the local-path
storage class (from local-static-provisioner) instead.
Jira Link: DB-2072 There is a certain error that doesn't show up often but has shown up at least three times:
while true; do ./bin/yb-ctl destroy; ./bin/yb-ctl create --rf 3; ./bin/ysqlsh -c 'create database co colocated true' || break; done
(as reported in a comment of issue #3354)while true; do ./bin/yb-ctl destroy; ./bin/yb-ctl create --rf 3; ./bin/ysqlsh -c 'create database d' || break; done
(full log: https://gist.githubusercontent.com/jaki/d86a6b7629402c74ec7c5eec9a6c7350/raw)Of special note is that the UUIDs seem to be the same throughout all three failures. The table id corresponds to
template1.pg_authid
; the namespace id corresponds to the master system catalog tablet.Here is an interesting log snippet:
There are some leader changes and async leader stepdown going on in the logs, but they seem to happen in benign cases as well, so that probably isn't the issue.