Open bmatican opened 4 years ago
The permission denied
log message is not the issue as that's expected from the test. The issues seem to be catalog version mismatch and timeout on delete namespace.
@jaki Interesting, is the timeout on delete namespace expected? I know it's currently still a slow and sync operation, so maybe, can we just raise the timeouts across the board for that in our tests?
Separately, is the catalog version mismatch something we can better guard for in tests, if it's a frequent test failure pattern?
I tried testing its close friend org.yb.pgsql.TestPgAuthorization#testMultiDatabaseOwnershipDrop
, and that also fails for similar reasons. Things seem to go downhill from TRAP: FailedAssertion("!(portal)", File: "../../../../../../src/postgres/src/backend/tcop/postgres.c", Line: 4012)
, the line right after the catalog version mismatch one.
The delete namespace test log is buried somewhere in all my logs. Something else might have caused it to happen rather than purely being timeout.
I'm not excited about fighting catalog version mismatch. I'd rather wait for it to get fixed if possible.
Different errors on OwnershipDrop
:
@bmatican, this is what happens before delete namespace timeout: https://gist.github.com/jaki/7ec7d7a383d93dd7cb6ca50a1cd79b56/raw. It says
Failed to heartbeat to 127.211.34.109:11111: Timed out
then
Connected to a leader master server at 127.211.34.109:11111
right after. This happens multiple times.
There are also two delete namespace requests sent for the same "su_db" database.
Here's an alternate, more comprehensive log snippet: https://gist.github.com/jaki/45ac43ccc5fa6d1183532ef65d5ff853/raw.
I noticed
ts1|pid23511|:17317|http://127.107.80.78:13072 W0528 00:19:46.632740 24195 client-internal.cc:1257] GetTableSchemaRpc(table_identifier: table_id: "0000ffff00003000800000000000ffff", num_attempts: 1) failed: Not found (yb/master/catalog_manager.cc:3764): The object does not exist: table_id: "0000ffff00003000800000000000ffff"
shows up in most logs, but it also does in passing ones. The Failed to heartbeat
messages in the previous comment do not show up in the passing logs.
Bunch of confusing / slow things happening there. cc @nspiegelberg
Deleting CDC streams does not factor in YSQL system tables?
m2|pid23453|:16910|http://127.62.12.233:17770 I0528 00:19:46.523736 24040 catalog_manager_ent.cc:1613] Deleting CDC streams for tables: 000030aa00003000800000000000303f 000030aa00003000800000000000303d 000030aa00003000800000000000303b 000030aa000030008000000000003039 000030aa000030008000000000003037 000030aa000030008000000000003035 000030aa000030008000000000003033 000030aa0000300080000000000017da 000030aa0000300080000000000017d8 000030aa0000300080000000000017d6 000030aa000030008000000000000eb4 000030aa000030008000000000000e13 000030aa000030008000000000000e12 000030aa000030008000000000000e11 000030aa000030008000000000000e10 000030aa000030008000000000000e0c 000030aa000030008000000000000df8 000030aa000030008000000000000dd5 000030aa000030008000000000000dad 000030aa000030008000000000000d8a 000030aa000030008000000000000d80 000030aa000030008000000000000d42 000030aa000030008000000000000d35 000030aa000030008000000000000d16 000030aa000030008000000000000cb8 000030aa000030008000000000000c2e 000030aa000030008000000000000c07 000030aa000030008000000000000bb3 000030aa000030008000000000000ac1 000030aa000030008000000000000a3c 000030aa000030008000000000000a3b 000030aa000030008000000000000a3a 000030aa000030008000000000000a39 000030aa000030008000000000000a38 000030aa000030008000000000000a37 000030aa000030008000000000000a35 000030aa000030008000000000000a34 000030aa000030008000000000000a33 000030aa000030008000000000000a32 000030aa000030008000000000000a31 000030aa000030008000000000000a30 000030aa000030008000000000000a2f 000030aa000030008000000000000a2e 000030aa000030008000000000000a2d 000030aa000030008000000000000a2c 000030aa000030008000000000000a2b 000030aa000030008000000000000a2a 000030aa000030008000000000000a29 000030aa000030008000000000000a28 000030aa000030008000000000000918 000030aa0000300080000000000008b0 000030aa00003000800000000000058a 000030aa000030008000000000000589 000030aa0000300080000000000004eb 000030aa0000300080000000000004e7 000030aa0000300080000000000004e1 000030aa0000300080000000000004df 000030aa00003000800000000000033a 000030aa0000300080000000000000af 000030aa000030008000000000000224 000030aa000030008000000000000225 000030aa00003000800000000000033b 000030aa00003000800000000000088b 000030aa000030008000000000000a13 000030aa000030008000000000000a5b 000030aa000030008000000000000a5d 000030aa000030008000000000000a5e 000030aa000030008000000000000a5f 000030aa000030008000000000000a60 000030aa000030008000000000000a62 000030aa000030008000000000000a65 000030aa000030008000000000000a67 000030aa000030008000000000000a68 000030aa000030008000000000000a69 000030aa000030008000000000000a6a 000030aa000030008000000000000a6c 000030aa000030008000000000000a6d 000030aa000030008000000000000a71 000030aa000030008000000000000a72 000030aa000030008000000000000a76 000030aa000030008000000000000a79 000030aa000030008000000000000a7c 000030aa000030008000000000000a7e 000030aa000030008000000000000a81 000030aa000030008000000000000a83 000030aa000030008000000000000a85 000030aa000030008000000000000a8b 000030aa000030008000000000000a8d 000030aa000030008000000000000a90 000030aa000030008000000000000ac2 000030aa000030008000000000000c09 000030aa000030008000000000000c5c 000030aa000030008000000000000cba 000030aa000030008000000000000d33 000030aa000030008000000000000d7f 000030aa000030008000000000000d8b 000030aa000030008000000000000daf 000030aa000030008000000000000dce 000030aa000030008000000000000df7 000030aa000030008000000000000e14 000030aa000030008000000000000e16 000030aa000030008000000000000e18 000030aa000030008000000000000eb6 000030aa000030008000000000000f9d 000030aa0000300080000000000017df 000030aa0000300080000000000017e1
Background thread is marking system tables as deleted as well, we can probably optimize this, and not even go through an intermediary state for these.
m2|pid23453|:16910|http://127.62.12.233:17770 I0528 00:19:47.066646 23534 catalog_manager.cc:3348] Marking table as DELETED: pg_user_mapping_user_server_index [id=000030aa0000300080000000000000af]
m2|pid23453|:16910|http://127.62.12.233:17770 I0528 00:19:47.071460 23534 catalog_manager.cc:3348] Marking table as DELETED: pg_foreign_data_wrapper_name_index [id=000030aa000030008000000000000224]
Then M2 is the leader, but we see this from M3, a follower:
m3|pid23485|:11276|http://127.245.251.93:10853 W0528 00:20:02.725206 23518 long_operation_tracker.cc:112] UpdateReplica running for 1.000s:
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2ca909211e (unknown) (/tmp/glibc-20181130-26094-cs1x60/glibc-2.23/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb81e5f74 __GI___tls_get_addr (/tmp/glibc-20181130-26094-cs1x60/glibc-2.23/elf/dl-tls.c:831)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb320ac97 rocksdb::InternalKeyComparator::Compare(yb::Slice const&, yb::Slice const&) const (src/yb/rocksdb/db/dbformat.cc:111)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb328b7ef rocksdb::MemTable::KeyComparator::operator()(char const*, char const*) const (src/yb/rocksdb/db/memtable.cc:215)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb32f5de1 rocksdb::SkipListBase<char const*, rocksdb::MemTableRep::KeyComparator const&, rocksdb::SingleWriterInlineSkipListNode>::KeyIsAfterNode(char const*, rocksdb::SingleWriterInlineSkipListNode*) const (src/yb/rocksdb/db/skiplist.h:346)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb32f6302 rocksdb::SkipListBase<char const*, rocksdb::MemTableRep::KeyComparator const&, rocksdb::SingleWriterInlineSkipListNode>::PrepareInsert(char const*) (src/yb/rocksdb/db/skiplist.h:457)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb32f4d04 rocksdb::SingleWriterInlineSkipList<rocksdb::MemTableRep::KeyComparator const&>::Insert(char const*) (src/yb/rocksdb/db/skiplist.h:642)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb32f4d04 Insert (src/yb/rocksdb/memtable/skiplistrep.cc:55)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb328c38e rocksdb::MemTable::Add(unsigned long, rocksdb::ValueType, yb::Slice const&, yb::Slice const&, bool) (src/yb/rocksdb/db/memtable.cc:423)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb32dfdef PutCF (src/yb/rocksdb/db/write_batch.cc:650)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb32e18d4 rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const (src/yb/rocksdb/db/write_batch.cc:313)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb32e1c26 rocksdb::WriteBatchInternal::InsertInto(rocksdb::autovector<rocksdb::WriteThread::Writer*, 8ul> const&, unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, bool, unsigned long, rocksdb::DB*, yb::EnumBitSet<rocksdb::InsertFlag>) (src/yb/rocksdb/db/write_batch.cc:874)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb32395b8 rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*) (src/yb/rocksdb/db/db_impl.cc:5146)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb3239eff rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*) (src/yb/rocksdb/db/db_impl.cc:4789)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb6e8b469 yb::tablet::Tablet::WriteToRocksDB(rocksdb::UserFrontiers const*, rocksdb::WriteBatch*, yb::docdb::StorageDbType) (src/yb/tablet/tablet.cc:1108)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb6e9a645 yb::tablet::Tablet::ApplyKeyValueRowOperations(long, yb::docdb::KeyValueWriteBatchPB const&, rocksdb::UserFrontiers const*, yb::HybridTime) (src/yb/tablet/tablet.cc:1082)
m3|pid23485|:11276|http://127.245.251.93:10853 @ 0x7f2cb6e9a80d yb::tablet::Tablet::ApplyOperationState(yb::tablet::OperationState const&, long, yb::docdb::KeyValueWriteBatchPB const&) (src/yb/tablet/tablet.cc:1020)
What is it writing for >1s? Is DROP DATABASE
really this expensive?
Then some absurd times for TSHeartBeat
:
m2|pid23453|:16910|http://127.62.12.233:17770 W0528 00:20:34.690135 23530 long_operation_tracker.cc:112] TSHeartbeat running for 47.100s:
This suggests we were stuck on w/e previous work (foreground DeleteNamespace
or background thread marking tables as deleted), for a really long time, likely holding some locks.
Eventually the YSQL level gives up:
ts1|pid23511|:17317|http://127.107.80.78:13072 W0528 00:20:47.133375 24186 client-internal.cc:215] Unable to send the request (namespace { id: "00004003000030008000000000000000" name: "su_db" database_type: YQL_DATABASE_PGSQL } database_type: YQL_DATABASE_PGSQL) to leader Master (127.62.12.233:16910): Timed out (yb/rpc/outbound_call.cc:512): DeleteNamespace RPC (request call id 501) to 127.62.12.233:16910 timed out after 60.000s
It also retries though? That suggests we're not putting a 60s cap on the total operation, since we just spent 60s on the previous attempt:
m2|pid23453|:16910|http://127.62.12.233:17770 I0528 00:20:47.137390 26650 catalog_manager.cc:4870] Servicing DeleteNamespace request from 127.107.80.78:33509: namespace { id: "00004003000030008000000000000000" name: "su_db" database_type: YQL_DATABASE_PGSQL } database_type: YQL_DATABASE_PGSQL
Some things related to commit 6994a85679ef798eb032427a8fef7ff842268e6f.
Deleting CDC streams does not factor in YSQL system tables?
It used to only be for user tables, but no reviewer mentioned anything when that changed. If CDC streams can't be on system tables, we can probably change this.
Background thread is marking system tables as deleted as well, we can probably optimize this, and not even go through an intermediary state for these.
I thought the point was to mark them as deleted. Otherwise, they would be RUNNING
or DELETING
.
I noticed a lot of these:
Failed to heartbeat to 127.62.12.233:16910: Service unavailable (yb/tserver/heartbeater.cc:400): master is no longer the leader
Connected to a leader master server at 127.62.12.233:16910
This seems to happen for at least ts1
and ts2
(didn't check ts3
). Somehow, tservers aren't figuring out the master leader change?
As for
It also retries though? That suggests we're not putting a 60s cap on the total operation, since we just spent 60s on the previous attempt:
the log suggests
if (now < deadline) {
is true. The default should be 60s (from the old "now"). I'll try to figure out what the deadline is.
Here is a full log of the previous comprehensive one: https://gist.github.com/jaki/854f160cc63ba014337d14ed352a8ddc/raw.
The now < deadline
is reasonable since rpc_deadline
is intentionally set slightly lower than deadline
.
// The RPC's deadline is intentionally earlier than the overall
// deadline so that we reserve some time with which to find a new
// leader master and retry before the overall deadline expires.
ts1|pid12307|:22684|http://127.98.111.153:21049 W0530 16:15:23.426978 22201 client-internal.cc:215]
ts1|pid12307|:22684|http://127.98.111.153:21049 now: 2865956.120s
ts1|pid12307|:22684|http://127.98.111.153:21049 deadline: 2866016.117s
ts1|pid12307|:22684|http://127.98.111.153:21049 rpc_deadline: 2865956.117s
ts1|pid12307|:22684|http://127.98.111.153:21049 default_rpc_timeout_: 60.000s
The Failed to heartbeat
messages seem closer to this issue.
However, the failures related to this are a tiny fraction of the most prevalent failure: catalog version mismatch.
This test now failed only on MAC and RELEASE builds intermittently with the following error.
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
Error Stack
8493 org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
8494 at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:335)
8495 at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441)
8496 at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365)
8497 at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:307)
8498 at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:293)
8499 at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:270)
8500 at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:266)
8501 at org.yb.pgsql.TestPgAuthorization.lambda$testMultiDatabaseOwnershipReassign$110(TestPgAuthorization.java:2660)
8502 at org.yb.pgsql.TestPgAuthorization.withRole(TestPgAuthorization.java:3203)
8503 at org.yb.pgsql.TestPgAuthorization.testMultiDatabaseOwnershipReassign(TestPgAuthorization.java:2659)
8504 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
8505 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
8506 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
8507 at java.lang.reflect.Method.invoke(Method.java:498)
8508 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
8509 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
8510 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
8511 at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
8512 at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
8513 at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
8514 at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
8515 at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
8516 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
8517 at java.lang.Thread.run(Thread.java:748)
8518 Caused by: java.io.EOFException
8519 at org.postgresql.core.PGStream.receiveChar(PGStream.java:308)
8520 at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1952)
8521 at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:308)
Are these failures due to a test-setup issue on Jenkins?
Jira Link: DB-1428 8/10 of the past commits.
https://detective-gcp.dev.yugabyte.com/job/github-yugabyte-db-centos-master-clang-release%2F883%2Fartifact%2Fjava%2Fyb-pgsql%2Ftarget%2Fsurefire-reports_org.yb.pgsql.TestPgAuthorization__testMultiDatabaseOwnershipReassign%2Forg.yb.pgsql.TestPgAuthorization-output.txt?max_lines=3000&start_line=6001