yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
9k stars 1.07k forks source link

tserver exit with SIGSEGV #1899

Closed GOGOYAO closed 5 years ago

GOGOYAO commented 5 years ago

On centos 7.6, I build a cluster of 3 nodes with db version 1.3.0.0. I run benchmark and find one tserver node is dead with messages as follows:
SIGSEGV (@0x0) received by PID 21386 (TID 0x7f73bdd29700) from PID 0; stack trace: @ 0x7f73cde98ba0 (unknown) @ 0x7f73d6935d20 rocksdb::DBImpl::GetMutableMemTableSmallestFrontier() @ 0x7f73d9ef6995 yb::tablet::Tablet::OldestMutableMemtableWriteHybridTime() @ 0x7f73da731b67 yb::tserver::TSTabletManager::TabletToFlush() @ 0x7f73da733d12 yb::tserver::TSTabletManager::MaybeFlushTablet() @ 0x7f73da743118 yb::BackgroundTask::Run() @ 0x7f73d3176c18 yb::Thread::SuperviseThread() @ 0x7f73cde90694 start_thread @ 0x7f73cd5cd41d __clone @ 0x0 (unknown)

ttyusupov commented 5 years ago

Hi @GoGoYao , how easy it is to reproduce? Could you share the benchmark?

ttyusupov commented 5 years ago

Also, maybe you can attach logs (yb-tserver.INFO.xxx) from the node which was crashed?

GOGOYAO commented 5 years ago

I can find error in yb-tserver.ERROR as follows:

E0725 12:04:44.052781 22097 process_context.cc:180] SQL Error: Execution Error. Write(tablet: f62a872c1fcd45699afa5263401b725b, num_ops: 1, num_attempts: 9,
 txn: 167378f3-d60e-4c77-b43b-18f6165c105d) passed its deadline 115909.394s (passed: 65.305s): Remote error (yb/rpc/outbound_call.cc:402): Service unavailab
le (yb/rpc/service_pool.cc:207): Write request on yb.tserver.TabletServerService from 10.158.41.140:16923 dropped due to backpressure. The service queue is
full, it has 18446744073709551614 items.
BEGIN TRANSACTION  INSERT INTO CassandraTransactionalKeyValue (k, v) VALUES (:k1, :v1);  INSERT INTO CassandraTransactionalKeyValue (k, v) VALUES (:k2, :v2)
;END TRANSACTION;

And I can find some message by dmesg:

[117413.579099] traps: postgres[22301] general protection ip:9615ec sp:7ffd8b4d2880 error:0 in postgres[400000+7a9000]
[117413.579126] sq_acceptor[22302]: segfault at 8 ip 00000000009615ec sp 00007f98b39b9c30 error 4 in postgres[400000+7a9000]

Then I find some log in postgresql-xxxx.log:

2019-07-25 11:57:32.563 UTC [22074] LOG:  database system is ready to accept connections
2019-07-25 12:28:46.504 UTC [22074] LOG:  received fast shutdown request
2019-07-25 12:28:46.505 UTC [22074] LOG:  aborting any active transactions
terminating background worker "YSQL webservererminating background worker "YSQL webserver" due to administrator command
2019-07-25 12:28:46.505 UTC [22301] FATAL:  terminating background worker "YSQL webserver" due to administrator command
2019-07-25 12:28:47.407 UTC [22074] LOG:  background worker "YSQL webserver" (PID 22301) was terminated by signal 11: Segmentation fault
2019-07-25 12:28:47.407 UTC [22074] LOG:  terminating any other active server processes
2019-07-25 12:28:47.417 UTC [22074] LOG:  abnormal database system shutdown
2019-07-25 12:28:47.468 UTC [22074] LOG:  database system is shut down

It seems like something happened to sql layer?

ndeodhar commented 5 years ago

Is your benchmark for Cassandra or SQL? Your tserver error is for Cassandra benchmarking. Did you also run SQL benchmarking?

ndeodhar commented 5 years ago

Can you also turn on postgres statement logging and run your benchmark again? You can turn statement logging by updating $data_dir/node-1/disk-1/pg_data/postgresql.conf file and adding the line log_statement='all'. After this, you'll need to restart tserver (can use ./bin/yb-ctl restart if you're using a local setup) and then run benchmarking.

ttyusupov commented 5 years ago

As for

E0725 12:04:44.052781 22097 process_context.cc:180] SQL Error: Execution Error. Write(tablet: f62a872c1fcd45699afa5263401b725b, num_ops: 1, num_attempts: 9, txn: 167378f3-d60e-4c77-b43b-18f6165c105d) passed its deadline 115909.394s (passed: 65.305s): Remote error (yb/rpc/outbound_call.cc:402): Service unavailable (yb/rpc/service_pool.cc:207): Write request on yb.tserver.TabletServerService from 10.158.41.140:16923 dropped due to backpressure. The service queue is full, it has 18446744073709551614 items.

The service queue is full, it has 18446744073709551614 items. is due to overflow which was fixed by https://github.com/YugaByte/yugabyte-db/issues/1866.

But there probably were more issues before overflow. @GoGoYao , could you attach to the ticket appropriate yb-tserver.INFO.<xxx> log file which corresponds to the core dump time window?

GOGOYAO commented 5 years ago

@ttyusupov I pick some logs around where queue is full appears at the first time .

I0725 11:58:10.688164 29930 tablet_bootstrap.cc:931] T aca45ade0d1343f58c61aa10a18ab2fb P 3c1da0d1b980407f96d4fd153acfb78c: Number of orphaned replicates: 0, last id: 1.5436, commited id: 1.5436
I0725 11:58:10.688897 29930 tablet_bootstrap_if.cc:75] T aca45ade0d1343f58c61aa10a18ab2fb P 3c1da0d1b980407f96d4fd153acfb78c: Bootstrap complete.
I0725 11:58:10.688963 29930 ts_tablet_manager.cc:965] T aca45ade0d1343f58c61aa10a18ab2fb P 3c1da0d1b980407f96d4fd153acfb78c: Time spent bootstrapping tablet: real 0.130s   user 0.089s sys 0.041s
I0725 11:58:10.689718 29930 consensus_meta.cc:248] T aca45ade0d1343f58c61aa10a18ab2fb P 3c1da0d1b980407f96d4fd153acfb78c: Updating active role from UNKNOWN_ROLE to LEARNER. Consensus state: current_term: 1 leader_uuid: "" config { opid_index: 5436 peers { permanent_uuid: "7c5305f849ae40f58885d719b346c0d8" member_type: VOTER last_known_private_addr { host: "10.158.41.141" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "b87d70c678fc49fb8435dbe4afc3db24" member_type: VOTER last_known_private_addr { host: "10.158.41.139" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "3c1da0d1b980407f96d4fd153acfb78c" member_type: PRE_VOTER last_known_private_addr { host: "10.158.41.140" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } }, has_pending_config = 0
I0725 11:58:10.689992 29930 consensus_meta.cc:248] T aca45ade0d1343f58c61aa10a18ab2fb P 3c1da0d1b980407f96d4fd153acfb78c: Updating active role from LEARNER to LEARNER. Consensus state: current_term: 1 leader_uuid: "" config { opid_index: 5436 peers { permanent_uuid: "7c5305f849ae40f58885d719b346c0d8" member_type: VOTER last_known_private_addr { host: "10.158.41.141" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "b87d70c678fc49fb8435dbe4afc3db24" member_type: VOTER last_known_private_addr { host: "10.158.41.139" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "3c1da0d1b980407f96d4fd153acfb78c" member_type: PRE_VOTER last_known_private_addr { host: "10.158.41.140" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } }, has_pending_config = 0
I0725 11:58:10.690049 29930 raft_consensus.cc:354] T aca45ade0d1343f58c61aa10a18ab2fb P 3c1da0d1b980407f96d4fd153acfb78c [term 1 LEARNER]: Replica starting. Triggering 0 pending operations. Active config: opid_index: 5436 peers { permanent_uuid: "7c5305f849ae40f58885d719b346c0d8" member_type: VOTER last_known_private_addr { host: "10.158.41.141" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "b87d70c678fc49fb8435dbe4afc3db24" member_type: VOTER last_known_private_addr { host: "10.158.41.139" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "3c1da0d1b980407f96d4fd153acfb78c" member_type: PRE_VOTER last_known_private_addr { host: "10.158.41.140" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } }
I0725 11:58:10.690085 29930 raft_consensus.cc:875] T aca45ade0d1343f58c61aa10a18ab2fb P 3c1da0d1b980407f96d4fd153acfb78c [term 1 LEARNER]: Becoming Follower/Learner. State: Replica: 3c1da0d1b980407f96d4fd153acfb78c, State: 1, Role: LEARNER, Watermarks: {Received: { term: 1 index: 5436 } Committed: { term: 1 index: 5436 }} Leader: { term: 0 index: 0 }, new leader: , initial_fd_wait: <uninitialized>
I0725 11:58:10.690130 29930 consensus_meta.cc:248] T aca45ade0d1343f58c61aa10a18ab2fb P 3c1da0d1b980407f96d4fd153acfb78c: Updating active role from LEARNER to LEARNER. Consensus state: current_term: 1 leader_uuid: "" config { opid_index: 5436 peers { permanent_uuid: "7c5305f849ae40f58885d719b346c0d8" member_type: VOTER last_known_private_addr { host: "10.158.41.141" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "b87d70c678fc49fb8435dbe4afc3db24" member_type: VOTER last_known_private_addr { host: "10.158.41.139" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "3c1da0d1b980407f96d4fd153acfb78c" member_type: PRE_VOTER last_known_private_addr { host: "10.158.41.140" port: 9100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } }, has_pending_config = 0
I0725 11:58:10.690163 29930 consensus_queue.cc:204] T aca45ade0d1343f58c61aa10a18ab2fb P 3c1da0d1b980407f96d4fd153acfb78c [NON_LEADER]: Queue going to NON_LEADER mode. State: All replicated op: 0.0, Majority replicated op: 0.0, Committed index: 0.0, Last appended: 1.5436, Current term: 0, Majority size: -1, State: QUEUE_OPEN, Mode: NON_LEADER
I0725 11:58:10.690181 29930 raft_consensus.cc:2815] T aca45ade0d1343f58c61aa10a18ab2fb P 3c1da0d1b980407f96d4fd153acfb78c [term 1 LEARNER]: Calling mark dirty synchronously for reason code CONSENSUS_STARTED
I0725 11:58:10.690373 29930 tablet_peer.cc:982] T aca45ade0d1343f58c61aa10a18ab2fb P 3c1da0d1b980407f96d4fd153acfb78c [state=RUNNING]: Changed state from BOOTSTRAPPING to RUNNING
I0725 11:58:10.690438 29930 ts_tablet_manager.cc:1001] T aca45ade0d1343f58c61aa10a18ab2fb P 3c1da0d1b980407f96d4fd153acfb78c: Time spent starting tablet: real 0.001s   user 0.001s sys 0.000s
I0725 11:58:10.731720 29651 ts_tablet_manager.cc:715] T 7fc593f2eab3461abff9e9b5c32d4293 P 3c1da0d1b980407f96d4fd153acfb78c: Initiating remote bootstrap from Peer 7c5305f849ae40f58885d719b346c0d8 (10.158.41.141:9100)
I0725 11:58:10.731775 29651 remote_bootstrap_client.cc:259] T 7fc593f2eab3461abff9e9b5c32d4293 P 1: Remote bootstrap client: Beginning remote bootstrap session from remote peer at address 10.158.41.141:9100
W0725 11:58:10.789455 22096 tablet_rpc.cc:327] Timed out (yb/rpc/rpc.cc:199): Failed UpdateTransaction: tablet_id: "de92a87c5e88486184ecb41843d8b53c" state { transaction_id: "4x\247j\016\030MN\217\207\006\027\230\020Z\001" status: APPLYING tablets: "046cd2a698774ad395ac629662894841" commit_hybrid_time: 6406371512294457344 }, retrier: { task_id: -1 state: kRunning deadline: 115518.488s } to tablet de92a87c5e88486184ecb41843d8b53c on tablet server { uuid: b87d70c678fc49fb8435dbe4afc3db24 private: [host: "10.158.41.139" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" after 6 attempt(s): UpdateTransaction: tablet_id: "de92a87c5e88486184ecb41843d8b53c" state { transaction_id: "4x\247j\016\030MN\217\207\006\027\230\020Z\001" status: APPLYING tablets: "046cd2a698774ad395ac629662894841" commit_hybrid_time: 6406371512294457344 }, retrier: { task_id: -1 state: kRunning deadline: 115518.488s } passed its deadline 115518.488s (passed: 7.949s): Remote error (yb/rpc/outbound_call.cc:402): Service unavailable (yb/rpc/service_pool.cc:207): UpdateTransaction request on yb.tserver.TabletServerService from 10.158.41.140:15975 dropped due to backpressure. The service queue is full, it has 18446744073709551614 items.
W0725 11:58:10.789451 22086 tablet_rpc.cc:327] Timed out (yb/rpc/rpc.cc:199): Failed UpdateTransaction: tablet_id: "0d1bc413e12d4afc94d14d23ef290330" state { transaction_id: "%\004\350\250EOB\032\211u\237\010\304\346\264\004" status: APPLYING tablets: "046cd2a698774ad395ac629662894841" commit_hybrid_time: 6406371495515217920 }, retrier: { task_id: -1 state: kRunning deadline: 115518.488s } to tablet 0d1bc413e12d4afc94d14d23ef290330 on tablet server { uuid: b87d70c678fc49fb8435dbe4afc3db24 private: [host: "10.158.41.139" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" after 6 attempt(s): UpdateTransaction: tablet_id: "0d1bc413e12d4afc94d14d23ef290330" state { transaction_id: "%\004\350\250EOB\032\211u\237\010\304\346\264\004" status: APPLYING tablets: "046cd2a698774ad395ac629662894841" commit_hybrid_time: 6406371495515217920 }, retrier: { task_id: -1 state: kRunning deadline: 115518.488s } passed its deadline 115518.488s (passed: 7.949s): Remote error (yb/rpc/outbound_call.cc:402): Service unavailable (yb/rpc/service_pool.cc:207): UpdateTransaction request on yb.tserver.TabletServerService from 10.158.41.140:35740 dropped due to backpressure. The service queue is full, it has 18446744073709551614 items.
GOGOYAO commented 5 years ago

@ndeodhar I build the cluster with 3 machines, not a local cluster. I use the benchmark tool and don't run any CQL or SQL. I just have run CassandraKeyValue, CassandraTransactionalKeyValue and CassandraBatchKeyValue test.

ttyusupov commented 5 years ago

@GoGoYao it could be useful if you can attach the whole yb-tserver.INFO.xxx file which corresponds to tserver run when you SIGSEGV, because all the info from this log could be useful for investigation, not only around queue full.

ndeodhar commented 5 years ago

@GoGoYao Thanks for confirming which benchmarking tests you ran. All these three tests (CassandraKeyValue , CassandraTransactionalKeyValue and CassandraBatchKeyValue) are Cassandra based tests and use YCQL. They don't use postgres (YSQL). So the postgres errors and logs specified above are unrelated to these.

Like @ttyusupov mentioned, to help debug this, it will be useful to get the entire yb-tserver.INFO.xx file corresponding to the tsserver crash.