yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.65k stars 1.04k forks source link

[DocDb] Unable to connect to ysqlsh on a long running universe post upgrade #10410

Open kripasreenivasan opened 2 years ago

kripasreenivasan commented 2 years ago

Jira Link: DB-863

Description

Slack Thread:https://yugabyte.slack.com/archives/C4141D60H/p1634909201108600

On the long running universe puppy-food-4s-2in http://portal.dev.yugabyte.com/universes/da5e271d-3d18-448a-a6fa-b73560576ab7 , the health check Connectivity with ysqlsh is failing with the error

Error executing command ['ssh', 'yugabyte@172.151.40.75', '-p', '54422', '-o', 'StrictHostKeyChecking no', '-o', 'ConnectTimeout=10', '-o', 'UserKnownHostsFile /dev/null', '-o', 'LogLevel ERROR', '-i', '/opt/yugabyte/yugaware/data/keys/08c0ba0e-3558-40fc-94e5-a87a627de8c5/yb-15-aws-portal-1-key.pem', 'set -o pipefail; /home/yugabyte/tserver/bin/ysqlsh -h 172.151.40.75 -p 5433 -U yugabyte -c "conninfo"']: b'ysqlsh: could not connect to server: Connection refused\n\tIs the server running on host "172.151.40.75" and accepting\n\tTCP/IP connections on port 5433?\n'On the same node, I’m unable to connect to the YSQLSH

[yugabyte@ip-172-151-40-75 tserver]$ bin/ysqlsh -h 172.151.40.75ysqlsh: could not connect to server: Connection refusedIs the server running on host "172.151.40.75" and acceptingTCP/IP connections on port 5433?

Notes from :looks like we have an issue with a tablet splitting universe, see the logs below - http://portal.dev.yugabyte.com/universes/da5e271d-3d18-448a-a6fa-b73560576ab7/nodes - looks like a whole bunch of remote bootstraps got kicked off after the upgrade.@Kripa Sreenivasan: looks like the issues started after a software upgrade to 2.9.2.0-b47 - did you also start running sample apps at this time?W1022 16:01:47.821209 31583 db_impl.cc:5782] T bfb3323721074572ae16d62b36caf27b P 7f8cd41c7be0478db27e7c43fe504e66 [I]: DeleteFile /000125.sst failed. File not foundW1022 16:01:47.821231 31583 tablet.cc:875] T bfb3323721074572ae16d62b36caf27b P 7f8cd41c7be0478db27e7c43fe504e66: DoCleanupIntentFiles: Failed to delete { total_size: 66687 base_size: 66385 uncompressed_size: 67080 name: "/000125.sst" db_path: "/mnt/d2/yb-data/tserver/data/rocksdb/table-000030ad000030008000000000004e05/tablet-bfb3323721074572ae16d62b36caf27b.intents" imported: 0 being_compacted: 0 column_family_name: default level: 0 smallest: { seqno: 1125899957496443 user_frontier: 0x000000002302fb40 -> { op_id: 4.5638204 hybrid_time: { physical: 1634918504250862 } history_cutoff: hybrid_time_filter: max_value_level_ttl_expiration_time: } } largest: { seqno: 1125899957496451 user_frontier: 0x00000000045a7ac0 -> { op_id: 4.5638222 hybrid_time: { physical: 1634918505728804 } history_cutoff: hybrid_time_filter: max_value_level_ttl_expiration_time: } } }, all files [{ total_size: 66687 base_size: 66385 uncompressed_size: 67080 name: "/000125.sst" db_path: "/mnt/d2/yb-data/tserver/data/rocksdb/table-000030ad000030008000000000004e05/tablet-bfb3323721074572ae16d62b36caf27b.intents" imported: 0 being_compacted: 0 column_family_name: default level: 0 smallest: { seqno: 1125899957496443 user_frontier: 0x000000002302fb40 -> { op_id: 4.5638204 hybrid_time: { physical: 1634918504250862 } history_cutoff: hybrid_time_filter: max_value_level_ttl_expiration_time: } } largest: { seqno: 1125899957496451 user_frontier: 0x00000000045a7ac0 -> { op_id: 4.5638222 hybrid_time: { physical: 1634918505728804 } history_cutoff: hybrid_time_filter: max_value_level_ttl_expiration_time: } } }]: Invalid argument (yb/rocksdb/db/db_impl.cc:5784): File not foundI1022 16:01:47.821590 31583 tablet.cc:864] T bfb3323721074572ae16d62b36caf27b P 7f8cd41c7be0478db27e7c43fe504e66: DoCleanupIntentFiles: Intents SST file will be deleted: { total_size: 66670 base_size: 66385 uncompressed_size: 67080 name: "/000126.sst" db_path: "/mnt/d2/yb-data/tserver/data/rocksdb/table-000030ad000030008000000000004e05/tablet-bfb3323721074572ae16d62b36caf27b.intents" imported: 0 being_compacted: 0 column_family_name: default level: 0 smallest: { seqno: 0 user_frontier: 0x00000000045a7ac0 -> { op_id: 4.5638204 hybrid_time: { physical: 1634918504250862 } history_cutoff: hybrid_time_filter: max_value_level_ttl_expiration_time: } } largest: { seqno: 1125899957496451 user_frontier: 0x0000000098289580 -> { op_id: 4.5638222 hybrid_time: { physical: 1634918505728804 } history_cutoff: hybrid_time_filter: max_value_level_ttl_expiration_time: } } }, max ht: { physical: 1634918505728804 }, min running transaction start ht: { physical: 1634918507656006 }I1022 16:01:47.821730 31583 db_impl.cc:1147] T bfb3323721074572ae16d62b36caf27b P 7f8cd41c7be0478db27e7c43fe504e66 [I]: [JOB 405] Delete /mnt/d2/yb-data/tserver/data/rocksdb/table-000030ad000030008000000000004e05/tablet-bfb3323721074572ae16d62b36caf27b.intents/000126.sst type=2 #126 – OKI1022 16:01:47.821748 31583 event_logger.cc:67] T bfb3323721074572ae16d62b36caf27b P 7f8cd41c7be0478db27e7c43fe504e66 [I]: EVENT_LOG_v1 {"time_micros": 1634918507821744, "job": 405, "event": "table_file_deletion", "file_number": 126}W1022 16:01:48.401001 22813 pg_wrapper.cc:706] Failed when waiting for PostgreSQL server to exit: Illegal state (yb/util/subprocess.cc:398): DoWait called on a process that is not running, waiting a bitW1022 16:01:49.401114 22813 pg_wrapper.cc:706] Failed when waiting for PostgreSQL server to exit: Illegal state (yb/util/subprocess.cc:398): DoWait called on a process that is not running, waiting a bitW1022 16:01:50.40122

Observations from :cc @mihnea this might be some issues with TS starting PG after a restartStartup shows an older PG lingering?W1022 09:47:00.902312 22777 pg_wrapper.cc:638] Killing older postgres process: 30706I1022 09:47:00.902508 22777 pg_wrapper.cc:608] Starting PostgreSQL serverW1022 09:47:00.926363 22777 pg_wrapper.cc:303] No hba configuration lines found, defaulting to trust all configuration.I1022 09:47:00.931166 22777 pg_wrapper.cc:429] PostgreSQL server running as pid 22812then over and over again, we fail after starting it upW1022 09:47:06.668337 22813 pg_wrapper.cc:701] PostgreSQL server exited with code 256I1022 09:47:06.668391 22813 pg_wrapper.cc:717] Restarting PostgreSQL serverW1022 09:47:06.669970 22813 pg_wrapper.cc:303] No hba configuration lines found, defaulting to trust all configuration.I1022 09:47:06.675122 22813 pg_wrapper.cc:429] PostgreSQL server running as pid 22890eventually we seem to give upW1022 09:51:08.471691 22813 pg_wrapper.cc:721] Failed trying to start PostgreSQL server: Runtime error (yb/util/subprocess.cc:307): Unable to fork: Cannot allocate memory (system error 12), waiting a bitW1022 09:51:09.471838 22813 pg_wrapper.cc:706] Failed when waiting for PostgreSQL server to exit: Illegal state (yb/util/subprocess.cc:398): DoWait called on a process that is not running, waiting a bitTBD if the memory error is a cause of the constant failures, or an effect due to constant forking?

separately cc @sergei @dmitry seems like the new pgclient work might end spamming the logs, also TBD if cause or effect — but we should go over one of these PG logs and see what we can learn about log spew under stress2021-10-22 09:46:59.728 UTC [18676] ERROR: Network error: recvmsg error: Connection reset by peer2021-10-22 09:46:59.728 UTC [18676] STATEMENT: SELECT k, v FROM sqlsecondaryindex WHERE v = $12021-10-22 09:46:59.768 UTC [18676] FATAL: terminating connection due to unexpected postmaster exitI1022 09:46:59.771402 18680 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:80): Scheduler is shutting down (system error 108)2021-10-22 09:47:01.630 UTC [22662] FATAL: terminating connection due to unexpected postmaster exitI1022 09:47:01.639415 22731 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:80): Scheduler is shutting down (system error 108)W1022 09:47:04.226065 20834 pg_client.cc:100] Heartbeat failed: Invalid argument (yb/tserver/pg_client_service.cc:301): Unknown session: 437W1022 09:47:13.213145 20832 pg_client.cc:100] Heartbeat failed: Invalid argument (yb/tserver/pg_client_service.cc:301): Unknown session: 437W1022 09:47:22.250661 20833 pg_client.cc:100] Heartbeat failed: Invalid argument (yb/tserver/pg_client_service.cc:301): Unknown session: 437W1022 09:47:31.235635 20831 pg_client.cc:100] Heartbeat failed: Invalid argument (yb/tserver/pg_client_service.cc:301): Unknown session: 437

some more strangenessps aux | grep postgres...yugabyte 20724 0.0 0.4 553432 15744 ? Ssl 09:39 0:22 postgres: postgres postgres 172.151.26.211(54130) INSERTyugabyte 30725 0.0 0.3 296780 10912 ? Ss Oct21 0:00 postgres: loggerso there’s a backend still stuck from right after the restartand the last PG log file is actively being written to: postgresql-2021-10-22_000000.log , so perhaps this backend is somehow holding things up?also, what’s the postgres: logger process and why is it still up since yesterday?

tedyu commented 2 years ago

For ./src/yb/yql/pggate/pg_client.cc

      if (!status.ok()) {
        LOG(WARNING) << "Heartbeat failed: " << status;

It seems YB_PLOG_EVERY_N can be used to reduce spam.