[yugabyted] after a while Yugabyte cluster on Kubernetes starts crashing

remeq commented 12 months ago

Jira Link: DB-8028

Description

Hello,

we are deploying YB using helm chart into our Kubernetes:

$ helm list
NAME    NAMESPACE       REVISION        UPDATED                                         STATUS          CHART           APP VERSION  
yb-prm1 dev-stage       1               2023-09-21 09:43:49.444627057 +0200 CEST        deployed        yugabyte-2.19.0 2.19.0.0-b190

After a few days of running, the cluster always starts failing. The yb-tserver-0,yb-tserver-1,yb-tserver-2 pods are crashing and restarting (~60 restarts of a pod wihin 2 hours). Currently we fix that by completely uninstalling YB and removing all PVCs (its a development environment).

I am attaching two log files from the stdout of the containers. I am not sure what we see in the logs is a root cause or already a conseuqence.

Perhaps following is the issue?

I0914 08:53:09.228197    83 raft_consensus.cc:2982] T efbb420532054a84aefd2feed4a8600d P 2d6a8645473b4a37827efb2fe87a88d9 [term 134 LEADER]:  Leader pre-election vote request: Denying vote to candidate 458dd194435f4ec3a1432c7b8aa55374 for term 135 because replica believes a valid leader to be alive (leader uuid: 2d6a8645473b4a37827efb2fe87a88d9). Time left: 9222016986.154s

Note that this happens regularly. After a few days the YB always end up like this.

Besides the logs files which contain container stdout, I have also collected yb logs from the containers, but they are too big to upload here.

Regards, Premek tserver-0.log tserver-1.log

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

ddorian commented 12 months ago

cc @yogendra

yogendra commented 12 months ago

Would you be able to share the values file used for this? I see a couple of issues in the tserver-1.

First, I can see problem in creating socket

2023-09-14 08:52:04.579 UTC [631] FATAL:  lock file "/tmp/.yb.0.0.0.0:5433/.s.PGSQL.5433.lock" already exists

Second, I see that there is redis related logs. Are you using / have enabled YEDIS?

I0914 08:52:04.627539   100 client_master_rpc.cc:76] 0x00005617b28261d8 -> GetTableSchemaRpc(table_identifier: table_name: "redis" namespace { name: "system_redis" database_type: YQL_DATABASE_REDIS }, num_attempts: 1): Failed, got resp error: Not found (yb/master/catalog_manager.cc:5265): Keyspace name not found: system_redis (master error 11)

remeq commented 12 months ago

Here is the values file I use to deploy YB via helm:

storage:
  master:
    count: 2
    size: 10Gi
    storageClass: gp2
  tserver:
    count: 2
    size: 10Gi
    storageClass: gp2

resource:
  master:
    requests:
      cpu: 2
      memory: 8Gi
    limits:
      cpu: 2
      memory: 8Gi
  tserver:
    requests:
      cpu: 2
      memory: 8Gi
    limits:
      cpu: 2
      memory: 8Gi

replicas:
  master: 3
  tserver: 3

partition:
  master: 3
  tserver: 3

nodeSelector:
 onramp2: bignode

tserver:
  tolerations:
  - key: dedicated
    operator: Equal
    value: experimental
    effect: NoSchedule

master:
  tolerations:
  - key: dedicated
    operator: Equal
    value: experimental
    effect: NoSchedule

I haven't configured anything related to redis (and don't know that redis is being used under the hood).

yogendra commented 12 months ago

Only thing odd in this is the partition values. Value 3 means that in case of a upgrade, pod number 4 and later will be upgraded in the statefulset.

I am assuming:

This is running on aws
nodeSelector is pointing to a nodepool on the AWS

Is the nodePoll multi-AZ? There might be a chance that on restarting/rescheduling of pod, if may come up on a different zone? Like tserver-0 is on node on AZ1 and on restarting it may come up on AZ2 or AZ2? That may cause PV mounting issue also

remeq commented 12 months ago

It is a fully multi-az setup. Currently every replica of the tserver is in its own AZ, but I think the PVC are also created accordingly in respective AZ. Moreover - the YB after fresh installation is running without issues and we are able to create data in the DB. Only after few days it starts crashing like that.

yogendra commented 12 months ago

I am trying to check on the error message (long lease time). Is there any way to find out why the first restart happened? Was the host drained? Or any out of memory or anything as such?

remeq commented 12 months ago

No, I dont know why the first restart happened. I don't think it is because of nodes, they all have uptime over 15 days and last time when the YB broke was yesterday. When the YB breaks again, I can provide access to our cluster.

yogendra commented 12 months ago

In case of multi-az deploy, I would recommend using the using multi-az procedure. https://docs.yugabyte.com/preview/deploy/kubernetes/multi-zone/eks/helm-chart/

I am suspecting that pod rescheduling due to restart/crash might ber causing AZ and volume mismatch. Initial deployment would result in correct mapping due to provisioning of PV.

remeq commented 12 months ago

Thanks, umm I somehow missed this doc. I will redeploy according to multi-az procedure. It could indeed be the culprit. Thanks again, I will report back

remeq commented 11 months ago

@yogendra So I have completely redeployed our Yugabyte according to the multiAZ doc. Unfortunately after about 2 hours of running, the YB broke again. My impression is that maybe its not exactly the same type of crash like the one in the original post. What happened is that the currently elected master went down for some reason, and there were about 9 restarts of the container within an hour. Also (not related to those 9 restarts of the pod) - from the attached log its apparent that the YB container has problem restarting the system inside because some lock file hasn't probably been properly cleaned form the previous run.

2023-09-24 15:22:01.771 UTC [16896] LOG:  YugaByte is ENABLED in PostgreSQL. Transactions are enabled.
2023-09-24 15:22:01.777 UTC [16896] LOG:  pgaudit extension initialized
2023-09-24 15:22:01.778 UTC [16896] LOG:  listening on IPv4 address "0.0.0.0", port 5433
2023-09-24 15:22:01.779 UTC [16896] FATAL:  lock file "/tmp/.yb.0.0.0.0:5433/.s.PGSQL.5433.lock" already exists
2023-09-24 15:22:01.779 UTC [16896] HINT:  Is another postmaster (PID 68) using socket file "/tmp/.yb.0.0.0.0:5433/.s.PGSQL.5433"?
2023-09-24 15:22:01.779 UTC [16896] LOG:  database system is shut down

The inner YB software tries to run the db server in loop so the log is full of those attempts - those attempts are however within the same "pod/container run". I don't know what exactly caused the pod itself to be restarted 9 times (after those 9 times it kept running but the software inside was stuck in the loop from the attached log).

A kinda good news is that this kind of crash seems to be "softer". I was able get YB back online simply by deleting the pod and lettting the statefulset recreate it completely new. In the original crash it was much worse (deleting the pod didnt help). yb-tserver-0.log

shinji62 commented 11 months ago

Hi @remeq what we need to understand is why the postgres process crashing ?

We need to get the full logs from the start to the crash if possible, we need to take a look to the postgres log which should be in the container as well.

What may happen in your case, is the process is restarted / crash and the lock not being removed properly.

Tserver try to restart the postgres process but goes into loop because of the lock file.

remeq commented 11 months ago

Hi @shinji62 , today it crashed again. There were 29 restarts of tserver pod and after that it stabilized itself: here are the logs: https://yblogs123.s3.eu-west-1.amazonaws.com/logs-tserver0-az1.tar.gz

yogendra commented 11 months ago

I can see a bunch os Scheduler is shutting down (system error 108) leading upto system shutdown. Can I check what was the activity ongoing when server got shutdown? Were you running some performance test? Or high concurrency job? Like >100 concurrent connections?

remeq commented 11 months ago

@yogendra chmmm I haven't ran anything specific. The database is used by the Temporal software as a backend db (besides our own bunch of microservices) so if that software did something I am not sure.

Now it broke again:

remeq commented 11 months ago

@yogendra Here is also a gist which contains full describe of the yb-master-0 pod and also a full describe of node which this pod is running on, should somehow resources would be the problem.

https://gist.github.com/remeq/5afe5a530b4fa433e060ee430ce7ebc5

remeq commented 11 months ago

@yogendra actually we do see this 108 error all the time, even after we freshly recreate the pods and the DB appears to be functional, so the question is does this error indicate a real problem that should be investigated ?

I0928 14:43:21.864732 3323 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:80): Scheduler is shutting down (system error 108)

remeq commented 11 months ago

@yogendra I think your question about number of concurrent connections could be a lead in right direction. I counted a number of active connections in one of our AZ using following commands and a result is frightening:

kubectl exec -it yb-tserver-0  -n yb-eu-west-1a -- /bin/bash
[root@yb-tserver-0 cores]# netstat -an | grep 5433 | grep ESTABLISHED | wc -l
227

shinji62 commented 11 months ago

@remeq how was the cpu / memory at the time of crashing ? Could be one of the pod is overloaded. How do you connect to the db from your app ?

remeq commented 11 months ago

@shinji62 I dont have numbers for cpu/memory right now, but our current focus is fixing our application to not open so many connections to the DB server. I've read in the YB docs that the default limit is 300 connections per node and we were getting close to it.

Do I suppose correctly that should we need a lot of connections we can "scale" this by adding new nodes into YB cluster and spreading connections over the cluster ?

Once we will bring the number of connections down I will report back if it improved the situation.

shinji62 commented 11 months ago

@remeq I think in our Saas we normally put 15 connections by vpcu/node, of course this is quite conservative but that works for a large range of workload.

Not sure what driver you are using but you should have a pooling mechanism to reuse connection.

remeq commented 11 months ago

Hmm, so it seems that there is some more issues with our YB. We fixed the number of connections, so currently there is about 25 connections but the YB is still crashing. Currently I am looking at two issues:

1) In the postgre logs I keep seeing the error 108 about the scheduler:

I0928 15:04:02.204464 17087 pg_client.cc:155] Using TServer host_port: yb-tserver-0.yb-tservers.yb-eu-west-1a.svc.cluster.local:9100
I0928 15:04:02.206054 17087 pg_client.cc:162] Session id 1338: Session id acquired. Postgres backend pid: 17087
I0928 15:04:02.248798 17097 mem_tracker.cc:419] Creating root MemTracker with garbage collection threshold 5242880 bytes
I0928 15:04:02.249640 17097 thread_pool.cc:167] Starting thread pool { name: pggate_ybclient max_workers: 1024 }
I0928 15:04:02.250339 17097 pg_client.cc:155] Using TServer host_port: yb-tserver-0.yb-tservers.yb-eu-west-1a.svc.cluster.local:9100
I0928 15:04:02.251868 17097 pg_client.cc:162] Session id 1339: Session id acquired. Postgres backend pid: 17097
I0928 15:04:08.764662  4055 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:80): Scheduler is shutting down (system error 108)
I0928 15:04:15.566011 17114 mem_tracker.cc:419] Creating root MemTracker with garbage collection threshold 5242880 bytes
I0928 15:04:15.566864 17114 thread_pool.cc:167] Starting thread pool { name: pggate_ybclient max_workers: 1024 }
I0928 15:04:15.567566 17114 pg_client.cc:155] Using TServer host_port: yb-tserver-0.yb-tservers.yb-eu-west-1a.svc.cluster.local:9100
I0928 15:04:15.569221 17114 pg_client.cc:162] Session id 1340: Session id acquired. Postgres backend pid: 17114
I0928 15:04:28.478423 17141 mem_tracker.cc:419] Creating root MemTracker with garbage collection threshold 5242880 bytes
I0928 15:04:28.479322 17141 thread_pool.cc:167] Starting thread pool { name: pggate_ybclient max_workers: 1024 }
I0928 15:04:28.480062 17141 pg_client.cc:155] Using TServer host_port: yb-tserver-0.yb-tservers.yb-eu-west-1a.svc.cluster.local:9100
I0928 15:04:28.481685 17141 pg_client.cc:162] Session id 1341: Session id acquired. Postgres backend pid: 17141
I0928 15:04:28.517724 17151 mem_tracker.cc:419] Creating root MemTracker with garbage collection threshold 5242880 bytes

This error occurs even when I freshly recycle the YB (delete YB pods).

2) pods crash after a while with following:


I1003 11:22:40.407141    67 client_master_rpc.cc:76] 0x000055a60dc5e718 -> GetTableSchemaRpc(table_identifier: table_name: "redis" namespace { name: "system_redis" database_type: YQL_DATABASE_REDIS }, num_attempts: 1): Failed, got resp error: Not found (yb/master/catalog_manager.cc:5307): Keyspace name not found: system_redis (master error 11)
W1003 11:22:40.407539    67 client-internal.cc:1408] GetTableSchemaRpc(table_identifier: table_name: "redis" namespace { name: "system_redis" database_type: YQL_DATABASE_REDIS }, num_attempts: 1) failed: Not found (yb/master/catalog_manager.cc:5307): Keyspace name not found: system_redis (master error 11)
*** Aborted at 1696332160 (unix time) try "date -d @1696332160" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x10) received by PID 24 (TID 0x7f6051d0b700) from PID 16; stack trace: ***
    @     0x55a5e4190281 std::__1::__function::__func<>::operator()()
    @     0x55a5e41c5178 yb::redisserver::RedisServiceImpl::Impl::Handle()
    @     0x55a5e41cfa7e yb::redisserver::RedisServiceImpl::Handle()
    @     0x55a5e38eaf8a yb::rpc::ServicePoolImpl::Handle()
    @     0x55a5e382ad7f yb::rpc::InboundCall::InboundCallTask::Run()
    @     0x55a5e38f9b23 yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x55a5e3fcbb82 yb::Thread::SuperviseThread()
W1003 11:22:40.469089    68 transaction_status_cache.cc:133] Intent for transaction w/o metadata: d8cddb69-7582-4734-be64-f9635cc535fb
W1003 11:22:40.469184    68 transaction_status_cache.cc:133] Intent for transaction w/o metadata: 74b31e59-7d34-49d0-9e3c-d1cacaf2abe4
W1003 11:22:40.469202    68 transaction_status_cache.cc:133] Intent for transaction w/o metadata: e27db424-cfae-4a46-84cf-aa7c2d6848b7```

yogendra commented 11 months ago

Can you grep on 4055 Based on following line, I am wondering that in the thread id 4055 somthing must have happened earlier to trigger a shutdown

I0928 15:04:08.764662  4055 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:80): Scheduler is shutting down (system error 108)

remeq commented 11 months ago

@yogendra not sure if it has any significance. It seems that it is a pid of the process which is having the 108 error and it is restarted and fails regularly:

I1003 13:08:48.563877   946 pg_client.cc:162] Session id 26: Session id acquired. Postgres backend pid: 946
I1003 13:08:48.638608   950 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:80): Scheduler is shutting down (system error 108)
I1003 13:08:49.157478   958 mem_tracker.cc:419] Creating root MemTracker with garbage collection threshold 5242880 bytes
I1003 13:08:49.158547   958 thread_pool.cc:167] Starting thread pool { name: pggate_ybclient max_workers: 1024 }
I1003 13:08:49.159335   958 pg_client.cc:155] Using TServer host_port: yb-tserver-0.yb-tservers.yb-eu-west-1a.svc.cluster.local:9100
I1003 13:08:49.161031   958 pg_client.cc:162] Session id 27: Session id acquired. Postgres backend pid: 958
I1003 13:08:49.240470   960 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:80): Scheduler is shutting down (system error 108)
I1003 13:08:49.317142   969 mem_tracker.cc:419] Creating root MemTracker with garbage collection threshold 5242880 bytes
I1003 13:08:49.318174   969 thread_pool.cc:167] Starting thread pool { name: pggate_ybclient max_workers: 1024 }
I1003 13:08:49.318938   969 pg_client.cc:155] Using TServer host_port: yb-tserver-0.yb-tservers.yb-eu-west-1a.svc.cluster.local:9100
I1003 13:08:49.320544   969 pg_client.cc:162] Session id 28: Session id acquired. Postgres backend pid: 969
I1003 13:08:49.406906   971 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:80): Scheduler is shutting down (system error 108)
I1003 13:08:49.438158   980 mem_tracker.cc:419] Creating root MemTracker with garbage collection threshold 5242880 bytes
I1003 13:08:49.439085   980 thread_pool.cc:167] Starting thread pool { name: pggate_ybclient max_workers: 1024 }
I1003 13:08:49.439265   981 mem_tracker.cc:419] Creating root MemTracker with garbage collection threshold 5242880 bytes
I1003 13:08:49.439823   980 pg_client.cc:155] Using TServer host_port: yb-tserver-0.yb-tservers.yb-eu-west-1a.svc.cluster.local:9100

Here you can see it happened 3 times in a short period of time.

remeq commented 11 months ago

Note also that I am getting this error in a pod which keeps running (is not crashed) so this particular error is definitely not an error that would be a result of the whole pod going down.

remeq commented 11 months ago

@yogendra @shinji62 - leaving the scheduler error 108 aside, what about the SIGSEGV crash which I also sent above ? Our YB pods keep crashing with that and it seems it is related to redis part of the YB. Should the redis functionality be somehow enabled/disabled/configured? When I list our databases there is no system_redis table which YB is looking for:

                                     List of databases
        Name         |  Owner   | Encoding | Collate |    Ctype    |   Access privileges   
---------------------+----------+----------+---------+-------------+-----------------------
 postgres            | postgres | UTF8     | C       | en_US.UTF-8 | 
 system_platform     | postgres | UTF8     | C       | en_US.UTF-8 | 
 template0           | postgres | UTF8     | C       | en_US.UTF-8 | =c/postgres          +
                     |          |          |         |             | postgres=CTc/postgres
 template1           | postgres | UTF8     | C       | en_US.UTF-8 | =c/postgres          +
                     |          |          |         |             | postgres=CTc/postgres
 temporal            | yugabyte | UTF8     | C       | en_US.UTF-8 | 
 temporal_visibility | yugabyte | UTF8     | C       | en_US.UTF-8 | 
 yugabyte            | postgres | UTF8     | C       | en_US.UTF-8 | 
(7 rows)

shinji62 commented 11 months ago

At this point it would be nice to get a core dump, you can check here https://docs.yugabyte.com/preview/yugabyte-platform/troubleshoot/universe-issues/#collect-core-dumps-in-kubernetes-environments

remeq commented 11 months ago

@shinji62 here is the core dump:

https://yblogs123.s3.eu-west-1.amazonaws.com/core.60

and the SIGSEGV trace is:

*** SIGSEGV (@0x10) received by PID 24 (TID 0x7f6051d0b700) from PID 16; stack trace: ***
    @     0x55a5e4190281 std::__1::__function::__func<>::operator()()
    @     0x55a5e41c5178 yb::redisserver::RedisServiceImpl::Impl::Handle()
    @     0x55a5e41cfa7e yb::redisserver::RedisServiceImpl::Handle()
    @     0x55a5e38eaf8a yb::rpc::ServicePoolImpl::Handle()
    @     0x55a5e382ad7f yb::rpc::InboundCall::InboundCallTask::Run()
    @     0x55a5e38f9b23 yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x55a5e3fcbb82 yb::Thread::SuperviseThread()

remeq commented 11 months ago

@shinji62

I am now trying to configure redis using following doc, as it doesnt seem that helm chart iself supports anything related to the redis config:

https://docs.yugabyte.com/preview/yedis/quick-start/#kubernetes

I've executed commmand

/home/yugabyte/bin/yb-admin --master_addresses yb-master-0.yb-masters.yb-eu-west-1a.svc.cluster.local:7100,yb-master-0.yb-masters.yb-eu-west-1b.svc.cluster.local:7100,yb-master-0.yb-masters.yb-eu-west-1c.svc.cluster.local:7100 setup_redis_table

Which according to the doc should be enough to setup the redis but using the verification command from the same doc I see the redis is not available:

kubectl exec -it yb-tserver-0 -n yb-eu-west-1a -- /home/yugabyte/bin/redis-cli
Defaulted container "yb-tserver" out of: yb-tserver, yb-cleanup
Could not connect to Redis at 127.0.0.1:6379: Connection refused
Could not connect to Redis at 127.0.0.1:6379: Connection refused
not connected>

ddorian commented 11 months ago

I am now trying to configure redis using following doc

Please do not use yedis because it's Deprecated https://docs.yugabyte.com/preview/yedis/

remeq commented 11 months ago

@ddorian no its not like that. I dont want to use Redis (YEDIS) but I needed to configure it because apparently without the redis configuration the normal Yugabyte (YSQL/YCQL) keeps crashing. Since yesterday when I executed the above command the YB hasn't crashed yet.....

yogendra commented 11 months ago

@remeq Would it be possible for you to join our community slack and ping me? we could converse quickly and might be able to get on a zoom call at some point.
I have been busy for past week and couldn't get around to this. I am trying to understand the coredump, which will take a bit.

In the mean time, can you share how many db, schme, tables, and tablets are there in this cluster? I suspect apart from connections, you might be running into CPU contention due to high number for tablets. I am trying to check the temporal docs for schema.

I am assuming this is still a small (2vcpu, 8 gb ram) pod config

remeq commented 11 months ago

@yogendra sure, I can join slack or zoom or assist any possible way. Please just note that so far it seems I solved the crash by running the redis config mentioned above. Perhaps its too early to say but since yesterday the YB didnt crash (normally it was crashing within few hours).

shinji62 commented 11 months ago

@remeq I sent the core dump to the DB team, so I am waiting for their response. Sorry he do take some time as I was having trouble trying to open the core dump.

remeq commented 11 months ago

@shinji62 pls read previous posts. As of now I can report that the YB stopped crashing after I created the redis tables using the command I mention above. My conclusion is that the problem is with the YB helm distribution which after initial installation expects the redis tables to exist.

shinji62 commented 11 months ago

@remeq I did read your previous post but I think it is still important to understand what is going on.

We found that a flushdb command is send to yedis and then backend is crashing because the flushdb is run against a none existing table.

Now, I am not sure what trigger the flushdb we are still looking into.

remeq commented 10 months ago

@shinji62 no problem. Let me know should you need any assistance from our side. Currently our YB is running without crashing for 14 days already.

yugabyte / yugabyte-db

[yugabyted] after a while Yugabyte cluster on Kubernetes starts crashing #19237

Description

Warning: Please confirm that this issue does not contain any sensitive information