Closed PeeK1e closed 2 years ago
Do you have any idea why this might be happening?
I am pretty sure there are details in postgres logs.
I am pretty sure there are details in postgres logs.
Oh, I forgot to mention, the logs at ~/pgdata/pgroot/pg_log/
are empty also there is nothing at /var/log/postgresql
Are there any other locations?
Logs should be written to /home/postgres/pgdata/pgroot/pg_log/
. If they are empty - you have to figure out what is wrong with permissions.
I've spend some time looking around and trying to find an issue with permissions in the container
postgres
user on the pgdata
folder777
su postgres -c /usr/lib/postgresql/14/bin/pg_ctl -D /home/postgres/pgdata/pgroot/data -l logfile start
which fails with a su: System Error
and still no log files or anything...I tried to create a test container and see if i could write any data into an PVC with the same storage class but there is no issue with that as well.
On another note it seems the endpoints
are not available (which is expected) but the endpoint slices
are not being created or only partially being created for the spilo containers, could that be an issue for the database to not start up?
Hi, I have exactly the same problem with one of my clusters... Notice that it was working perfectly and after a simple restart (to update the config file) it is now impossible to relaunch the cluster. I tried to recreate a different cluster with different spilo versions, spilo-13:2.1-p1 or spilo-14:2.1-p5 same problem. Very strange ... Someone got any idea please ?
My "same log" :
sunr-belenos-dev-0
belenos-dev
0
[StatefulSet](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapis%2Fapps%2Fv1%2Fnamespaces%2Fbelenos-dev%2Fstatefulsets%2Fsunr-belenos-dev&kube-selected=undefined)
[scw-ams-dev-1-belenos-1-028a116932b04a84afbf80](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapi%2Fv1%2Fnodes%2Fscw-ams-dev-1-belenos-1-028a116932b04a84afbf80&kube-selected=undefined)
Burstable
27m
Running
sunr-belenos-dev-1
belenos-dev
0
[StatefulSet](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapis%2Fapps%2Fv1%2Fnamespaces%2Fbelenos-dev%2Fstatefulsets%2Fsunr-belenos-dev&kube-selected=undefined)
[scw-ams-dev-1-belenos-1-028a116932b04a84afbf80](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapi%2Fv1%2Fnodes%2Fscw-ams-dev-1-belenos-1-028a116932b04a84afbf80&kube-selected=undefined)
Burstable
27m
Running
Namespace
belenos-dev
Owner
StatefulSet sunr-belenos-dev
Pod
sunr-belenos-dev-0
Container
postgres
Search...
2022-04-19 09:14:37,198 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2022-04-19 09:14:39,211 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup
2022-04-19 09:14:39,214 - bootstrapping - INFO - No meta-data available for this provider
2022-04-19 09:14:39,215 - bootstrapping - INFO - Looks like your running local
2022-04-19 09:14:39,371 - bootstrapping - INFO - Configuring pgbouncer
2022-04-19 09:14:39,372 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2022-04-19 09:14:39,372 - bootstrapping - INFO - Configuring wal-e
2022-04-19 09:14:39,372 - bootstrapping - INFO - Configuring patroni
2022-04-19 09:14:39,390 - bootstrapping - INFO - Writing to file /run/postgres.yml
2022-04-19 09:14:39,394 - bootstrapping - INFO - Configuring standby-cluster
2022-04-19 09:14:39,394 - bootstrapping - INFO - Configuring pgqd
2022-04-19 09:14:39,395 - bootstrapping - INFO - Configuring pam-oauth2
2022-04-19 09:14:39,396 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2022-04-19 09:14:39,397 - bootstrapping - INFO - Configuring log
2022-04-19 09:14:39,397 - bootstrapping - INFO - Configuring crontab
2022-04-19 09:14:39,397 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2022-04-19 09:14:39,398 - bootstrapping - INFO - Configuring certificate
2022-04-19 09:14:39,398 - bootstrapping - INFO - Generating ssl certificate
2022-04-19 09:14:39,483 - bootstrapping - INFO - Configuring bootstrap
2022-04-19 09:14:40,983 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
Hi, I have exactly the same problem with one of my clusters... Notice that it was working perfectly and after a simple restart (to update the config file) it is now impossible to relaunch the cluster. I tried to recreate a different cluster with different spilo versions, spilo-13:2.1-p1 or spilo-14:2.1-p5 same problem. Very strange ... Someone got any idea please ?
My "same log" :
sunr-belenos-dev-0 belenos-dev 0 [StatefulSet](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapis%2Fapps%2Fv1%2Fnamespaces%2Fbelenos-dev%2Fstatefulsets%2Fsunr-belenos-dev&kube-selected=undefined) [scw-ams-dev-1-belenos-1-028a116932b04a84afbf80](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapi%2Fv1%2Fnodes%2Fscw-ams-dev-1-belenos-1-028a116932b04a84afbf80&kube-selected=undefined) Burstable 27m Running sunr-belenos-dev-1 belenos-dev 0 [StatefulSet](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapis%2Fapps%2Fv1%2Fnamespaces%2Fbelenos-dev%2Fstatefulsets%2Fsunr-belenos-dev&kube-selected=undefined) [scw-ams-dev-1-belenos-1-028a116932b04a84afbf80](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapi%2Fv1%2Fnodes%2Fscw-ams-dev-1-belenos-1-028a116932b04a84afbf80&kube-selected=undefined) Burstable 27m Running Namespace belenos-dev Owner StatefulSet sunr-belenos-dev Pod sunr-belenos-dev-0 Container postgres Search... 2022-04-19 09:14:37,198 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?) 2022-04-19 09:14:39,211 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup 2022-04-19 09:14:39,214 - bootstrapping - INFO - No meta-data available for this provider 2022-04-19 09:14:39,215 - bootstrapping - INFO - Looks like your running local 2022-04-19 09:14:39,371 - bootstrapping - INFO - Configuring pgbouncer 2022-04-19 09:14:39,372 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping 2022-04-19 09:14:39,372 - bootstrapping - INFO - Configuring wal-e 2022-04-19 09:14:39,372 - bootstrapping - INFO - Configuring patroni 2022-04-19 09:14:39,390 - bootstrapping - INFO - Writing to file /run/postgres.yml 2022-04-19 09:14:39,394 - bootstrapping - INFO - Configuring standby-cluster 2022-04-19 09:14:39,394 - bootstrapping - INFO - Configuring pgqd 2022-04-19 09:14:39,395 - bootstrapping - INFO - Configuring pam-oauth2 2022-04-19 09:14:39,396 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql 2022-04-19 09:14:39,397 - bootstrapping - INFO - Configuring log 2022-04-19 09:14:39,397 - bootstrapping - INFO - Configuring crontab 2022-04-19 09:14:39,397 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability 2022-04-19 09:14:39,398 - bootstrapping - INFO - Configuring certificate 2022-04-19 09:14:39,398 - bootstrapping - INFO - Generating ssl certificate 2022-04-19 09:14:39,483 - bootstrapping - INFO - Configuring bootstrap 2022-04-19 09:14:40,983 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
Would you mind sharing your Network setup with us if you are using HA-Proxy or Cillium for example? Trying to figure out if the network may be causing trouble...
@PeeK1e sure, I am on kubernetes 1.23 .. what instruction should I do ?
Networking should be in the kube-system namespace so the output of kubectl get pods -n kube-system
may be usefull. Also lookout for pods like cilium etc that may be running in other namespaces. Thanks :)
We are running on a mix of HA-Proxy and Cilium.
Thanks, here is a list of the pods related to the cluster or the networking :
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-cd559cf6d-jcr58 1/1 Running 0 4h15m
kube-system calico-node-2mctm 1/1 Running 0 4h19m
kube-system calico-node-kz742 1/1 Running 0 4h9m
kube-system calico-node-pqxl9 1/1 Running 0 4h15m
kube-system coredns-64cfd66f7-kdwhz 1/1 Running 0 4h15m
kube-system csi-node-g7n2p 2/2 Running 0 4h7m
kube-system csi-node-khcg4 2/2 Running 0 4h13m
kube-system csi-node-qr99t 2/2 Running 0 4h19m
kube-system konnectivity-agent-52lhb 1/1 Running 2 (4h9m ago) 49d
kube-system konnectivity-agent-pdgjf 1/1 Running 2 (4h19m ago) 49d
kube-system konnectivity-agent-zmlbh 1/1 Running 1 (4h15m ago) 4d3h
kube-system kube-proxy-hchh6 1/1 Running 0 4h13m
kube-system kube-proxy-j99rm 1/1 Running 0 4h16m
kube-system kube-proxy-r2z8r 1/1 Running 0 4h7m
kube-system metrics-server-74548cc7d5-5bjvq 1/1 Running 0 4h15m
kube-system node-problem-detector-8m4l8 1/1 Running 1 (4h15m ago) 4d3h
kube-system node-problem-detector-h7rhz 1/1 Running 5 (4h9m ago) 147d
kube-system node-problem-detector-kvcc8 1/1 Running 6 (4h19m a
--- the operator
postgresql postgres-operator-7986d7df47-ttzxm 1/1 Running 0 47m
postgresql postgres-operator-ui-c5f47f7c4-m9qgd 1/1 Running 0 55m
-- my two spilo nodes
belenos-ams sunr-belenos-clone-0 1/1 Running 0 41m
belenos-ams sunr-belenos-clone-1 1/1 Running 0 40m
I don't see pods like cilium...
@PeeK1e does that ring something to you? any other investigations?
@PeeK1e does that ring something to you? any other investigations?
Sadly not :/
I ran into this and what helped me solve the issue is to adjust the log level to see "DEBUG" logs. Turned out my problem was the command to start postgres attempted to use too much share_buffers memory: https://dba.stackexchange.com/questions/184951/memory-errors-on-startup-in-postgresql-9-6-log-map-hugetlb-failed
I ran into this and what helped me solve the issue is to adjust the log level to see "DEBUG" logs. Turned out my problem was the command to start postgres attempted to use too much share_buffers memory: https://dba.stackexchange.com/questions/184951/memory-errors-on-startup-in-postgresql-9-6-log-map-hugetlb-failed
This seems to have done the trick! Thank you very much! The other nodes in other clusters had way less RAM than the nodes in the problematic Cluster. Increasing the limit to a couple of gigs let the database startup again.
thanks very much @LiuShuaiyi it appears indeed that max_locks_per_transaction x max_connections
setting was too high for the system.
What is weird is that it was working previously...
postgresql:
version: "13"
parameters:
max_locks_per_transaction: "100000" # lowered to 10 000 worked
max_connections: "200"
It seems spilo (or postgres-operator?) sets a too high value of postgresql.parameters.shared_buffers
in postgres.yml
. On one of my servers it's
shared_buffers: 51552MB
even though the Pod has a configured memory limit of 1Gi. So I think this issue should be reopened.
I think this deserves a separate issue. My original issue is solved and didn't occur since I've increased the memory limit for a postgres-pod. Have you tried to decrease the locks and connections as mentioned before and checked what happens?
It seems that people slowly start switching to cgroup v2. I'll prepare the fix.
@CyberDem0n Thanks. I think it will be very helpful for people who run spilo in Kubernetes, where hosts can have orders of magnitude more memory than individual Pods.
@CyberDem0n thanks. Could you re-open this issue and close it when the fix is merged? This makes it easier for watchers like us to track this.
@haslersn there is a fix - https://github.com/zalando/spilo/releases/tag/2.1-p6 "Compatibility with cgroup v2 when figuring out memory limit and auto-calculating shared_buffers size."
Hello,
we're trying to deploy Harbor on a Kubernetes Cluster using Zalando PostgresOperator. We deployed the PostgresOperator and it seems to be running fine and detects when a new cluster is to be managed.
When we deploy a Postgresql HA with spilo (image -> spilo-14:2.1-p3) and the following config:
it shows both pods are running but when looking at the logs we can see that it fails to bootstrap the cluster like here:
The logs say that postmaster is not running so when trying to get the pid of postmaster it indeed is missing
Do you have any idea why this might be happening? We're running on Kubernetes 1.23.5 I also must note we have a very similar different cluster setup where everything works fine, we can't figure out why it doesn't deploy on this cluster though.
If you need more information please let me know.