zalando / spilo

Highly available elephant herd: HA PostgreSQL cluster using Docker
Apache License 2.0
1.55k stars 384 forks source link

Postmaster not starting #716

Closed PeeK1e closed 2 years ago

PeeK1e commented 2 years ago

Hello,

we're trying to deploy Harbor on a Kubernetes Cluster using Zalando PostgresOperator. We deployed the PostgresOperator and it seems to be running fine and detects when a new cluster is to be managed.

When we deploy a Postgresql HA with spilo (image -> spilo-14:2.1-p3) and the following config:

apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  creationTimestamp: "2022-04-11T08:39:32Z"
  generation: 1
  name: harbor-postgresql
  namespace: vanillastack-harbor
  resourceVersion: "197604124"
  uid: <redacted>
spec:
  databases:
    harbor: harbor
    notary_server: harbor
    notary_signer: harbor
    registry: harbor
  enableLogicalBackup: true
  logicalBackupSchedule: 30 */12 * * *
  numberOfInstances: 2
  postgresql:
    parameters:
      max_connections: "400"
    version: "14"
  resources:
    limits:
      cpu: 750m
      memory: 1.5Gi
    requests:
      cpu: 100m
      memory: 1Gi
  teamId: harbor
  users:
    harbor: []
    postgres:
    - superuser
    - createdb
  volume:
    size: 20Gi

it shows both pods are running but when looking at the logs we can see that it fails to bootstrap the cluster like here:

2022-04-11 08:42:52,811 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-04-11 08:42:52,822 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-04-11 08:42:52,824 INFO: Lock owner: None; I am harbor-postgresql-0
2022-04-11 08:42:52,842 INFO: trying to bootstrap a new cluster
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory /home/postgres/pgdata/pgroot/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    /usr/lib/postgresql/14/bin/pg_ctl -D /home/postgres/pgdata/pgroot/data -l logfile start

2022-04-11 08:42:53,887 INFO: postmaster pid=222
2022-04-11 08:42:53 UTC [222]: [1-1] 6253ea0d.de 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2022-04-11 08:42:53 UTC [222]: [2-1] 6253ea0d.de 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
/var/run/postgresql:5432 - no response
/var/run/postgresql:5432 - no response
2022-04-11 08:42:55,973 ERROR: postmaster is not running
2022-04-11 08:42:55,978 INFO: removing initialize key after failed attempt to bootstrap the cluster
2022-04-11 08:42:55,985 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2022-04-11-08-42-55
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 171, in main
    return patroni_main()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 139, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 109, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run
    self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 112, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1471, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1345, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1238, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1231, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/run/service/patroni: finished with code=1 signal=0
/run/service/patroni: sleeping 120 seconds

The logs say that postmaster is not running so when trying to get the pid of postmaster it indeed is missing

    PID TTY      STAT   TIME COMMAND
      1 ?        Ss     0:00 /usr/bin/dumb-init -c --rewrite 1:0 -- /bin/sh /launch.s
      7 ?        S      0:00 /bin/sh /launch.sh
     31 ?        S      0:00 /usr/bin/runsvdir -P /etc/service
     32 ?        Ss     0:00 runsv pgqd
     33 ?        Ss     0:00 runsv patroni
     35 ?        S      0:00 /bin/bash /scripts/patroni_wait.sh --role master -- /usr
    288 pts/0    Ss     0:00 bash
    312 ?        S      0:00 sleep 60
    317 pts/0    R+     0:00 ps -ax

Do you have any idea why this might be happening? We're running on Kubernetes 1.23.5 I also must note we have a very similar different cluster setup where everything works fine, we can't figure out why it doesn't deploy on this cluster though.

If you need more information please let me know.

CyberDem0n commented 2 years ago

Do you have any idea why this might be happening?

I am pretty sure there are details in postgres logs.

PeeK1e commented 2 years ago

I am pretty sure there are details in postgres logs.

Oh, I forgot to mention, the logs at ~/pgdata/pgroot/pg_log/ are empty also there is nothing at /var/log/postgresql Are there any other locations?

CyberDem0n commented 2 years ago

Logs should be written to /home/postgres/pgdata/pgroot/pg_log/. If they are empty - you have to figure out what is wrong with permissions.

PeeK1e commented 2 years ago

I've spend some time looking around and trying to find an issue with permissions in the container

I tried to create a test container and see if i could write any data into an PVC with the same storage class but there is no issue with that as well.

On another note it seems the endpoints are not available (which is expected) but the endpoint slices are not being created or only partially being created for the spilo containers, could that be an issue for the database to not start up?

warielon commented 2 years ago

Hi, I have exactly the same problem with one of my clusters... Notice that it was working perfectly and after a simple restart (to update the config file) it is now impossible to relaunch the cluster. I tried to recreate a different cluster with different spilo versions, spilo-13:2.1-p1 or spilo-14:2.1-p5 same problem. Very strange ... Someone got any idea please ?

My "same log" :

 sunr-belenos-dev-0
belenos-dev
0
[StatefulSet](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapis%2Fapps%2Fv1%2Fnamespaces%2Fbelenos-dev%2Fstatefulsets%2Fsunr-belenos-dev&kube-selected=undefined)
[scw-ams-dev-1-belenos-1-028a116932b04a84afbf80](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapi%2Fv1%2Fnodes%2Fscw-ams-dev-1-belenos-1-028a116932b04a84afbf80&kube-selected=undefined)
Burstable
27m
Running
sunr-belenos-dev-1
belenos-dev
0
[StatefulSet](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapis%2Fapps%2Fv1%2Fnamespaces%2Fbelenos-dev%2Fstatefulsets%2Fsunr-belenos-dev&kube-selected=undefined)
[scw-ams-dev-1-belenos-1-028a116932b04a84afbf80](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapi%2Fv1%2Fnodes%2Fscw-ams-dev-1-belenos-1-028a116932b04a84afbf80&kube-selected=undefined)
Burstable
27m
Running
Namespace
belenos-dev
Owner
StatefulSet sunr-belenos-dev
Pod
sunr-belenos-dev-0
Container
postgres
Search...
2022-04-19 09:14:37,198 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2022-04-19 09:14:39,211 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup
2022-04-19 09:14:39,214 - bootstrapping - INFO - No meta-data available for this provider
2022-04-19 09:14:39,215 - bootstrapping - INFO - Looks like your running local
2022-04-19 09:14:39,371 - bootstrapping - INFO - Configuring pgbouncer
2022-04-19 09:14:39,372 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2022-04-19 09:14:39,372 - bootstrapping - INFO - Configuring wal-e
2022-04-19 09:14:39,372 - bootstrapping - INFO - Configuring patroni
2022-04-19 09:14:39,390 - bootstrapping - INFO - Writing to file /run/postgres.yml
2022-04-19 09:14:39,394 - bootstrapping - INFO - Configuring standby-cluster
2022-04-19 09:14:39,394 - bootstrapping - INFO - Configuring pgqd
2022-04-19 09:14:39,395 - bootstrapping - INFO - Configuring pam-oauth2
2022-04-19 09:14:39,396 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2022-04-19 09:14:39,397 - bootstrapping - INFO - Configuring log
2022-04-19 09:14:39,397 - bootstrapping - INFO - Configuring crontab
2022-04-19 09:14:39,397 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2022-04-19 09:14:39,398 - bootstrapping - INFO - Configuring certificate
2022-04-19 09:14:39,398 - bootstrapping - INFO - Generating ssl certificate
2022-04-19 09:14:39,483 - bootstrapping - INFO - Configuring bootstrap
2022-04-19 09:14:40,983 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
PeeK1e commented 2 years ago

Hi, I have exactly the same problem with one of my clusters... Notice that it was working perfectly and after a simple restart (to update the config file) it is now impossible to relaunch the cluster. I tried to recreate a different cluster with different spilo versions, spilo-13:2.1-p1 or spilo-14:2.1-p5 same problem. Very strange ... Someone got any idea please ?

My "same log" :

 sunr-belenos-dev-0
belenos-dev
0
[StatefulSet](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapis%2Fapps%2Fv1%2Fnamespaces%2Fbelenos-dev%2Fstatefulsets%2Fsunr-belenos-dev&kube-selected=undefined)
[scw-ams-dev-1-belenos-1-028a116932b04a84afbf80](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapi%2Fv1%2Fnodes%2Fscw-ams-dev-1-belenos-1-028a116932b04a84afbf80&kube-selected=undefined)
Burstable
27m
Running
sunr-belenos-dev-1
belenos-dev
0
[StatefulSet](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapis%2Fapps%2Fv1%2Fnamespaces%2Fbelenos-dev%2Fstatefulsets%2Fsunr-belenos-dev&kube-selected=undefined)
[scw-ams-dev-1-belenos-1-028a116932b04a84afbf80](http://2faab5d0443ad7817b552d7ecdc474f3.localhost:55184/pods?kube-details=%2Fapi%2Fv1%2Fnodes%2Fscw-ams-dev-1-belenos-1-028a116932b04a84afbf80&kube-selected=undefined)
Burstable
27m
Running
Namespace
belenos-dev
Owner
StatefulSet sunr-belenos-dev
Pod
sunr-belenos-dev-0
Container
postgres
Search...
2022-04-19 09:14:37,198 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2022-04-19 09:14:39,211 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup
2022-04-19 09:14:39,214 - bootstrapping - INFO - No meta-data available for this provider
2022-04-19 09:14:39,215 - bootstrapping - INFO - Looks like your running local
2022-04-19 09:14:39,371 - bootstrapping - INFO - Configuring pgbouncer
2022-04-19 09:14:39,372 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2022-04-19 09:14:39,372 - bootstrapping - INFO - Configuring wal-e
2022-04-19 09:14:39,372 - bootstrapping - INFO - Configuring patroni
2022-04-19 09:14:39,390 - bootstrapping - INFO - Writing to file /run/postgres.yml
2022-04-19 09:14:39,394 - bootstrapping - INFO - Configuring standby-cluster
2022-04-19 09:14:39,394 - bootstrapping - INFO - Configuring pgqd
2022-04-19 09:14:39,395 - bootstrapping - INFO - Configuring pam-oauth2
2022-04-19 09:14:39,396 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2022-04-19 09:14:39,397 - bootstrapping - INFO - Configuring log
2022-04-19 09:14:39,397 - bootstrapping - INFO - Configuring crontab
2022-04-19 09:14:39,397 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2022-04-19 09:14:39,398 - bootstrapping - INFO - Configuring certificate
2022-04-19 09:14:39,398 - bootstrapping - INFO - Generating ssl certificate
2022-04-19 09:14:39,483 - bootstrapping - INFO - Configuring bootstrap
2022-04-19 09:14:40,983 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.

Would you mind sharing your Network setup with us if you are using HA-Proxy or Cillium for example? Trying to figure out if the network may be causing trouble...

warielon commented 2 years ago

@PeeK1e sure, I am on kubernetes 1.23 .. what instruction should I do ?

PeeK1e commented 2 years ago

Networking should be in the kube-system namespace so the output of kubectl get pods -n kube-system may be usefull. Also lookout for pods like cilium etc that may be running in other namespaces. Thanks :)

We are running on a mix of HA-Proxy and Cilium.

warielon commented 2 years ago

Thanks, here is a list of the pods related to the cluster or the networking :

NAMESPACE                      NAME                                                 READY   STATUS      RESTARTS         AGE

kube-system                    calico-kube-controllers-cd559cf6d-jcr58              1/1     Running     0                4h15m
kube-system                    calico-node-2mctm                                    1/1     Running     0                4h19m
kube-system                    calico-node-kz742                                    1/1     Running     0                4h9m
kube-system                    calico-node-pqxl9                                    1/1     Running     0                4h15m
kube-system                    coredns-64cfd66f7-kdwhz                              1/1     Running     0                4h15m
kube-system                    csi-node-g7n2p                                       2/2     Running     0                4h7m
kube-system                    csi-node-khcg4                                       2/2     Running     0                4h13m
kube-system                    csi-node-qr99t                                       2/2     Running     0                4h19m
kube-system                    konnectivity-agent-52lhb                             1/1     Running     2 (4h9m ago)     49d
kube-system                    konnectivity-agent-pdgjf                             1/1     Running     2 (4h19m ago)    49d
kube-system                    konnectivity-agent-zmlbh                             1/1     Running     1 (4h15m ago)    4d3h
kube-system                    kube-proxy-hchh6                                     1/1     Running     0                4h13m
kube-system                    kube-proxy-j99rm                                     1/1     Running     0                4h16m
kube-system                    kube-proxy-r2z8r                                     1/1     Running     0                4h7m
kube-system                    metrics-server-74548cc7d5-5bjvq                      1/1     Running     0                4h15m
kube-system                    node-problem-detector-8m4l8                          1/1     Running     1 (4h15m ago)    4d3h
kube-system                    node-problem-detector-h7rhz                          1/1     Running     5 (4h9m ago)     147d
kube-system                    node-problem-detector-kvcc8                          1/1     Running     6 (4h19m a

--- the operator
postgresql                     postgres-operator-7986d7df47-ttzxm                   1/1     Running     0                47m
postgresql                     postgres-operator-ui-c5f47f7c4-m9qgd                 1/1     Running     0                55m
-- my two spilo nodes
belenos-ams                    sunr-belenos-clone-0                                 1/1     Running     0                41m
belenos-ams                    sunr-belenos-clone-1                                 1/1     Running     0                40m

I don't see pods like cilium...

warielon commented 2 years ago

@PeeK1e does that ring something to you? any other investigations?

PeeK1e commented 2 years ago

@PeeK1e does that ring something to you? any other investigations?

Sadly not :/

LiuShuaiyi commented 2 years ago

I ran into this and what helped me solve the issue is to adjust the log level to see "DEBUG" logs. Turned out my problem was the command to start postgres attempted to use too much share_buffers memory: https://dba.stackexchange.com/questions/184951/memory-errors-on-startup-in-postgresql-9-6-log-map-hugetlb-failed

PeeK1e commented 2 years ago

I ran into this and what helped me solve the issue is to adjust the log level to see "DEBUG" logs. Turned out my problem was the command to start postgres attempted to use too much share_buffers memory: https://dba.stackexchange.com/questions/184951/memory-errors-on-startup-in-postgresql-9-6-log-map-hugetlb-failed

This seems to have done the trick! Thank you very much! The other nodes in other clusters had way less RAM than the nodes in the problematic Cluster. Increasing the limit to a couple of gigs let the database startup again.

warielon commented 2 years ago

thanks very much @LiuShuaiyi it appears indeed that max_locks_per_transaction x max_connections setting was too high for the system. What is weird is that it was working previously...

  postgresql:
    version: "13"
    parameters:
      max_locks_per_transaction: "100000" # lowered to 10 000 worked
      max_connections: "200" 
haslersn commented 2 years ago

It seems spilo (or postgres-operator?) sets a too high value of postgresql.parameters.shared_buffers in postgres.yml. On one of my servers it's

    shared_buffers: 51552MB

even though the Pod has a configured memory limit of 1Gi. So I think this issue should be reopened.

PeeK1e commented 2 years ago

I think this deserves a separate issue. My original issue is solved and didn't occur since I've increased the memory limit for a postgres-pod. Have you tried to decrease the locks and connections as mentioned before and checked what happens?

CyberDem0n commented 2 years ago

It seems that people slowly start switching to cgroup v2. I'll prepare the fix.

LiuShuaiyi commented 2 years ago

@CyberDem0n Thanks. I think it will be very helpful for people who run spilo in Kubernetes, where hosts can have orders of magnitude more memory than individual Pods.

haslersn commented 2 years ago

@CyberDem0n thanks. Could you re-open this issue and close it when the fix is merged? This makes it easier for watchers like us to track this.

FactorT commented 2 years ago

@haslersn there is a fix - https://github.com/zalando/spilo/releases/tag/2.1-p6 "Compatibility with cgroup v2 when figuring out memory limit and auto-calculating shared_buffers size."