zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.32k stars 980 forks source link

Single master fails to restart #2042

Open sumukharaj opened 2 years ago

sumukharaj commented 2 years ago

I have a single member cluster (master only) deployed on opnstack, kubernetes.

  1. After successful installation when i try to edit any parameter in the postgresql CRD configuration, the pod restarts but the database does not start up. It continues to reject connections and repeatedly logs the same data (like in the pod logs below) The pg_log complains of missing history files.
  2. Also for a simple restart without any configuration changes I reduce the "numberOfInstances" in the postgresql config to 0 and again to 1. Is this the right approach or is there a better one?

Please, answer some short questions which should help us to understand your problem / question better?

Some general remarks when posting a bug report:

Operator logs

`time="2022-09-15T11:24:33Z" level=info msg="SYNC event has been queued" cluster-name=insights-gs-execute-staging/ai-postgres pkg=controller worker=7

time="2022-09-15T11:24:33Z" level=info msg="syncing of the cluster started" cluster-name=insights-gs-execute-staging/ai-postgres pkg=controller worker=7 time="2022-09-15T11:24:33Z" level=debug msg="team API is disabled" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:33Z" level=debug msg="team API is disabled" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:33Z" level=info msg="syncing secrets" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:35Z" level=debug msg="secret insights-gs-execute-staging/zalando.ai-postgres.credentials.postgresql.acid.zalan.do already exists, fetching its password" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:38Z" level=debug msg="secret insights-gs-execute-staging/standby.ai-postgres.credentials.postgresql.acid.zalan.do already exists, fetching its password" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:41Z" level=debug msg="secret insights-gs-execute-staging/postgres.ai-postgres.credentials.postgresql.acid.zalan.do already exists, fetching its password" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:41Z" level=debug msg="syncing master service" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:45Z" level=debug msg="final load balancer source ranges as seen in a service spec (not necessarily applied): [\"127.0.0.1/32\"]" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:45Z" level=debug msg="syncing replica service" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:48Z" level=debug msg="final load balancer source ranges as seen in a service spec (not necessarily applied): [\"127.0.0.1/32\"]" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:48Z" level=debug msg="syncing volumes using \"pvc\" storage resize mode" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:49Z" level=info msg="volume claims do not require changes" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:49Z" level=debug msg="syncing statefulsets" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:54Z" level=info msg="Mount additional volumes: [{Name:empty MountPath:/opt/empty SubPath: TargetContainers:[all] VolumeSource:{HostPath:nil EmptyDir:&EmptyDirVolumeSource{Medium:,SizeLimit:,} GCEPersistentDisk:nil AWSElasticBlockStore:nil GitRepo:nil Secret:nil NFS:nil ISCSI:nil Glusterfs:nil PersistentVolumeClaim:nil RBD:nil FlexVolume:nil Cinder:nil CephFS:nil Flocker:nil DownwardAPI:nil FC:nil AzureFile:nil ConfigMap:nil VsphereVolume:nil Quobyte:nil AzureDisk:nil PhotonPersistentDisk:nil Projected:nil PortworxVolume:nil ScaleIO:nil StorageOS:nil CSI:nil Ephemeral:nil}}]" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:56Z" level=debug msg="making GET http request: http://10.42.14.232:8008/config" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:56Z" level=debug msg="syncing pod disruption budgets" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:56Z" level=debug msg="syncing roles" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:24:57Z" level=warning msg="could not connect to Postgres database: dial tcp 10.43.249.26:5432: connect: connection refused" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:25:12Z" level=warning msg="could not connect to Postgres database: dial tcp 10.43.249.26:5432: connect: connection refused" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:25:27Z" level=warning msg="could not connect to Postgres database: dial tcp 10.43.249.26:5432: connect: connection refused" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:25:42Z" level=warning msg="could not connect to Postgres database: dial tcp 10.43.249.26:5432: connect: connection refused" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:25:57Z" level=warning msg="could not connect to Postgres database: dial tcp 10.43.249.26:5432: connect: connection refused" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:26:12Z" level=warning msg="could not connect to Postgres database: dial tcp 10.43.249.26:5432: connect: connection refused" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:26:27Z" level=warning msg="could not connect to Postgres database: dial tcp 10.43.249.26:5432: connect: connection refused" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:26:42Z" level=warning msg="could not connect to Postgres database: dial tcp 10.43.249.26:5432: connect: connection refused" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:26:42Z" level=warning msg="error while syncing cluster state: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=insights-gs-execute-staging/ai-postgres pkg=cluster time="2022-09-15T11:26:42Z" level=error msg="could not sync cluster: could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=insights-gs-execute-staging/ai-postgres pkg=controller worker=7 time="2022-09-15T11:34:33Z" level=info msg="SYNC event has been queued" cluster-name=insights-gs-execute-staging/ai-postgres pkg=controller worker=7 `

Pod logs

`2022-09-15 10:27:50,656 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?) 2022-09-15 10:27:52,047 - bootstrapping - INFO - Looks like your running openstack 2022-09-15 10:27:52,168 - bootstrapping - INFO - Configuring pgbouncer 2022-09-15 10:27:52,169 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping 2022-09-15 10:27:52,169 - bootstrapping - INFO - Configuring wal-e 2022-09-15 10:27:52,169 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX 2022-09-15 10:27:52,170 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX 2022-09-15 10:27:52,170 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID 2022-09-15 10:27:52,170 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY 2022-09-15 10:27:52,170 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT 2022-09-15 10:27:52,170 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT 2022-09-15 10:27:52,170 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE 2022-09-15 10:27:52,171 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY 2022-09-15 10:27:52,171 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY 2022-09-15 10:27:52,171 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE 2022-09-15 10:27:52,171 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION 2022-09-15 10:27:52,171 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT 2022-09-15 10:27:52,171 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN 2022-09-15 10:27:52,187 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR 2022-09-15 10:27:52,187 - bootstrapping - INFO - Configuring crontab 2022-09-15 10:27:52,187 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability 2022-09-15 10:27:52,214 - bootstrapping - INFO - Configuring standby-cluster 2022-09-15 10:27:52,214 - bootstrapping - INFO - Configuring pgqd 2022-09-15 10:27:52,214 - bootstrapping - INFO - Configuring bootstrap 2022-09-15 10:27:52,214 - bootstrapping - INFO - Configuring patroni 2022-09-15 10:27:52,224 - bootstrapping - INFO - Writing to file /run/postgres.yml 2022-09-15 10:27:52,224 - bootstrapping - INFO - Configuring log 2022-09-15 10:27:52,224 - bootstrapping - INFO - Configuring certificate 2022-09-15 10:27:52,224 - bootstrapping - INFO - Generating ssl self-signed certificate 2022-09-15 10:27:52,290 - bootstrapping - INFO - Configuring pam-oauth2 2022-09-15 10:27:52,291 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql 2022-09-15 10:27:54,130 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'. 2022-09-15 10:27:54,191 INFO: No PostgreSQL configuration items changed, nothing to reload. 2022-09-15 10:27:54,264 WARNING: Postgresql is not running. 2022-09-15 10:27:54,265 INFO: Lock owner: None; I am ai-postgres-0 2022-09-15 10:27:54,276 INFO: pg_controldata: pg_control version number: 1100 Catalog version number: 201809051 Database system identifier: 6925770682722574383 Database cluster state: shut down in recovery pg_control last modified: Thu Sep 15 06:04:08 2022 Latest checkpoint location: DC0F/31755350 Latest checkpoint's REDO location: DC0F/31755350 Latest checkpoint's REDO WAL file: 000000020000DC0F00000031 Latest checkpoint's TimeLineID: 2 Latest checkpoint's PrevTimeLineID: 2 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:6236252 Latest checkpoint's NextOID: 1821843 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 561 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 0 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTsXid: 0 Latest checkpoint's newestCommitTsXid: 0 Time of latest checkpoint: Fri Sep 2 07:14:51 2022 Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/0 Min recovery ending loc's timeline: 0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: logical wal_log_hints setting: on max_connections setting: 200 max_worker_processes setting: 8 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 Mock authentication nonce: e1f4672423ecf0ef7f60b88c1eb5650579a43f927807d8be1d19881b5387f934

2022-09-15 10:27:54,329 INFO: Lock owner: None; I am ai-postgres-0 2022-09-15 10:27:54,438 INFO: starting as a secondary 2022-09-15 10:27:54,502 WARNING: Removing unexpected parameter=cron_database_name value=postgres from the config 2022-09-15 10:27:55.020 GMT [71] LOG: skipping missing configuration file "/var/pv/data/user.conf" 2022-09-15 10:27:55.030 GMT [71] LOG: skipping missing configuration file "/var/pv/data/user.conf" 2022-09-15 10:27:55,079 INFO: postmaster pid=71 /var/run/postgresql:5432 - no response 2022-09-15 10:27:55 UTC [71]: [3-1] 6322fe2b.47 0 LOG: listening on IPv4 address "0.0.0.0", port 5432 2022-09-15 10:27:55 UTC [71]: [4-1] 6322fe2b.47 0 LOG: listening on IPv6 address "::", port 5432 2022-09-15 10:27:55 UTC [71]: [5-1] 6322fe2b.47 0 LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2022-09-15 10:27:56 UTC [71]: [6-1] 6322fe2b.47 0 LOG: redirecting log output to logging collector process 2022-09-15 10:27:56 UTC [71]: [7-1] 6322fe2b.47 0 HINT: Future log output will appear in directory "../pg_log". /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - no response 2022-09-15 10:28:01,024 WARNING: Postgresql is not running. 2022-09-15 10:28:01,024 INFO: Lock owner: None; I am ai-postgres-0 `

pg_log

ERROR: 2022/09/15 00:00:35.585523 Archive '00000005.history' does not exist. 2022-09-15 00:00:44 UTC [25600]: [8-1] 63226b28.6400 0 LOG: ending log output to stderr 2022-09-15 00:00:44 UTC [25600]: [9-1] 63226b28.6400 0 HINT: Future log output will go to log destination "csvlog". ERROR: 2022/09/15 00:00:47.141790 Archive '00000005.history' does not exist. 2022-09-15 00:00:54 UTC [25729]: [8-1] 63226b32.6481 0 LOG: ending log output to stderr 2022-09-15 00:00:54 UTC [25729]: [9-1] 63226b32.6481 0 HINT: Future log output will go to log destination "csvlog". ERROR: 2022/09/15 00:00:58.157736 Archive '00000005.history' does not exist. 2022-09-15 00:01:05 UTC [25859]: [8-1] 63226b3d.6503 0 LOG: ending log output to stderr 2022-09-15 00:01:05 UTC [25859]: [9-1] 63226b3d.6503 0 HINT: Future log output will go to log destination "csvlog". ERROR: 2022/09/15 00:01:08.793950 Archive '00000005.history' does not exist.

postgres configuration

`bootstrap: dcs: loop_wait: 10 maximum_lag_on_failover: 33554432 postgresql: parameters: archive_mode: 'true' archive_timeout: '160' autovacuum_analyze_scale_factor: '0.02' autovacuum_max_workers: '12' autovacuum_vacuum_cost_delay: 5ms autovacuum_vacuum_scale_factor: '0.05' checkpoint_completion_target: '0.9' checkpoint_timeout: 1d cron_database_name: postgres default_statistics_target: '500' effective_cache_size: 24GB effective_io_concurrency: '200' hot_standby: 'on' log_autovacuum_min_duration: 0 log_checkpoints: 'on' log_connections: 'on' log_disconnections: 'on' log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h ' log_lock_waits: 'on' log_min_duration_statement: 500 log_statement: all log_temp_files: 0 maintenance_work_mem: 2GB max_connections: '200' max_parallel_maintenance_workers: '4' max_parallel_workers: '8' max_parallel_workers_per_gather: '4' max_replication_slots: 10 max_wal_senders: '90' max_wal_size: 16GB max_worker_processes: '8' min_wal_size: 4GB random_page_cost: '1.1' tcp_keepalives_idle: 900 tcp_keepalives_interval: 100 track_functions: all vacuum_cost_limit: '2400' wal_buffers: 16MB wal_compression: 'on' wal_keep_segments: '32' wal_level: logical wal_log_hints: 'on' work_mem: 209kB use_pg_rewind: true use_slots: true retry_timeout: 10 ttl: 30 initdb:

leonz commented 1 year ago

hi @sumukharaj I seem to be having the same problems. Did you figure it out?

Jan-M commented 1 year ago

Looks like Postgres may be stuck in recovery mode and Patroni not deeming it healthy enough. It's a bit odd for single node cluster.

Check what Postgres is doing.

You can pause Patroni and start Postgres manually after making sure it is not running. See if this yields different results.

If you manage to get Postgres running you can shut down clean and start Patroni again.

sumukharaj commented 1 year ago

This issue was resolve in the patroni thread Check this out. https://github.com/zalando/patroni/issues/2415#issue-1377884062