Clone: Postgresql server not started - Connection rejected

Martin-Hogge commented 3 years ago

Which image of the operator are you using? e.g. registry.opensource.zalan.do/acid/postgres-operator:v1.6.1
Where do you run it - cloud or metal? Kubernetes or OpenShift? GCP
Are you running Postgres Operator in production? no
Type of issue? Bug report

I installed postgres-operator with helm and configure pods to backup with WAL-G on a s3 minio bucket (hosted on the same kubernetes cluster):

apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-pod-config
  namespace: postgres-operator
data:
  USE_WALG_BACKUP: "true"
  USE_WALG_RESTORE: "true"
  AWS_ACCESS_KEY_ID: "xxx"
  AWS_SECRET_ACCESS_KEY: "xxx"
  WALG_S3_PREFIX: "s3://postgres-backup/mydomain"
  AWS_ENDPOINT: "https://minio.mydomain.com"
  AWS_S3_FORCE_PATH_STYLE: "true"

I manage to create a cluster with 2 instances and enableConnectionPooler: true. In order to test the backup/recovery I created a table with 2 rows on it.

I ran a backup manually and then tried to restore the cluster by cloning it:

kind: "postgresql"
apiVersion: "acid.zalan.do/v1"

metadata:
  name: "acid-test-zalando-postgres-15-restored-5"
  namespace: "default"
  labels:
    team: acid

spec:
  teamId: "acid"
  postgresql:
    version: "13"
  numberOfInstances: 2
  enableConnectionPooler: true
  volume:
    size: "10Gi"

  allowedSourceRanges:
    # IP ranges to access your cluster go here

  resources:
    requests:
      cpu: 100m
      memory: 100Mi
    limits:
      cpu: 500m
      memory: 500Mi

  clone:
    cluster: "acid-test-zalando-postgresql-15"
    uid: "a6cd15cd-753b-457a-998b-b687be8e718e"
    timestamp: "2021-04-01T08:17:31.837+00:00"
    s3_endpoint: https://minio.mydomain.com
    s3_access_key_id: xxx
    s3_secret_access_key: xxx
    s3_force_path_style: true
    s3_wal_path: s3://postgres-backup/mydomain

But the master is never getting ready, the UI is stuck on Waiting for master to become available. I share the logs of postgresql pods:

2021-03-31 10:00:13,171 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2021-03-31 10:00:13,180 - bootstrapping - INFO - Looks like your running google
2021-03-31 10:00:13,262 - bootstrapping - INFO - Configuring patroni
2021-03-31 10:00:13,276 - bootstrapping - INFO - Writing to file /run/postgres.yml
2021-03-31 10:00:13,277 - bootstrapping - INFO - Configuring pgqd
2021-03-31 10:00:13,277 - bootstrapping - INFO - Configuring log
2021-03-31 10:00:13,278 - bootstrapping - INFO - Configuring standby-cluster
2021-03-31 10:00:13,278 - bootstrapping - INFO - Configuring pam-oauth2
2021-03-31 10:00:13,278 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2021-03-31 10:00:13,278 - bootstrapping - INFO - Configuring wal-e
2021-03-31 10:00:13,279 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2021-03-31 10:00:13,279 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2021-03-31 10:00:13,279 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2021-03-31 10:00:13,279 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2021-03-31 10:00:13,280 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2021-03-31 10:00:13,280 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2021-03-31 10:00:13,280 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2021-03-31 10:00:13,280 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2021-03-31 10:00:13,280 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2021-03-31 10:00:13,281 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2021-03-31 10:00:13,281 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_BACKUP
2021-03-31 10:00:13,281 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2021-03-31 10:00:13,281 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2021-03-31 10:00:13,281 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2021-03-31 10:00:13,282 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2021-03-31 10:00:13,283 - bootstrapping - INFO - Configuring certificate
2021-03-31 10:00:13,283 - bootstrapping - INFO - Generating ssl certificate
2021-03-31 10:00:13,478 - bootstrapping - INFO - Configuring crontab
2021-03-31 10:00:13,496 - bootstrapping - INFO - Configuring pgbouncer
2021-03-31 10:00:13,496 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2021-03-31 10:00:13,496 - bootstrapping - INFO - Configuring bootstrap
2021-03-31 10:00:13,497 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15/WALE_S3_PREFIX
2021-03-31 10:00:13,497 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15/WALG_S3_PREFIX
2021-03-31 10:00:13,497 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15/AWS_ACCESS_KEY_ID
2021-03-31 10:00:13,497 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15/AWS_SECRET_ACCESS_KEY
2021-03-31 10:00:13,497 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15/WALE_S3_ENDPOINT
2021-03-31 10:00:13,498 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15/AWS_ENDPOINT
2021-03-31 10:00:13,498 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15/WALG_DISABLE_S3_SSE
2021-03-31 10:00:13,498 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15/AWS_S3_FORCE_PATH_STYLE
2021-03-31 10:00:13,499 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15/USE_WALG_BACKUP
2021-03-31 10:00:13,499 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15/USE_WALG_RESTORE
2021-03-31 10:00:13,499 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15/WALE_LOG_DESTINATION
2021-03-31 10:00:13,499 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15/TMPDIR
2021-03-31 10:00:15,464 INFO: Selected new K8s API server endpoint https://35.205.59.174:443
2021-03-31 10:00:15,485 INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-03-31 10:00:15,488 INFO: Lock owner: None; I am acid-test-zalando-postgres-15-restored-5-0
2021-03-31 10:00:15,600 INFO: trying to bootstrap a new cluster
2021-03-31 10:00:15,600 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15" python3 /scripts/clone_with_wale.py --recovery-target-time="2021-04-01T08:17:31.837+00:00"
2021-03-31 10:00:15,667 INFO: Trying s3://postgres-backup/adisoft for clone
2021-03-31 10:00:15,875 INFO: cloning cluster acid-test-zalando-postgres-15-restored-5 using wal-g backup-fetch /home/postgres/pgdata/pgroot/data base_000000010000000000000002
INFO: 2021/03/31 10:00:16.164564 Finished decompression of part_003.tar.lz4
INFO: 2021/03/31 10:00:16.164585 Finished extraction of part_003.tar.lz4
INFO: 2021/03/31 10:00:18.935429 Finished decompression of part_001.tar.lz4
INFO: 2021/03/31 10:00:18.935450 Finished extraction of part_001.tar.lz4
INFO: 2021/03/31 10:00:18.947468 Finished decompression of pg_control.tar.lz4
INFO: 2021/03/31 10:00:18.947511 Finished extraction of pg_control.tar.lz4
INFO: 2021/03/31 10:00:18.947534 
Backup extraction complete.
2021-03-31 10:00:19,463 maybe_pg_upgrade INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-03-31 10:00:20 UTC [107]: [1-1] 60644834.6b 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2021-03-31 10:00:20 UTC [107]: [2-1] 60644834.6b 0     LOG:  pg_stat_kcache.linux_hz is set to 500000
2021-03-31 10:00:20,347 INFO: postmaster pid=107
2021-03-31 10:00:20 UTC [107]: [3-1] 60644834.6b 0     LOG:  redirecting log output to logging collector process
2021-03-31 10:00:20 UTC [107]: [4-1] 60644834.6b 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - no response
2021-03-31 10:00:24,794 INFO: removing initialize key after failed attempt to bootstrap the cluster
2021-03-31 10:00:24,843 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2021-03-31-10-00-24
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 33, in <module>
    sys.exit(load_entry_point('patroni==2.0.1', 'console_scripts', 'patroni')())
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 170, in main
    return patroni_main()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 138, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 108, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run
    self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 111, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1457, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1351, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1247, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1240, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/run/service/patroni: finished with code=1 signal=0
/run/service/patroni: sleeping 30 seconds
2021-03-31 10:00:55,893 INFO: Selected new K8s API server endpoint https://35.205.59.174:443
2021-03-31 10:00:55,949 INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-03-31 10:00:55,954 INFO: Lock owner: None; I am acid-test-zalando-postgres-15-restored-5-0
2021-03-31 10:00:56,128 INFO: trying to bootstrap a new cluster
2021-03-31 10:00:56,128 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15" python3 /scripts/clone_with_wale.py --recovery-target-time="2021-04-01T08:17:31.837+00:00"
2021-03-31 10:00:56,196 INFO: Trying s3://postgres-backup/adisoft for clone
2021-03-31 10:00:56,451 INFO: cloning cluster acid-test-zalando-postgres-15-restored-5 using wal-g backup-fetch /home/postgres/pgdata/pgroot/data base_000000010000000000000002
INFO: 2021/03/31 10:00:56.759242 Finished decompression of part_003.tar.lz4
INFO: 2021/03/31 10:00:56.759276 Finished extraction of part_003.tar.lz4
INFO: 2021/03/31 10:00:59.305742 Finished decompression of part_001.tar.lz4
INFO: 2021/03/31 10:00:59.305810 Finished extraction of part_001.tar.lz4
INFO: 2021/03/31 10:00:59.314427 Finished decompression of pg_control.tar.lz4
INFO: 2021/03/31 10:00:59.314523 Finished extraction of pg_control.tar.lz4
INFO: 2021/03/31 10:00:59.314535 
Backup extraction complete.
2021-03-31 10:00:59,759 maybe_pg_upgrade INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-03-31 10:01:00 UTC [284]: [1-1] 6064485c.11c 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2021-03-31 10:01:00 UTC [284]: [2-1] 6064485c.11c 0     LOG:  pg_stat_kcache.linux_hz is set to 333333
2021-03-31 10:01:00,664 INFO: postmaster pid=284
2021-03-31 10:01:00 UTC [284]: [3-1] 6064485c.11c 0     LOG:  redirecting log output to logging collector process
2021-03-31 10:01:00 UTC [284]: [4-1] 6064485c.11c 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - no response
2021-03-31 10:01:05,488 INFO: removing initialize key after failed attempt to bootstrap the cluster
2021-03-31 10:01:05,536 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2021-03-31-10-01-05
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 33, in <module>
    sys.exit(load_entry_point('patroni==2.0.1', 'console_scripts', 'patroni')())
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 170, in main
    return patroni_main()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 138, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 108, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run
    self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 111, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1457, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1351, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1247, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1240, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/run/service/patroni: finished with code=1 signal=0
/run/service/patroni: sleeping 60 seconds
2021-03-31 10:02:06,948 INFO: Selected new K8s API server endpoint https://35.205.59.174:443
2021-03-31 10:02:06,964 INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-03-31 10:02:06,968 INFO: Lock owner: None; I am acid-test-zalando-postgres-15-restored-5-0
2021-03-31 10:02:07,082 INFO: trying to bootstrap a new cluster
2021-03-31 10:02:07,082 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15" python3 /scripts/clone_with_wale.py --recovery-target-time="2021-04-01T08:17:31.837+00:00"
2021-03-31 10:02:07,154 INFO: Trying s3://postgres-backup/adisoft for clone
2021-03-31 10:02:07,374 INFO: cloning cluster acid-test-zalando-postgres-15-restored-5 using wal-g backup-fetch /home/postgres/pgdata/pgroot/data base_000000010000000000000002
INFO: 2021/03/31 10:02:07.680876 Finished decompression of part_003.tar.lz4
INFO: 2021/03/31 10:02:07.680898 Finished extraction of part_003.tar.lz4
INFO: 2021/03/31 10:02:10.306694 Finished decompression of part_001.tar.lz4
INFO: 2021/03/31 10:02:10.306713 Finished extraction of part_001.tar.lz4
INFO: 2021/03/31 10:02:10.316917 Finished decompression of pg_control.tar.lz4
INFO: 2021/03/31 10:02:10.316933 Finished extraction of pg_control.tar.lz4
INFO: 2021/03/31 10:02:10.316942 
Backup extraction complete.
2021-03-31 10:02:10,779 maybe_pg_upgrade INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-03-31 10:02:11 UTC [468]: [1-1] 606448a3.1d4 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2021-03-31 10:02:11 UTC [468]: [2-1] 606448a3.1d4 0     LOG:  pg_stat_kcache.linux_hz is set to 500000
2021-03-31 10:02:11,563 INFO: postmaster pid=468
2021-03-31 10:02:11 UTC [468]: [3-1] 606448a3.1d4 0     LOG:  redirecting log output to logging collector process
2021-03-31 10:02:11 UTC [468]: [4-1] 606448a3.1d4 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - no response
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - no response
2021-03-31 10:02:16,370 INFO: removing initialize key after failed attempt to bootstrap the cluster
2021-03-31 10:02:16,425 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2021-03-31-10-02-16
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 33, in <module>
    sys.exit(load_entry_point('patroni==2.0.1', 'console_scripts', 'patroni')())
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 170, in main
    return patroni_main()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 138, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 108, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run
    self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 111, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1457, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1351, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1247, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1240, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/run/service/patroni: finished with code=1 signal=0
/run/service/patroni: sleeping 90 seconds
2021-03-31 10:03:47,385 INFO: Selected new K8s API server endpoint https://35.205.59.174:443
2021-03-31 10:03:47,406 INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-03-31 10:03:47,412 INFO: Lock owner: None; I am acid-test-zalando-postgres-15-restored-5-0
2021-03-31 10:03:47,559 INFO: trying to bootstrap a new cluster
2021-03-31 10:03:47,560 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-acid-test-zalando-postgresql-15" python3 /scripts/clone_with_wale.py --recovery-target-time="2021-04-01T08:17:31.837+00:00"
2021-03-31 10:03:47,670 INFO: Trying s3://postgres-backup/adisoft for clone
2021-03-31 10:03:47,878 INFO: cloning cluster acid-test-zalando-postgres-15-restored-5 using wal-g backup-fetch /home/postgres/pgdata/pgroot/data base_000000010000000000000002
INFO: 2021/03/31 10:03:48.250101 Finished decompression of part_003.tar.lz4
INFO: 2021/03/31 10:03:48.250123 Finished extraction of part_003.tar.lz4
INFO: 2021/03/31 10:03:50.664832 Finished decompression of part_001.tar.lz4
INFO: 2021/03/31 10:03:50.664853 Finished extraction of part_001.tar.lz4
INFO: 2021/03/31 10:03:50.674144 Finished decompression of pg_control.tar.lz4
INFO: 2021/03/31 10:03:50.674161 Finished extraction of pg_control.tar.lz4
INFO: 2021/03/31 10:03:50.674170 
Backup extraction complete.
2021-03-31 10:03:51,197 maybe_pg_upgrade INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-03-31 10:03:52 UTC [662]: [1-1] 60644908.296 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2021-03-31 10:03:52 UTC [662]: [2-1] 60644908.296 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2021-03-31 10:03:52 UTC [662]: [3-1] 60644908.296 0     LOG:  redirecting log output to logging collector process
2021-03-31 10:03:52 UTC [662]: [4-1] 60644908.296 0     HINT:  Future log output will appear in directory "../pg_log".
2021-03-31 10:03:52,248 INFO: postmaster pid=662
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - no response
2021-03-31 10:03:56,583 INFO: removing initialize key after failed attempt to bootstrap the cluster
2021-03-31 10:03:56,632 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2021-03-31-10-03-56
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 33, in <module>
    sys.exit(load_entry_point('patroni==2.0.1', 'console_scripts', 'patroni')())
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 170, in main
    return patroni_main()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 138, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 108, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run
    self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 111, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1457, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1351, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1247, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1240, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/run/service/patroni: finished with code=1 signal=0
/run/service/patroni: sleeping 120 seconds

I exec in one of the postgresql pod and try to connect to postgresql server by using psql but I'm facing

psql: error: could not connect to server: No such file or directory
        Is the server running locally and accepting
        connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?

Any idea what's wrong here ?

Thank you.

Martin-Hogge commented 3 years ago

I tried a fresh install with operator 1.6.2. Creating a new cluster, insert some data and try to clone it but it keeps failing with same issue:

File "/usr/local/bin/patroni", line 33, in sys.exit(load_entry_point('patroni==2.0.2', 'console_scripts', 'patroni')()) File "/usr/local/lib/python3.6/dist-packages/patroni/init.py", line 170, in main return patroni_main() File "/usr/local/lib/python3.6/dist-packages/patroni/init.py", line 138, in patroni_main abstract_main(Patroni, schema) File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main controller.run() File "/usr/local/lib/python3.6/dist-packages/patroni/init.py", line 108, in run super(Patroni, self).run() File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run self._run_cycle() File "/usr/local/lib/python3.6/dist-packages/patroni/init.py", line 111, in _run_cycle logger.info(self.ha.run_cycle()) File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1457, in run_cycle info = self._run_cycle() File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1351, in _run_cycle return self.post_bootstrap() File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1247, in post_bootstrap self.cancel_initialization() File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1240, in cancel_initialization raise PatroniFatalException('Failed to bootstrap cluster') patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'

steav commented 3 years ago

I encounter the same issue on wal-g clone-restore via S3. Operator v1.6.2. The exit 1 is caused by the following error during postgres recovery.

consistent recovery state reached at 2/EEAFBED0
redo done at 2/EEAFBED0
last completed transaction was at log time 2021-04-14 23:52:18.45155+00
FATAL,XX000,"recovery ended before configured recovery target was reached
startup process (PID 166) exited with exit code 1

This happens all the time during S3 restores (when clone.timestamp is set in the crd manifest) and never during cloning an running cluster.

root@test-staging-db-core-clone31-0:/home/postgres# cat /home/postgres/pgdata/pgroot/data/postgresql.conf 
# Do not edit this file manually!
# It will be overwritten by Patroni!
include 'postgresql.base.conf'

archive_command = 'envdir "/run/etc/wal-e.d/env" wal-g wal-push "%p"'
archive_mode = 'on'
archive_timeout = '1800s'
autovacuum_analyze_scale_factor = '0.02'
autovacuum_max_workers = '5'
autovacuum_vacuum_scale_factor = '0.05'
bg_mon.history_buckets = '120'
bg_mon.listen_address = '::'
checkpoint_completion_target = '0.9'
cluster_name = 'test-staging-db-core-clone31'
extwlist.custom_path = '/scripts'
extwlist.extensions = 'btree_gin,btree_gist,citext,hstore,intarray,ltree,pgcrypto,pgq,pg_trgm,postgres_fdw,tablefunc,uuid-ossp,hypopg,timescaledb,pg_partman'
hot_standby = 'off'
listen_addresses = '*'
log_autovacuum_min_duration = '0'
log_checkpoints = 'on'
log_connections = 'on'
log_destination = 'csvlog'
log_directory = '../pg_log'
log_disconnections = 'on'
log_file_mode = '0644'
log_filename = 'postgresql-%u.log'
log_line_prefix = '%t [%p]: [%l-1] %c %x %d %u %a %h '
log_lock_waits = 'on'
log_min_duration_statement = '500'
log_rotation_age = '1d'
log_statement = 'ddl'
log_temp_files = '0'
log_truncate_on_rotation = 'on'
logging_collector = 'on'
max_connections = '136'
max_locks_per_transaction = '64'
max_prepared_transactions = '0'
max_replication_slots = '10'
max_wal_senders = '10'
max_worker_processes = '8'
pg_stat_statements.track_utility = 'off'
port = '5432'
shared_buffers = '409MB'
shared_preload_libraries = 'bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,timescaledb,pg_cron,pg_stat_kcache'
ssl = 'on'
ssl_cert_file = '/run/certs/server.crt'
ssl_key_file = '/run/certs/server.key'
tcp_keepalives_idle = '900'
tcp_keepalives_interval = '100'
track_commit_timestamp = 'off'
track_functions = 'all'
wal_keep_size = '128MB'
wal_level = 'replica'
wal_log_hints = 'on'
hba_file = '/home/postgres/pgdata/pgroot/data/pg_hba.conf'
ident_file = '/home/postgres/pgdata/pgroot/data/pg_ident.conf'

# recovery.conf
recovery_target_action = 'promote'
recovery_target_time = '2030-04-15T09:34:39.835+00:00'
recovery_target_timeline = 'latest'
restore_command = 'envdir "/run/etc/wal-e.d/env-clone-test-staging-db-core" /scripts/restore_command.sh "%f" "%p"'

pg_log with debug when starting postgres manually on that pod:

2021-04-15 11:59:35.183 UTC,,,4404,,60782a8e.1134,94,,2021-04-15 11:59:10 UTC,,0,DEBUG,00000,"executing restore command ""envdir ""/run/etc/wal-e.d/env-clone-test-staging-db-core"" /scripts/restore_command.sh ""000000140000000300000033"" ""pg_wal/RECOVERYXLOG""""",,,,,,,,,"","startup"
2021-04-15 11:59:35.905 UTC,,,4404,,60782a8e.1134,95,,2021-04-15 11:59:10 UTC,,0,LOG,00000,"restored log file ""000000140000000300000033"" from archive",,,,,,,,,"","startup"
2021-04-15 11:59:36.211 UTC,,,4404,,60782a8e.1134,96,,2021-04-15 11:59:10 UTC,,0,DEBUG,00000,"got WAL segment from archive",,,,,,,,,"","startup"
2021-04-15 11:59:36.236 UTC,,,4404,,60782a8e.1134,97,,2021-04-15 11:59:10 UTC,,0,DEBUG,00000,"executing restore command ""envdir ""/run/etc/wal-e.d/env-clone-test-staging-db-core"" /scripts/restore_command.sh ""000000140000000300000034"" ""pg_wal/RECOVERYXLOG""""",,,,,,,,,"","startup"
2021-04-15 11:59:36.859 UTC,,,4404,,60782a8e.1134,98,,2021-04-15 11:59:10 UTC,,0,DEBUG,00000,"could not restore file ""000000140000000300000034"" from archive: child process exited with exit code 1",,,,,,,,,"","startup"
2021-04-15 11:59:36.859 UTC,,,4404,,60782a8e.1134,99,,2021-04-15 11:59:10 UTC,,0,DEBUG,58P01,"could not open file ""pg_wal/000000140000000300000034"": No such file or directory",,,,,,,,,"","startup"
2021-04-15 11:59:36.859 UTC,,,4404,,60782a8e.1134,100,,2021-04-15 11:59:10 UTC,,0,LOG,00000,"redo done at 3/33AB64E8",,,,,,,,,"","startup"
2021-04-15 11:59:36.859 UTC,,,4404,,60782a8e.1134,101,,2021-04-15 11:59:10 UTC,,0,LOG,00000,"last completed transaction was at log time 2021-04-15 11:45:26.565047+00",,,,,,,,,"","startup"
2021-04-15 11:59:36.859 UTC,,,4404,,60782a8e.1134,102,,2021-04-15 11:59:10 UTC,,0,FATAL,XX000,"recovery ended before configured recovery target was reached",,,,,,,,,"","startup"
2021-04-15 11:59:36.859 UTC,,,4404,,60782a8e.1134,103,,2021-04-15 11:59:10 UTC,,0,DEBUG,00000,"shmem_exit(1): 1 before_shmem_exit callbacks to make",,,,,,,,,"","startup"
2021-04-15 11:59:36.859 UTC,,,4404,,60782a8e.1134,104,,2021-04-15 11:59:10 UTC,,0,DEBUG,00000,"shmem_exit(1): 5 on_shmem_exit callbacks to make",,,,,,,,,"","startup"

My zalando-operator configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-operator-wal-backup
data:
  ALLOW_NOSSL: 'true' 
  AWS_ACCESS_KEY_ID: "${BACKUP_S3_KEY}"
  AWS_INSTANCE_PROFILE: 'false'
  AWS_REGION: eu-central-1
  AWS_SECRET_ACCESS_KEY: "${BACKUP_S3_SECRET}"
  BACKUP_NUM_TO_RETAIN: '7' # depending on schedule
  BACKUP_SCHEDULE: 0 1 * * * 
  CLONE_AWS_ACCESS_KEY_ID: "${BACKUP_S3_KEY}"
  CLONE_AWS_INSTANCE_PROFILE: 'false'
  CLONE_AWS_REGION: eu-central-1
  CLONE_AWS_SECRET_ACCESS_KEY: "${BACKUP_S3_SECRET}"
  CLONE_USE_WALG_RESTORE: 'true'
  STANDBY_AWS_ACCESS_KEY_ID: "${BACKUP_S3_KEY}"
  STANDBY_AWS_INSTANCE_PROFILE: 'false'
  STANDBY_AWS_REGION: eu-central-1
  STANDBY_AWS_SECRET_ACCESS_KEY: "${BACKUP_S3_SECRET}"
  STANDBY_USE_WALG_RESTORE: 'true'
  USE_WALG_BACKUP: 'true'
  USE_WALG_RESTORE: 'true'
  WALG_COMPRESSION_METHOD: "brotli"

steav commented 3 years ago

Update: It works when I change the recovery_target_time in postgres.conf manully to 2021-04-15T11:45:26+00:00(= from last completed transaction log) and restart postgres service:

2021-04-15 12:12:52.266 UTC,,,5510,,60782da9.1586,92,,2021-04-15 12:12:25 UTC,,0,LOG,00000,"restored log file ""000000140000000300000032"" from archive",,,,,,,,,"","startup"
2021-04-15 12:12:52.560 UTC,,,5510,,60782da9.1586,93,,2021-04-15 12:12:25 UTC,,0,DEBUG,00000,"got WAL segment from archive",,,,,,,,,"","startup"
2021-04-15 12:12:52.579 UTC,,,5510,,60782da9.1586,94,,2021-04-15 12:12:25 UTC,,0,DEBUG,00000,"executing restore command ""envdir ""/run/etc/wal-e.d/env-clone-test-staging-db-core"" /scripts/restore_command.sh ""000000140000000300000033"" ""pg_wal/RECOVERYXLOG""""",,,,,,,,,"","startup"
2021-04-15 12:12:53.325 UTC,,,5510,,60782da9.1586,95,,2021-04-15 12:12:25 UTC,,0,LOG,00000,"restored log file ""000000140000000300000033"" from archive",,,,,,,,,"","startup"
2021-04-15 12:12:53.617 UTC,,,5510,,60782da9.1586,96,,2021-04-15 12:12:25 UTC,,0,DEBUG,00000,"got WAL segment from archive",,,,,,,,,"","startup"
2021-04-15 12:12:53.632 UTC,,,5510,,60782da9.1586,97,,2021-04-15 12:12:25 UTC,,0,LOG,00000,"recovery stopping before commit of transaction 9524, time 2021-04-15 11:45:26.365858+00",,,,,,,,,"","startup"
2021-04-15 12:12:53.632 UTC,,,5510,,60782da9.1586,98,,2021-04-15 12:12:25 UTC,,0,LOG,00000,"redo done at 3/33A5EC30",,,,,,,,,"","startup"
2021-04-15 12:12:53.632 UTC,,,5510,,60782da9.1586,99,,2021-04-15 12:12:25 UTC,,0,LOG,00000,"last completed transaction was at log time 2021-04-15 11:45:25.744419+00",,,,,,,,,"","startup"

StepanKuksenko commented 3 years ago

also faced this issue, something strange happens, see these errors:

/var/run/postgresql:5432 - no response
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - no response
2021-05-19 08:57:26,247 INFO: removing initialize key after failed attempt to bootstrap the cluster
2021-05-19 08:57:26,301 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2021-05-19-08-57-26
Traceback (most recent call last):
File "/usr/local/bin/patroni", line 33, in <module>
sys.exit(load_entry_point('patroni==2.0.2', 'console_scripts', 'patroni')())
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 170, in main
return patroni_main()
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 138, in patroni_main
abstract_main(Patroni, schema)
File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main
controller.run()
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 108, in run
super(Patroni, self).run()
File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run
self._run_cycle()
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 111, in _run_cycle
logger.info(self.ha.run_cycle())
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1457, in run_cycle
info = self._run_cycle()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1351, in _run_cycle
return self.post_bootstrap()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1247, in post_bootstrap
self.cancel_initialization()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1240, in cancel_initialization
raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/run/service/patroni: finished with code=1 signal=0

but after waiting one-two hours i restarted pods and it was initialized successfully .. (you can see accepting connections in the end)

/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - accepting connections
2021-05-19 10:57:47,536 INFO: establishing a new patroni connection to the postgres cluster
2021-05-19 10:57:47,669 INFO: running post_bootstrap

don't really know what exactly happening...but it happens every time when i want to restore (clone) cluster

namelessvoid commented 3 years ago

Are there any updates on this issue? This just hit me and took me some hours of debugging until I stumbled across this issue 😞

The workaround provided by @steav also works if you adjust the timestamp in spec.clone of the postgresql manifest. Thanks for the suggestion, steav!

However, it explicitly says "recovery stopping before commit of transaction XYZ", so you definitely lose one transaction, probably more!

haslersn commented 2 years ago

I just hit this issue, so it's still relevant.

haslersn commented 2 years ago

Update: Here's my log:

Show log

``` test-clone-db-0 postgres 2022-01-10 20:49:40,944 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?) test-clone-db-0 postgres 2022-01-10 20:49:42,950 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup test-clone-db-0 postgres 2022-01-10 20:49:42,951 - bootstrapping - INFO - No meta-data available for this provider test-clone-db-0 postgres 2022-01-10 20:49:42,951 - bootstrapping - INFO - Looks like your running local test-clone-db-0 postgres 2022-01-10 20:49:42,997 - bootstrapping - INFO - Configuring bootstrap test-clone-db-0 postgres 2022-01-10 20:49:42,998 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/WALE_S3_PREFIX test-clone-db-0 postgres 2022-01-10 20:49:42,999 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/WALG_S3_PREFIX test-clone-db-0 postgres 2022-01-10 20:49:42,999 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/AWS_ACCESS_KEY_ID test-clone-db-0 postgres 2022-01-10 20:49:42,999 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/AWS_SECRET_ACCESS_KEY test-clone-db-0 postgres 2022-01-10 20:49:42,999 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/WALE_S3_ENDPOINT test-clone-db-0 postgres 2022-01-10 20:49:43,000 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/AWS_ENDPOINT test-clone-db-0 postgres 2022-01-10 20:49:43,000 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/WALG_DISABLE_S3_SSE test-clone-db-0 postgres 2022-01-10 20:49:43,000 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/AWS_S3_FORCE_PATH_STYLE test-clone-db-0 postgres 2022-01-10 20:49:43,000 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/USE_WALG_BACKUP test-clone-db-0 postgres 2022-01-10 20:49:43,000 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/USE_WALG_RESTORE test-clone-db-0 postgres 2022-01-10 20:49:43,001 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/WALE_LOG_DESTINATION test-clone-db-0 postgres 2022-01-10 20:49:43,001 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/TMPDIR test-clone-db-0 postgres 2022-01-10 20:49:43,001 - bootstrapping - INFO - Configuring wal-e test-clone-db-0 postgres 2022-01-10 20:49:43,002 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX test-clone-db-0 postgres 2022-01-10 20:49:43,002 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX test-clone-db-0 postgres 2022-01-10 20:49:43,003 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID test-clone-db-0 postgres 2022-01-10 20:49:43,003 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY test-clone-db-0 postgres 2022-01-10 20:49:43,003 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT test-clone-db-0 postgres 2022-01-10 20:49:43,003 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT test-clone-db-0 postgres 2022-01-10 20:49:43,004 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE test-clone-db-0 postgres 2022-01-10 20:49:43,004 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE test-clone-db-0 postgres 2022-01-10 20:49:43,004 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY test-clone-db-0 postgres 2022-01-10 20:49:43,004 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY test-clone-db-0 postgres 2022-01-10 20:49:43,005 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_COMPRESSION_METHOD test-clone-db-0 postgres 2022-01-10 20:49:43,005 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_BACKUP test-clone-db-0 postgres 2022-01-10 20:49:43,005 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE test-clone-db-0 postgres 2022-01-10 20:49:43,005 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION test-clone-db-0 postgres 2022-01-10 20:49:43,006 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT test-clone-db-0 postgres 2022-01-10 20:49:43,006 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN test-clone-db-0 postgres 2022-01-10 20:49:43,006 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR test-clone-db-0 postgres 2022-01-10 20:49:43,007 - bootstrapping - INFO - Configuring pgbouncer test-clone-db-0 postgres 2022-01-10 20:49:43,007 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping test-clone-db-0 postgres 2022-01-10 20:49:43,007 - bootstrapping - INFO - Configuring certificate test-clone-db-0 postgres 2022-01-10 20:49:43,007 - bootstrapping - INFO - Generating ssl certificate test-clone-db-0 postgres 2022-01-10 20:49:43,086 - bootstrapping - INFO - Configuring crontab test-clone-db-0 postgres 2022-01-10 20:49:43,086 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability test-clone-db-0 postgres 2022-01-10 20:49:43,100 - bootstrapping - INFO - Configuring pgqd test-clone-db-0 postgres 2022-01-10 20:49:43,101 - bootstrapping - INFO - Configuring pam-oauth2 test-clone-db-0 postgres 2022-01-10 20:49:43,102 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql test-clone-db-0 postgres 2022-01-10 20:49:43,102 - bootstrapping - INFO - Configuring log test-clone-db-0 postgres 2022-01-10 20:49:43,102 - bootstrapping - INFO - Configuring standby-cluster test-clone-db-0 postgres 2022-01-10 20:49:43,102 - bootstrapping - INFO - Configuring patroni test-clone-db-0 postgres 2022-01-10 20:49:43,119 - bootstrapping - INFO - Writing to file /run/postgres.yml test-clone-db-0 postgres 2022-01-10 20:49:44,472 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'. test-clone-db-0 postgres 2022-01-10 20:49:44,494 INFO: No PostgreSQL configuration items changed, nothing to reload. test-clone-db-0 postgres 2022-01-10 20:49:44,498 INFO: Lock owner: None; I am test-clone-db-0 test-clone-db-0 postgres 2022-01-10 20:49:44,686 INFO: trying to bootstrap a new cluster test-clone-db-0 postgres 2022-01-10 20:49:44,687 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-test-db" python3 /scripts/clone_with_wale.py --recovery-target-time="2022-02-01T12:00:00+01:00" test-clone-db-0 postgres 2022-01-10 20:49:44,907 INFO: Trying s3://cluster01-postgres-backups/spilo/test-db/f66acf9e-0541-4419-a4db-542d70a78ef9/wal/13/ for clone test-clone-db-0 postgres 2022-01-10 20:49:45,100 INFO: cloning cluster test-clone-db using wal-g backup-fetch /home/postgres/pgdata/pgroot/data base_000000130000000000000026 test-clone-db-0 postgres INFO: 2022/01/10 20:49:45.146065 Selecting the backup with name base_000000130000000000000026... test-clone-db-0 postgres INFO: 2022/01/10 20:49:45.402127 Finished decompression of part_003.tar.lzma test-clone-db-0 postgres INFO: 2022/01/10 20:49:45.402155 Finished extraction of part_003.tar.lzma test-clone-db-0 postgres INFO: 2022/01/10 20:49:52.925736 Finished decompression of part_001.tar.lzma test-clone-db-0 postgres INFO: 2022/01/10 20:49:52.925801 Finished extraction of part_001.tar.lzma test-clone-db-0 postgres INFO: 2022/01/10 20:49:52.966480 Finished extraction of pg_control.tar.lzma test-clone-db-0 postgres INFO: 2022/01/10 20:49:52.966505 Finished decompression of pg_control.tar.lzma test-clone-db-0 postgres INFO: 2022/01/10 20:49:52.966516 test-clone-db-0 postgres Backup extraction complete. test-clone-db-0 postgres 2022-01-10 20:49:53,316 maybe_pg_upgrade INFO: No PostgreSQL configuration items changed, nothing to reload. test-clone-db-0 postgres 2022-01-10 20:49:53,344 maybe_pg_upgrade INFO: Cluster version: 13, bin version: 13 test-clone-db-0 postgres 2022-01-10 20:49:53,345 maybe_pg_upgrade INFO: Trying to perform point-in-time recovery test-clone-db-0 postgres 2022-01-10 20:49:53,345 maybe_pg_upgrade INFO: Running custom bootstrap script: true test-clone-db-0 postgres 2022-01-10 20:49:53,626 maybe_pg_upgrade INFO: postmaster pid=133 test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [133]: [1-1] 61dc9bf1.85 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter... test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [133]: [2-1] 61dc9bf1.85 0 LOG: pg_stat_kcache.linux_hz is set to 500000 test-clone-db-0 postgres /var/run/postgresql:5432 - no response test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [133]: [3-1] 61dc9bf1.85 0 LOG: starting PostgreSQL 13.4 (Ubuntu 13.4-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [133]: [4-1] 61dc9bf1.85 0 LOG: listening on IPv4 address "0.0.0.0", port 5432 test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [133]: [5-1] 61dc9bf1.85 0 LOG: listening on IPv6 address "::", port 5432 test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [133]: [6-1] 61dc9bf1.85 0 LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [135]: [1-1] 61dc9bf1.87 0 LOG: database system was interrupted; last known up at 2022-01-10 19:17:29 UTC test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [135]: [2-1] 61dc9bf1.87 0 LOG: creating missing WAL directory "pg_wal/archive_status" test-clone-db-0 postgres ERROR: 2022/01/10 20:49:54.064471 Archive '00000014.history' does not exist. test-clone-db-0 postgres 2022-01-10 20:49:54 UTC [135]: [3-1] 61dc9bf1.87 0 LOG: starting point-in-time recovery to 2022-02-01 11:00:00+00 test-clone-db-0 postgres 2022-01-10 20:49:54 UTC [135]: [4-1] 61dc9bf1.87 0 LOG: restored log file "00000013.history" from archive test-clone-db-0 postgres 2022-01-10 20:49:54 UTC [221]: [1-1] 61dc9bf2.dd 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local] test-clone-db-0 postgres 2022-01-10 20:49:54 UTC [221]: [2-1] 61dc9bf2.dd 0 postgres postgres [unknown] [local] FATAL: the database system is starting up test-clone-db-0 postgres /var/run/postgresql:5432 - rejecting connections test-clone-db-0 postgres 2022-01-10 20:49:54 UTC [223]: [1-1] 61dc9bf2.df 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local] test-clone-db-0 postgres 2022-01-10 20:49:54 UTC [223]: [2-1] 61dc9bf2.df 0 postgres postgres [unknown] [local] FATAL: the database system is starting up test-clone-db-0 postgres /var/run/postgresql:5432 - rejecting connections test-clone-db-0 postgres 2022-01-10 20:49:54,998 INFO: Lock owner: None; I am test-clone-db-0 test-clone-db-0 postgres 2022-01-10 20:49:54,999 INFO: not healthy enough for leader race test-clone-db-0 postgres 2022-01-10 20:49:55,066 INFO: bootstrap in progress test-clone-db-0 postgres 2022-01-10 20:49:55 UTC [135]: [5-1] 61dc9bf1.87 0 LOG: restored log file "000000130000000000000026" from archive test-clone-db-0 postgres 2022-01-10 20:49:55 UTC [135]: [6-1] 61dc9bf1.87 0 LOG: redo starts at 0/26000028 test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.316479 WAL-prefetch file: 000000130000000000000027 test-clone-db-0 postgres 2022-01-10 20:49:55 UTC [135]: [7-1] 61dc9bf1.87 0 LOG: consistent recovery state reached at 0/2606B238 test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.326663 WAL-prefetch file: 000000130000000000000028 test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.339615 WAL-prefetch file: 000000130000000000000029 test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.396945 WAL-prefetch file: 00000013000000000000002A test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.406930 WAL-prefetch file: 00000013000000000000002B test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.417314 WAL-prefetch file: 00000013000000000000002C test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.431312 WAL-prefetch file: 00000013000000000000002D test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.451452 WAL-prefetch file: 00000013000000000000002E test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.481687 WAL-prefetch file: 00000013000000000000002F test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.491702 WAL-prefetch file: 000000130000000000000030 test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.604198 Archive '00000013000000000000002C' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.604213 Archive '00000013000000000000002A' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.604984 Archive '000000130000000000000030' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.605115 Archive '000000130000000000000029' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.605160 Archive '000000130000000000000028' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.607426 Archive '00000013000000000000002E' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.609038 Archive '00000013000000000000002F' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.609997 Archive '00000013000000000000002B' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.613621 Archive '00000013000000000000002D' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres 2022-01-10 20:49:55 UTC [272]: [1-1] 61dc9bf3.110 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local] test-clone-db-0 postgres 2022-01-10 20:49:55 UTC [272]: [2-1] 61dc9bf3.110 0 postgres postgres [unknown] [local] FATAL: the database system is starting up test-clone-db-0 postgres /var/run/postgresql:5432 - rejecting connections test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [135]: [8-1] 61dc9bf1.87 0 LOG: restored log file "000000130000000000000027" from archive test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.530118 WAL-prefetch file: 000000130000000000000028 test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.541730 WAL-prefetch file: 000000130000000000000029 test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.551746 WAL-prefetch file: 00000013000000000000002A test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.572022 WAL-prefetch file: 00000013000000000000002B test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.596828 WAL-prefetch file: 00000013000000000000002C test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.606892 WAL-prefetch file: 00000013000000000000002D test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.616963 WAL-prefetch file: 00000013000000000000002E test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.627039 WAL-prefetch file: 00000013000000000000002F test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.637117 WAL-prefetch file: 000000130000000000000030 test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.657266 WAL-prefetch file: 000000130000000000000031 test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [320]: [1-1] 61dc9bf4.140 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local] test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [320]: [2-1] 61dc9bf4.140 0 postgres postgres [unknown] [local] FATAL: the database system is starting up test-clone-db-0 postgres /var/run/postgresql:5432 - rejecting connections test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.813048 Archive '00000013000000000000002B' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.813178 Archive '00000013000000000000002F' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.813265 Archive '000000130000000000000031' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.814233 Archive '00000013000000000000002C' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.814950 Archive '00000013000000000000002A' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.815073 Archive '00000013000000000000002D' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.815563 Archive '000000130000000000000029' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.817074 Archive '000000130000000000000030' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.818416 Archive '000000130000000000000028' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.819396 Archive '00000013000000000000002E' does not exist. test-clone-db-0 postgres test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.941766 Archive '000000130000000000000028' does not exist. test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [135]: [9-1] 61dc9bf1.87 0 LOG: redo done at 0/27006230 test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [135]: [10-1] 61dc9bf1.87 0 FATAL: recovery ended before configured recovery target was reached test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [133]: [7-1] 61dc9bf1.85 0 LOG: startup process (PID 135) exited with exit code 1 test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [133]: [8-1] 61dc9bf1.85 0 LOG: terminating any other active server processes test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [133]: [9-1] 61dc9bf1.85 0 LOG: database system is shut down test-clone-db-0 postgres /var/run/postgresql:5432 - no response test-clone-db-0 postgres Traceback (most recent call last): test-clone-db-0 postgres File "/scripts/maybe_pg_upgrade.py", line 51, in perform_pitr test-clone-db-0 postgres raise Exception('Point-in-time recovery failed') test-clone-db-0 postgres Exception: Point-in-time recovery failed test-clone-db-0 postgres test-clone-db-0 postgres During handling of the above exception, another exception occurred: test-clone-db-0 postgres test-clone-db-0 postgres Traceback (most recent call last): test-clone-db-0 postgres File "/scripts/maybe_pg_upgrade.py", line 140, in test-clone-db-0 postgres main() test-clone-db-0 postgres File "/scripts/maybe_pg_upgrade.py", line 87, in main test-clone-db-0 postgres perform_pitr(upgrade, cluster_version, bin_version, config['bootstrap']) test-clone-db-0 postgres File "/scripts/maybe_pg_upgrade.py", line 70, in perform_pitr test-clone-db-0 postgres raise Exception('Point-in-time recovery failed.\nLOGS:\n--\n' + logs) test-clone-db-0 postgres Exception: Point-in-time recovery failed. test-clone-db-0 postgres LOGS: test-clone-db-0 postgres -- test-clone-db-0 postgres test-clone-db-0 postgres 2022-01-10 20:49:57,770 ERROR: /scripts/maybe_pg_upgrade.py script failed test-clone-db-0 postgres 2022-01-10 20:49:57,789 INFO: removing initialize key after failed attempt to bootstrap the cluster test-clone-db-0 postgres 2022-01-10 20:49:57,842 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2022-01-10-20-49-57 test-clone-db-0 postgres Traceback (most recent call last): test-clone-db-0 postgres File "/usr/local/bin/patroni", line 11, in test-clone-db-0 postgres sys.exit(main()) test-clone-db-0 postgres File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 171, in main test-clone-db-0 postgres return patroni_main() test-clone-db-0 postgres File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 139, in patroni_main test-clone-db-0 postgres abstract_main(Patroni, schema) test-clone-db-0 postgres File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main test-clone-db-0 postgres controller.run() test-clone-db-0 postgres File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 109, in run test-clone-db-0 postgres super(Patroni, self).run() test-clone-db-0 postgres File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run test-clone-db-0 postgres self._run_cycle() test-clone-db-0 postgres File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 112, in _run_cycle test-clone-db-0 postgres logger.info(self.ha.run_cycle()) test-clone-db-0 postgres File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1471, in run_cycle test-clone-db-0 postgres info = self._run_cycle() test-clone-db-0 postgres File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1345, in _run_cycle test-clone-db-0 postgres return self.post_bootstrap() test-clone-db-0 postgres File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1238, in post_bootstrap test-clone-db-0 postgres self.cancel_initialization() test-clone-db-0 postgres File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1231, in cancel_initialization test-clone-db-0 postgres raise PatroniFatalException('Failed to bootstrap cluster') test-clone-db-0 postgres patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster' test-clone-db-0 postgres /run/service/patroni: finished with code=1 signal=0 test-clone-db-0 postgres /run/service/patroni: sleeping 30 seconds ```

The message FATAL: recovery ended before configured recovery target was reached therein seems relevant.

@FxKu you recently successfully cloned a cluster when trying to reproduce another issue. Did you use wal-g? It seems the issue described here only occurs with wal-g.

TusharNagar211 commented 2 years ago

Hello, I am unable to start the cluster after recovering from the the pg_basebackup. I already have the backups configured for the cluster towards the azurestorage account. When i try to recover from the base backup, the slave do not get created. The master is in Running mode. It has error stating:

2022-05-31 11:51:50,361 INFO: Lock owner: pg-cluster-clone-0; I am pg-cluster-clone-1 2022-05-31 11:51:50,361 INFO: starting as a secondary 2022-05-31 11:51:50 UTC [5371]: [1-1] 62960156.14fb 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter... 2022-05-31 11:51:50 UTC [5371]: [2-1] 62960156.14fb 0 LOG: pg_stat_kcache.linux_hz is set to 1000000 2022-05-31 11:51:50,516 INFO: postmaster pid=5371 /var/run/postgresql:5432 - no response

I am trying to achieve what has been mentioned here: https://github.com/zalando/postgres-operator/blob/master/docs/user.md#clone-directly

Tristan971 commented 2 years ago

Running into this still with operator v1.8.2 and spilo-14:2.1-p6 btw.

@steav's workaround is correct in that regard

In my case, waiting a bit did also fix it, because of maybe_pg_upgrade INFO: Retrying point-in-time recovery without target as far as I can tell

looking at https://github.com/zalando/spilo/blob/cdae614e71b04ccbbd9e53f684c8a5a30afd08fa/postgres-appliance/bootstrap/maybe_pg_upgrade.py#L58-L65 it seems like there could be a race condition that prevents this to reliably trigger? It took something like 20 retries on a pod to get into that state on my end

looking further, in https://github.com/zalando/spilo/blob/cdae614e71b04ccbbd9e53f684c8a5a30afd08fa/postgres-appliance/bootstrap/maybe_pg_upgrade.py#L11-L14 the tail to try and find the PIR target error gets only 5 lines, so can quite easily fail depending on the logging patterns encountered

One solution could be to extend the lookback, but the cleanest is to be sure we have the expected recovery target in the end... However putting the exact recovery target timestamp fails to find the backup for me, as if it was a strict before-timestamp rather than before-or-equal lookup...

Unsure overall

vladimir22 commented 2 years ago

Hi community, It seems, I have caught the same PITR problem, spent a lot of time, found a workaround, and would like to share that because still unclear where the problem is (patroni, wal-g, postgres, etc.) :) My case is the same: base_cluster: wal-g --> s3 minio --> clone_cluster My images: postgres-operator:1.8.2-build.2, spilo-cdp:14-2.1-p6-build.7 My problem: I can clone cluster from the "initial" backup, but I CANNOT clone cluster from the next backup that includes 1 INSERT query.

kubectl exec -it -n pf $PG_CLUSTER-0 -- wal-g backup-list

name                          modified             wal_segment_backup_start
base_000000010000000000000003 2022-10-05T01:00:05Z 000000010000000000000003  (initial backup - PITR OK)
base_000000010000000000000005 2022-10-05T07:36:16Z 000000010000000000000005  (backup after 1st INSTERT - PITR FAILED)

Patroni logs from the Clone is very poor (startup process (PID 108) exited with exit code 1) and unclear:

2022-10-06 08:10:22,580 INFO: Trying s3://foundation-pf/spilo/postgres-db-pg-cluster for clone
2022-10-06 08:10:23,231 INFO: cloning cluster postgres-db-pg-cluster using wal-g backup-fetch /home/postgres/pgdata/pgroot/data base_00000001000000000000000D
INFO: 2022/10/06 08:10:23.335329 Selecting the backup with name base_00000001000000000000000D...
INFO: 2022/10/06 08:10:23.461990 AO files metadata was not found. Skipping the AO segments unpacking.
INFO: 2022/10/06 08:10:23.480171 Finished extraction of part_003.tar.lz4
INFO: 2022/10/06 08:10:28.850769 Finished extraction of part_001.tar.lz4
INFO: 2022/10/06 08:10:28.862579 Finished extraction of pg_control.tar.lz4
INFO: 2022/10/06 08:10:28.862601
Backup extraction complete.
2022-10-06 08:10:29,855 maybe_pg_upgrade INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-10-06 08:10:29,947 maybe_pg_upgrade INFO: Cluster version: 14, bin version: 14
2022-10-06 08:10:29,947 maybe_pg_upgrade INFO: Trying to perform point-in-time recovery
2022-10-06 08:10:29,947 maybe_pg_upgrade INFO: Running custom bootstrap script: true
2022-10-06 08:10:30,937 maybe_pg_upgrade INFO: postmaster pid=105
2022-10-06 08:10:30 UTC [105]: [1-1] 633e8d76.69 0     DEBUG:  registering background worker "logical replication launcher"
2022-10-06 08:10:31 UTC [105]: [2-1] 633e8d76.69 0     DEBUG:  registering background worker "bg_mon"
2022-10-06 08:10:31 UTC [105]: [3-1] 633e8d76.69 0     DEBUG:  loaded library "bg_mon"
2022-10-06 08:10:31 UTC [105]: [4-1] 633e8d76.69 0     DEBUG:  loaded library "pg_stat_statements"
2022-10-06 08:10:31 UTC [105]: [5-1] 633e8d76.69 0     DEBUG:  loaded library "pgextwlist"
2022-10-06 08:10:31 UTC [105]: [6-1] 633e8d76.69 0     DEBUG:  loaded library "pg_auth_mon"
2022-10-06 08:10:31 UTC [105]: [7-1] 633e8d76.69 0     DEBUG:  loaded library "set_user"
2022-10-06 08:10:31 UTC [105]: [8-1] 633e8d76.69 0     INFO:  timescaledb loaded
2022-10-06 08:10:31 UTC [105]: [9-1] 633e8d76.69 0     DEBUG:  registering background worker "TimescaleDB Background Worker Launcher"
2022-10-06 08:10:31 UTC [105]: [10-1] 633e8d76.69 0     DEBUG:  loaded library "timescaledb"
2022-10-06 08:10:31 UTC [105]: [11-1] 633e8d76.69 0     DEBUG:  registering background worker "pg_cron launcher"
2022-10-06 08:10:31 UTC [105]: [12-1] 633e8d76.69 0     DEBUG:  loaded library "pg_cron"
2022-10-06 08:10:31 UTC [105]: [13-1] 633e8d76.69 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2022-10-06 08:10:31 UTC [105]: [14-1] 633e8d76.69 0     LOG:  pg_stat_kcache.linux_hz is set to 500000
2022-10-06 08:10:31 UTC [105]: [15-1] 633e8d76.69 0     DEBUG:  loaded library "pg_stat_kcache"
/var/run/postgresql:5432 - no response
2022-10-06 08:10:31 UTC [105]: [16-1] 633e8d76.69 0     DEBUG:  mmap(117440512) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2022-10-06 08:10:31 UTC [105]: [17-1] 633e8d76.69 0     LOG:  redirecting log output to logging collector process
2022-10-06 08:10:31 UTC [105]: [18-1] 633e8d76.69 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-10-06 08:10:32,335 INFO: Lock owner: None; I am postgres-db-pg-cluster-0
2022-10-06 08:10:32,335 INFO: not healthy enough for leader race
2022-10-06 08:10:32,435 INFO: bootstrap in progress
/var/run/postgresql:5432 - no response
Traceback (most recent call last):
  File "/scripts/maybe_pg_upgrade.py", line 51, in perform_pitr
    raise Exception('Point-in-time recovery failed')
Exception: Point-in-time recovery failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scripts/maybe_pg_upgrade.py", line 140, in <module>
    main()
  File "/scripts/maybe_pg_upgrade.py", line 87, in main
    perform_pitr(upgrade, cluster_version, bin_version, config['bootstrap'])
  File "/scripts/maybe_pg_upgrade.py", line 70, in perform_pitr
    raise Exception('Point-in-time recovery failed.\nLOGS:\n--\n' + logs)
Exception: Point-in-time recovery failed.
LOGS:
--
2022-10-06 08:10:32.950 UTC,,,105,,633e8d76.69,7,,2022-10-06 08:10:30 UTC,,0,LOG,00000,"startup process (PID 108) exited with exit code 1",,,,,,,,,"","postmaster",,0
2022-10-06 08:10:32.950 UTC,,,105,,633e8d76.69,8,,2022-10-06 08:10:30 UTC,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"","postmaster",,0
2022-10-06 08:10:32.952 UTC,,,105,,633e8d76.69,9,,2022-10-06 08:10:30 UTC,,0,LOG,00000,"shutting down due to startup process failure",,,,,,,,,"","postmaster",,0
2022-10-06 08:10:32.954 UTC,,,105,,633e8d76.69,10,,2022-10-06 08:10:30 UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,"","postmaster",,0
2022-10-06 08:10:32.956 UTC,,,107,,633e8d77.6b,1,,2022-10-06 08:10:31 UTC,,0,DEBUG,00000,"logger shutting down",,,,,,,,,"","logger",,0

2022-10-06 08:10:33,345 ERROR: /scripts/maybe_pg_upgrade.py script failed
2022-10-06 08:10:33,425 INFO: removing initialize key after failed attempt to bootstrap the cluster
2022-10-06 08:10:33,473 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2022-10-06-08-10-33
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 143, in main
    return patroni_main()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 135, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 105, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run
    self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 108, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1514, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1388, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1280, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1273, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: sleeping 30 seconds

I have found answer in the postgresql logs:

## Clone cluster logs
kubectl exec -it -n $CLONE_NS $CLONE_NAME-0 -- grep --exclude='*.log' -rnw '/home/postgres/pgdata/pgroot/pg_log/' -e '.'

/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:102:2022-10-06 08:12:27.436 UTC,,,395,,633e8dea.18b,18,,2022-10-06 08:12:26 UTC,,0,DEBUG,00000,"starting up replication slots",,,,,,,,,"","startup",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:103:2022-10-06 08:12:27.524 UTC,,,395,,633e8dea.18b,19,,2022-10-06 08:12:26 UTC,,0,DEBUG,00000,"resetting unlogged relations: cleanup 1 init 0",,,,,,,,,"","startup",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:104:2022-10-06 08:12:27.526 UTC,,,395,,633e8dea.18b,20,,2022-10-06 08:12:26 UTC,,0,LOG,00000,"redo starts at 0/D000028",,,,,,,,,"","startup",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:105:2022-10-06 08:12:27.526 UTC,,,395,,633e8dea.18b,21,,2022-10-06 08:12:26 UTC,,0,DEBUG,00000,"end of backup reached",,,,,"WAL redo at 0/D0000D8 for XLOG/BACKUP_END: 0/D000028",,,,"","startup",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:106:2022-10-06 08:12:27.535 UTC,,,443,"[local]",633e8deb.1bb,1,"",2022-10-06 08:12:27 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:107:2022-10-06 08:12:27.535 UTC,"postgres","postgres",443,"[local]",633e8deb.1bb,2,"",2022-10-06 08:12:27 UTC,,0,FATAL,57P03,"the database system is not accepting connections","Hot standby mode is disabled.",,,,,,,,"","client backend",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:108:2022-10-06 08:12:27.536 UTC,,,395,,633e8dea.18b,22,,2022-10-06 08:12:26 UTC,,0,LOG,00000,"consistent recovery state reached at 0/D000100",,,,,,,,,"","startup",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:109:2022-10-06 08:12:27.738 UTC,,,451,"[local]",633e8deb.1c3,1,"",2022-10-06 08:12:27 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:110:2022-10-06 08:12:27.739 UTC,"postgres","postgres",451,"[local]",633e8deb.1c3,2,"",2022-10-06 08:12:27 UTC,,0,FATAL,57P03,"the database system is not accepting connections","Hot standby mode is disabled.",,,,,,,,"","client backend",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:111:2022-10-06 08:12:28.278 UTC,,,395,,633e8dea.18b,23,,2022-10-06 08:12:26 UTC,,0,LOG,00000,"redo done at 0/D000100 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.75 s",,,,,,,,,"","startup",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:112:2022-10-06 08:12:28.278 UTC,,,395,,633e8dea.18b,24,,2022-10-06 08:12:26 UTC,,0,FATAL,XX000,"recovery ended before configured recovery target was reached",,,,,,,,,"","startup",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:113:2022-10-06 08:12:28.280 UTC,,,391,,633e8dea.187,7,,2022-10-06 08:12:26 UTC,,0,LOG,00000,"startup process (PID 395) exited with exit code 1",,,,,,,,,"","postmaster",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:114:2022-10-06 08:12:28.280 UTC,,,391,,633e8dea.187,8,,2022-10-06 08:12:26 UTC,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"","postmaster",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:115:2022-10-06 08:12:28.284 UTC,,,391,,633e8dea.187,9,,2022-10-06 08:12:26 UTC,,0,LOG,00000,"shutting down due to startup process failure",,,,,,,,,"","postmaster",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:116:2022-10-06 08:12:28.287 UTC,,,391,,633e8dea.187,10,,2022-10-06 08:12:26 UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,"","postmaster",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:117:2022-10-06 08:12:28.288 UTC,,,394,,633e8dea.18a,1,,2022-10-06 08:12:26 UTC,,0,DEBUG,00000,"logger shutting down",,,,,,,,,"","logger",,0

It seems postgresql CANNOT upload "delta" wal (backup after 1st INSERT) because: FATAL,XX000,"recovery ended before configured recovery target was reached"

My workaround: Send 2nd INSERT query, generate next backup, and after that I can do PITR from the backup after 1st INSERT :

kubectl exec -it -n pf $PG_CLUSTER-0 -- wal-g backup-list
name                          modified             wal_segment_backup_start
base_000000010000000000000003 2022-10-05T01:00:05Z 000000010000000000000003  (initial backup - PITR OK)
base_000000010000000000000005 2022-10-05T07:36:16Z 000000010000000000000005  (backup after 1st INSERT - PITR REPAIRED)
base_000000010000000000000007 2022-10-05T07:41:23Z 000000010000000000000007  (backup after 2nd INSERT - PITR FAILED)

It seems, the problem inside the postgres cluster (why recovery ended before ? ) but in any case it would be nice to fix patroni logging because it takes time to understand what went wrong with PITR in the Clone :)

Thanks.

zalando / postgres-operator

Clone: Postgresql server not started - Connection rejected #1435