Open Martin-Hogge opened 3 years ago
Seems similar to https://github.com/zalando/postgres-operator/issues/1201
I tried a fresh install with operator 1.6.2. Creating a new cluster, insert some data and try to clone it but it keeps failing with same issue:
File "/usr/local/bin/patroni", line 33, in
sys.exit(load_entry_point('patroni==2.0.2', 'console_scripts', 'patroni')()) File "/usr/local/lib/python3.6/dist-packages/patroni/init.py", line 170, in main return patroni_main() File "/usr/local/lib/python3.6/dist-packages/patroni/init.py", line 138, in patroni_main abstract_main(Patroni, schema) File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main controller.run() File "/usr/local/lib/python3.6/dist-packages/patroni/init.py", line 108, in run super(Patroni, self).run() File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run self._run_cycle() File "/usr/local/lib/python3.6/dist-packages/patroni/init.py", line 111, in _run_cycle logger.info(self.ha.run_cycle()) File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1457, in run_cycle info = self._run_cycle() File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1351, in _run_cycle return self.post_bootstrap() File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1247, in post_bootstrap self.cancel_initialization() File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1240, in cancel_initialization raise PatroniFatalException('Failed to bootstrap cluster') patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
I encounter the same issue on wal-g clone-restore via S3. Operator v1.6.2. The exit 1 is caused by the following error during postgres recovery.
consistent recovery state reached at 2/EEAFBED0
redo done at 2/EEAFBED0
last completed transaction was at log time 2021-04-14 23:52:18.45155+00
FATAL,XX000,"recovery ended before configured recovery target was reached
startup process (PID 166) exited with exit code 1
This happens all the time during S3 restores (when clone.timestamp is set in the crd manifest) and never during cloning an running cluster.
root@test-staging-db-core-clone31-0:/home/postgres# cat /home/postgres/pgdata/pgroot/data/postgresql.conf
# Do not edit this file manually!
# It will be overwritten by Patroni!
include 'postgresql.base.conf'
archive_command = 'envdir "/run/etc/wal-e.d/env" wal-g wal-push "%p"'
archive_mode = 'on'
archive_timeout = '1800s'
autovacuum_analyze_scale_factor = '0.02'
autovacuum_max_workers = '5'
autovacuum_vacuum_scale_factor = '0.05'
bg_mon.history_buckets = '120'
bg_mon.listen_address = '::'
checkpoint_completion_target = '0.9'
cluster_name = 'test-staging-db-core-clone31'
extwlist.custom_path = '/scripts'
extwlist.extensions = 'btree_gin,btree_gist,citext,hstore,intarray,ltree,pgcrypto,pgq,pg_trgm,postgres_fdw,tablefunc,uuid-ossp,hypopg,timescaledb,pg_partman'
hot_standby = 'off'
listen_addresses = '*'
log_autovacuum_min_duration = '0'
log_checkpoints = 'on'
log_connections = 'on'
log_destination = 'csvlog'
log_directory = '../pg_log'
log_disconnections = 'on'
log_file_mode = '0644'
log_filename = 'postgresql-%u.log'
log_line_prefix = '%t [%p]: [%l-1] %c %x %d %u %a %h '
log_lock_waits = 'on'
log_min_duration_statement = '500'
log_rotation_age = '1d'
log_statement = 'ddl'
log_temp_files = '0'
log_truncate_on_rotation = 'on'
logging_collector = 'on'
max_connections = '136'
max_locks_per_transaction = '64'
max_prepared_transactions = '0'
max_replication_slots = '10'
max_wal_senders = '10'
max_worker_processes = '8'
pg_stat_statements.track_utility = 'off'
port = '5432'
shared_buffers = '409MB'
shared_preload_libraries = 'bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,timescaledb,pg_cron,pg_stat_kcache'
ssl = 'on'
ssl_cert_file = '/run/certs/server.crt'
ssl_key_file = '/run/certs/server.key'
tcp_keepalives_idle = '900'
tcp_keepalives_interval = '100'
track_commit_timestamp = 'off'
track_functions = 'all'
wal_keep_size = '128MB'
wal_level = 'replica'
wal_log_hints = 'on'
hba_file = '/home/postgres/pgdata/pgroot/data/pg_hba.conf'
ident_file = '/home/postgres/pgdata/pgroot/data/pg_ident.conf'
# recovery.conf
recovery_target_action = 'promote'
recovery_target_time = '2030-04-15T09:34:39.835+00:00'
recovery_target_timeline = 'latest'
restore_command = 'envdir "/run/etc/wal-e.d/env-clone-test-staging-db-core" /scripts/restore_command.sh "%f" "%p"'
pg_log with debug when starting postgres manually on that pod:
2021-04-15 11:59:35.183 UTC,,,4404,,60782a8e.1134,94,,2021-04-15 11:59:10 UTC,,0,DEBUG,00000,"executing restore command ""envdir ""/run/etc/wal-e.d/env-clone-test-staging-db-core"" /scripts/restore_command.sh ""000000140000000300000033"" ""pg_wal/RECOVERYXLOG""""",,,,,,,,,"","startup"
2021-04-15 11:59:35.905 UTC,,,4404,,60782a8e.1134,95,,2021-04-15 11:59:10 UTC,,0,LOG,00000,"restored log file ""000000140000000300000033"" from archive",,,,,,,,,"","startup"
2021-04-15 11:59:36.211 UTC,,,4404,,60782a8e.1134,96,,2021-04-15 11:59:10 UTC,,0,DEBUG,00000,"got WAL segment from archive",,,,,,,,,"","startup"
2021-04-15 11:59:36.236 UTC,,,4404,,60782a8e.1134,97,,2021-04-15 11:59:10 UTC,,0,DEBUG,00000,"executing restore command ""envdir ""/run/etc/wal-e.d/env-clone-test-staging-db-core"" /scripts/restore_command.sh ""000000140000000300000034"" ""pg_wal/RECOVERYXLOG""""",,,,,,,,,"","startup"
2021-04-15 11:59:36.859 UTC,,,4404,,60782a8e.1134,98,,2021-04-15 11:59:10 UTC,,0,DEBUG,00000,"could not restore file ""000000140000000300000034"" from archive: child process exited with exit code 1",,,,,,,,,"","startup"
2021-04-15 11:59:36.859 UTC,,,4404,,60782a8e.1134,99,,2021-04-15 11:59:10 UTC,,0,DEBUG,58P01,"could not open file ""pg_wal/000000140000000300000034"": No such file or directory",,,,,,,,,"","startup"
2021-04-15 11:59:36.859 UTC,,,4404,,60782a8e.1134,100,,2021-04-15 11:59:10 UTC,,0,LOG,00000,"redo done at 3/33AB64E8",,,,,,,,,"","startup"
2021-04-15 11:59:36.859 UTC,,,4404,,60782a8e.1134,101,,2021-04-15 11:59:10 UTC,,0,LOG,00000,"last completed transaction was at log time 2021-04-15 11:45:26.565047+00",,,,,,,,,"","startup"
2021-04-15 11:59:36.859 UTC,,,4404,,60782a8e.1134,102,,2021-04-15 11:59:10 UTC,,0,FATAL,XX000,"recovery ended before configured recovery target was reached",,,,,,,,,"","startup"
2021-04-15 11:59:36.859 UTC,,,4404,,60782a8e.1134,103,,2021-04-15 11:59:10 UTC,,0,DEBUG,00000,"shmem_exit(1): 1 before_shmem_exit callbacks to make",,,,,,,,,"","startup"
2021-04-15 11:59:36.859 UTC,,,4404,,60782a8e.1134,104,,2021-04-15 11:59:10 UTC,,0,DEBUG,00000,"shmem_exit(1): 5 on_shmem_exit callbacks to make",,,,,,,,,"","startup"
My zalando-operator configmap:
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-operator-wal-backup
data:
ALLOW_NOSSL: 'true'
AWS_ACCESS_KEY_ID: "${BACKUP_S3_KEY}"
AWS_INSTANCE_PROFILE: 'false'
AWS_REGION: eu-central-1
AWS_SECRET_ACCESS_KEY: "${BACKUP_S3_SECRET}"
BACKUP_NUM_TO_RETAIN: '7' # depending on schedule
BACKUP_SCHEDULE: 0 1 * * *
CLONE_AWS_ACCESS_KEY_ID: "${BACKUP_S3_KEY}"
CLONE_AWS_INSTANCE_PROFILE: 'false'
CLONE_AWS_REGION: eu-central-1
CLONE_AWS_SECRET_ACCESS_KEY: "${BACKUP_S3_SECRET}"
CLONE_USE_WALG_RESTORE: 'true'
STANDBY_AWS_ACCESS_KEY_ID: "${BACKUP_S3_KEY}"
STANDBY_AWS_INSTANCE_PROFILE: 'false'
STANDBY_AWS_REGION: eu-central-1
STANDBY_AWS_SECRET_ACCESS_KEY: "${BACKUP_S3_SECRET}"
STANDBY_USE_WALG_RESTORE: 'true'
USE_WALG_BACKUP: 'true'
USE_WALG_RESTORE: 'true'
WALG_COMPRESSION_METHOD: "brotli"
Update: It works when I change the recovery_target_time in postgres.conf manully to 2021-04-15T11:45:26+00:00
(= from last completed transaction log) and restart postgres service:
2021-04-15 12:12:52.266 UTC,,,5510,,60782da9.1586,92,,2021-04-15 12:12:25 UTC,,0,LOG,00000,"restored log file ""000000140000000300000032"" from archive",,,,,,,,,"","startup"
2021-04-15 12:12:52.560 UTC,,,5510,,60782da9.1586,93,,2021-04-15 12:12:25 UTC,,0,DEBUG,00000,"got WAL segment from archive",,,,,,,,,"","startup"
2021-04-15 12:12:52.579 UTC,,,5510,,60782da9.1586,94,,2021-04-15 12:12:25 UTC,,0,DEBUG,00000,"executing restore command ""envdir ""/run/etc/wal-e.d/env-clone-test-staging-db-core"" /scripts/restore_command.sh ""000000140000000300000033"" ""pg_wal/RECOVERYXLOG""""",,,,,,,,,"","startup"
2021-04-15 12:12:53.325 UTC,,,5510,,60782da9.1586,95,,2021-04-15 12:12:25 UTC,,0,LOG,00000,"restored log file ""000000140000000300000033"" from archive",,,,,,,,,"","startup"
2021-04-15 12:12:53.617 UTC,,,5510,,60782da9.1586,96,,2021-04-15 12:12:25 UTC,,0,DEBUG,00000,"got WAL segment from archive",,,,,,,,,"","startup"
2021-04-15 12:12:53.632 UTC,,,5510,,60782da9.1586,97,,2021-04-15 12:12:25 UTC,,0,LOG,00000,"recovery stopping before commit of transaction 9524, time 2021-04-15 11:45:26.365858+00",,,,,,,,,"","startup"
2021-04-15 12:12:53.632 UTC,,,5510,,60782da9.1586,98,,2021-04-15 12:12:25 UTC,,0,LOG,00000,"redo done at 3/33A5EC30",,,,,,,,,"","startup"
2021-04-15 12:12:53.632 UTC,,,5510,,60782da9.1586,99,,2021-04-15 12:12:25 UTC,,0,LOG,00000,"last completed transaction was at log time 2021-04-15 11:45:25.744419+00",,,,,,,,,"","startup"
also faced this issue, something strange happens, see these errors:
/var/run/postgresql:5432 - no response
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - no response
2021-05-19 08:57:26,247 INFO: removing initialize key after failed attempt to bootstrap the cluster
2021-05-19 08:57:26,301 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2021-05-19-08-57-26
Traceback (most recent call last):
File "/usr/local/bin/patroni", line 33, in <module>
sys.exit(load_entry_point('patroni==2.0.2', 'console_scripts', 'patroni')())
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 170, in main
return patroni_main()
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 138, in patroni_main
abstract_main(Patroni, schema)
File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main
controller.run()
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 108, in run
super(Patroni, self).run()
File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run
self._run_cycle()
File "/usr/local/lib/python3.6/dist-packages/patroni/__init__.py", line 111, in _run_cycle
logger.info(self.ha.run_cycle())
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1457, in run_cycle
info = self._run_cycle()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1351, in _run_cycle
return self.post_bootstrap()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1247, in post_bootstrap
self.cancel_initialization()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1240, in cancel_initialization
raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/run/service/patroni: finished with code=1 signal=0
but after waiting one-two hours i restarted pods and it was initialized successfully .. (you can see accepting connections
in the end)
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - accepting connections
2021-05-19 10:57:47,536 INFO: establishing a new patroni connection to the postgres cluster
2021-05-19 10:57:47,669 INFO: running post_bootstrap
don't really know what exactly happening...but it happens every time when i want to restore (clone) cluster
Are there any updates on this issue? This just hit me and took me some hours of debugging until I stumbled across this issue 😞
The workaround provided by @steav also works if you adjust the timestamp in spec.clone
of the postgresql
manifest. Thanks for the suggestion, steav!
However, it explicitly says "recovery stopping before commit of transaction XYZ", so you definitely lose one transaction, probably more!
I just hit this issue, so it's still relevant.
Update: Here's my log:
```
test-clone-db-0 postgres 2022-01-10 20:49:40,944 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
test-clone-db-0 postgres 2022-01-10 20:49:42,950 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup
test-clone-db-0 postgres 2022-01-10 20:49:42,951 - bootstrapping - INFO - No meta-data available for this provider
test-clone-db-0 postgres 2022-01-10 20:49:42,951 - bootstrapping - INFO - Looks like your running local
test-clone-db-0 postgres 2022-01-10 20:49:42,997 - bootstrapping - INFO - Configuring bootstrap
test-clone-db-0 postgres 2022-01-10 20:49:42,998 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/WALE_S3_PREFIX
test-clone-db-0 postgres 2022-01-10 20:49:42,999 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/WALG_S3_PREFIX
test-clone-db-0 postgres 2022-01-10 20:49:42,999 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/AWS_ACCESS_KEY_ID
test-clone-db-0 postgres 2022-01-10 20:49:42,999 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/AWS_SECRET_ACCESS_KEY
test-clone-db-0 postgres 2022-01-10 20:49:42,999 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/WALE_S3_ENDPOINT
test-clone-db-0 postgres 2022-01-10 20:49:43,000 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/AWS_ENDPOINT
test-clone-db-0 postgres 2022-01-10 20:49:43,000 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/WALG_DISABLE_S3_SSE
test-clone-db-0 postgres 2022-01-10 20:49:43,000 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/AWS_S3_FORCE_PATH_STYLE
test-clone-db-0 postgres 2022-01-10 20:49:43,000 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/USE_WALG_BACKUP
test-clone-db-0 postgres 2022-01-10 20:49:43,000 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/USE_WALG_RESTORE
test-clone-db-0 postgres 2022-01-10 20:49:43,001 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/WALE_LOG_DESTINATION
test-clone-db-0 postgres 2022-01-10 20:49:43,001 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-test-db/TMPDIR
test-clone-db-0 postgres 2022-01-10 20:49:43,001 - bootstrapping - INFO - Configuring wal-e
test-clone-db-0 postgres 2022-01-10 20:49:43,002 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
test-clone-db-0 postgres 2022-01-10 20:49:43,002 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
test-clone-db-0 postgres 2022-01-10 20:49:43,003 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
test-clone-db-0 postgres 2022-01-10 20:49:43,003 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
test-clone-db-0 postgres 2022-01-10 20:49:43,003 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
test-clone-db-0 postgres 2022-01-10 20:49:43,003 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
test-clone-db-0 postgres 2022-01-10 20:49:43,004 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
test-clone-db-0 postgres 2022-01-10 20:49:43,004 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
test-clone-db-0 postgres 2022-01-10 20:49:43,004 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
test-clone-db-0 postgres 2022-01-10 20:49:43,004 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
test-clone-db-0 postgres 2022-01-10 20:49:43,005 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_COMPRESSION_METHOD
test-clone-db-0 postgres 2022-01-10 20:49:43,005 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_BACKUP
test-clone-db-0 postgres 2022-01-10 20:49:43,005 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
test-clone-db-0 postgres 2022-01-10 20:49:43,005 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
test-clone-db-0 postgres 2022-01-10 20:49:43,006 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
test-clone-db-0 postgres 2022-01-10 20:49:43,006 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
test-clone-db-0 postgres 2022-01-10 20:49:43,006 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
test-clone-db-0 postgres 2022-01-10 20:49:43,007 - bootstrapping - INFO - Configuring pgbouncer
test-clone-db-0 postgres 2022-01-10 20:49:43,007 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
test-clone-db-0 postgres 2022-01-10 20:49:43,007 - bootstrapping - INFO - Configuring certificate
test-clone-db-0 postgres 2022-01-10 20:49:43,007 - bootstrapping - INFO - Generating ssl certificate
test-clone-db-0 postgres 2022-01-10 20:49:43,086 - bootstrapping - INFO - Configuring crontab
test-clone-db-0 postgres 2022-01-10 20:49:43,086 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
test-clone-db-0 postgres 2022-01-10 20:49:43,100 - bootstrapping - INFO - Configuring pgqd
test-clone-db-0 postgres 2022-01-10 20:49:43,101 - bootstrapping - INFO - Configuring pam-oauth2
test-clone-db-0 postgres 2022-01-10 20:49:43,102 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
test-clone-db-0 postgres 2022-01-10 20:49:43,102 - bootstrapping - INFO - Configuring log
test-clone-db-0 postgres 2022-01-10 20:49:43,102 - bootstrapping - INFO - Configuring standby-cluster
test-clone-db-0 postgres 2022-01-10 20:49:43,102 - bootstrapping - INFO - Configuring patroni
test-clone-db-0 postgres 2022-01-10 20:49:43,119 - bootstrapping - INFO - Writing to file /run/postgres.yml
test-clone-db-0 postgres 2022-01-10 20:49:44,472 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
test-clone-db-0 postgres 2022-01-10 20:49:44,494 INFO: No PostgreSQL configuration items changed, nothing to reload.
test-clone-db-0 postgres 2022-01-10 20:49:44,498 INFO: Lock owner: None; I am test-clone-db-0
test-clone-db-0 postgres 2022-01-10 20:49:44,686 INFO: trying to bootstrap a new cluster
test-clone-db-0 postgres 2022-01-10 20:49:44,687 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-test-db" python3 /scripts/clone_with_wale.py --recovery-target-time="2022-02-01T12:00:00+01:00"
test-clone-db-0 postgres 2022-01-10 20:49:44,907 INFO: Trying s3://cluster01-postgres-backups/spilo/test-db/f66acf9e-0541-4419-a4db-542d70a78ef9/wal/13/ for clone
test-clone-db-0 postgres 2022-01-10 20:49:45,100 INFO: cloning cluster test-clone-db using wal-g backup-fetch /home/postgres/pgdata/pgroot/data base_000000130000000000000026
test-clone-db-0 postgres INFO: 2022/01/10 20:49:45.146065 Selecting the backup with name base_000000130000000000000026...
test-clone-db-0 postgres INFO: 2022/01/10 20:49:45.402127 Finished decompression of part_003.tar.lzma
test-clone-db-0 postgres INFO: 2022/01/10 20:49:45.402155 Finished extraction of part_003.tar.lzma
test-clone-db-0 postgres INFO: 2022/01/10 20:49:52.925736 Finished decompression of part_001.tar.lzma
test-clone-db-0 postgres INFO: 2022/01/10 20:49:52.925801 Finished extraction of part_001.tar.lzma
test-clone-db-0 postgres INFO: 2022/01/10 20:49:52.966480 Finished extraction of pg_control.tar.lzma
test-clone-db-0 postgres INFO: 2022/01/10 20:49:52.966505 Finished decompression of pg_control.tar.lzma
test-clone-db-0 postgres INFO: 2022/01/10 20:49:52.966516
test-clone-db-0 postgres Backup extraction complete.
test-clone-db-0 postgres 2022-01-10 20:49:53,316 maybe_pg_upgrade INFO: No PostgreSQL configuration items changed, nothing to reload.
test-clone-db-0 postgres 2022-01-10 20:49:53,344 maybe_pg_upgrade INFO: Cluster version: 13, bin version: 13
test-clone-db-0 postgres 2022-01-10 20:49:53,345 maybe_pg_upgrade INFO: Trying to perform point-in-time recovery
test-clone-db-0 postgres 2022-01-10 20:49:53,345 maybe_pg_upgrade INFO: Running custom bootstrap script: true
test-clone-db-0 postgres 2022-01-10 20:49:53,626 maybe_pg_upgrade INFO: postmaster pid=133
test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [133]: [1-1] 61dc9bf1.85 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter...
test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [133]: [2-1] 61dc9bf1.85 0 LOG: pg_stat_kcache.linux_hz is set to 500000
test-clone-db-0 postgres /var/run/postgresql:5432 - no response
test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [133]: [3-1] 61dc9bf1.85 0 LOG: starting PostgreSQL 13.4 (Ubuntu 13.4-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [133]: [4-1] 61dc9bf1.85 0 LOG: listening on IPv4 address "0.0.0.0", port 5432
test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [133]: [5-1] 61dc9bf1.85 0 LOG: listening on IPv6 address "::", port 5432
test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [133]: [6-1] 61dc9bf1.85 0 LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [135]: [1-1] 61dc9bf1.87 0 LOG: database system was interrupted; last known up at 2022-01-10 19:17:29 UTC
test-clone-db-0 postgres 2022-01-10 20:49:53 UTC [135]: [2-1] 61dc9bf1.87 0 LOG: creating missing WAL directory "pg_wal/archive_status"
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:54.064471 Archive '00000014.history' does not exist.
test-clone-db-0 postgres 2022-01-10 20:49:54 UTC [135]: [3-1] 61dc9bf1.87 0 LOG: starting point-in-time recovery to 2022-02-01 11:00:00+00
test-clone-db-0 postgres 2022-01-10 20:49:54 UTC [135]: [4-1] 61dc9bf1.87 0 LOG: restored log file "00000013.history" from archive
test-clone-db-0 postgres 2022-01-10 20:49:54 UTC [221]: [1-1] 61dc9bf2.dd 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local]
test-clone-db-0 postgres 2022-01-10 20:49:54 UTC [221]: [2-1] 61dc9bf2.dd 0 postgres postgres [unknown] [local] FATAL: the database system is starting up
test-clone-db-0 postgres /var/run/postgresql:5432 - rejecting connections
test-clone-db-0 postgres 2022-01-10 20:49:54 UTC [223]: [1-1] 61dc9bf2.df 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local]
test-clone-db-0 postgres 2022-01-10 20:49:54 UTC [223]: [2-1] 61dc9bf2.df 0 postgres postgres [unknown] [local] FATAL: the database system is starting up
test-clone-db-0 postgres /var/run/postgresql:5432 - rejecting connections
test-clone-db-0 postgres 2022-01-10 20:49:54,998 INFO: Lock owner: None; I am test-clone-db-0
test-clone-db-0 postgres 2022-01-10 20:49:54,999 INFO: not healthy enough for leader race
test-clone-db-0 postgres 2022-01-10 20:49:55,066 INFO: bootstrap in progress
test-clone-db-0 postgres 2022-01-10 20:49:55 UTC [135]: [5-1] 61dc9bf1.87 0 LOG: restored log file "000000130000000000000026" from archive
test-clone-db-0 postgres 2022-01-10 20:49:55 UTC [135]: [6-1] 61dc9bf1.87 0 LOG: redo starts at 0/26000028
test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.316479 WAL-prefetch file: 000000130000000000000027
test-clone-db-0 postgres 2022-01-10 20:49:55 UTC [135]: [7-1] 61dc9bf1.87 0 LOG: consistent recovery state reached at 0/2606B238
test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.326663 WAL-prefetch file: 000000130000000000000028
test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.339615 WAL-prefetch file: 000000130000000000000029
test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.396945 WAL-prefetch file: 00000013000000000000002A
test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.406930 WAL-prefetch file: 00000013000000000000002B
test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.417314 WAL-prefetch file: 00000013000000000000002C
test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.431312 WAL-prefetch file: 00000013000000000000002D
test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.451452 WAL-prefetch file: 00000013000000000000002E
test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.481687 WAL-prefetch file: 00000013000000000000002F
test-clone-db-0 postgres INFO: 2022/01/10 20:49:55.491702 WAL-prefetch file: 000000130000000000000030
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.604198 Archive '00000013000000000000002C' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.604213 Archive '00000013000000000000002A' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.604984 Archive '000000130000000000000030' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.605115 Archive '000000130000000000000029' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.605160 Archive '000000130000000000000028' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.607426 Archive '00000013000000000000002E' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.609038 Archive '00000013000000000000002F' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.609997 Archive '00000013000000000000002B' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:55.613621 Archive '00000013000000000000002D' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres 2022-01-10 20:49:55 UTC [272]: [1-1] 61dc9bf3.110 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local]
test-clone-db-0 postgres 2022-01-10 20:49:55 UTC [272]: [2-1] 61dc9bf3.110 0 postgres postgres [unknown] [local] FATAL: the database system is starting up
test-clone-db-0 postgres /var/run/postgresql:5432 - rejecting connections
test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [135]: [8-1] 61dc9bf1.87 0 LOG: restored log file "000000130000000000000027" from archive
test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.530118 WAL-prefetch file: 000000130000000000000028
test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.541730 WAL-prefetch file: 000000130000000000000029
test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.551746 WAL-prefetch file: 00000013000000000000002A
test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.572022 WAL-prefetch file: 00000013000000000000002B
test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.596828 WAL-prefetch file: 00000013000000000000002C
test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.606892 WAL-prefetch file: 00000013000000000000002D
test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.616963 WAL-prefetch file: 00000013000000000000002E
test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.627039 WAL-prefetch file: 00000013000000000000002F
test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.637117 WAL-prefetch file: 000000130000000000000030
test-clone-db-0 postgres INFO: 2022/01/10 20:49:56.657266 WAL-prefetch file: 000000130000000000000031
test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [320]: [1-1] 61dc9bf4.140 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local]
test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [320]: [2-1] 61dc9bf4.140 0 postgres postgres [unknown] [local] FATAL: the database system is starting up
test-clone-db-0 postgres /var/run/postgresql:5432 - rejecting connections
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.813048 Archive '00000013000000000000002B' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.813178 Archive '00000013000000000000002F' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.813265 Archive '000000130000000000000031' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.814233 Archive '00000013000000000000002C' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.814950 Archive '00000013000000000000002A' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.815073 Archive '00000013000000000000002D' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.815563 Archive '000000130000000000000029' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.817074 Archive '000000130000000000000030' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.818416 Archive '000000130000000000000028' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.819396 Archive '00000013000000000000002E' does not exist.
test-clone-db-0 postgres
test-clone-db-0 postgres ERROR: 2022/01/10 20:49:56.941766 Archive '000000130000000000000028' does not exist.
test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [135]: [9-1] 61dc9bf1.87 0 LOG: redo done at 0/27006230
test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [135]: [10-1] 61dc9bf1.87 0 FATAL: recovery ended before configured recovery target was reached
test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [133]: [7-1] 61dc9bf1.85 0 LOG: startup process (PID 135) exited with exit code 1
test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [133]: [8-1] 61dc9bf1.85 0 LOG: terminating any other active server processes
test-clone-db-0 postgres 2022-01-10 20:49:56 UTC [133]: [9-1] 61dc9bf1.85 0 LOG: database system is shut down
test-clone-db-0 postgres /var/run/postgresql:5432 - no response
test-clone-db-0 postgres Traceback (most recent call last):
test-clone-db-0 postgres File "/scripts/maybe_pg_upgrade.py", line 51, in perform_pitr
test-clone-db-0 postgres raise Exception('Point-in-time recovery failed')
test-clone-db-0 postgres Exception: Point-in-time recovery failed
test-clone-db-0 postgres
test-clone-db-0 postgres During handling of the above exception, another exception occurred:
test-clone-db-0 postgres
test-clone-db-0 postgres Traceback (most recent call last):
test-clone-db-0 postgres File "/scripts/maybe_pg_upgrade.py", line 140, in
The message FATAL: recovery ended before configured recovery target was reached
therein seems relevant.
@FxKu you recently successfully cloned a cluster when trying to reproduce another issue. Did you use wal-g? It seems the issue described here only occurs with wal-g.
Hello, I am unable to start the cluster after recovering from the the pg_basebackup. I already have the backups configured for the cluster towards the azurestorage account. When i try to recover from the base backup, the slave do not get created. The master is in Running mode. It has error stating:
2022-05-31 11:51:50,361 INFO: Lock owner: pg-cluster-clone-0; I am pg-cluster-clone-1 2022-05-31 11:51:50,361 INFO: starting as a secondary 2022-05-31 11:51:50 UTC [5371]: [1-1] 62960156.14fb 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter... 2022-05-31 11:51:50 UTC [5371]: [2-1] 62960156.14fb 0 LOG: pg_stat_kcache.linux_hz is set to 1000000 2022-05-31 11:51:50,516 INFO: postmaster pid=5371 /var/run/postgresql:5432 - no response
I am trying to achieve what has been mentioned here: https://github.com/zalando/postgres-operator/blob/master/docs/user.md#clone-directly
Running into this still with operator v1.8.2 and spilo-14:2.1-p6 btw.
@steav's workaround is correct in that regard
In my case, waiting a bit did also fix it, because of maybe_pg_upgrade INFO: Retrying point-in-time recovery without target
as far as I can tell
looking at https://github.com/zalando/spilo/blob/cdae614e71b04ccbbd9e53f684c8a5a30afd08fa/postgres-appliance/bootstrap/maybe_pg_upgrade.py#L58-L65 it seems like there could be a race condition that prevents this to reliably trigger? It took something like 20 retries on a pod to get into that state on my end
looking further, in https://github.com/zalando/spilo/blob/cdae614e71b04ccbbd9e53f684c8a5a30afd08fa/postgres-appliance/bootstrap/maybe_pg_upgrade.py#L11-L14 the tail to try and find the PIR target error gets only 5 lines, so can quite easily fail depending on the logging patterns encountered
One solution could be to extend the lookback, but the cleanest is to be sure we have the expected recovery target in the end... However putting the exact recovery target timestamp fails to find the backup for me, as if it was a strict before-timestamp rather than before-or-equal lookup...
Unsure overall
Hi community,
It seems, I have caught the same PITR problem, spent a lot of time, found a workaround, and would like to share that because still unclear where the problem is (patroni, wal-g, postgres, etc.) :)
My case is the same: base_cluster: wal-g --> s3 minio --> clone_cluster
My images: postgres-operator:1.8.2-build.2, spilo-cdp:14-2.1-p6-build.7
My problem:
I can clone cluster from the "initial" backup, but I CANNOT clone cluster from the next backup that includes 1 INSERT query.
kubectl exec -it -n pf $PG_CLUSTER-0 -- wal-g backup-list
name modified wal_segment_backup_start
base_000000010000000000000003 2022-10-05T01:00:05Z 000000010000000000000003 (initial backup - PITR OK)
base_000000010000000000000005 2022-10-05T07:36:16Z 000000010000000000000005 (backup after 1st INSTERT - PITR FAILED)
Patroni logs from the Clone is very poor (startup process (PID 108) exited with exit code 1
) and unclear:
2022-10-06 08:10:22,580 INFO: Trying s3://foundation-pf/spilo/postgres-db-pg-cluster for clone
2022-10-06 08:10:23,231 INFO: cloning cluster postgres-db-pg-cluster using wal-g backup-fetch /home/postgres/pgdata/pgroot/data base_00000001000000000000000D
INFO: 2022/10/06 08:10:23.335329 Selecting the backup with name base_00000001000000000000000D...
INFO: 2022/10/06 08:10:23.461990 AO files metadata was not found. Skipping the AO segments unpacking.
INFO: 2022/10/06 08:10:23.480171 Finished extraction of part_003.tar.lz4
INFO: 2022/10/06 08:10:28.850769 Finished extraction of part_001.tar.lz4
INFO: 2022/10/06 08:10:28.862579 Finished extraction of pg_control.tar.lz4
INFO: 2022/10/06 08:10:28.862601
Backup extraction complete.
2022-10-06 08:10:29,855 maybe_pg_upgrade INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-10-06 08:10:29,947 maybe_pg_upgrade INFO: Cluster version: 14, bin version: 14
2022-10-06 08:10:29,947 maybe_pg_upgrade INFO: Trying to perform point-in-time recovery
2022-10-06 08:10:29,947 maybe_pg_upgrade INFO: Running custom bootstrap script: true
2022-10-06 08:10:30,937 maybe_pg_upgrade INFO: postmaster pid=105
2022-10-06 08:10:30 UTC [105]: [1-1] 633e8d76.69 0 DEBUG: registering background worker "logical replication launcher"
2022-10-06 08:10:31 UTC [105]: [2-1] 633e8d76.69 0 DEBUG: registering background worker "bg_mon"
2022-10-06 08:10:31 UTC [105]: [3-1] 633e8d76.69 0 DEBUG: loaded library "bg_mon"
2022-10-06 08:10:31 UTC [105]: [4-1] 633e8d76.69 0 DEBUG: loaded library "pg_stat_statements"
2022-10-06 08:10:31 UTC [105]: [5-1] 633e8d76.69 0 DEBUG: loaded library "pgextwlist"
2022-10-06 08:10:31 UTC [105]: [6-1] 633e8d76.69 0 DEBUG: loaded library "pg_auth_mon"
2022-10-06 08:10:31 UTC [105]: [7-1] 633e8d76.69 0 DEBUG: loaded library "set_user"
2022-10-06 08:10:31 UTC [105]: [8-1] 633e8d76.69 0 INFO: timescaledb loaded
2022-10-06 08:10:31 UTC [105]: [9-1] 633e8d76.69 0 DEBUG: registering background worker "TimescaleDB Background Worker Launcher"
2022-10-06 08:10:31 UTC [105]: [10-1] 633e8d76.69 0 DEBUG: loaded library "timescaledb"
2022-10-06 08:10:31 UTC [105]: [11-1] 633e8d76.69 0 DEBUG: registering background worker "pg_cron launcher"
2022-10-06 08:10:31 UTC [105]: [12-1] 633e8d76.69 0 DEBUG: loaded library "pg_cron"
2022-10-06 08:10:31 UTC [105]: [13-1] 633e8d76.69 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter...
2022-10-06 08:10:31 UTC [105]: [14-1] 633e8d76.69 0 LOG: pg_stat_kcache.linux_hz is set to 500000
2022-10-06 08:10:31 UTC [105]: [15-1] 633e8d76.69 0 DEBUG: loaded library "pg_stat_kcache"
/var/run/postgresql:5432 - no response
2022-10-06 08:10:31 UTC [105]: [16-1] 633e8d76.69 0 DEBUG: mmap(117440512) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2022-10-06 08:10:31 UTC [105]: [17-1] 633e8d76.69 0 LOG: redirecting log output to logging collector process
2022-10-06 08:10:31 UTC [105]: [18-1] 633e8d76.69 0 HINT: Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
2022-10-06 08:10:32,335 INFO: Lock owner: None; I am postgres-db-pg-cluster-0
2022-10-06 08:10:32,335 INFO: not healthy enough for leader race
2022-10-06 08:10:32,435 INFO: bootstrap in progress
/var/run/postgresql:5432 - no response
Traceback (most recent call last):
File "/scripts/maybe_pg_upgrade.py", line 51, in perform_pitr
raise Exception('Point-in-time recovery failed')
Exception: Point-in-time recovery failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/scripts/maybe_pg_upgrade.py", line 140, in <module>
main()
File "/scripts/maybe_pg_upgrade.py", line 87, in main
perform_pitr(upgrade, cluster_version, bin_version, config['bootstrap'])
File "/scripts/maybe_pg_upgrade.py", line 70, in perform_pitr
raise Exception('Point-in-time recovery failed.\nLOGS:\n--\n' + logs)
Exception: Point-in-time recovery failed.
LOGS:
--
2022-10-06 08:10:32.950 UTC,,,105,,633e8d76.69,7,,2022-10-06 08:10:30 UTC,,0,LOG,00000,"startup process (PID 108) exited with exit code 1",,,,,,,,,"","postmaster",,0
2022-10-06 08:10:32.950 UTC,,,105,,633e8d76.69,8,,2022-10-06 08:10:30 UTC,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"","postmaster",,0
2022-10-06 08:10:32.952 UTC,,,105,,633e8d76.69,9,,2022-10-06 08:10:30 UTC,,0,LOG,00000,"shutting down due to startup process failure",,,,,,,,,"","postmaster",,0
2022-10-06 08:10:32.954 UTC,,,105,,633e8d76.69,10,,2022-10-06 08:10:30 UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,"","postmaster",,0
2022-10-06 08:10:32.956 UTC,,,107,,633e8d77.6b,1,,2022-10-06 08:10:31 UTC,,0,DEBUG,00000,"logger shutting down",,,,,,,,,"","logger",,0
2022-10-06 08:10:33,345 ERROR: /scripts/maybe_pg_upgrade.py script failed
2022-10-06 08:10:33,425 INFO: removing initialize key after failed attempt to bootstrap the cluster
2022-10-06 08:10:33,473 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2022-10-06-08-10-33
Traceback (most recent call last):
File "/usr/local/bin/patroni", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 143, in main
return patroni_main()
File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 135, in patroni_main
abstract_main(Patroni, schema)
File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 100, in abstract_main
controller.run()
File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 105, in run
super(Patroni, self).run()
File "/usr/local/lib/python3.6/dist-packages/patroni/daemon.py", line 59, in run
self._run_cycle()
File "/usr/local/lib/python3.6/dist-packages/patroni/__main__.py", line 108, in _run_cycle
logger.info(self.ha.run_cycle())
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1514, in run_cycle
info = self._run_cycle()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1388, in _run_cycle
return self.post_bootstrap()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1280, in post_bootstrap
self.cancel_initialization()
File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1273, in cancel_initialization
raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: sleeping 30 seconds
I have found answer in the postgresql logs:
## Clone cluster logs
kubectl exec -it -n $CLONE_NS $CLONE_NAME-0 -- grep --exclude='*.log' -rnw '/home/postgres/pgdata/pgroot/pg_log/' -e '.'
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:102:2022-10-06 08:12:27.436 UTC,,,395,,633e8dea.18b,18,,2022-10-06 08:12:26 UTC,,0,DEBUG,00000,"starting up replication slots",,,,,,,,,"","startup",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:103:2022-10-06 08:12:27.524 UTC,,,395,,633e8dea.18b,19,,2022-10-06 08:12:26 UTC,,0,DEBUG,00000,"resetting unlogged relations: cleanup 1 init 0",,,,,,,,,"","startup",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:104:2022-10-06 08:12:27.526 UTC,,,395,,633e8dea.18b,20,,2022-10-06 08:12:26 UTC,,0,LOG,00000,"redo starts at 0/D000028",,,,,,,,,"","startup",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:105:2022-10-06 08:12:27.526 UTC,,,395,,633e8dea.18b,21,,2022-10-06 08:12:26 UTC,,0,DEBUG,00000,"end of backup reached",,,,,"WAL redo at 0/D0000D8 for XLOG/BACKUP_END: 0/D000028",,,,"","startup",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:106:2022-10-06 08:12:27.535 UTC,,,443,"[local]",633e8deb.1bb,1,"",2022-10-06 08:12:27 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:107:2022-10-06 08:12:27.535 UTC,"postgres","postgres",443,"[local]",633e8deb.1bb,2,"",2022-10-06 08:12:27 UTC,,0,FATAL,57P03,"the database system is not accepting connections","Hot standby mode is disabled.",,,,,,,,"","client backend",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:108:2022-10-06 08:12:27.536 UTC,,,395,,633e8dea.18b,22,,2022-10-06 08:12:26 UTC,,0,LOG,00000,"consistent recovery state reached at 0/D000100",,,,,,,,,"","startup",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:109:2022-10-06 08:12:27.738 UTC,,,451,"[local]",633e8deb.1c3,1,"",2022-10-06 08:12:27 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:110:2022-10-06 08:12:27.739 UTC,"postgres","postgres",451,"[local]",633e8deb.1c3,2,"",2022-10-06 08:12:27 UTC,,0,FATAL,57P03,"the database system is not accepting connections","Hot standby mode is disabled.",,,,,,,,"","client backend",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:111:2022-10-06 08:12:28.278 UTC,,,395,,633e8dea.18b,23,,2022-10-06 08:12:26 UTC,,0,LOG,00000,"redo done at 0/D000100 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.75 s",,,,,,,,,"","startup",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:112:2022-10-06 08:12:28.278 UTC,,,395,,633e8dea.18b,24,,2022-10-06 08:12:26 UTC,,0,FATAL,XX000,"recovery ended before configured recovery target was reached",,,,,,,,,"","startup",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:113:2022-10-06 08:12:28.280 UTC,,,391,,633e8dea.187,7,,2022-10-06 08:12:26 UTC,,0,LOG,00000,"startup process (PID 395) exited with exit code 1",,,,,,,,,"","postmaster",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:114:2022-10-06 08:12:28.280 UTC,,,391,,633e8dea.187,8,,2022-10-06 08:12:26 UTC,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"","postmaster",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:115:2022-10-06 08:12:28.284 UTC,,,391,,633e8dea.187,9,,2022-10-06 08:12:26 UTC,,0,LOG,00000,"shutting down due to startup process failure",,,,,,,,,"","postmaster",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:116:2022-10-06 08:12:28.287 UTC,,,391,,633e8dea.187,10,,2022-10-06 08:12:26 UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,"","postmaster",,0
/home/postgres/pgdata/pgroot/pg_log/postgresql-4.csv:117:2022-10-06 08:12:28.288 UTC,,,394,,633e8dea.18a,1,,2022-10-06 08:12:26 UTC,,0,DEBUG,00000,"logger shutting down",,,,,,,,,"","logger",,0
It seems postgresql CANNOT upload "delta" wal (backup after 1st INSERT
) because:
FATAL,XX000,"recovery ended before configured recovery target was reached"
My workaround:
Send 2nd INSERT query, generate next backup, and after that I can do PITR from the backup after 1st INSERT
:
kubectl exec -it -n pf $PG_CLUSTER-0 -- wal-g backup-list
name modified wal_segment_backup_start
base_000000010000000000000003 2022-10-05T01:00:05Z 000000010000000000000003 (initial backup - PITR OK)
base_000000010000000000000005 2022-10-05T07:36:16Z 000000010000000000000005 (backup after 1st INSERT - PITR REPAIRED)
base_000000010000000000000007 2022-10-05T07:41:23Z 000000010000000000000007 (backup after 2nd INSERT - PITR FAILED)
It seems, the problem inside the postgres cluster (why recovery ended before
? ) but in any case it would be nice to fix patroni logging because it takes time to understand what went wrong with PITR in the Clone :)
Thanks.
I installed
postgres-operator
with helm and configure pods to backup with WAL-G on a s3 minio bucket (hosted on the same kubernetes cluster):I manage to create a cluster with 2 instances and
enableConnectionPooler: true
. In order to test the backup/recovery I created a table with 2 rows on it.I ran a backup manually and then tried to restore the cluster by cloning it:
But the master is never getting ready, the UI is stuck on
Waiting for master to become available
. I share the logs ofpostgresql
pods:I exec in one of the
postgresql
pod and try to connect to postgresql server by usingpsql
but I'm facingAny idea what's wrong here ?
Thank you.