zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.34k stars 979 forks source link

After running base backup standby cluster does not create configs #2460

Closed freedbka closed 1 year ago

freedbka commented 1 year ago

Please, answer some short questions which should help us to understand your problem / question better?

I hope I'm posting in the right thread

Hello. I am trying to migrate postgres from VM to zalando. I have version 12 of postgres on both sides. I also have streaming, asynchronous replication on both sides. I am trying to use the standby cluster functionality https://github.com/zalando/postgres-operator/blob/master/docs/user.md#setting-up-a-standby-cluster

And in the first case with replication with s3

spec:
  standby:
    s3_wal_path: "s3://<bucketname>/spilo/<source_db_cluster>/<UID>/wal/<PGVERSION>"

And in the second case, with a direct replica

spec:
  standby:
    standby_host: "acid-minimal-cluster.default"
    standby_port: "5433"

I find the same error. Replication is successful through basebacup. But then when it tries to start, it can't find the postgresql config. Then it marks the base as data.failed and starts doing basebackup on a regular basis, up to 5 times. Attempts to push the config to him were unsuccessful. writes that the configuration is not correct, but does not write in what exact line.
Questions:

  1. Should what I'm trying to do work?
  2. Maybe there is some other way to migrate from legacy postgres to zalando postgres?

I will be very grateful for any answer

restore form standby host:

2023-10-24 11:38:34,003 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2023-10-24 11:38:34,013 - bootstrapping - INFO - Looks like you are running aws
2023-10-24 11:38:34,036 - bootstrapping - INFO - Configuring pgqd
2023-10-24 11:38:34,036 - bootstrapping - INFO - Configuring log
2023-10-24 11:38:34,036 - bootstrapping - INFO - Configuring certificate
2023-10-24 11:38:34,036 - bootstrapping - INFO - Generating ssl self-signed certificate
2023-10-24 11:38:34,198 - bootstrapping - INFO - Configuring standby-cluster
2023-10-24 11:38:34,198 - bootstrapping - INFO - Configuring patroni
2023-10-24 11:38:34,205 - bootstrapping - INFO - Writing to file /run/postgres.yml
2023-10-24 11:38:34,206 - bootstrapping - INFO - Configuring crontab
2023-10-24 11:38:34,206 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2023-10-24 11:38:34,211 - bootstrapping - INFO - Configuring pgbouncer
2023-10-24 11:38:34,211 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2023-10-24 11:38:34,211 - bootstrapping - INFO - Configuring pam-oauth2
2023-10-24 11:38:34,211 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2023-10-24 11:38:34,212 - bootstrapping - INFO - Configuring wal-e
2023-10-24 11:38:34,212 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2023-10-24 11:38:34,212 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2023-10-24 11:38:34,212 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2023-10-24 11:38:34,212 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2023-10-24 11:38:34,212 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_REGION
2023-10-24 11:38:34,212 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2023-10-24 11:38:34,213 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2023-10-24 11:38:34,213 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2023-10-24 11:38:34,213 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2023-10-24 11:38:34,213 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_BACKUP
2023-10-24 11:38:34,213 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2023-10-24 11:38:34,213 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2023-10-24 11:38:34,213 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2023-10-24 11:38:34,213 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2023-10-24 11:38:34,214 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2023-10-24 11:38:34,214 - bootstrapping - INFO - Configuring bootstrap
2023-10-24 11:38:35,396 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2023-10-24 11:38:35,432 INFO: No PostgreSQL configuration items changed, nothing to reload.
2023-10-24 11:38:35,434 INFO: Lock owner: None; I am s3-leg-0
2023-10-24 11:38:35,675 INFO: trying to bootstrap a new standby leader
1024+0 records in
1024+0 records out
16777216 bytes (17 MB, 16 MiB) copied, 0.0112936 s, 1.5 GB/s
2023-10-24 11:38:45,934 INFO: Lock owner: None; I am s3-leg-0
2023-10-24 11:38:45,934 INFO: not healthy enough for leader race
2023-10-24 11:38:45,962 INFO: bootstrap_standby_leader in progress
2023-10-24 11:38:55,934 INFO: Lock owner: None; I am s3-leg-0
2023-10-24 11:51:35,934 INFO: bootstrap_standby_leader in progress
NOTICE:  all required WAL segments have been archived
2023-10-24 11:51:45,922 INFO: replica has been created using basebackup_fast_xlog
2023-10-24 11:51:45,923 INFO: bootstrapped clone from remote member postgresql://192.168.43.46:5432
2023-10-24 11:51:45,925 ERROR: Exception during execution of long running task bootstrap_standby_leader
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/patroni/async_executor.py", line 97, in run
    wakeup = func(*args) if args else func()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 405, in bootstrap_standby_leader
    result = self.clone(clone_source, msg)
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 359, in clone
    return self.state_handler.follow(node_to_follow) is not False
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 958, in follow
    ret = self.start(timeout=timeout, block_callbacks=change_role, role=role) or None
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 594, in start
    self.config.write_postgresql_conf(configuration)
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/config.py", line 383, in write_postgresql_conf
    os.rename(self._postgresql_conf, self._postgresql_base_conf)
FileNotFoundError: [Errno 2] No such file or directory: '/home/postgres/pgdata/pgroot/data/postgresql.conf' -> '/home/postgres/pgdata/pgroot/data/postgresql.base.conf'
/var/run/postgresql:5432 - no response
2023-10-24 11:51:45,930 INFO: removing initialize key after failed attempt to bootstrap the cluster
2023-10-24 11:51:45,982 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data.failed
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 144, in main
    return patroni_main()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 136, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.10/dist-packages/patroni/daemon.py", line 108, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 106, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.10/dist-packages/patroni/daemon.py", line 65, in run
    self._run_cycle()
  File "/usr/local/lib/python3.10/dist-packages/patroni/__main__.py", line 109, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1771, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1593, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1484, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.10/dist-packages/patroni/ha.py", line 1477, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')
patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
/etc/runit/runsvdir/default/patroni: finished with code=1 signal=0
/etc/runit/runsvdir/default/patroni: sleeping 30 seconds
2023-10-24 11:52:16,586 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2023-10-24 11:52:16,626 INFO: No PostgreSQL configuration items changed, nothing to reload.
2023-10-24 11:52:16,627 INFO: Lock owner: None; I am s3-leg-0
2023-10-24 11:52:16,820 INFO: trying to bootstrap a new standby leader
1024+0 records in
1024+0 records out
16777216 bytes (17 MB, 16 MiB) copied, 0.0113348 s, 1.5 GB/s
2023-10-24 11:52:27,127 INFO: Lock owner: None; I am s3-leg-0
2023-10-24 11:52:27,128 INFO: not healthy enough for leader race
2023-10-24 11:52:27,194 INFO: bootstrap_standby_leader in progress
pg_receivewal: error: could not open file "/home/postgres/pgdata/pgroot/wal_fast/00000001000001CD000000EA.partial": No such file or directory
pg_receivewal: error: could not close file "00000001000001CD000000EA.partial": No such file or directory
pg_receivewal: disconnected; waiting 5 seconds to try again
freedbka commented 1 year ago

image image

freedbka commented 1 year ago

Decision. Maybe it will be useful to someone. At the time of basabackup, we create configs: postgresql.base.conf postgresql.conf

We copy the contents from some working cluster

Then everything according to the instructions: https://github.com/zalando/postgres-operator/blob/master/docs/user.md#setting-up-a-standby-cluster