PostgreSQL Doesn't Initialize when S3 is Down

When using Spilo with wal-g backups enabled and the S3 endpoint is down, then PostgreSQL does not initialize. Instead, the container log spams

default-db-0 postgres /var/run/postgresql:5432 - rejecting connections

and the PostgreSQL log spams:

2022-01-08 17:19:39.018 UTC,,,546,"[local]",61d9c7ab.222,1,"",2022-01-08 17:19:39 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized"
2022-01-08 17:19:39.018 UTC,"postgres","postgres",546,"[local]",61d9c7ab.222,2,"",2022-01-08 17:19:39 UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,"","client backend"

(I don't know whether wal-e is affected, too, but I guess so.)

To Reproduce

Create a PostgreSQL cluster with AWS_ENDPOINT pointing to your own S3-compatible object storage. (When using postgres-operator, this can be done using the ConfigMap referred as pod_environment_configmap.)
Take the PostgreSQL cluster down.
Take the object storage down.
Take the PostgreSQL cluster up again.

Expected behavior

If the object storage is not available, then the PostgreSQL cluster's instances should use their local WAL data in order to recover from the downtime.

It is somewhat unexpected, that enabling backups can lead to less availability. If backups aren't enabled, then the instances can recover using their local WAL data. Why should this be impossible if backups are enabled?

Environment

Postgres Operator image: registry.opensource.zalan.do/acid/postgres-operator:v1.7.0
Spilo image: registry.opensource.zalan.do/acid/spilo-13:2.1-p1
PostgreSQL version: 13

Opening yet another issue will not change the behavior of Postgres.

Opening yet another issue will not change the behavior of Postgres.

I “moved” it to here.

Maybe the problem is solved by setting recovery_target = 'immediate' (Documentation). Is this safe to enable, or could it lead to data loss due to not recovering all transactions from the backup?

I tested it, and adding

postgresql:
  parameters:
    recovery_target: immediate

does indeed solve the problem. However, the question remains: Is this safe?

I tried to research the matter in order to answer the question myself, but there are several things I don't know/understand:

The documentation of recovery_target = 'immediate' says that it recovers until a “consistent state” is reached, but what's the definition of consistent state?
If an instance is started and already has a so-called consistent state, then recovery_target = 'immediate' means that no WAL files need to be fetched from S3, right? That would explain why it fixed the issue in my case.
All relevant WAL files that are present in the S3 backup are also present in at least one instance, right?
After a shutdown or crash of an instance, can/will it ever be in a non-consistent state?
- If yes, and assuming USE_WALE is disabled (and hence there's no recovery_conf in the Patroni config), could the instance recover from this non-consistent state via streaming replication from the primary?
- If yes, then why does Spilo even enable recovery_conf if USE_WALE is enabled?

but what's the definition of consistent state?

Data on the filesystem is consistent only in one case, if the primary was shut down cleanly. In all other cases starting Postgres up requires to apply Write Ahead Logs until the consistent state is reached.

means that no WAL files need to be fetched from S3, right?

No, https://www.depesz.com/2014/01/27/waiting-for-9-4-add-recovery_targetimmediate-option/ explains a bit better what this option does, and having it set in the HA system is undesirable.

All relevant WAL files that are present in the S3 backup are also present in at least one instance, right?

What made you think so? We are living in a distributed world, there are many nodes(pods) in the cluster and they all could have a different state, including the presence/absence of WAL files.

After a shutdown or crash of an instance, can/will it ever be in a non-consistent state?

Replicas are nearly always inconsistent. Primary could be consistent only if it was shut down cleanly, but, in HA system it is expected that one or another replica will immediately take over and become the new primary.

If yes, and assuming USE_WALE is disabled (and hence there's no recovery_conf in the Patroni config), could the instance recover from this non-consistent state via streaming replication from the primary? If yes, then why does Spilo even enable recovery_conf if USE_WALE is enabled?

It is not about wal-e/wal-g or any other backup solution. It is not even about recovery_conf in Patroni config, which by the way is either translated to the recovery.conf file or postgresql.conf + standby.signal. And Patroni will create correct Postgres configs even if there is no recovery_conf section in the Patroni config.

I would repeat, the world becomes way more complex if we start running more than one Postgres node:

if the replica was down during significant time the primary most likely will miss some WAL files and therefore having restore_command is a must
if there are no nodes running and we are starting up how can we know that this node is eligible to become the new primary? What if it is too much behind? It is barely possible to tell the current WAL position without performing full recovery in read-only mode.
What if the node we start was the replica and there are no other nodes currently running? By starting up as a standby and replaying WAL files from the archive it can reach a position that would be acceptable to become the new leader.

Without continuous archiving and PITR these issues are not possible to solve.

https://www.depesz.com/2014/01/27/waiting-for-9-4-add-recovery_targetimmediate-option/ explains a bit better what this option does, and having it set in the HA system is undesirable.

I've already read this, but it doesn't explain why having it set in a HA system is undesirable.

All relevant WAL files that are present in the S3 backup are also present in at least one instance, right?

What made you think so?

One instance must have uploaded them to the S3 backup. But I think you're saying that some WAL files may have already been deleted from the instances, so they're only left in the S3 backup. Thus, I was wrong.

Without continuous archiving and PITR these issues are not possible to solve.

How are these issues solved in Spilo HA clusters if continuous archiving (i.e. USE_WALE) is disabled? If they are not solved, then what are the implications and are they documented somewhere?

I run clusters with 3 instances and synchronous_mode: true and synchronous_mode_strict: true. In this configuration, there are always at least two instances which aren't behind. I think I understand that a full recovery might be necessary for the third instance, which might be behind. But it would be nice if the two other instances (previous primary and previous sync standby) can start, even if the object storage is down. Is this possible to achieve with some configuration?

I've already read this, but it doesn't explain why having it set in a HA system is undesirable.

The article didn't say it, I did. In HA system we better apply all available WAL files from the archive, while this setting forces promotion as soon as consistency is reached.

One instance must have uploaded them to the S3 backup.

WAL archiving is an asynchronous process, there are no guarantees.

How are these issues solved in Spilo HA clusters if continuous archiving (i.e. USE_WALE) is disabled?

It is not possible to solve them.

what are the implications and are they documented somewhere?

Sometimes replicas won't be able to join the cluster back after it was down. More or less the same behavior as with vanilla Postgres. I don't see a point in copy&pasting Postgres documentation.

I run clusters with 3 instances and synchronous_mode: true and synchronous_mode_strict: true. In this configuration, there are always at least two instances which aren't behind

Postgres synchronous replication doesn't give you such guarantees.WAL is written not only by client activities, but also for example by autovacuum. Even client transactions are first written/flushed to the WAL on primary and only after they are sent to replicas. Effectively even synchronous replicas are always slightly behind the primary. And surprise, the failed primary may need to be rewound when you want to join it back to the cluster as a replica.

But it would be nice if the two other instances (previous primary and previous sync standby) can start, even if the object storage is down

First of all you have to understand what is going on there. Have you checked Postgres logs? Can it be that for example that the restore_command is hanging? Have you checked it?

First of all you have to understand what is going on there. Have you checked Postgres logs? Can it be that for example that the restore_command is hanging? Have you checked it?

My original post contains the Postgres logs. There are only those two lines that get logged over and over again (with ascending timestamps). I suspect that the restore_command is hanging, but I'm not sure. How to check this? Does the restore_command produce any log?

My original post contains the Postgres logs.

Two random lines are not enough. And it seems that you didn't even spend time scrolling them up to the place where they start. On start it writes waaaaay more interesting stuff.

I suspect that the restore_command is hanging, but I'm not sure. How to check this?

Come on! These are sysadmin basics: ps auxwwf.

Does the restore_command produce any log?

Whatever the tool writes to stderr is redirected to the *.log files.

And it seems that you didn't even spend time scrolling them up to the place where they start. On start it writes waaaaay more interesting stuff.

Tailing the log only showed those two lines over and over again. In order to get the full log, I now redirected the PostgreSQL to stderr (and thus to the container log) via:

postgresql:
  parameters:
    log_destination: stderr
    logging_collector: off

For the sake of an easy example, I used a cluster with only one instance, so there's only the primary. Even in this configuration the problem occurs. Here's the full container log:

Show container log

``` test-db-0 postgres 2022-01-10 12:39:46,808 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?) test-db-0 postgres 2022-01-10 12:39:48,815 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup test-db-0 postgres 2022-01-10 12:39:48,816 - bootstrapping - INFO - No meta-data available for this provider test-db-0 postgres 2022-01-10 12:39:48,816 - bootstrapping - INFO - Looks like your running local test-db-0 postgres 2022-01-10 12:39:48,866 - bootstrapping - INFO - Configuring log test-db-0 postgres 2022-01-10 12:39:48,866 - bootstrapping - INFO - Configuring pgbouncer test-db-0 postgres 2022-01-10 12:39:48,867 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping test-db-0 postgres 2022-01-10 12:39:48,867 - bootstrapping - INFO - Configuring crontab test-db-0 postgres 2022-01-10 12:39:48,867 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability test-db-0 postgres 2022-01-10 12:39:48,883 - bootstrapping - INFO - Configuring wal-e test-db-0 postgres 2022-01-10 12:39:48,884 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX test-db-0 postgres 2022-01-10 12:39:48,884 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX test-db-0 postgres 2022-01-10 12:39:48,884 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID test-db-0 postgres 2022-01-10 12:39:48,885 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY test-db-0 postgres 2022-01-10 12:39:48,885 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT test-db-0 postgres 2022-01-10 12:39:48,885 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT test-db-0 postgres 2022-01-10 12:39:48,886 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE test-db-0 postgres 2022-01-10 12:39:48,886 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE test-db-0 postgres 2022-01-10 12:39:48,886 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY test-db-0 postgres 2022-01-10 12:39:48,886 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY test-db-0 postgres 2022-01-10 12:39:48,887 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_COMPRESSION_METHOD test-db-0 postgres 2022-01-10 12:39:48,887 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_BACKUP test-db-0 postgres 2022-01-10 12:39:48,887 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE test-db-0 postgres 2022-01-10 12:39:48,888 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION test-db-0 postgres 2022-01-10 12:39:48,888 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT test-db-0 postgres 2022-01-10 12:39:48,888 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN test-db-0 postgres 2022-01-10 12:39:48,888 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR test-db-0 postgres 2022-01-10 12:39:48,889 - bootstrapping - INFO - Configuring bootstrap test-db-0 postgres 2022-01-10 12:39:48,889 - bootstrapping - INFO - Configuring pam-oauth2 test-db-0 postgres 2022-01-10 12:39:48,889 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql test-db-0 postgres 2022-01-10 12:39:48,889 - bootstrapping - INFO - Configuring pgqd test-db-0 postgres 2022-01-10 12:39:48,890 - bootstrapping - INFO - Configuring standby-cluster test-db-0 postgres 2022-01-10 12:39:48,890 - bootstrapping - INFO - Configuring patroni test-db-0 postgres 2022-01-10 12:39:48,905 - bootstrapping - INFO - Writing to file /run/postgres.yml test-db-0 postgres 2022-01-10 12:39:48,905 - bootstrapping - INFO - Configuring certificate test-db-0 postgres 2022-01-10 12:39:48,905 - bootstrapping - INFO - Generating ssl certificate test-db-0 postgres 2022-01-10 12:39:49,627 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'. test-db-0 postgres 2022-01-10 12:39:49,648 INFO: No PostgreSQL configuration items changed, nothing to reload. test-db-0 postgres 2022-01-10 12:39:49,662 WARNING: Postgresql is not running. test-db-0 postgres 2022-01-10 12:39:49,663 INFO: Lock owner: test-db-0; I am test-db-0 test-db-0 postgres 2022-01-10 12:39:49,732 INFO: pg_controldata: test-db-0 postgres pg_control version number: 1300 test-db-0 postgres Catalog version number: 202007201 test-db-0 postgres Database system identifier: 7051553984695054418 test-db-0 postgres Database cluster state: shut down test-db-0 postgres pg_control last modified: Mon Jan 10 12:34:38 2022 test-db-0 postgres Latest checkpoint location: 0/5000028 test-db-0 postgres Latest checkpoint's REDO location: 0/5000028 test-db-0 postgres Latest checkpoint's REDO WAL file: 000000020000000000000005 test-db-0 postgres Latest checkpoint's TimeLineID: 2 test-db-0 postgres Latest checkpoint's PrevTimeLineID: 2 test-db-0 postgres Latest checkpoint's full_page_writes: on test-db-0 postgres Latest checkpoint's NextXID: 0:1078 test-db-0 postgres Latest checkpoint's NextOID: 16720 test-db-0 postgres Latest checkpoint's NextMultiXactId: 1 test-db-0 postgres Latest checkpoint's NextMultiOffset: 0 test-db-0 postgres Latest checkpoint's oldestXID: 479 test-db-0 postgres Latest checkpoint's oldestXID's DB: 1 test-db-0 postgres Latest checkpoint's oldestActiveXID: 0 test-db-0 postgres Latest checkpoint's oldestMultiXid: 1 test-db-0 postgres Latest checkpoint's oldestMulti's DB: 1 test-db-0 postgres Latest checkpoint's oldestCommitTsXid: 0 test-db-0 postgres Latest checkpoint's newestCommitTsXid: 0 test-db-0 postgres Time of latest checkpoint: Mon Jan 10 12:34:38 2022 test-db-0 postgres Fake LSN counter for unlogged rels: 0/3E8 test-db-0 postgres Minimum recovery ending location: 0/0 test-db-0 postgres Min recovery ending loc's timeline: 0 test-db-0 postgres Backup start location: 0/0 test-db-0 postgres Backup end location: 0/0 test-db-0 postgres End-of-backup record required: no test-db-0 postgres wal_level setting: replica test-db-0 postgres wal_log_hints setting: on test-db-0 postgres max_connections setting: 100 test-db-0 postgres max_worker_processes setting: 8 test-db-0 postgres max_wal_senders setting: 10 test-db-0 postgres max_prepared_xacts setting: 0 test-db-0 postgres max_locks_per_xact setting: 64 test-db-0 postgres track_commit_timestamp setting: off test-db-0 postgres Maximum data alignment: 8 test-db-0 postgres Database block size: 8192 test-db-0 postgres Blocks per segment of large relation: 131072 test-db-0 postgres WAL block size: 8192 test-db-0 postgres Bytes per WAL segment: 16777216 test-db-0 postgres Maximum length of identifiers: 64 test-db-0 postgres Maximum columns in an index: 32 test-db-0 postgres Maximum size of a TOAST chunk: 1996 test-db-0 postgres Size of a large-object chunk: 2048 test-db-0 postgres Date/time type storage: 64-bit integers test-db-0 postgres Float8 argument passing: by value test-db-0 postgres Data page checksum version: 0 test-db-0 postgres Mock authentication nonce: 2daf7915373c4403767c7c3876aa10e4dec56113950f98a81183600be7854426 test-db-0 postgres test-db-0 postgres 2022-01-10 12:39:49,734 INFO: Lock owner: test-db-0; I am test-db-0 test-db-0 postgres 2022-01-10 12:39:49,853 INFO: starting as readonly because i had the session lock test-db-0 postgres 2022-01-10 12:39:50 UTC [70]: [1-1] 61dc2916.46 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter... test-db-0 postgres 2022-01-10 12:39:50 UTC [70]: [2-1] 61dc2916.46 0 LOG: pg_stat_kcache.linux_hz is set to 1000000 test-db-0 postgres 2022-01-10 12:39:50,115 INFO: postmaster pid=70 test-db-0 postgres /var/run/postgresql:5432 - no response test-db-0 postgres 2022-01-10 12:39:50 UTC [70]: [3-1] 61dc2916.46 0 LOG: starting PostgreSQL 13.4 (Ubuntu 13.4-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit test-db-0 postgres 2022-01-10 12:39:50 UTC [70]: [4-1] 61dc2916.46 0 LOG: listening on IPv4 address "0.0.0.0", port 5432 test-db-0 postgres 2022-01-10 12:39:50 UTC [70]: [5-1] 61dc2916.46 0 LOG: listening on IPv6 address "::", port 5432 test-db-0 postgres 2022-01-10 12:39:50 UTC [70]: [6-1] 61dc2916.46 0 LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" test-db-0 postgres 2022-01-10 12:39:50 UTC [72]: [1-1] 61dc2916.48 0 LOG: database system was shut down at 2022-01-10 12:34:38 UTC test-db-0 postgres 2022-01-10 12:39:51 UTC [102]: [1-1] 61dc2917.66 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local] test-db-0 postgres 2022-01-10 12:39:51 UTC [102]: [2-1] 61dc2917.66 0 postgres postgres [unknown] [local] FATAL: the database system is starting up test-db-0 postgres /var/run/postgresql:5432 - rejecting connections test-db-0 postgres 2022-01-10 12:39:51 UTC [104]: [1-1] 61dc2917.68 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local] test-db-0 postgres 2022-01-10 12:39:51 UTC [104]: [2-1] 61dc2917.68 0 postgres postgres [unknown] [local] FATAL: the database system is starting up test-db-0 postgres /var/run/postgresql:5432 - rejecting connections test-db-0 postgres 2022-01-10 12:39:52 UTC [106]: [1-1] 61dc2918.6a 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local] test-db-0 postgres 2022-01-10 12:39:52 UTC [106]: [2-1] 61dc2918.6a 0 postgres postgres [unknown] [local] FATAL: the database system is starting up test-db-0 postgres /var/run/postgresql:5432 - rejecting connections test-db-0 postgres 2022-01-10 12:39:53 UTC [108]: [1-1] 61dc2919.6c 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local] test-db-0 postgres 2022-01-10 12:39:53 UTC [108]: [2-1] 61dc2919.6c 0 postgres postgres [unknown] [local] FATAL: the database system is starting up test-db-0 postgres /var/run/postgresql:5432 - rejecting connections test-db-0 postgres 2022-01-10 12:39:54 UTC [110]: [1-1] 61dc291a.6e 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local] test-db-0 postgres 2022-01-10 12:39:54 UTC [110]: [2-1] 61dc291a.6e 0 postgres postgres [unknown] [local] FATAL: the database system is starting up test-db-0 postgres /var/run/postgresql:5432 - rejecting connections test-db-0 postgres 2022-01-10 12:39:55 UTC [112]: [1-1] 61dc291b.70 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local] test-db-0 postgres 2022-01-10 12:39:55 UTC [112]: [2-1] 61dc291b.70 0 postgres postgres [unknown] [local] FATAL: the database system is starting up test-db-0 postgres /var/run/postgresql:5432 - rejecting connections test-db-0 postgres 2022-01-10 12:39:56 UTC [114]: [1-1] 61dc291c.72 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local] test-db-0 postgres 2022-01-10 12:39:56 UTC [114]: [2-1] 61dc291c.72 0 postgres postgres [unknown] [local] FATAL: the database system is starting up test-db-0 postgres /var/run/postgresql:5432 - rejecting connections test-db-0 postgres 2022-01-10 12:39:57 UTC [116]: [1-1] 61dc291d.74 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local] test-db-0 postgres 2022-01-10 12:39:57 UTC [116]: [2-1] 61dc291d.74 0 postgres postgres [unknown] [local] FATAL: the database system is starting up test-db-0 postgres /var/run/postgresql:5432 - rejecting connections test-db-0 postgres 2022-01-10 12:39:58 UTC [118]: [1-1] 61dc291e.76 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local] test-db-0 postgres 2022-01-10 12:39:58 UTC [118]: [2-1] 61dc291e.76 0 postgres postgres [unknown] [local] FATAL: the database system is starting up test-db-0 postgres /var/run/postgresql:5432 - rejecting connections test-db-0 postgres 2022-01-10 12:39:59 UTC [120]: [1-1] 61dc291f.78 0 [unknown] [unknown] [unknown] [local] LOG: connection received: host=[local] test-db-0 postgres 2022-01-10 12:39:59 UTC [120]: [2-1] 61dc291f.78 0 postgres postgres [unknown] [local] FATAL: the database system is starting up test-db-0 postgres /var/run/postgresql:5432 - rejecting connections ```

Come on! These are sysadmin basics: ps auxwwf.

Thanks. It hangs in wal-g wal-fetch ...:

Show ps auxwwf output

``` root@test-db-0:/home/postgres# ps auxwwf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 866 0.3 0.0 22040 4160 pts/1 Ss 12:49 0:00 bash root 884 0.0 0.0 37808 3372 pts/1 R+ 12:49 0:00 \_ ps auxwwf root 1 0.0 0.0 4392 764 ? Ss 12:39 0:00 /usr/bin/dumb-init -c --rewrite 1:0 -- /bin/sh /launch.sh root 7 0.0 0.0 4636 1744 ? S 12:39 0:00 /bin/sh /launch.sh root 27 0.0 0.0 4560 736 ? S 12:39 0:00 \_ /usr/bin/runsvdir -P /etc/service root 28 0.0 0.0 4408 768 ? Ss 12:39 0:00 \_ runsv cron root 32 0.0 0.0 28364 2752 ? S 12:39 0:00 | \_ /usr/sbin/cron -f root 29 0.0 0.0 4408 840 ? Ss 12:39 0:00 \_ runsv patroni postgres 31 0.2 0.0 755420 35172 ? Sl 12:39 0:01 | \_ /usr/bin/python3 /usr/local/bin/patroni /home/postgres/postgres.yml root 30 0.0 0.0 4408 788 ? Ss 12:39 0:00 \_ runsv pgqd postgres 33 0.0 0.0 108020 7840 ? S 12:39 0:00 \_ /usr/bin/pgqd /home/postgres/pgq_ticker.ini postgres 70 0.0 0.0 433352 35660 ? S 12:39 0:00 /usr/lib/postgresql/13/bin/postgres -D /home/postgres/pgdata/pgroot/data --config-file=/home/postgres/pgdata/pgroot/data/postgresql.conf --listen_addresses=* --port=5432 --cluster_name=test-db --wal_level=replica --hot_standby=on --max_connections=100 --max_wal_senders=10 --max_prepared_transactions=0 --max_locks_per_transaction=64 --track_commit_timestamp=off --max_replication_slots=10 --max_worker_processes=8 --wal_log_hints=on postgres 72 0.0 0.0 433352 6884 ? Ss 12:39 0:00 \_ postgres: test-db: startup postgres 74 0.0 0.0 4636 824 ? S 12:39 0:00 | \_ sh -c envdir "/run/etc/wal-e.d/env" /scripts/restore_command.sh "00000003.history" "pg_wal/RECOVERYHISTORY" postgres 75 0.0 0.0 2613468 59624 ? Sl 12:39 0:00 | \_ wal-g wal-fetch 00000003.history pg_wal/RECOVERYHISTORY postgres 73 0.2 0.0 533208 17392 ? Ssl 12:39 0:01 \_ postgres: test-db: bg_mon ```

Does the restore_command produce any log?

Whatever the tool writes to stderr is redirected to the *.log files.

I was referring to concrete restore_command in this issue's context, i.e. /scripts/restore_command.sh.

Show container log

I am talking about postgres logs.

It hangs in wal-g wal-fetch ...

See? It is not really a Patroni/Spilo to blame. It seems that wal-g doesn't properly handling timeout or maybe doesn't have some sane default values.

I was referring to concrete restore_command in this issue's context, i.e. /scripts/restore_command.sh.

It only wrapps wal-e/wal-g in this case.

I am talking about postgres logs.

As stated in my post, I redirected the postgres logs to stderr, so they appear in the container log.

When adding a timeout

--- a/postgres-appliance/scripts/restore_command.sh
+++ b/postgres-appliance/scripts/restore_command.sh
@@ -37,7 +37,7 @@ else
     POOL_SIZE=$WALG_DOWNLOAD_CONCURRENCY
 fi

-[[ "$USE_WALG_RESTORE" == "true" ]] && exec wal-g wal-fetch "${wal_filename}" "${wal_destination}"
+[[ "$USE_WALG_RESTORE" == "true" ]] && exec /usr/bin/timeout 60 wal-g wal-fetch "${wal_filename}" "${wal_destination}"

 [[ $POOL_SIZE -gt 8 ]] && POOL_SIZE=8

then PostgreSQL does start after 3 minutes, but only in read-only mode. This is with a single instance (no HA) that was shut down cleanly.

Update: After another 5 minutes (so 8 minutes in total) the instance becomes writable and at around that time there's the following log output:

test-db-0 postgres 2022-01-10 18:07:23.016 42 LOG {ticks: 0, maint: 0, retry: 0}
test-db-0 postgres 2022-01-10 18:07:23 UTC [685]: [1-1] 61dc75db.2ad 0 [unknown] [unknown] [unknown] [local] LOG:  connection received: host=[local]
test-db-0 postgres 2022-01-10 18:07:23 UTC [685]: [2-1] 61dc75db.2ad 0 db postgres [unknown] [local] LOG:  connection authorized: user=postgres database=db application_name=pgq ticker
test-db-0 postgres 2022-01-10 18:07:23 UTC [686]: [1-1] 61dc75db.2ae 0 [unknown] [unknown] [unknown] [local] LOG:  connection received: host=[local]
test-db-0 postgres 2022-01-10 18:07:23 UTC [687]: [1-1] 61dc75db.2af 0 [unknown] [unknown] [unknown] [local] LOG:  connection received: host=[local]
test-db-0 postgres 2022-01-10 18:07:23 UTC [686]: [2-1] 61dc75db.2ae 0 postgres postgres [unknown] [local] LOG:  connection authorized: user=postgres database=postgres application_name=pgq ticker
test-db-0 postgres 2022-01-10 18:07:23 UTC [687]: [2-1] 61dc75db.2af 0 postgres postgres [unknown] [local] LOG:  connection authorized: user=postgres database=postgres application_name=pgq ticker
test-db-0 postgres 2022-01-10 18:07:23 UTC [685]: [3-1] 61dc75db.2ad 0 db postgres pgq ticker [local] LOG:  disconnection: session time: 0:00:00.003 user=postgres database=db host=[local]
test-db-0 postgres 2022-01-10 18:07:23 UTC [686]: [3-1] 61dc75db.2ae 0 postgres postgres pgq ticker [local] LOG:  disconnection: session time: 0:00:00.003 user=postgres database=postgres host=[local]
test-db-0 postgres 2022-01-10 18:07:23 UTC [687]: [3-1] 61dc75db.2af 0 postgres postgres pgq ticker [local] LOG:  disconnection: session time: 0:00:00.004 user=postgres database=postgres host=[local]
test-db-0 postgres 2022-01-10 18:07:23 UTC [495]: [1-1] 61dc74af.1ef 0     LOG:  checkpoint starting: force
test-db-0 postgres 2022-01-10 18:07:23 UTC [74]: [8-1] 61dc73fb.4a 0     LOG:  database system is ready to accept connections
test-db-0 postgres 2022-01-10 18:07:23 UTC [692]: [1-1] 61dc75db.2b4 0     LOG:  TimescaleDB background worker launcher connected to shared catalogs
test-db-0 postgres 2022-01-10 18:07:23 UTC [691]: [1-1] 61dc75db.2b3 0     LOG:  pg_cron scheduler started
test-db-0 postgres 2022-01-10 18:07:23 UTC [495]: [2-1] 61dc74af.1ef 0     LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.031 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
test-db-0 postgres INFO: 2022/01/10 18:07:24.130521 FILE PATH: 0000000A.history.lzma
test-db-0 postgres 2022-01-10 18:07:24 UTC [721]: [1-1] 61dc75dc.2d1 0 [unknown] [unknown] [unknown] [local] LOG:  connection received: host=[local]
test-db-0 postgres 2022-01-10 18:07:24 UTC [721]: [2-1] 61dc75dc.2d1 0 postgres postgres [unknown] [local] LOG:  connection authorized: user=postgres database=postgres application_name=Patroni
test-db-0 postgres 2022-01-10 18:07:24 UTC [495]: [3-1] 61dc74af.1ef 0     LOG:  checkpoint starting: immediate force wait
test-db-0 postgres 2022-01-10 18:07:24 UTC [495]: [4-1] 61dc74af.1ef 0     LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.027 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
test-db-0 postgres 2022-01-10 18:07:24 UTC [721]: [3-1] 61dc75dc.2d1 0 postgres postgres Patroni [local] LOG:  disconnection: session time: 0:00:00.030 user=postgres database=postgres host=[local]
test-db-0 postgres 2022-01-10 18:07:25,018 INFO: no action. I am (test-db-0) the leader with the lock
test-db-0 postgres 2022-01-10 18:07:25,199 INFO: no action. I am (test-db-0) the leader with the lock

if the replica was down during significant time the primary most likely will miss some WAL files and therefore having restore_command is a must

Is there a way to configure PostgreSQL such that it first tries to stream WAL files from the primary and only uses the restore_command if necessary (e.g. if the primary is down or if WAL files are missing on the primary)?

While you say that having a restore_command is a must, I'm currently again evaluating the pros and cons of removing the restore_command in a setup with synchronous_mode_strict. (Also because in the PostgreSQL Slack they recommended to me to remove the restore_command.) It would be very helpful, if you could elaborate a bit more on your following two previous statements:

if there are no nodes running and we are starting up how can we know that this node is eligible to become the new primary? What if it is too much behind? It is barely possible to tell the current WAL position without performing full recovery in read-only mode.

I assume, with synchronous_mode_strict, patroni will only promote the previous sync-standby to the new primary; is this correct? The sync-standby should not be behind (in terms of committed transactions), right? Then this is not a concern.

What if the node we start was the replica and there are no other nodes currently running? By starting up as a standby and replaying WAL files from the archive it can reach a position that would be acceptable to become the new leader.

If my above assumption is correct, then the candidate for a new leader can only be the previous sync-standby, which is always acceptable to become the new leader.

zalando / spilo

PostgreSQL Doesn't Initialize when S3 is Down #685