Major version upgrade fails intermittently

bchrobot commented 3 years ago

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.6.2
Where do you run it - cloud or metal? Kubernetes or OpenShift? GKE using Helm chart
Are you running Postgres Operator in production? yes
Type of issue? Bug report

Relevant operator config and logs are here: https://gist.github.com/bchrobot/78be1494857fb98f602557a7e0dc15d7

We have a number of production clusters running PG 12 and recently began updating them to PG 13 using the new major version upgrade feature in postgres-operator. Some of the upgrades have gone smoothly but others have gotten stuck. Attempts to then run the inplace_upgrade.py script manually as described here have resulted in a broken replica-replica state.

The issue seems to be that postgres-operator updates the pods with the new PG 13 envvar but then runs inplace_upgrade.py on the replica rather than the master. This fails, but postgres-operator treats it as a success (or maybe doesn't care as it plans to retry later anyway?) and kicks off a new base backup.

The replica-replica situation may be admin error due to running inplace_upgrade.py manually before the postgres-operator-initiated base backup completed. The logs in the linked gist are from an attempt this morning where I waited until that basebackup completed before running inplace_upgrade.py manually. This seems to have completed successfully without ending up in the replica-replica state.

eskornev commented 3 years ago

Also cannot make rolling upgrade from 12 to 13 version, PGVERSION changes accordingly but actual version inside pod stays the same. Pods restarts beginning from one of replicas, than other replica, and ex-primary after them. Operator parameter major_version_upgrade_mode is "manual".

stromvirvel commented 3 years ago

Also cannot make rolling upgrade from 12 to 13 version, PGVERSION changes accordingly but actual version inside pod stays the same. Pods restarts beginning from one of replicas, than other replica, and ex-primary after them. Operator parameter major_version_upgrade_mode is "manual".

Have the exact behaviour too on AKS.

slavniyteo commented 2 years ago

Same issue with postgres-operator 1.7.1 that tries to upgrade postgresql cluster from v13 to v14. Every sync-period (10m in my case) postgres-operator initiates major version upgrade on the secondary (out of two pods: 1 primary + 2 secondary) and consequently fails.

The content of the last-upgrade.log is (I removed time to shorten):

inplace_upgrade INFO: No PostgreSQL configuration items changed, nothing to reload.
inplace_upgrade WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
inplace_upgrade INFO: establishing a new patroni connection to the postgres cluster
inplace_upgrade ERROR: PostgreSQL is not running or in recovery

After that postgres-operator creates an Event Upgrade from 130004 to 140000 finished and after 10 minutes starts from the beginning.

After I run python3 /scripts/inplace_upgrade.py 2 manually at master pod, upgrade have finished successfully.

zalando / postgres-operator

Major version upgrade fails intermittently #1460