zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.35k stars 980 forks source link

major version upgrade failed #1689

Open monotek opened 2 years ago

monotek commented 2 years ago

We just tried to update the major postgres version of one of our patroni clusters.

For this we've update the following fileds in the cr:

spec.dockerImage: from: "registry.opensource.zalan.do/acid/spilo-12:1.6-p5" to "registry.opensource.zalan.do/acid/spilo-13:2.1-p1"

spec.postgresql.version: from "12" to "13"

In the first place everything looked fine as the pods were successfully restarted but the operator logs point out that it fails:

time="2021-11-15T16:26:43Z" level=debug msg="making GET http request: http://172.25.78.171:8008/patroni" cluster-name=auth/auth-postgres-db pkg=cluster
time="2021-11-15T16:26:43Z" level=info msg="healthy cluster ready to upgrade, current: 120008 desired: 130000" cluster-name=auth/auth-postgres-db pkg=cluster
time="2021-11-15T16:26:43Z" level=info msg="triggering major version upgrade on pod auth-postgres-db-2 of 3 pods" cluster-name=auth/auth-postgres-db pkg=cluster
time="2021-11-15T16:26:45Z" level=error msg="major version upgrade failed: could not execute: command terminated with exit code 1" cluster-name=auth/auth-postgres-db pkg=cluster
time="2021-11-15T16:26:45Z" level=info msg="cluster has been synced" cluster-name=auth/auth-postgres-db pkg=controller worker=0
time="2021-11-15T16:26:45Z" level=info msg="recieved add event for already existing Postgres cluster" cluster-name=auth/auth-postgres-db pkg=controller worker=0

What can we do to get the upgrade to run?

FxKu commented 2 years ago

Can you check Postgres logs (not pod logs) of the master where the upgrade was attempted? Sometimes an update can fail if there is data corruption.

monotek commented 2 years ago

Imho data is ok. At least the application using the database works wirthout problems. "patronictl list" does also not repport any problems too.

Where can i find non pod / postgresql logs? Are they saved inside the container?

Edit: I've found some log and csv files in /pgdata/pgroot/pg_log What should i look for in it? For me there is nothing obvious.

Is there any way i can trigger the update from inside the container so i can follow the log directly?

FxKu commented 2 years ago

Yes you find the Postgres logs under /pgdata/pgroot/pg_log (you can also use the $PGLOG shortcut). The .log files show backup logs and .csv the Postgres stdout/err logs. You should see ERROR around the time of the upgrade. If the upgrade fails there's you should find the folder /home/postgres/pgdata/pgroot/data_upgrade with a pg_upgrade_server.log containing upgrade and error logs e.g. can be you have to remove certain stored procedures first - or as I said - some data corruption somewhere.

And yes, you can run the upgrade script from within the master pod. See admin docs

monotek commented 2 years ago

Thanks for your help :) I wasn't able to find a pg_upgrade_server.log but triggering the upgrade manually from the master worked :)

We have some more instances with the same problem. If i can help to find the problem to do it automatically let me know...

FxKu commented 2 years ago

When it works to trigger the upgrade manually, it should have also been worked automatically. If you can't find the pg_upgrade log (I think, it's only there right after the upgrade failed) you should still be able to find ERRORs in the postgresql-<1-7>.csv log files.

monotek commented 2 years ago

We have one instance left, where we need to do the update. I'll inform you if there is anythin obvious in the logs, when the update is done.

Samusername commented 2 years ago

We got similar error with two of our test DB clusters.

In logs of postgres-operator pod:

time="2021-12-13T08:25:03Z" level=info msg="triggering major version upgrade on pod acid-upgrade-test-1 of 2 pods" cluster-name=rdbms-pg/acid-upgrade-test pkg=cluster worker=1 time="2021-12-13T08:25:03Z" level=error msg="major version upgrade failed: could not execute: command terminated with exit code 1" cluster-name=rdbms-pg/acid-upgrade-test pkg=cluster worker=1

Probably there are no meaningful ERROR lines in db pod outputs.

Following folder does not exist:

/home/postgres/pgdata/pgroot/data_upgrade

postgres@acid-upgrade-test-1:~/pgdata/pgroot$ ls -ltrh
total 12K
drwxr-sr-x.  2 postgres postgres 4.0K Dec 13 08:23 pg_log
drwxr-sr-x.  2 postgres postgres 4.0K Dec 13 08:23 wal_fast
drwx--S---. 19 postgres postgres 4.0K Dec 13 08:24 data

postgres@acid-upgrade-test-1:~/pgdata/pgroot$ env | grep PGVE PGVERSION=14

postgres@acid-upgrade-test-1:~/pgdata/pgroot$ cat data/PG_VERSION 13

We are trying to test with versions of postgres-operator and spilo from: https://github.com/zalando/postgres-operator/releases/tag/v1.7.1

Samusername commented 2 years ago

In our case, following seems to be problematic: https://github.com/zalando/postgres-operator/blob/f9150aa6db85e6b7d18553f2d104dcb7b43d69dc/pkg/cluster/majorversionupgrade.go#L92

--> We tried to replace "/bin/su postgres -c" with "/bin/bash -c": // result, err := c.ExecCommand(podName, "/bin/su", "postgres", "-c", upgradeCommand) result, err := c.ExecCommand(podName, "/bin/bash", "-c", upgradeCommand)

E.g. with such modification, it seems to work, at least in an empty simple DB cluster case. ( To be tested more... )

How should it be done in a proper way?

RogierO commented 2 years ago

I have the same issues when doing an inplace upgrade from PostgreSQL version 12 to 13. When triggering the inplace upgrade manually on the primary/master the upgrade seems to work fine.

time="2022-01-06T08:12:13Z" level=info msg="healthy cluster ready to upgrade, current: 130004 desired: 140000" cluster-name=dba-zalando-testdb/dba-testdb pkg=cluster time="2022-01-06T08:12:13Z" level=info msg="triggering major version upgrade on pod dba-rogierdb-2 of 3 pods" cluster-name=dba-zalando-testdb/dba-testdb pkg=cluster time="2022-01-06T08:12:13Z" level=error msg="major version upgrade failed: could not execute: command terminated with exit code 1" cluster-name=dba-zalando-testdb/dba-testdb pkg=cluster time="2022-01-06T08:12:13Z" level=info msg="cluster has been synced" cluster-name=dba-zalando-testdb/dba-testdb pkg=controller worker=0

Using the following images: registry.opensource.zalan.do/acid/postgres-operator:v1.7.1 registry.opensource.zalan.do/acid/spilo-14:2.1-p3

CyberDem0n commented 2 years ago

Our upgrade scripts are doing the heavy lifting: executing pg_upgrade and upgrading replicas with rsync. In addition to that the script is taking care of very specific spilo configuration. If your configuration is far off the standard one the upgrade might fail.

Actually, there are many reasons why a major upgrade could fail. Some of them, for example, could be related to your database schema. And, as you may guess, our scripts will not mess up with it.

There is only one recipe if the upgrade doesn't work automatically:

  1. Exec into the master pods as postgres used: kubectl exec -ti my-pod-0 -- su postgres
  2. Call the update script manually: python3 /scripts/inplace_upgrade.py <NUM>. Where <NUM> is the number of pods in your cluster
  3. Check update logs in the /home/postgres/pgdata/pgroot/data_upgrade.
  4. Fix reported issues and go to the step 2.
RogierO commented 2 years ago

In our case, following seems to be problematic:

https://github.com/zalando/postgres-operator/blob/f9150aa6db85e6b7d18553f2d104dcb7b43d69dc/pkg/cluster/majorversionupgrade.go#L92

--> We tried to replace "/bin/su postgres -c" with "/bin/bash -c": // result, err := c.ExecCommand(podName, "/bin/su", "postgres", "-c", upgradeCommand) result, err := c.ExecCommand(podName, "/bin/bash", "-c", upgradeCommand)

E.g. with such modification, it seems to work, at least in an empty simple DB cluster case. ( To be tested more... )

How should it be done in a proper way?

I can confirm that making this adjustment to the operator works. With that adjustment the operator will start doing the major upgrade automatically after the existing pods have been updated with the new statefulset.

CyberDem0n commented 2 years ago

Something is fishy. Backup script and pg_upgrade are supposed to be called from a postgres user. Do you happen to run containers without root? If yes, than the fix would be a bit more complex and must distinguish to cases, when the container is running with root and without.

RogierO commented 2 years ago

Something is fishy. Backup script and pg_upgrade are supposed to be called from a postgres user. Do you happen to run containers without root? If yes, than the fix would be a bit more complex and must distinguish to cases, when the container is running with root and without.

Yes, I'm running the operator as a non root user. Like how it's setup by default. My PostgreSQL cluster is simply the example minimal-postgres-manifest.yaml cluster definition.

CyberDem0n commented 2 years ago

Yes, I'm running the operator as a non root user. Like how it's setup by default.

I am talking about Spilo

RogierO commented 2 years ago

Yes, I'm running the operator as a non root user. Like how it's setup by default.

I am talking about Spilo

Within our organization we're not allowed to run containers as root. Therefore we've updated our postgresql cluster definition to run as a different users as follow

apiVersion: "acid.zalan.do/v1"  
kind: postgresql  
metadata:  
  name: dba-testdb
spec:
  teamId: "dba"
  spiloRunAsUser: 101
  spiloRunAsGroup: 103
  spiloFSGroup: 103

Is this what makes the automatic major upgrade fail @CyberDem0n ?

neelasha-09 commented 2 years ago

@CyberDem0n @FxKu :

We are facing the similar issue while performing upgrade from version "12" to "13" The images used are

Version 12:

OPR: v1.6.2
Cluster: acid/spilo-13:2.0-p6

Version 13:

OPR: v1.7.1-14-g411abbe3
Cluster: acid/spilo-13:2.0-p6

We even tried with the latest cluster image acid/spilo-14:2.1-p3 , but the upgrade still fails.

We could see below error in OPR logs:

time="2022-01-31T06:52:14Z" level=info msg="healthy cluster ready to upgrade, current: 120006 desired: 130000" cluster-name=ocp-install1-postgres4/postgres-operator-cluster pkg=cluster
time="2022-01-31T06:52:14Z" level=info msg="triggering major version upgrade on pod postgres-operator-cluster-1 of 2 pods" cluster-name=ocp-install1-postgres4/postgres-operator-cluster pkg=cluster
time="2022-01-31T06:52:14Z" level=error msg="major version upgrade failed: could not execute: command terminated with exit code 1" cluster-name=ocp-install1-postgres4/postgres-operator-cluster pkg=cluster

We tried to check for the ERROR in .csv logs and could see below:

$ 2022-01-31 06:29:45.873 UTC,"postgres","postgres",171,"[local]",61f781d9.ab,43,"ALTER EXTENSION",2022-01-31 06:29:45 UTC,10/45,0,ERROR,22023,"extension ""set_user"" has no update path from version ""3.0"" to version ""2.0""",,,,,,"ALTER EXTENSION set_user UPDATE;",,,"psql"
2022-01-31 06:29:46.557 UTC,"postgres","a3s",190,"[local]",61f781da.be,43,"ALTER EXTENSION",2022-01-31 06:29:46 UTC,8/126,0,ERROR,22023,"extension ""set_user"" has no update path from version ""3.0"" to version ""2.0""",,,,,,"ALTER EXTENSION set_user UPDATE;",,,"psql"
2022-01-31 06:29:47.492 UTC,"postgres","template1",213,"[local]",61f781da.d5,43,"ALTER EXTENSION",2022-01-31 06:29:46 UTC,10/119,0,ERROR,22023,"extension ""set_user"" has no update path from version ""3.0"" to version ""2.0""",,,,,,"ALTER EXTENSION set_user UPDATE;",,,"psql"
2022-01-31 06:42:48.851 UTC,"postgres","postgres",159,"[local]",61f784e8.9f,5,"ALTER EXTENSION",2022-01-31 06:42:48 UTC,8/8,0,ERROR,22023,"extension ""pg_auth_mon"" has no update path from version ""1.1"" to version ""1.0""",,,,,,"ALTER EXTENSION pg_auth_mon UPDATE;",,,"psql"
    error_severity text,
   AND error_severity = 'FATAL';",,,,,,,,,"psql"
   AND error_severity = 'FATAL';",,,,,,,,,"psql"
   AND error_severity = 'FATAL';",,,,,,,,,"psql"
   AND error_severity = 'FATAL';",,,,,,,,,"psql"
   AND error_severity = 'FATAL';",,,,,,,,,"psql"
   AND error_severity = 'FATAL';",,,,,,,,,"psql"
   AND error_severity = 'FATAL';",,,,,,,,,"psql"
   AND error_severity = 'FATAL';",,,,,,,,,"psql"
    except EnvironmentError:
    except IOError:
    except EnvironmentError:
    except EnvironmentError:
2022-01-31 06:42:49.972 UTC,"postgres","postgres",199,"[local]",61f784e9.c7,43,"ALTER EXTENSION",2022-01-31 06:42:49 UTC,9/51,0,ERROR,22023,"extension ""set_user"" has no update path from version ""3.0"" to version ""2.0""",,,,,,"ALTER EXTENSION set_user UPDATE;",,,"psql"
2022-01-31 06:42:50.753 UTC,"postgres","a3s",222,"[local]",61f784ea.de,43,"ALTER EXTENSION",2022-01-31 06:42:50 UTC,10/43,0,ERROR,2202/bin/sh: 7: 3,"extension ""set_user"" has no update path from version ""3.0"" to version ""2.0""",,,,,,"ALTER EXTENSION set_user UPDATE;",,,"psql"
2022-01-31 06:42:51.556 UTC,"postgres","template1",237,"[local]",61f784eb.ed,43,"ALTER EXTENSION",2022-01-31 06:42:51 UTC,8/132,0,ERROR,22023,"extension ""set_user"" has no update path from version ""3.0"" to version ""2.0""",,,,,,"ALTER EXTENSION set_user UPDATE;",,,"psql"

However running the upgrade script manually works absolutely fine

Please note: this is an issue in OpenShift environment

neelasha-09 commented 2 years ago

Thanks @RogierO for helping spot the issue :)