Open monotek opened 2 years ago
Can you check Postgres logs (not pod logs) of the master where the upgrade was attempted? Sometimes an update can fail if there is data corruption.
Imho data is ok. At least the application using the database works wirthout problems. "patronictl list" does also not repport any problems too.
Where can i find non pod / postgresql logs? Are they saved inside the container?
Edit: I've found some log and csv files in /pgdata/pgroot/pg_log What should i look for in it? For me there is nothing obvious.
Is there any way i can trigger the update from inside the container so i can follow the log directly?
Yes you find the Postgres logs under /pgdata/pgroot/pg_log
(you can also use the $PGLOG shortcut). The .log
files show backup logs and .csv
the Postgres stdout/err logs. You should see ERROR
around the time of the upgrade. If the upgrade fails there's you should find the folder /home/postgres/pgdata/pgroot/data_upgrade
with a pg_upgrade_server.log
containing upgrade and error logs e.g. can be you have to remove certain stored procedures first - or as I said - some data corruption somewhere.
And yes, you can run the upgrade script from within the master pod. See admin docs
Thanks for your help :) I wasn't able to find a pg_upgrade_server.log but triggering the upgrade manually from the master worked :)
We have some more instances with the same problem. If i can help to find the problem to do it automatically let me know...
When it works to trigger the upgrade manually, it should have also been worked automatically. If you can't find the pg_upgrade log (I think, it's only there right after the upgrade failed) you should still be able to find ERROR
s in the postgresql-<1-7>.csv log files.
We have one instance left, where we need to do the update. I'll inform you if there is anythin obvious in the logs, when the update is done.
We got similar error with two of our test DB clusters.
In logs of postgres-operator pod:
time="2021-12-13T08:25:03Z" level=info msg="triggering major version upgrade on pod acid-upgrade-test-1 of 2 pods" cluster-name=rdbms-pg/acid-upgrade-test pkg=cluster worker=1 time="2021-12-13T08:25:03Z" level=error msg="major version upgrade failed: could not execute: command terminated with exit code 1" cluster-name=rdbms-pg/acid-upgrade-test pkg=cluster worker=1
Probably there are no meaningful ERROR lines in db pod outputs.
Following folder does not exist:
/home/postgres/pgdata/pgroot/data_upgrade
postgres@acid-upgrade-test-1:~/pgdata/pgroot$ ls -ltrh
total 12K
drwxr-sr-x. 2 postgres postgres 4.0K Dec 13 08:23 pg_log
drwxr-sr-x. 2 postgres postgres 4.0K Dec 13 08:23 wal_fast
drwx--S---. 19 postgres postgres 4.0K Dec 13 08:24 data
postgres@acid-upgrade-test-1:~/pgdata/pgroot$ env | grep PGVE PGVERSION=14
postgres@acid-upgrade-test-1:~/pgdata/pgroot$ cat data/PG_VERSION 13
We are trying to test with versions of postgres-operator and spilo from: https://github.com/zalando/postgres-operator/releases/tag/v1.7.1
In our case, following seems to be problematic: https://github.com/zalando/postgres-operator/blob/f9150aa6db85e6b7d18553f2d104dcb7b43d69dc/pkg/cluster/majorversionupgrade.go#L92
--> We tried to replace "/bin/su postgres -c" with "/bin/bash -c": // result, err := c.ExecCommand(podName, "/bin/su", "postgres", "-c", upgradeCommand) result, err := c.ExecCommand(podName, "/bin/bash", "-c", upgradeCommand)
E.g. with such modification, it seems to work, at least in an empty simple DB cluster case. ( To be tested more... )
How should it be done in a proper way?
I have the same issues when doing an inplace upgrade from PostgreSQL version 12 to 13. When triggering the inplace upgrade manually on the primary/master the upgrade seems to work fine.
time="2022-01-06T08:12:13Z" level=info msg="healthy cluster ready to upgrade, current: 130004 desired: 140000" cluster-name=dba-zalando-testdb/dba-testdb pkg=cluster time="2022-01-06T08:12:13Z" level=info msg="triggering major version upgrade on pod dba-rogierdb-2 of 3 pods" cluster-name=dba-zalando-testdb/dba-testdb pkg=cluster time="2022-01-06T08:12:13Z" level=error msg="major version upgrade failed: could not execute: command terminated with exit code 1" cluster-name=dba-zalando-testdb/dba-testdb pkg=cluster time="2022-01-06T08:12:13Z" level=info msg="cluster has been synced" cluster-name=dba-zalando-testdb/dba-testdb pkg=controller worker=0
Using the following images: registry.opensource.zalan.do/acid/postgres-operator:v1.7.1 registry.opensource.zalan.do/acid/spilo-14:2.1-p3
Our upgrade scripts are doing the heavy lifting: executing pg_upgrade and upgrading replicas with rsync. In addition to that the script is taking care of very specific spilo configuration. If your configuration is far off the standard one the upgrade might fail.
Actually, there are many reasons why a major upgrade could fail. Some of them, for example, could be related to your database schema. And, as you may guess, our scripts will not mess up with it.
There is only one recipe if the upgrade doesn't work automatically:
kubectl exec -ti my-pod-0 -- su postgres
python3 /scripts/inplace_upgrade.py <NUM>
. Where <NUM>
is the number of pods in your clusterIn our case, following seems to be problematic:
--> We tried to replace "/bin/su postgres -c" with "/bin/bash -c": // result, err := c.ExecCommand(podName, "/bin/su", "postgres", "-c", upgradeCommand) result, err := c.ExecCommand(podName, "/bin/bash", "-c", upgradeCommand)
E.g. with such modification, it seems to work, at least in an empty simple DB cluster case. ( To be tested more... )
How should it be done in a proper way?
I can confirm that making this adjustment to the operator works. With that adjustment the operator will start doing the major upgrade automatically after the existing pods have been updated with the new statefulset.
Something is fishy. Backup script and pg_upgrade are supposed to be called from a postgres
user.
Do you happen to run containers without root? If yes, than the fix would be a bit more complex and must distinguish to cases, when the container is running with root and without.
Something is fishy. Backup script and pg_upgrade are supposed to be called from a
postgres
user. Do you happen to run containers without root? If yes, than the fix would be a bit more complex and must distinguish to cases, when the container is running with root and without.
Yes, I'm running the operator as a non root user. Like how it's setup by default. My PostgreSQL cluster is simply the example minimal-postgres-manifest.yaml cluster definition.
Yes, I'm running the operator as a non root user. Like how it's setup by default.
I am talking about Spilo
Yes, I'm running the operator as a non root user. Like how it's setup by default.
I am talking about Spilo
Within our organization we're not allowed to run containers as root. Therefore we've updated our postgresql cluster definition to run as a different users as follow
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
name: dba-testdb
spec:
teamId: "dba"
spiloRunAsUser: 101
spiloRunAsGroup: 103
spiloFSGroup: 103
Is this what makes the automatic major upgrade fail @CyberDem0n ?
@CyberDem0n @FxKu :
We are facing the similar issue while performing upgrade from version "12" to "13" The images used are
Version 12:
OPR: v1.6.2
Cluster: acid/spilo-13:2.0-p6
Version 13:
OPR: v1.7.1-14-g411abbe3
Cluster: acid/spilo-13:2.0-p6
We even tried with the latest cluster image acid/spilo-14:2.1-p3
, but the upgrade still fails.
We could see below error in OPR logs:
time="2022-01-31T06:52:14Z" level=info msg="healthy cluster ready to upgrade, current: 120006 desired: 130000" cluster-name=ocp-install1-postgres4/postgres-operator-cluster pkg=cluster
time="2022-01-31T06:52:14Z" level=info msg="triggering major version upgrade on pod postgres-operator-cluster-1 of 2 pods" cluster-name=ocp-install1-postgres4/postgres-operator-cluster pkg=cluster
time="2022-01-31T06:52:14Z" level=error msg="major version upgrade failed: could not execute: command terminated with exit code 1" cluster-name=ocp-install1-postgres4/postgres-operator-cluster pkg=cluster
We tried to check for the ERROR
in .csv logs
and could see below:
$ 2022-01-31 06:29:45.873 UTC,"postgres","postgres",171,"[local]",61f781d9.ab,43,"ALTER EXTENSION",2022-01-31 06:29:45 UTC,10/45,0,ERROR,22023,"extension ""set_user"" has no update path from version ""3.0"" to version ""2.0""",,,,,,"ALTER EXTENSION set_user UPDATE;",,,"psql"
2022-01-31 06:29:46.557 UTC,"postgres","a3s",190,"[local]",61f781da.be,43,"ALTER EXTENSION",2022-01-31 06:29:46 UTC,8/126,0,ERROR,22023,"extension ""set_user"" has no update path from version ""3.0"" to version ""2.0""",,,,,,"ALTER EXTENSION set_user UPDATE;",,,"psql"
2022-01-31 06:29:47.492 UTC,"postgres","template1",213,"[local]",61f781da.d5,43,"ALTER EXTENSION",2022-01-31 06:29:46 UTC,10/119,0,ERROR,22023,"extension ""set_user"" has no update path from version ""3.0"" to version ""2.0""",,,,,,"ALTER EXTENSION set_user UPDATE;",,,"psql"
2022-01-31 06:42:48.851 UTC,"postgres","postgres",159,"[local]",61f784e8.9f,5,"ALTER EXTENSION",2022-01-31 06:42:48 UTC,8/8,0,ERROR,22023,"extension ""pg_auth_mon"" has no update path from version ""1.1"" to version ""1.0""",,,,,,"ALTER EXTENSION pg_auth_mon UPDATE;",,,"psql"
error_severity text,
AND error_severity = 'FATAL';",,,,,,,,,"psql"
AND error_severity = 'FATAL';",,,,,,,,,"psql"
AND error_severity = 'FATAL';",,,,,,,,,"psql"
AND error_severity = 'FATAL';",,,,,,,,,"psql"
AND error_severity = 'FATAL';",,,,,,,,,"psql"
AND error_severity = 'FATAL';",,,,,,,,,"psql"
AND error_severity = 'FATAL';",,,,,,,,,"psql"
AND error_severity = 'FATAL';",,,,,,,,,"psql"
except EnvironmentError:
except IOError:
except EnvironmentError:
except EnvironmentError:
2022-01-31 06:42:49.972 UTC,"postgres","postgres",199,"[local]",61f784e9.c7,43,"ALTER EXTENSION",2022-01-31 06:42:49 UTC,9/51,0,ERROR,22023,"extension ""set_user"" has no update path from version ""3.0"" to version ""2.0""",,,,,,"ALTER EXTENSION set_user UPDATE;",,,"psql"
2022-01-31 06:42:50.753 UTC,"postgres","a3s",222,"[local]",61f784ea.de,43,"ALTER EXTENSION",2022-01-31 06:42:50 UTC,10/43,0,ERROR,2202/bin/sh: 7: 3,"extension ""set_user"" has no update path from version ""3.0"" to version ""2.0""",,,,,,"ALTER EXTENSION set_user UPDATE;",,,"psql"
2022-01-31 06:42:51.556 UTC,"postgres","template1",237,"[local]",61f784eb.ed,43,"ALTER EXTENSION",2022-01-31 06:42:51 UTC,8/132,0,ERROR,22023,"extension ""set_user"" has no update path from version ""3.0"" to version ""2.0""",,,,,,"ALTER EXTENSION set_user UPDATE;",,,"psql"
However running the upgrade script manually works absolutely fine
Please note: this is an issue in OpenShift environment
Thanks @RogierO for helping spot the issue :)
We just tried to update the major postgres version of one of our patroni clusters.
For this we've update the following fileds in the cr:
spec.dockerImage: from: "registry.opensource.zalan.do/acid/spilo-12:1.6-p5" to "registry.opensource.zalan.do/acid/spilo-13:2.1-p1"
spec.postgresql.version: from "12" to "13"
In the first place everything looked fine as the pods were successfully restarted but the operator logs point out that it fails:
What can we do to get the upgrade to run?