Which image of the operator are you using? e.g. ghcr.io/zalando/postgres-operator:v1.12.2
ghcr.io/zalando/postgres-operator:v1.12.2
Where do you run it - cloud or metal? Kubernetes or OpenShift? [AWS K8s | GCP ... | Bare Metal K8s]
Bare Metal K8s.
Are you running Postgres Operator in production? [yes | no]
Yes
Type of issue? [Bug report, question, feature request, etc.]
Bug Report / Request for Help
This is happening to me on v1.12.2. As it is a FATAL error, it allegedly fails my cluster consistently every time I deploy (bare metal). Tried deleting the whole cluster and this reproduces consistently.
# kubectl logs -n postgres-operator postgres-cluster-1 | head -n 150
...
2024-07-31 11:06:52,579 ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 1078, in get_replica_timeline
with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 1073, in get_replication_connection_cursor
with get_connection_cursor(**conn_kwargs) as cur:
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 157, in get_connection_cursor
conn = psycopg.connect(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 103, in connect
ret = _connect(*args, **kwargs)
File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: role "standby" does not exist
2024-07-31 11:06:52,631 INFO: promoted self to leader by acquiring session lock
2024-07-31 11:06:52,678 INFO: Lock owner: postgres-cluster-1; I am postgres-cluster-1
2024-07-31 11:06:52,740 ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 1078, in get_replica_timeline
with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 1073, in get_replication_connection_cursor
with get_connection_cursor(**conn_kwargs) as cur:
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 157, in get_connection_cursor
conn = psycopg.connect(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 103, in connect
ret = _connect(*args, **kwargs)
File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: role "standby" does not exist...
...
The cluster's resources are sufficient (memory, CPU and storage).
Cluster: postgres-cluster (7397754784576811067) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------------------+--------------+---------+---------+----+-----------+
| postgres-cluster-0 | 10.99.247.35 | Replica | stopped | | unknown |
| postgres-cluster-1 | 10.99.247.38 | Replica | stopped | | unknown |
| postgres-cluster-2 | 10.99.247.34 | Replica | stopped | | unknown |
+--------------------+--------------+---------+---------+----+-----------+
When should the restart take place (e.g. 2024-08-01T14:11) [now]:
Are you sure you want to restart members postgres-cluster-0, postgres-cluster-1, postgres-cluster-2? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2) []:
Failed: restart for member postgres-cluster-0, status code=503, (postgres is still starting)
Failed: restart for member postgres-cluster-1, status code=503, (postgres is still starting)
Failed: restart for member postgres-cluster-2, status code=503, (postgres is still starting)
This is happening to me on v1.12.2. As it is a FATAL error, it allegedly fails my cluster consistently every time I deploy (bare metal). Tried deleting the whole cluster and this reproduces consistently.
And the repeating sequence of error + fatal logs:
The cluster's resources are sufficient (memory, CPU and storage).
Failed attempted fixes:
kubectl exec -it -n postgres-operator postgres-cluster-0 -- patronictl restart postgres-cluster
with output:Please let me know what other information I could attach to help with the understanding of this issue.