zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.27k stars 969 forks source link

password authentication failed for user \"postgres\"" #1115

Closed grapemix closed 1 year ago

grapemix commented 4 years ago

Hi,

I have stuck in the DB connection problem for a while. I have noticed there has a few issues with common symptom like my case, but none of their solutions works in my case. Can someone please give me some directions?

I have successfully installed the postgres-operator thru helm, created my cluster and run my cluster successfully. But I cannot connect to the DB even within Kubernetes cluster, so I ssh to the DB pod and connect the DB by kubectl exec -it pod/test-testdb-cluster-0 -- bash and psql -U postgres. However, I don't see the role and DB I specified in my yaml(testdbname and testdbowner).

                                    List of roles
 Role name  |                         Attributes                         | Member of 
------------+------------------------------------------------------------+-----------
 admin      | Create DB, Cannot login                                    | {}
 postgres   | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
 robot_zmon | Cannot login                                               | {}
 standby    | Replication                                                | {}
 zalandos   | Create DB, Cannot login                                    | {}

                      List of relations
 Schema |          Name           |     Type      |  Owner   
--------+-------------------------+---------------+----------
 public | failed_authentication_0 | view          | postgres
 public | failed_authentication_1 | view          | postgres
 public | failed_authentication_2 | view          | postgres
 public | failed_authentication_3 | view          | postgres
 public | failed_authentication_4 | view          | postgres
 public | failed_authentication_5 | view          | postgres
 public | failed_authentication_6 | view          | postgres
 public | failed_authentication_7 | view          | postgres
 public | pg_auth_mon             | view          | postgres
 public | pg_stat_kcache          | view          | postgres
 public | pg_stat_kcache_detail   | view          | postgres
 public | pg_stat_statements      | view          | postgres
 public | postgres_log            | table         | postgres
 public | postgres_log_0          | foreign table | postgres
 public | postgres_log_1          | foreign table | postgres
 public | postgres_log_2          | foreign table | postgres
 public | postgres_log_3          | foreign table | postgres
 public | postgres_log_4          | foreign table | postgres
 public | postgres_log_5          | foreign table | postgres
 public | postgres_log_6          | foreign table | postgres
 public | postgres_log_7          | foreign table | postgres
(21 rows)
kubectl get all
NAME                         READY   STATUS    RESTARTS   AGE
pod/nginx-584d4f8b45-jwczj   1/1     Running   0          5d13h
pod/test-testdb-cluster-0    1/1     Running   0          5d13h

NAME                                 TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)        AGE
service/nginx                        LoadBalancer   172.17.154.178   192.168.1.247   80:30193/TCP   5d13h
service/test-testdb-cluster          ClusterIP      172.16.104.168   <none>          5432/TCP       5d13h
service/test-testdb-cluster-config   ClusterIP      None             <none>          <none>         5d13h
service/test-testdb-cluster-repl     ClusterIP      172.17.204.69    <none>          5432/TCP       5d13h

NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nginx   1/1     1            1           5d13h

NAME                               DESIRED   CURRENT   READY   AGE
replicaset.apps/nginx-584d4f8b45   1         1         1       5d13h

NAME                                   READY   AGE
statefulset.apps/test-testdb-cluster   1/1     5d13h

Log from the cluster:

kubectl logs pod/test-testdb-cluster-0
2020-08-17 09:54:30,947 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2020-08-17 09:54:32,951 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup
2020-08-17 09:54:32,952 - bootstrapping - INFO - No meta-data available for this provider
2020-08-17 09:54:32,952 - bootstrapping - INFO - Looks like your running local
2020-08-17 09:54:32,981 - bootstrapping - INFO - Configuring bootstrap
2020-08-17 09:54:32,981 - bootstrapping - INFO - Configuring crontab
2020-08-17 09:54:32,982 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2020-08-17 09:54:32,982 - bootstrapping - INFO - Configuring pgbouncer
2020-08-17 09:54:32,982 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2020-08-17 09:54:32,982 - bootstrapping - INFO - Configuring wal-e
2020-08-17 09:54:32,982 - bootstrapping - INFO - Configuring pam-oauth2
2020-08-17 09:54:32,982 - bootstrapping - INFO - No PAM_OAUTH2 configuration was specified, skipping
2020-08-17 09:54:32,982 - bootstrapping - INFO - Configuring pgqd
2020-08-17 09:54:32,983 - bootstrapping - INFO - Configuring certificate
2020-08-17 09:54:32,983 - bootstrapping - INFO - Generating ssl certificate
2020-08-17 09:54:33,073 - bootstrapping - INFO - Configuring patroni
2020-08-17 09:54:33,081 - bootstrapping - INFO - Writing to file /home/postgres/postgres.yml
2020-08-17 09:54:33,081 - bootstrapping - INFO - Configuring log
2020-08-17 09:54:33,081 - bootstrapping - INFO - Configuring standby-cluster
2020-08-17 09:54:34,367 INFO: No PostgreSQL configuration items changed, nothing to reload.
2020-08-17 09:54:34,390 WARNING: Postgresql is not running.
2020-08-17 09:54:34,390 INFO: Lock owner: None; I am test-testdb-cluster-0
2020-08-17 09:54:34,394 INFO: pg_controldata:
  pg_control version number: 1201
  Catalog version number: 201909212
  Database system identifier: 6861875284178866242
  Database cluster state: in production
  pg_control last modified: Mon Aug 17 09:45:25 2020
  Latest checkpoint location: 0/30C2A38
  Latest checkpoint's REDO location: 0/30C2A00
  Latest checkpoint's REDO WAL file: 000000020000000000000003
  Latest checkpoint's TimeLineID: 2
  Latest checkpoint's PrevTimeLineID: 2
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:882
  Latest checkpoint's NextOID: 32768
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 480
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 882
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Mon Aug 17 09:45:11 2020
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float4 argument passing: by value
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: 70a4642487623f0d3db3448c6cdccb40f243ae5656da8a85aa9f58c7992db5bc

2020-08-17 09:54:34,403 INFO: doing crash recovery in a single user mode
2020-08-17 09:54:37,807 WARNING: Postgresql is not running.
2020-08-17 09:54:37,807 INFO: Lock owner: None; I am test-testdb-cluster-0
2020-08-17 09:54:37,811 INFO: pg_controldata:
  pg_control version number: 1201
  Catalog version number: 201909212
  Database system identifier: 6861875284178866242
  Database cluster state: shut down
  pg_control last modified: Mon Aug 17 09:54:37 2020
  Latest checkpoint location: 0/4000028
  Latest checkpoint's REDO location: 0/4000028
  Latest checkpoint's REDO WAL file: 000000020000000000000004
  Latest checkpoint's TimeLineID: 2
  Latest checkpoint's PrevTimeLineID: 2
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:882
  Latest checkpoint's NextOID: 32768
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 480
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Mon Aug 17 09:54:37 2020
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float4 argument passing: by value
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: 70a4642487623f0d3db3448c6cdccb40f243ae5656da8a85aa9f58c7992db5bc

2020-08-17 09:54:37,811 INFO: Lock owner: None; I am test-testdb-cluster-0
2020-08-17 09:54:37,823 INFO: Lock owner: None; I am test-testdb-cluster-0
2020-08-17 09:54:37,824 INFO: starting as a secondary
2020-08-17 09:54:38,050 INFO: postmaster pid=61
/var/run/postgresql:5432 - no response
2020-08-17 09:54:38 UTC [61]: [1-1] 5f3a53de.3d 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2020-08-17 09:54:38 UTC [61]: [2-1] 5f3a53de.3d 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2020-08-17 09:54:38 UTC [61]: [3-1] 5f3a53de.3d 0     LOG:  starting PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit
2020-08-17 09:54:38 UTC [61]: [4-1] 5f3a53de.3d 0     LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-08-17 09:54:38 UTC [61]: [5-1] 5f3a53de.3d 0     LOG:  listening on IPv6 address "::", port 5432
2020-08-17 09:54:38 UTC [61]: [6-1] 5f3a53de.3d 0     LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-08-17 09:54:38 UTC [61]: [7-1] 5f3a53de.3d 0     LOG:  redirecting log output to logging collector process
2020-08-17 09:54:38 UTC [61]: [8-1] 5f3a53de.3d 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - accepting connections
/var/run/postgresql:5432 - accepting connections
2020-08-17 09:54:39,091 INFO: establishing a new patroni connection to the postgres cluster
2020-08-17 09:54:39,142 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2020-08-17 09:54:39,152 ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/__init__.py", line 685, in get_local_timeline_lsn_from_replication_connection
    with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/__init__.py", line 679, in get_replication_connection_cursor
    with get_connection_cursor(**conn_kwargs) as cur:
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor
    with psycopg2.connect(**kwargs) as conn:
  File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 127, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  password authentication failed for user "standby"
FATAL:  no pg_hba.conf entry for replication connection from host "127.0.0.1", user "standby", SSL off

2020-08-17 09:54:39,164 INFO: promoted self to leader by acquiring session lock
server promoting
2020-08-17 09:54:39,169 INFO: cleared rewind state after becoming the leader
2020-08-17 09:54:40,179 INFO: Lock owner: test-testdb-cluster-0; I am test-testdb-cluster-0
2020-08-17 09:54:40,185 INFO: Lock owner: test-testdb-cluster-0; I am test-testdb-cluster-0
2020-08-17 09:54:40,247 INFO: no action.  i am the leader with the lock
SET
DO
DO
DO
NOTICE:  extension "pg_auth_mon" already exists, skipping
CREATE EXTENSION
NOTICE:  version "1.0" of extension "pg_auth_mon" is already installed
ALTER EXTENSION
GRANT
NOTICE:  extension "pg_cron" already exists, skipping
CREATE EXTENSION
NOTICE:  version "1.2" of extension "pg_cron" is already installed
ALTER EXTENSION
ALTER POLICY
REVOKE
GRANT
GRANT
CREATE FUNCTION
REVOKE
GRANT
REVOKE
GRANT
REVOKE
GRANT
GRANT
NOTICE:  extension "file_fdw" already exists, skipping
CREATE EXTENSION
DO
NOTICE:  relation "postgres_log" already exists, skipping
CREATE TABLE
GRANT
NOTICE:  relation "postgres_log_0" already exists, skipping
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
NOTICE:  relation "postgres_log_1" already exists, skipping
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
NOTICE:  relation "postgres_log_2" already exists, skipping
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
NOTICE:  relation "postgres_log_3" already exists, skipping
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
NOTICE:  relation "postgres_log_4" already exists, skipping
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
NOTICE:  relation "postgres_log_5" already exists, skipping
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
NOTICE:  relation "postgres_log_6" already exists, skipping
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
NOTICE:  relation "postgres_log_7" already exists, skipping
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
RESET
SET
NOTICE:  drop cascades to 3 other objects
DETAIL:  drop cascades to type zmon_utils.system_information
drop cascades to function zmon_utils.get_database_cluster_information()
drop cascades to function zmon_utils.get_database_cluster_system_information()
DROP SCHEMA
NOTICE:  extension "plpython3u" already exists, skipping
DO
NOTICE:  language "plpythonu" does not exist, skipping
DROP LANGUAGE
NOTICE:  function plpython_call_handler() does not exist, skipping
DROP FUNCTION
NOTICE:  function plpython_inline_handler(internal) does not exist, skipping
DROP FUNCTION
NOTICE:  function plpython_validator(oid) does not exist, skipping
DROP FUNCTION
CREATE SCHEMA
GRANT
SET
CREATE TYPE
CREATE FUNCTION
CREATE FUNCTION
GRANT
You are now connected to database "postgres" as user "postgres".
NOTICE:  schema "user_management" already exists, skipping
CREATE SCHEMA
GRANT
SET
CREATE FUNCTION
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
GRANT
RESET
NOTICE:  extension "pg_stat_statements" already exists, skipping
CREATE EXTENSION
NOTICE:  extension "pg_stat_kcache" already exists, skipping
CREATE EXTENSION
NOTICE:  extension "set_user" already exists, skipping
CREATE EXTENSION
NOTICE:  version "1.6" of extension "set_user" is already installed
ALTER EXTENSION
GRANT
GRANT
NOTICE:  schema "metric_helpers" already exists, skipping
CREATE SCHEMA
GRANT
GRANT
SET
CREATE FUNCTION
REVOKE
GRANT
GRANT
CREATE VIEW
REVOKE
GRANT
GRANT
CREATE FUNCTION
REVOKE
GRANT
GRANT
CREATE VIEW
REVOKE
GRANT
GRANT
CREATE FUNCTION
REVOKE
GRANT
GRANT
CREATE VIEW
REVOKE
GRANT
GRANT
RESET
You are now connected to database "template1" as user "postgres".
NOTICE:  schema "user_management" already exists, skipping
CREATE SCHEMA
GRANT
SET
CREATE FUNCTION
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
GRANT
RESET
NOTICE:  extension "pg_stat_statements" already exists, skipping
CREATE EXTENSION
NOTICE:  extension "pg_stat_kcache" already exists, skipping
CREATE EXTENSION
NOTICE:  extension "set_user" already exists, skipping
CREATE EXTENSION
NOTICE:  version "1.6" of extension "set_user" is already installed
ALTER EXTENSION
GRANT
GRANT
NOTICE:  schema "metric_helpers" already exists, skipping
CREATE SCHEMA
GRANT
GRANT
SET
CREATE FUNCTION
REVOKE
GRANT
GRANT
CREATE VIEW
REVOKE
GRANT
GRANT
CREATE FUNCTION
REVOKE
GRANT
GRANT
CREATE VIEW
REVOKE
GRANT
GRANT
CREATE FUNCTION
REVOKE
GRANT
GRANT
CREATE VIEW
REVOKE
GRANT
GRANT
RESET
2020-08-17 09:54:50,179 INFO: Lock owner: test-testdb-cluster-0; I am test-testdb-cluster-0
2020-08-17 09:54:50,234 INFO: no action.  i am the leader with the lock
2020-08-17 09:55:00,179 INFO: Lock owner: test-testdb-cluster-0; I am test-testdb-cluster-0
2020-08-17 09:55:00,233 INFO: no action.  i am the leader with the lock
2020-08-17 09:55:10,179 INFO: Lock owner: test-testdb-cluster-0; I am test-testdb-cluster-0
2020-08-17 09:55:10,192 INFO: no action.  i am the leader with the lock
2020-08-17 09:55:20,179 INFO: Lock owner: test-testdb-cluster-0; I am test-testdb-cluster-0
2020-08-17 09:55:20,184 INFO: no action.  i am the leader with the lock
2020-08-17 09:55:30,179 INFO: Lock owner: test-testdb-cluster-0; I am test-testdb-cluster-0
2020-08-17 09:55:30,226 INFO: no action.  i am the leader with the lock

My yaml:

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: test-testdb-cluster
  namespace: test
spec:
  teamId: "test"
  volume:
    size: 1Gi
  numberOfInstances: 1
  enableMasterLoadBalancer: false
  enableReplicaLoadBalancer: false
  users:
    # database owner
    testdbowner:
    - superuser
    - createdb
    testdb2owner:
    - createdb

    # role for application foo
    # foo_user: # or 'foo_user: []'

  #databases: name->owner
  databases:
    testdbname: testdbowner
    testdb2name: testdb2owner
  preparedDatabases:
    testdbname:
      extensions:
        postgis: data
  postgresql:
    version: "12"

Helm3 installation cmd:

helm install postgres --namespace postgres charts/postgres-operator --set configKubernetes.cluster_domain=MY_DOMAIN --set configLoadBalancer.db_hosted_zone=db.MY_DOMAIN  -f ./charts/postgres-operator/values-crd.yaml

I have tried the 1.4.0, 1.5.0 and master branch/tag. I have also tried to install helm without values-crd.yaml, but none of them works. Any ideas?

Thanks a lot.

Jan-M commented 4 years ago

You posted a lot of information, yet missing the postgres operator logs.

Also Postgres logs would be worth checking, to indicate the error messages there in relation to user creation and auth.

Users are stored in K8s secrets, can you list the secrets tagged with by default application=spilo. It is odd to see the "standby" user error in the pod logs.

grapemix commented 4 years ago

Thanks for replying and pointing out the debug direction. I don't know there has log in the pg operator pod. In short, I fixed a mis-configuration problem for the postgres opeator, but I got password authentication failure problem listed below.

The first problem I experienced is the mismatch cluster domain between the configKubernetes.cluster_domain and the kube-dns.

time="2020-08-30T22:16:40Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"test\", Name:\"test-testdb-cluster\", UID:\"e438c66b-dabc-4ecb-ab95-a23839cc23ab\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"3939103\", FieldPath:\"\"}): type: 'Normal' reason: 'StatefulSet' Pods are ready"
time="2020-08-30T22:16:41Z" level=error msg="could not connect to PostgreSQL database: dial tcp: lookup test-testdb-cluster.test.svc.MY_DOMAIN on 172.16.0.10:53: no such host" cluster-name=test/test-testdb-cluster pkg=cluster worker=0
time="2020-08-30T22:16:56Z" level=error msg="could not connect to PostgreSQL database: dial tcp: lookup test-testdb-cluster.test.svc.MY_DOMAIN on 172.16.0.10:53: no such host" cluster-name=test/test-testdb-cluster pkg=cluster worker=0
time="2020-08-30T22:17:10Z" level=error msg="could not connect to PostgreSQL database: dial tcp: lookup test-testdb-cluster.test.svc.MY_DOMAIN on 172.16.0.10:53: no such host" cluster-name=test/test-testdb-cluster pkg=cluster worker=0
time="2020-08-30T22:17:25Z" level=error msg="could not connect to PostgreSQL database: dial tcp: lookup test-testdb-cluster.test.svc.MY_DOMAIN on 172.16.0.10:53: no such host" cluster-name=test/test-testdb-cluster pkg=cluster worker=0
time="2020-08-30T22:17:41Z" level=error msg="could not connect to PostgreSQL database: dial tcp: lookup test-testdb-cluster.test.svc.phoenix on 172.16.0.10:53: no such host" cluster-name=test/test-testdb-cluster pkg=cluster worker=0
time="2020-08-30T22:17:56Z" level=error msg="could not connect to PostgreSQL database: dial tcp: lookup test-testdb-cluster.test.svc.MY_DOMAIN on 172.16.0.10:53: no such host" cluster-name=test/test-testdb-cluster pkg=cluster worker=0
time="2020-08-30T22:18:10Z" level=error msg="could not connect to PostgreSQL database: dial tcp: lookup test-testdb-cluster.test.svc.MY_DOMAIN on 172.16.0.10:53: no such host" cluster-name=test/test-testdb-cluster pkg=cluster worker=0
time="2020-08-30T22:18:26Z" level=error msg="could not connect to PostgreSQL database: dial tcp: lookup test-testdb-cluster.test.svc.MY_DOMAIN on 172.16.0.10:53: no such host" cluster-name=test/test-testdb-cluster pkg=cluster worker=0
time="2020-08-30T22:18:26Z" level=error msg="could not create cluster: could not create users: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=test/test-testdb-cluster pkg=controller worker=0

I found a similar issue like this, after syncing up the cluster domain name, this problem goes away.

Unfortunately, I got another problem - password authentication failure.

time="2020-08-31T01:16:31Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"test\", Name:\"acid-minimal-cluster\", UID:\"698d1fc8-d34c-49c0-8f69-ffe37bb98890\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"3981847\", FieldPath:\"\"}): type: 'Normal' reason: 'Create' Started creation of new cluster resources"
time="2020-08-31T01:16:31Z" level=info msg="endpoint \"test/acid-minimal-cluster\" has been successfully created" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:31Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"test\", Name:\"acid-minimal-cluster\", UID:\"698d1fc8-d34c-49c0-8f69-ffe37bb98890\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"3981847\", FieldPath:\"\"}): type: 'Normal' reason: 'Endpoints' Endpoint \"test/acid-minimal-cluster\" has been successfully created"
time="2020-08-31T01:16:32Z" level=info msg="master service \"test/acid-minimal-cluster\" has been successfully created" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:32Z" level=debug msg="No load balancer created for the replica service" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:32Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"test\", Name:\"acid-minimal-cluster\", UID:\"698d1fc8-d34c-49c0-8f69-ffe37bb98890\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"3981847\", FieldPath:\"\"}): type: 'Normal' reason: 'Services' The service \"test/acid-minimal-cluster\" for role master has been successfully created"
time="2020-08-31T01:16:32Z" level=info msg="replica service \"test/acid-minimal-cluster-repl\" has been successfully created" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:32Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"test\", Name:\"acid-minimal-cluster\", UID:\"698d1fc8-d34c-49c0-8f69-ffe37bb98890\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"3981847\", FieldPath:\"\"}): type: 'Normal' reason: 'Services' The service \"test/acid-minimal-cluster-repl\" for role replica has been successfully created"
time="2020-08-31T01:16:32Z" level=debug msg="team API is disabled, returning empty list of members for team \"acid\"" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:32Z" level=info msg="users have been initialized" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:32Z" level=debug msg="created new secret \"test/postgres.acid-minimal-cluster.credentials.postgresql.acid.zalan.do\", uid: \"872a575c-80b2-488e-a160-9f847b19bf51\"" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:33Z" level=debug msg="created new secret \"test/standby.acid-minimal-cluster.credentials.postgresql.acid.zalan.do\", uid: \"34029784-96e5-4e33-afb3-ed05882142ab\"" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:33Z" level=debug msg="created new secret \"test/foo-user.acid-minimal-cluster.credentials.postgresql.acid.zalan.do\", uid: \"3e36273b-913f-4c1c-bf1e-0d998359a808\"" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:33Z" level=debug msg="created new secret \"test/zalando.acid-minimal-cluster.credentials.postgresql.acid.zalan.do\", uid: \"35dbcfd7-ad4f-4326-b3e2-4b2e02eca9f5\"" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:33Z" level=info msg="secrets have been successfully created" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:33Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"test\", Name:\"acid-minimal-cluster\", UID:\"698d1fc8-d34c-49c0-8f69-ffe37bb98890\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"3981847\", FieldPath:\"\"}): type: 'Normal' reason: 'Secrets' The secrets have been successfully created"
time="2020-08-31T01:16:33Z" level=info msg="pod disruption budget \"test/postgres-acid-minimal-cluster-pdb\" has been successfully created" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:33Z" level=debug msg="Generating Spilo container, environment variables: [{SCOPE acid-minimal-cluster nil} {PGROOT /home/postgres/pgdata/pgroot nil} {POD_IP  &EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.podIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {POD_NAMESPACE  &EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {PGUSER_SUPERUSER postgres nil} {KUBERNETES_SCOPE_LABEL cluster-name nil} {KUBERNETES_ROLE_LABEL spilo-role nil} {PGPASSWORD_SUPERUSER  &EnvVarSource{FieldRef:nil,ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:&SecretKeySelector{LocalObjectReference:LocalObjectReference{Name:postgres.acid-minimal-cluster.credentials.postgresql.acid.zalan.do,},Key:password,Optional:nil,},}} {PGUSER_STANDBY standby nil} {PGPASSWORD_STANDBY  &EnvVarSource{FieldRef:nil,ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:&SecretKeySelector{LocalObjectReference:LocalObjectReference{Name:standby.acid-minimal-cluster.credentials.postgresql.acid.zalan.do,},Key:password,Optional:nil,},}} {PAM_OAUTH2  nil} {HUMAN_ROLE zalandos nil} {KUBERNETES_LABELS {\"application\":\"spilo\"} nil} {SPILO_CONFIGURATION {\"postgresql\":{\"bin_dir\":\"/usr/lib/postgresql/12/bin\"},\"bootstrap\":{\"initdb\":[{\"auth-host\":\"md5\"},{\"auth-local\":\"trust\"}],\"users\":{\"zalandos\":{\"password\":\"\",\"options\":[\"CREATEDB\",\"NOLOGIN\"]}},\"dcs\":{}}} nil} {DCS_ENABLE_KUBERNETES_API true nil}]" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:33Z" level=debug msg="created new statefulset \"test/acid-minimal-cluster\", uid: \"e8cafed9-6f93-4963-b03e-4118731bac58\"" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:33Z" level=info msg="statefulset \"test/acid-minimal-cluster\" has been successfully created" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:33Z" level=info msg="waiting for the cluster being ready" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:33Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"test\", Name:\"acid-minimal-cluster\", UID:\"698d1fc8-d34c-49c0-8f69-ffe37bb98890\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"3981847\", FieldPath:\"\"}): type: 'Normal' reason: 'StatefulSet' Statefulset \"test/acid-minimal-cluster\" has been successfully created"
time="2020-08-31T01:16:36Z" level=debug msg="Waiting for 1 pods to become ready" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:45Z" level=info msg="pods are ready" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:45Z" level=info msg="Create roles" cluster-name=test/acid-minimal-cluster pkg=cluster worker=1
time="2020-08-31T01:16:45Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"test\", Name:\"acid-minimal-cluster\", UID:\"698d1fc8-d34c-49c0-8f69-ffe37bb98890\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"3981847\", FieldPath:\"\"}): type: 'Normal' reason: 'StatefulSet' Pods are ready"
time="2020-08-31T01:16:45Z" level=error msg="could not create cluster: could not create users: could not init db connection: could not init db connection: pq: password authentication failed for user \"postgres\"" cluster-name=test/acid-minimal-cluster pkg=controller worker=1
time="2020-08-31T01:16:45Z" level=info msg="Event(v1.ObjectReference{Kind:\"postgresql\", Namespace:\"test\", Name:\"acid-minimal-cluster\", UID:\"698d1fc8-d34c-49c0-8f69-ffe37bb98890\", APIVersion:\"acid.zalan.do/v1\", ResourceVersion:\"3981959\", FieldPath:\"\"}): type: 'Warning' reason: 'Create' could not create cluster: could not create users: could not init db connection: could not init db connection: pq: password authentication failed for user \"postgres\""

I have tried my setup as well as the setup I grabbed from the doc:

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: acid-minimal-cluster
spec:
  teamId: "acid"
  volume:
    size: 1Gi
  numberOfInstances: 1
  users:
    # database owner
    zalando:
    - superuser
    - createdb

    # role for application foo
    foo_user: # or 'foo_user: []'

  #databases: name->owner
  databases:
    foo: zalando
  postgresql:
    version: "12"

Any ideas? Thanks.

kkendzia commented 3 years ago

Well, we have the same problem and I probably figured out why - at least for our case.

We use longhorn for storage provisioning and want to keep the data whatever it takes and reuse Volumes, so we can shut down the whole cluster and set it up again. The users get created for the first time in our database and provide a secret postgres.project-postgres.credentials.postgresql.acid.zalan.do which has the correct password to the entry in pg_shadow. When we restart the whole cluster / namespace, the "old" password of postgres (and of course for standby and all other roles) is still persistent due to the volume, but a new password will be generated and provided as new secret. What's missing here (I think) is that the initdb should alter the user password when the user already exists. Does this makes sense? https://github.com/zalando/postgres-operator/blob/692c721854e4667102778e8e69e3dd12e47984b5/pkg/util/users/users.go#L51 should handle this (should also be called with EventSync at cluster startup)

Can you give me an advice @Jan-M

kkendzia commented 3 years ago

So yeah, going into the operator logs we also have this

time="2021-02-09T04:08:30Z" level=debug msg="syncing roles" cluster-name=artiscluster-dev/artis-postgres pkg=cluster worker=0
time="2021-02-09T04:08:30Z" level=warning msg="error while syncing cluster state: could not sync roles: could not init db connection: could not init db connection: pq: password authentication failed for user \"postgres\"" cluster-name=artiscluster-dev/artis-postgres pkg=cluster worker=0
time="2021-02-09T04:08:30Z" level=error msg="could not sync cluster: could not sync roles: could not init db connection: could not init db connection: pq: password authentication failed for user \"postgres\"" cluster-name=artiscluster-dev/artis-postgres pkg=controller worker=0

After restarting the generated postgres user password obviously doesn't match up with the saved one - so it cannot change anything else.

Starefossen commented 3 years ago

Ran in to this issue as well. Recreated the postgresql cluster with an existing volume. The new cluster started up but the postgres operator is failing with the following error log: pq: password authentication failed for user \"postgres\""

Recovery procedure

  1. get the new postgres user password:
export NAME=my-db
export NAMESPACE=my-ns

kubectl get secrets -o yaml postgres.${NAME}.credentials.postgresql.acid.zalan.do -n ${NAMESPACE} \
  | grep password \
  | head -n 1 \
  | awk '{ print $2 }' \
  | base64 --decode
  1. exec into the database pod and change the password manually
kubectl exec -it ${NAME}-0 -- /bin/bash

$ psql postgres postgres
$ \password
// type  in the password from step 1
  1. restart the postgres-operator
kubectl delete pod -n postgres-operator postgres-operator-xxx-yyy
kubectl logs -n postgres-operator svc/postgres-operator -f
FxKu commented 3 years ago

or the other way around:

  1. exec into database pod
  2. env | grep PASS
  3. echo -n "grepped-password" | base64
  4. update secret with base64 encoded password
anthr76 commented 3 years ago

Any updates on the issue not requiring manual intervention?

avikjis27 commented 3 years ago

Any update on this?

fritz-net commented 2 years ago

This seems the same issue as this one https://github.com/zalando/postgres-operator/issues/524 but with another reason for it to happen.

The manual workaround by @Starefossen is nice to know, however not solution for my usecase. I have this running in a CI environment and so therefore this should work without intervention.

microyahoo commented 2 years ago

I have meet same issue, any update on this. @Jan-M @FxKu postgres-operator.tar.gz

microyahoo commented 2 years ago

Ran in to this issue as well. Recreated the postgresql cluster with an existing volume. The new cluster started up but the postgres operator is failing with the following error log: pq: password authentication failed for user \"postgres\""

Recovery procedure

  1. get the new postgres user password:
export NAME=my-db
export NAMESPACE=my-ns

kubectl get secrets -o yaml postgres.${NAME}.credentials.postgresql.acid.zalan.do -n ${NAMESPACE} \
  | grep password \
  | head -n 1 \
  | awk '{ print $2 }' \
  | base64 --decode
  1. exec into the database pod and change the password manually
kubectl exec -it ${NAME}-0 -- /bin/bash

$ psql postgres postgres
$ \password
// type  in the password from step 1
  1. restart the postgres-operator
kubectl delete pod -n postgres-operator postgres-operator-xxx-yyy
kubectl logs -n postgres-operator svc/postgres-operator -f

I have followed this guide, but replication between master and slaves is disconnected like follows:

82 2022-03-02 14:57:00 UTC [60]: [4-1] 621f85bc.3c 0     HINT:  Future log output will appear in directory "../pg_log".
 83 /var/run/postgresql:5432 - rejecting connections
 84 /var/run/postgresql:5432 - accepting connections
 85 2022-03-02 14:57:00,244 INFO: establishing a new patroni connection to the postgres cluster
 86 2022-03-02 14:57:00,327 WARNING: Request failed to acid-my-postgres-cluster-1: GET http://10.244.2.118:8008/patroni (HTTPConnectionPool(hos    t='10.244.2.118', port=8008): Max retries exceeded with url: /patroni (Caused by NewConnectionError('<urllib3.connection.HTTPConnection obj    ect at 0x7f4ac431ab00>: Failed to establish a new connection: [Errno 111] Connection refused',)))
 87 2022-03-02 14:57:00,327 WARNING: Request failed to acid-my-postgres-cluster-2: GET http://10.244.1.117:8008/patroni (HTTPConnectionPool(hos    t='10.244.1.117', port=8008): Max retries exceeded with url: /patroni (Caused by NewConnectionError('<urllib3.connection.HTTPConnection obj    ect at 0x7f4ac431afd0>: Failed to establish a new connection: [Errno 111] Connection refused',)))
 88 2022-03-02 14:57:00,470 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory    : '/dev/watchdog'"
 89 2022-03-02 14:57:00,482 ERROR: Can not fetch local timeline and lsn from replication connection
 90 Traceback (most recent call last):
 91   File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/__init__.py", line 821, in get_replica_timeline
 92     with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
 93   File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
 94     return next(self.gen)
 95   File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/__init__.py", line 816, in get_replication_connection_cursor
 96     with get_connection_cursor(**conn_kwargs) as cur:
 97   File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
 98     return next(self.gen)
 99   File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor
100     conn = psycopg2.connect(**kwargs)
101   File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 127, in connect
102     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
103 psycopg2.OperationalError: connection to server at "localhost" (127.0.0.1), port 5432 failed: FATAL:  password authentication failed for us    er "standby"
104 connection to server at "localhost" (127.0.0.1), port 5432 failed: FATAL:  no pg_hba.conf entry for replication connection from host "127.0    .0.1", user "standby", no encryption
105
106 2022-03-02 14:57:00,534 INFO: promoted self to leader by acquiring session lock
107 server promoting
108 2022-03-02 14:57:00,541 INFO: cleared rewind state after becoming the leader
root@acid-my-postgres-cluster-1:/home/postgres# patronictl list
+ Cluster: acid-my-postgres-cluster (7070449311109967941) ------+----+-----------+
| Member                     | Host         | Role    | State   | TL | Lag in MB |
+----------------------------+--------------+---------+---------+----+-----------+
| acid-my-postgres-cluster-0 | 10.244.0.61  | Leader  | running |  4 |           |
| acid-my-postgres-cluster-1 | 10.244.2.120 | Replica | running |    |        49 |
| acid-my-postgres-cluster-2 | 10.244.1.118 | Replica | running |    |        17 |
+----------------------------+--------------+---------+---------+----+-----------+
GRomR1 commented 2 years ago

I have got the same issue for standby user. Replication doesn't work with a password that store in Secret.

psycopg2.OperationalError: connection to server at "10.220.14.88", port 5432 failed: FATAL:  password authentication failed for user "standby"
connection to server at "10.220.14.88", port 5432 failed: FATAL:  password authentication failed for user "standby"

Only when I change a password throw psql in master node it will start.

export PGPASSWORD=$(kubectl get secret postgres.acid-minimal-cluster.credentials.standby.acid.zalan.do -o 'jsonpath={.data.password}' | base64 -d)
...
ALTER USER standby WITH PASSWORD '$PGPASSWORD';
FxKu commented 2 years ago

When facing connections issue with the standby you need to copy secrets of the source cluster.

Dmiakishev commented 2 years ago

When facing connections issue with the standby you need to copy secrets of the source cluster.

For example, I want to use variant b, is it possible to specify this in the operator settings, and if so, in which section?

grapemix commented 1 year ago

My problem has been resolved by upgrading the package. Thanks everyone for your input.

ps: I should close this ticket sooner. My bad.

zekicaneksi commented 1 year ago

I don't know much about GitHub, so I'm not sure if i open this issue but, i had the exact problem with syncing after creation.

this resolved my issue; https://github.com/zalando/postgres-operator/issues/1115#issuecomment-786204248

Is this still the way to do it, manually? Or is there a better way?

Thank you