Open devopsenggineer opened 2 years ago
This activity Postgres-operator is not able to do it, which it does perfectly in aks, gke clusters
time="2022-06-06T15:11:06Z" level=debug msg="closing database connection" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-06T15:11:06Z" level=info msg="users have been successfully created" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-06T15:11:06Z" level=info msg="creating database \"fx\" owner \"fx_admin\"" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-06T15:11:06Z" level=warning msg="closing an existing connection before opening a new one to fx" cluster-name=default/fx-postgres pkg=cl>
time="2022-06-06T15:11:06Z" level=debug msg="closing database connection" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-06T15:11:06Z" level=debug msg="closing database connection" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-06T15:11:06Z" level=debug msg="syncing prepared database \"fx\"" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-06T15:11:06Z" level=info msg="creating database schema \"data\" owner \"fx_data_owner\"" cluster-name=default/fx-postgres pkg=cluster w>
Errors we are getting when setting it up in bare-metal clusters of Ubuntu OS in virtualbox/cloud VMs.
time="2022-06-20T02:56:46Z" level=info msg="Create roles" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-20T02:56:47Z" level=warning msg="could not connect to Postgres database: dial tcp 195.201.199.239:5432: connect: connection refused" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-20T02:57:02Z" level=warning msg="could not connect to Postgres database: dial tcp 195.201.199.239:5432: connect: connection refused" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-20T02:57:16Z" level=warning msg="could not connect to Postgres database: dial tcp 195.201.199.239:5432: connect: connection refused" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-20T02:57:32Z" level=warning msg="could not connect to Postgres database: dial tcp 195.201.199.239:5432: connect: connection refused" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-20T02:57:46Z" level=warning msg="could not connect to Postgres database: dial tcp 85.10.194.207:5432: connect: connection refused" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-20T02:58:01Z" level=warning msg="could not connect to Postgres database: dial tcp 85.10.194.207:5432: connect: connection refused" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-20T02:58:16Z" level=warning msg="could not connect to Postgres database: dial tcp 85.10.194.207:5432: connect: connection refused" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-20T02:58:31Z" level=warning msg="could not connect to Postgres database: dial tcp 85.10.194.207:5432: connect: connection refused" cluster-name=default/fx-postgres pkg=cluster worker=0
time="2022-06-20T02:58:31Z" level=error msg="could not create cluster: could not create users: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=default/fx-postgres pkg=controller worker=0
Why is this strange behavior? Could anyone help over here?
I am getting this exact same issue on a bare metal cluster
I am having this same issue on k3d.
Same here!
@devopsenggineer I think the IP 85.10.194.207
is kind of a "default" name resolution of your DNS configured on your hosts.
To me, the issue is because of the operator trying to resolve the full name
fmt.Sprintf("%s.%s.svc.%s", c.Name, c.Namespace, c.OpConfig.ClusterDomain)
Where it shouldn't, the ClusterDomain
value is handled by default by search
in resolv.conf
used in the containers, so if you try to resolve directly the "full name" (including the ClusterDomain suffix) then it will use the search domain configured on the hosts (so the kube DNS will forward it to the host nameservers) before actually resolving the name
I would suggest removing the usage of OpConfig.ClusterDomain
everywhere but I don't have enough knowledge of this operator to get the impact of such changes.
The issue happens because of search DOMAINS
in your OS like
/etc/resolv.conf.
nameserver 127.0.0.53
options edns0 trust-ad
search DOMAINS
resolver adds 'domains' to postgresql-cluster.default.svc.cluster.local
domain name:
# k exec -ti postgres-operator-6c5657ccd6-ldtv5 -- /bin/sh
/ $ nslookup postgresql-cluster.default.svc.cluster.local.domains
Server: 10.96.0.10
Address: 10.96.0.10:53
Non-authoritative answer:
Name: postgresql-cluster.default.svc.cluster.local.domains
Address: 3.64.163.50
that why you get so strange resolving.
Thanks @oneumyvakin I solved this by modifying values.yaml for the Helm chart. The value modified was cluster_domain, located under configKubernetes. After doing so, everything works as expected.
Postgres Operator unable to connect to Postgres database in a bare-metal cluster setup, running in VirtualBox/cloud VMs.
Which image of the operator are you using? e.g. registry.opensource.zalan.do/acid/postgres-operator:v1.8.1 Ans: registry.opensource.zalan.do/acid/postgres-operator:v1.8.0
Where do you run it - cloud or metal? Kubernetes or OpenShift? [AWS K8s | GCP ... | Bare Metal K8s] Ans: Bare-Metal K8s in virtualbox/cloud VMs
Are you running Postgres Operator in production? [yes | no] Ans: Yes
Type of issue? [Bug report, question, feature request, etc.] Ans: Question
I have deployed a Highly Available Postgresql cluster using this Postgres-Operator with the following method, mentioned in operator docs.
All pods of Postgres are running fine.
Postgres Pod Logs
Postgres-Operator Logs
Getting below errors Postgres-operator pod also mentioned in above logs as well.
Because of the above error operator is not able to create specified roles, DB user and database(See below spec file)
Same above Postgres operator specs are working fine in Azure, GCP, and AWS cloud Kubernetes service, even in minikube cluster running locally in my laptop with the same NFS as dynamic storage class without any issue.
But in the bare-metal k8s cluster, it's giving problems as mentioned operator logs. Here are the steps we followed to setup bare-metal k8s cluster
Does Zalando Postgres-operator works in the bare-metal cluster or not?