Closed gruppferi closed 3 years ago
Side Info, I looked over this issue 927 but I am not running any sidecars
Having the same issue after upgrading from 1.6.0 to 1.6.1. The relevant log messages seem to be these:
time="2021-03-12T09:54:32Z" level=info msg="SYNC event has been queued" cluster-name=namespace/cluster-name pkg=controller worker=0
time="2021-03-12T09:54:32Z" level=info msg="there are 1 clusters running" pkg=controller
time="2021-03-12T09:54:32Z" level=info msg="syncing of the cluster started" cluster-name=namespace/cluster-name pkg=controller worker=0
[...]
time="2021-03-12T09:54:33Z" level=debug msg="set statefulset's rolling update annotation to false: caller/reason from cache" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="set statefulset's rolling update annotation to true: caller/reason statefulset changes" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=info msg="statefulset namespace/cluster-name is not in the desired state and needs to be updated" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="- terminationMessagePath: /dev/termination-log," cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="- terminationMessagePolicy: File," cluster-name=namespace/namespace-db pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="- restartPolicy: Always," cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="- dnsPolicy: ClusterFirst," cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="- serviceAccount: postgres-pod," cluster-name=namespace/namespace-db pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="- schedulerName: default-scheduler," cluster-name=namespace/namespace-db pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="- kind: PersistentVolumeClaim," cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="- apiVersion: v1," cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="- status: {" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="- phase: Pending" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="- }" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="+ status: {}" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="- }," cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="- revisionHistoryLimit: 10" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="+ }" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="metadata.annotation are different" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="- zalando-postgres-operator-rolling-update-required: false" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="+ zalando-postgres-operator-rolling-update-required: true" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=info msg="reason: new statefulset containers's postgres (index 0) security context does not match the current one" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="updating statefulset" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="patching statefulset annotations" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="patching statefulset annotations" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="calling Patroni API on a pod namespace/cluster-name-0 to set the following Postgres options: map[max_connections:300]" cluster-name=namespace/cluster-namepkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="making PATCH http request: http://10.56.7.44:8008/config" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=debug msg="performing rolling update" cluster-name=namespace/cluster-name pkg=cluster
time="2021-03-12T09:54:33Z" level=info msg="there are 2 pods in the cluster to recreate" cluster-name=namespace/cluster-name pkg=cluster
[...]
I think I figured out the issue, and it seems to have been introduced by https://github.com/zalando/postgres-operator/pull/1380.
With no capabilities set, currently the securityContext
of the postgres
container in my StatefulSet
is
securityContext:
allowPrivilegeEscalation: false
capabilities: {}
privileged: false
readOnlyRootFilesystem: false
, so I guess capabilities
defaults to {}
. Now with #1380, generateCapabilities()
was changed to return nil
when there's no capabilities set, which then makes the check
newCheck("new statefulset %s's %s (index %d) security context does not match the current one",
func(a, b v1.Container) bool { return !reflect.DeepEqual(a.SecurityContext, b.SecurityContext) }),
fail, because capabilities
is {}
in the cluster, and nil
in the definition generated by the operator.
I think this is a critical issue, as everyone starting with or upgrading to 1.6.1 will end up with all database cluster nodes being restarted every ~30m.
To confirm my previous assumption, I set additional_pod_capabilities: "SYS_NICE"
, and now everything is back to normal.
So current workarounds for this issue are:
additional_pod_capabilities
to have at least one capabilityHello,
I'm testing with Openshift and I have this error,
create Pod test-db-0 in StatefulSet tedial-astdb failed error: pods "test-db-0" is forbidden: unable to validate against any security context constraint: [capabilities.add: Invalid value: "SYS_NICE": capability may not be added]
Any idea?
@jamorales85 in this case SYS_NICE
is not allowed in your infrastructure. In our case it's added to a PodSecurityPolicy which we use. You could go back to v1.6.0
or use this image: v1.6.1-2-gca968ca1
which contains the fix for empty capabilities.
Duplicate of #1377, and #1380 actually fixes this (sorry, I was under the impression that #1380 got into 1.6.1).
@gruppferi I guess you can close this.
@FxKu It would be nice to link to the duplicated issue when adding the duplicate
label.
Just got hit with this issue as well. I am a bit surprised that @FxKu did not roll out 1.6.2 with the fix. As 1.6.1 is pretty much broken, unless you do a workaround with adding SYS_NICE. But that might not work for all systems\environments. Also first time user experience would also not be great if they roll out postgres operator and it keeps restarting their db every 30 min. @FxKu any reason for not releasing 1.6.2 with the fix for this issue?
I'm running into this bug too !
My Kubernetes provider force me with PodSecurityPolicy to drop capabilities
requiredDropCapabilities:
- MKNOD
wich generate automatically securityContext in pod
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- MKNOD
privileged: false
readOnlyRootFilesystem: false
Please pay attention to other field (drop), or ignore it completly.
@FxKu is it known when we could expect the new release to include fix for this one as well?
Seeing the exact same issue in our clusters.
Downgrading to registry.opensource.zalan.do/acid/postgres-operator:v1.6.0
and setting in my values-crd.yaml
:
image:
tag: v1.6.0
configKubernetes:
additional_pod_capabilities:
- "SYS_NICE"
I am still getting:
could not sync cluster: could not sync statefulsets: could not recreate pods: could not recreate replica pod "default/acid-foo-1": pod label wait timeout
My deploy script is:
git clone https://github.com/zalando/postgres-operator.git
helm upgrade pg ./postgres-operator/charts/postgres-operator -f values-crd.yaml --install --wait
We have finally released the bugfix 1.6.2
release. Took a bit too long. Sorry for the inconvenience. Closing this issue now.
Sorry for touching dead cows, but I'm experiencing something similar in v1.10.1. Any ideas?
I am using the Postgres Operator Helm template 1.6.1 with two Postgres DB Cluster, Installed brand new. After each around 30 minutes the pods of postgres cluster gets restart.
msg="could not get connection pooler secret pooler.postgres-default.credentials.postgresql.acid.zalan.do: secrets \"pooler.postgres-default.credentials.postgresql.acid.zalan.do\" not found" cluster-name=default/postgres-default pkg=cluster worker=0
DB Cluster template :
configPostgresPodResources: default_cpu_limit: "4" default_cpu_request: 100m default_memory_limit: 4Gi default_memory_request: 100Mi
resources: limits: cpu: "500m" memory: "500Mi" requests: cpu: "100m" memory: "250Mi"
crd: create: false
/run/service/patroni: finished with code=0 signal=0 stopping /run/service/patroni timeout: finish: .: (pid 194) 1749s, want down ok: down: patroni: 1s, normally up ok: down: /etc/service/patroni: 1s, normally up ok: down: /etc/service/pgqd: 0s, normally up