Implement generic scheme for avoiding comparison of automatically added/modified properties during update/sync examination

RossVertizan commented 2 years ago

Initial questions

Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.7.0
Where do you run it - cloud or metal? Kubernetes or OpenShift? Kubernetes on GKE
Are you running Postgres Operator in production? no - not yet
Type of issue? feature request

Background

In #1638 I described in detail the process I went through to install postgres-operator on GKE using autopilot. Autopilot is a new capability offered by GKE that takes care of a number of Kubernetes configuration tasks, such as scaling and security. In order to do this it modifies some properties/attributes of the containers/pods/statefulsets as they are launched. This means that when the comparison is done to figure out if the containers/statefulsets need to be updated, they always need an update and so the postgres-operator re-starts every sync period.

This is exactly the same mechanism as described for Rancher in #1482, #1478 and #1485. I can imagine that there are other frameworks that do similar things.

In these other issues e.g. #1485, it seems that the emphasis is on ignoring annotations. but in the setup with GKE Autopilot, there were also modifications to the SecurityContext. I'll post all of the details below, and how I have fixed/hacked the issue, but I believe this example points to the need for a more generic scheme to:

Identify the specific differences that are causing the update to be triggered
Provide a mechanism to capture these differences so that they do not cause a re-start

My specific case

In my particular case I was seeing this output in the postgres-operator logs:

time="2021-10-07T16:03:26Z" level=info msg="reason: new statefulset's annotations do not match the current one" cluster-name=default/vitaq-coverage-cluster pkg=cluster worker=1
time="2021-10-07T16:03:26Z" level=info msg="reason: new statefulset containers's postgres (index 0) resources do not match the current ones" cluster-name=default/vitaq-coverage-cluster pkg=cluster worker=1
time="2021-10-07T16:03:26Z" level=info msg="reason: new statefulset containers's postgres (index 0) security context does not match the current one" cluster-name=default/vitaq-coverage-cluster pkg=cluster worker=1
time="2021-10-07T16:03:26Z" level=info msg="reason: new statefulset's pod template security context in spec does not match the current one" cluster-name=default/vitaq-coverage-cluster pkg=cluster worker=1

Solution implemented

This is what was finally implemented to make this work ... but it's not pretty.

I forked the postgres-operator repo and fired up a Go IDE (I'm not familiar with Go, but I have access to the JetBrains GoLand IDE), which was a great help in helping me understand unfamiliar errors I created.

I fired up the deis/docker-go-dev Docker container as a Go development environment and mounted my develpment disk into the container following the instructions at that URL.

First a diagnostics block was added to cluster.go->compareStatefulSetWith() which ended up looking like this.

// Add a whole block of diagnostics
    if diff := deep.Equal(c.Statefulset.Annotations, statefulSet.Annotations); diff != nil {
        reasons = append(reasons, "Annotations: ")
        reasons = append(reasons, diff...)
        reasons = append(reasons, fmt.Sprintf("Running: %v\n", reflect.ValueOf(c.Statefulset.Annotations)))
        reasons = append(reasons, fmt.Sprintf("New: %v\n", reflect.ValueOf(statefulSet.Annotations)))
        reasons = append(reasons, string(len(statefulSet.Annotations)))
        reasons = append(reasons, string(len(c.Statefulset.Annotations)))
        reasons = append(reasons, c.Statefulset.Annotations["autopilot.gke.io/resource-adjustment"])
    }

    if diff := deep.Equal(c.Statefulset.Spec.Template.Spec.Containers[0].Resources, statefulSet.Spec.Template.Spec.Containers[0].Resources); diff != nil {
        reasons = append(reasons, "Spec.Template.Spec.Containers[0].Resources: ")
        reasons = append(reasons, fmt.Sprintf("Running: %v\n", reflect.ValueOf(c.Statefulset.Spec.Template.Spec.Containers[0].Resources)))
        reasons = append(reasons, fmt.Sprintf("New: %v\n", reflect.ValueOf(statefulSet.Spec.Template.Spec.Containers[0].Resources)))
        reasons = append(reasons, diff...)
    }

    if diff := deep.Equal(c.Statefulset.Spec.Template.Spec.Containers[0].SecurityContext, statefulSet.Spec.Template.Spec.Containers[0].SecurityContext); diff != nil {
        reasons = append(reasons, "Spec.Template.Spec.Containers[0].SecurityContext: ")
        reasons = append(reasons, fmt.Sprintf("Running: %v\n", reflect.ValueOf(c.Statefulset.Spec.Template.Spec.Containers[0].SecurityContext)))
        reasons = append(reasons, fmt.Sprintf("New: %v\n", reflect.ValueOf(statefulSet.Spec.Template.Spec.Containers[0].SecurityContext)))
        reasons = append(reasons, diff...)
    }

    if diff := deep.Equal(c.Statefulset.Spec.Template.Spec.SecurityContext, statefulSet.Spec.Template.Spec.SecurityContext); diff != nil {
        reasons = append(reasons, "Spec.Template.Spec.SecurityContext: ")
        reasons = append(reasons, diff[0])
        reasons = append(reasons, fmt.Sprintf("Running: %v\n", reflect.ValueOf(c.Statefulset.Spec.Template.Spec.SecurityContext)))
        reasons = append(reasons, fmt.Sprintf("New: %v\n", reflect.ValueOf(statefulSet.Spec.Template.Spec.SecurityContext)))
        reasons = append(reasons, fmt.Sprintf("%v\n", reflect.ValueOf(statefulSet.Spec.Template.Spec.SecurityContext.SeccompProfile).IsNil()))
        reasons = append(reasons, diff...)
    }

Things to note about this are:

There are four separate sections for each of the reasons I was seeing being printed.
I used the "github.com/go-test/deep" library as it outputs the differences found.
I added the output to the "reasons" array, which gets printed when an update is required

Taking each of these sections in turn.

1. "new statefulset's annotations do not match the current one"

Original reason printed was:

time="2021-10-07T16:03:26Z" level=info msg="reason: new statefulset's annotations do not match the current one" cluster-name=default/vitaq-coverage-cluster pkg=cluster worker=1

Relevant section of diagnostics

if diff := deep.Equal(c.Statefulset.Annotations, statefulSet.Annotations); diff != nil {
        reasons = append(reasons, "Annotations: ")
        reasons = append(reasons, diff...)
        reasons = append(reasons, fmt.Sprintf("Running: %v\n", reflect.ValueOf(c.Statefulset.Annotations)))
        reasons = append(reasons, fmt.Sprintf("New: %v\n", reflect.ValueOf(statefulSet.Annotations)))
        reasons = append(reasons, string(len(statefulSet.Annotations)))
        reasons = append(reasons, string(len(c.Statefulset.Annotations)))
        reasons = append(reasons, c.Statefulset.Annotations["autopilot.gke.io/resource-adjustment"])
    }

Output from diagnostics (after some changes were made)

time="2021-10-13T09:49:41Z" level=info msg="reason: Annotations: " cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: map[autopilot.gke.io/resource-adjustment:{\"input\":{\"containers\":[{\"limits\":{\"cpu\":\"250m\",\"memory\":\"512Mi\"},\"requests\":{\"cpu\":\"250m\",\"memory\":\"512Mi\"},\"name\":\"postgres\"}]},\"output\":{\"containers\":[{\"limits\":{\"cpu\":\"250m\",\"ephemeral-storage\":\"1Gi\",\"memory\":\"512Mi\"},\"requests\":{\"cpu\":\"250m\",\"ephemeral-storage\":\"1Gi\",\"memory\":\"512Mi\"},\"name\":\"postgres\"}]},\"modified\":true}] != <nil map>" cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: Running: map[autopilot.gke.io/resource-adjustment:{\"input\":{\"containers\":[{\"limits\":{\"cpu\":\"250m\",\"memory\":\"512Mi\"},\"requests\":{\"cpu\":\"250m\",\"memory\":\"512Mi\"},\"name\":\"postgres\"}]},\"output\":{\"containers\":[{\"limits\":{\"cpu\":\"250m\",\"ephemeral-storage\":\"1Gi\",\"memory\":\"512Mi\"},\"requests\":{\"cpu\":\"250m\",\"ephemeral-storage\":\"1Gi\",\"memory\":\"512Mi\"},\"name\":\"postgres\"}]},\"modified\":true}]\n" cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: New: map[]\n" cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: \x00" cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: \x01" cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: {\"input\":{\"containers\":[{\"limits\":{\"cpu\":\"250m\",\"memory\":\"512Mi\"},\"requests\":{\"cpu\":\"250m\",\"memory\":\"512Mi\"},\"name\":\"postgres\"}]},\"output\":{\"containers\":[{\"limits\":{\"cpu\":\"250m\",\"ephemeral-storage\":\"1Gi\",\"memory\":\"512Mi\"},\"requests\":{\"cpu\":\"250m\",\"ephemeral-storage\":\"1Gi\",\"memory\":\"512Mi\"},\"name\":\"postgres\"}]},\"modified\":true}" cluster-name=default/vitaq-coverage-cluster pkg=cluster

Reason for the update

The reason that this is triggering an update is that the running instance has the autopilot.gke.io/resource-adjustment annotation which was added by the GKE autopilot scheme.

Fix/hack implemented to avoid this triggering an update

I modified the code that performs the comparison, in which I look for the case where the new statefulset has an empty map and the running one has a single annotation which is the autopilot.gke.io/resource-adjustment annotation.

Obviously this is a dirty hack and will only work in the particular circumstance and does not scale and is hard to maintain, a better solution needs to be found. But for now it works and I don't see this as an issue anymore.

if !reflect.DeepEqual(c.Statefulset.Annotations, statefulSet.Annotations) {
        // Ignore the case where the new statefulSet has a nil map and the existing one only has an 'autopilot.gke.io/resource-adjustment' key.
        // All a bit hokey, should negate the logic, rather than having empty if clause
        if len(statefulSet.Annotations) == 0 && len(c.Statefulset.Annotations) == 1 {
            if val, ok := c.Statefulset.Annotations["autopilot.gke.io/resource-adjustment"]; ok {
                // Find a way to consume val
                fmt.Println(val)
            } else {
                match = false
                needsReplace = true
                reasons = append(reasons, "new statefulset's annotations do not match the current one")
            }
        } else {
            match = false
            needsReplace = true
            reasons = append(reasons, "new statefulset's annotations do not match the current one")
        }
    }

2. "new statefulset containers's postgres (index 0) resources do not match the current ones"

Original reason printed was:

time="2021-10-07T16:03:26Z" level=info msg="reason: new statefulset containers's postgres (index 0) resources do not match the current ones" cluster-name=default/vitaq-coverage-cluster pkg=cluster worker=1

Relevant section of diagnostics

if diff := deep.Equal(c.Statefulset.Spec.Template.Spec.Containers[0].Resources, statefulSet.Spec.Template.Spec.Containers[0].Resources); diff != nil {
        reasons = append(reasons, "Spec.Template.Spec.Containers[0].Resources: ")
        reasons = append(reasons, fmt.Sprintf("Running: %v\n", reflect.ValueOf(c.Statefulset.Spec.Template.Spec.Containers[0].Resources)))
        reasons = append(reasons, fmt.Sprintf("New: %v\n", reflect.ValueOf(statefulSet.Spec.Template.Spec.Containers[0].Resources)))
        reasons = append(reasons, diff...)
    }

Output from diagnostics (after some changes were made)

time="2021-10-13T09:49:41Z" level=info msg="reason: Spec.Template.Spec.Containers[0].Resources: " cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: Running: {map[cpu:{{250 -3} {<nil>} 250m DecimalSI} ephemeral-storage:{{1073741824 0} {<nil>} 1Gi BinarySI} memory:{{536870912 0} {<nil>}  BinarySI}] map[cpu:{{250 -3} {<nil>} 250m DecimalSI} ephemeral-storage:{{1073741824 0} {<nil>} 1Gi BinarySI} memory:{{536870912 0} {<nil>}  BinarySI}]}\n" cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: New: {map[cpu:{{250 -3} {<nil>} 250m DecimalSI} memory:{{536870912 0} {<nil>}  BinarySI}] map[cpu:{{250 -3} {<nil>} 250m DecimalSI} memory:{{536870912 0} {<nil>}  BinarySI}]}\n" cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: Limits.map[ephemeral-storage]: {{1073741824 0} {<nil>} 1Gi BinarySI} != <does not have key>" cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: Requests.map[ephemeral-storage]: {{1073741824 0} {<nil>} 1Gi BinarySI} != <does not have key>" cluster-name=default/vitaq-coverage-cluster pkg=cluster

Reason for the update

The resources of the 0-indexed container do not match.

Fix/hack implemented to avoid this triggering an update

After changing the default values of the postgres_pod_resources in the file postgresql-operator-default-configuration.yaml to:

postgres_pod_resources:
    default_cpu_limit: "250m"
    default_cpu_request: 250m
    default_memory_limit: 512Mi
    default_memory_request: 512Mi

which were the values set by GKE autopilot it only complained about the ephemeral storage, which is what is shown above. Initially I tried adding ephemeral storage as a new key that I could set in the postgresql-operator-default-configuration.yaml file, but I failed with this. So in the end I modified the location where it is checked so that I ended up with:

func compareResourcesAssumeFirstNotNil(a *v1.ResourceRequirements, b *v1.ResourceRequirements) bool {
    if b == nil || (len(b.Requests) == 0) {
        return len(a.Requests) == 0
    }
    for k, v := range a.Requests {
        // Check if key is ephemeral storage and if it is then ignore it.
        if k != "ephemeral-storage" {
            if (&v).Cmp(b.Requests[k]) != 0 {
                return false
            }
        }
    }
    for k, v := range a.Limits {
        if k != "ephemeral-storage" {
            if (&v).Cmp(b.Limits[k]) != 0 {
                return false
            }
        }
    }
    return true
}

This ignores the difference if the key is "ephemeral-storage". Again this works for now, but is not scaleable or maintainable in the longer term.

3. "new statefulset containers's postgres (index 0) security context does not match the current one"

Original reason printed was:

time="2021-10-07T16:03:26Z" level=info msg="reason: new statefulset containers's postgres (index 0) security context does not match the current one" cluster-name=default/vitaq-coverage-cluster pkg=cluster worker=1

Relevant section of diagnostics

if diff := deep.Equal(c.Statefulset.Spec.Template.Spec.Containers[0].SecurityContext, statefulSet.Spec.Template.Spec.Containers[0].SecurityContext); diff != nil {
    reasons = append(reasons, "Spec.Template.Spec.Containers[0].SecurityContext: ")
    reasons = append(reasons, fmt.Sprintf("Running: %v\n", reflect.ValueOf(c.Statefulset.Spec.Template.Spec.Containers[0].SecurityContext)))
    reasons = append(reasons, fmt.Sprintf("New: %v\n", reflect.ValueOf(statefulSet.Spec.Template.Spec.Containers[0].SecurityContext)))
    reasons = append(reasons, diff...)
}

Output from diagnostics (after some changes were made)

time="2021-10-13T09:49:41Z" level=info msg="reason: Spec.Template.Spec.Containers[0].SecurityContext: " cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: Running: &SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[NET_RAW],},Privileged:*false,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:*false,AllowPrivilegeEscalation:*true,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,}\n" cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: New: &SecurityContext{Capabilities:nil,Privileged:*false,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:*false,AllowPrivilegeEscalation:*true,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,}\n" cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: Capabilities: v1.Capabilities != <nil pointer>" cluster-name=default/vitaq-coverage-cluster pkg=cluster

Reason for the update

The Capabilities section of the SecurityContext of the running container is v1.Capabilities in the running container, but is nil in the new container.

Fix/hack implemented to avoid this triggering an update

newCheck("new statefulset %s's %s (index %d) security context does not match the current one",
            func(a, b v1.Container) bool { return !compareSecurityContexts(a.SecurityContext, b.SecurityContext) }),

and implemented compareSecurityContext function

func compareSecurityContexts(a *v1.SecurityContext, b *v1.SecurityContext) bool {
    if reflect.ValueOf(b.Capabilities).IsNil() {
        return true
    } else {
        return reflect.DeepEqual(a, b)
    }
}

which allowed me to avoid comparison of containers which have nil Capabilities.

4. "new statefulset's pod template security context in spec does not match the current one"

Original reason printed was:

time="2021-10-07T16:03:26Z" level=info msg="reason: new statefulset's pod template security context in spec does not match the current one" cluster-name=default/vitaq-coverage-cluster pkg=cluster worker=1

Relevant section of diagnostics

if diff := deep.Equal(c.Statefulset.Spec.Template.Spec.SecurityContext, statefulSet.Spec.Template.Spec.SecurityContext); diff != nil {
        reasons = append(reasons, "Spec.Template.Spec.SecurityContext: ")
        reasons = append(reasons, diff[0])
        reasons = append(reasons, fmt.Sprintf("Running: %v\n", reflect.ValueOf(c.Statefulset.Spec.Template.Spec.SecurityContext)))
        reasons = append(reasons, fmt.Sprintf("New: %v\n", reflect.ValueOf(statefulSet.Spec.Template.Spec.SecurityContext)))
        reasons = append(reasons, fmt.Sprintf("%v\n", reflect.ValueOf(statefulSet.Spec.Template.Spec.SecurityContext.SeccompProfile).IsNil()))
        reasons = append(reasons, diff...)
    }

Output from diagnostics (after some changes were made)

time="2021-10-13T09:49:41Z" level=info msg="reason: Spec.Template.Spec.SecurityContext: " cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: SeccompProfile: v1.SeccompProfile != <nil pointer>" cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: Running: &PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,FSGroupChangePolicy:nil,SeccompProfile:&SeccompProfile{Type:RuntimeDefault,LocalhostProfile:nil,},}\n" cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: New: &PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,FSGroupChangePolicy:nil,SeccompProfile:nil,}\n"
 cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: true\n" cluster-name=default/vitaq-coverage-cluster pkg=cluster
time="2021-10-13T09:49:41Z" level=info msg="reason: SeccompProfile: v1.SeccompProfile != <nil pointer>" cluster-name=default/vitaq-coverage-cluster pkg=cluster

Reason for the update

The SeccompProfile section of the SecurityContext of the statefulset pod is v1.SeccompProfile in the current pod, but is nil in the new pod.

Fix/hack implemented to avoid this triggering an update

if !reflect.DeepEqual(c.Statefulset.Spec.Template.Spec.SecurityContext, statefulSet.Spec.Template.Spec.SecurityContext) {
        // Ignore the case where the new statefulSet has a nil pointer and the existing one only has an 'SeccompProfile' key.
        // All a bit hokey, should negate the logic, rather than having empty if clause
        if reflect.ValueOf(statefulSet.Spec.Template.Spec.SecurityContext.SeccompProfile).IsNil() {
        } else {
            match = false
            needsReplace = true
            needsRollUpdate = true
            reasons = append(reasons, "new statefulset's pod template security context in spec does not match the current one")
        }
    }

In the code that does the comparison of the SecurityContexts, I added code to avoid doing the comparison if the new statefulset has a nil SeccompProfile.

Conclusion

There are a lot more properties in the postgres-operator that are compared to determine if an update is required. Potentially any of them may be modified by other frameworks, consequently, I would suggest that a mechanism is required to help users identify the exact details of the difference that causes the update and a generic mechanism to record them to be avoided in the comparison.

Thoughts?

FxKu commented 2 years ago

First of all, thanks @RossVertizan for this detailed description.

Regarding annotations: Have you tried with the changes from #1485 if it would help you out? I have pinged the author again for updates.

Regarding resources: I can see that our compareResourcesAssumeFirstNotNil only diffs on the requests and limits level - not cpu and memory. Should be updated then if more frameworks start adding extra fields like ephemeral-storage by default. Only fields from the manifest should be compared.

Maybe same applies to securityContext then. The generateStatefulSet function needs to set all fields. Maybe for those field not configurable from Postgres manifest or configuration it could just take the values from the existing statefulset e.g. SeccompProfile{Type:RuntimeDefault,LocalhostProfile:nil,}

In general, it looks like the diff logs have to be enhanced with more details.

RossVertizan commented 2 years ago

@FxKu Sorry for the slow response - I had to immediately move on to another piece of work.

I did not try the #1485 suggestion for ignoring annotations, as it isn't clear that it will make it into a release and it only seemed to solve a part of my problems, i.e. it doesn't fix the securityContext.

I agree with your comments. I like the idea of only checking the fields that are defined in the manifest, I think that makes for a robust implementation as any other annotations would be silently ignored, though of course in some cases some of those annotations may need to be supported so users can specify them so that they ARE checked.

I think it would be useful to add known parameters, I always prefer an explicit over an implicit approach, this comes from the Zen of Python. I tried adding ephemeral storage, but did not get that to work.

I'm not so clear about your suggestion for the securityContext. I think I would rather they were handled in the same way to the annotations to make a consistent approach (though I am no Kubernetes security expert and that may make no sense) and the implementation may be a lot of effort. I suspect your suggestion would be easier to implement, is this really just the same as saying that any parameter that has a value of nil should be ignored or is it more detailed than that.

Finally, I agree that the diff logs need to be improved. I would propose that there be a configuration setting that would allow "verbose"/"debug" output. I can't think why, but there may be some users who have these messages occasionally and ignore them. For me identifying the actual difference took the majority of the time, so something that can help a user identify the exact difference easily would be very valuable.

I am keen to get away from my hacked version and back onto the main branch, so if we could agree on the details of the implementation, I would be happy to have a shot at implementing this, though as I already stated I am new to Go.

I guess I also need to go and study the implementation in #1485 a little more as well.

@moshloop do you have any comments?

jknetl commented 1 year ago

Thanks @RossVertizan for the analysis. I hit the very same issue on GKE Autopilot recently so I implemented your suggestions in the my fork https://github.com/jknetl/postgres-operator/pull/1. Maybe this will be useful to someone.

I didn't include suggestion the 1. "new statefulset's annotations do not match the current one" because it can be solved by configuration now (using ignored_annotations configuration option in OperatorConfiguration resource).

Also I hit a problem on GKE autopilot with non-matching pod tolerations. In the logs I saw:

time="2023-09-18T09:13:56Z" level=info msg="reason: new statefulset's pod tolerations does not match the current one" cluster-name=essential/rieter-db-cluster pkg=cluster

But it can be also solved by configuration. I changed configuration of my cluster to include the same tolerations like the ones GKE autopilot was adding automaticaly (see https://postgres-operator.readthedocs.io/en/latest/reference/cluster_manifest/) In my case that meant adding following lines to the definition of my db cluster:

  tolerations:
    - key: "kubernetes.io/arch"
      operator: "Equal"
      value: "amd64"
      effect: "NoSchedule"

zalando / postgres-operator