zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.26k stars 970 forks source link

RFC: operator should register all PG manifests #382

Open sdudoladov opened 6 years ago

sdudoladov commented 6 years ago

Problem

The operator does not register a Postgres cluster if the cluster's manifest appears to be malformed. It becomes then significantly harder for users to understand what went wrong because the PG cluster does not show up in the operator logs.

Example

If a single team has two names, the first name may be used in team label of the manifest, and second name may be used as a prefix in the name of the PG cluster. If the operator for example restarts, it will refuse to register such cluster by skipping the AddEvent due to the name check . The log message from the check will be the only message about this PG cluster actually visible in the logs.

Proposed solution

  1. Ensure all consistency checks that an operator itself carries out happen after the operator registers a PG cluster.
  2. Make the Sync skip operations for clusters with malformed manifests and instead log warnings such as:

    Warning: the manifest of the cluster foo-bar is invalid. Likely issues: 'issue 1', 'issue2'

  3. Suggestion from @alexeyklyukin

    every sync or repair even reports the number of clusters in so-called invalid state. The cluster is in such a state if the Error field in the operator cluster structure is not empty. One can easily print the name of the cluster as well (postgresql.go:85).

To make it clear, this approach means that there can be discrepancy between the cluster state as describe in the (malformed) manifest known to the k8s, and the actual cluster state that will match the old version of the manifest stored in the operator's internal data structures. One way to make it easier to see is to add the /status subresource to the postgresqls CRD. The /status may also show errors found during manifest verification.

Related issues

The operator and PG clusters it manages are quite customizable and thus have very many parameters that can be (mis)configured in a manifest. Such misconfigurations take a long time to debug and hence frustrate users. In the future we would like to:

  1. Provide the feedback on the configuration as early as possible. We plan to achieve that by refactoring all the configuration checks into a separate module that can be called, for example, from the UI while a user types in a manifest. The PRs #336 and #385 are first steps in this direction.

  2. Make the operator internal state easier to observe for users. The feature proposed here falls into this category.

Questions to the community

  1. Please provide feedback
  2. Please provide more examples of the same behavior if you have any

More details to follow.

Jan-M commented 6 years ago

Sounds good to me.

alexeyklyukin commented 6 years ago

The operator does not register a Postgres cluster if the cluster's manifest appears to be malformed. It becomes then significantly harder for users to understand what went wrong because the PG cluster does not show up in the operator logs.

Such clusters should be known to the operator: every sync or repair even reports the number of clusters in so-called invalid state. The cluster is in such a state if the Error field in the operator cluster structure is not empty. One can easily print the name of the cluster as well (postgresql.go:85).

Would it make sense to implement CRD validation, preventing the whole situation of the operator seeing an invalid manifest in the first place?

sdudoladov commented 6 years ago

The cluster is in such a state if the Error field in the operator cluster structure is not empty. One can easily print the name of the cluster as well (postgresql.go:85).

Thanks for the hint, Oleksii, that we will add.

Would it make sense to implement CRD validation, preventing the whole situation of the operator seeing an invalid manifest in the first place?

Do you mean enabling this k8s-originated validation for the postgresql CRDs ?

alexeyklyukin commented 6 years ago

Do you mean enabling this k8s-originated validation for the postgresql CRDs ?

Exactly. This would save us from the task of processing invalid manifests in the operator and will outright reject such manifests even if the operator is not active.

As a a side note, there is also metadata object called 'finalizer' on that page. We could use that in order to protect certain PostgreSQL clusters from accidental deletion by definition finalizers on them.

sdudoladov commented 6 years ago

This would save us from the task of processing invalid manifests in the operator and will outright reject such manifests even if the operator is not active

I'll file a separate issue for that to keep the current one limited. Note that enabling such validation may not catch the example mentioned in this issue's description since both team names are valid identifiers and I am unsure if the JSON validation can enforce the equality of two fields in a manifest.