Open sdudoladov opened 6 years ago
Sounds good to me.
The operator does not register a Postgres cluster if the cluster's manifest appears to be malformed. It becomes then significantly harder for users to understand what went wrong because the PG cluster does not show up in the operator logs.
Such clusters should be known to the operator: every sync or repair even reports the number of clusters in so-called invalid state. The cluster is in such a state if the Error field in the operator cluster structure is not empty. One can easily print the name of the cluster as well (postgresql.go:85).
Would it make sense to implement CRD validation, preventing the whole situation of the operator seeing an invalid manifest in the first place?
The cluster is in such a state if the Error field in the operator cluster structure is not empty. One can easily print the name of the cluster as well (postgresql.go:85).
Thanks for the hint, Oleksii, that we will add.
Would it make sense to implement CRD validation, preventing the whole situation of the operator seeing an invalid manifest in the first place?
Do you mean enabling this k8s-originated validation for the postgresql
CRDs ?
Do you mean enabling this k8s-originated validation for the postgresql CRDs ?
Exactly. This would save us from the task of processing invalid manifests in the operator and will outright reject such manifests even if the operator is not active.
As a a side note, there is also metadata object called 'finalizer' on that page. We could use that in order to protect certain PostgreSQL clusters from accidental deletion by definition finalizers on them.
This would save us from the task of processing invalid manifests in the operator and will outright reject such manifests even if the operator is not active
I'll file a separate issue for that to keep the current one limited. Note that enabling such validation may not catch the example mentioned in this issue's description since both team names are valid identifiers and I am unsure if the JSON validation can enforce the equality of two fields in a manifest.
Problem
The operator does not register a Postgres cluster if the cluster's manifest appears to be malformed. It becomes then significantly harder for users to understand what went wrong because the PG cluster does not show up in the operator logs.
Example
If a single team has two names, the first name may be used in
team
label of the manifest, and second name may be used as a prefix in the name of the PG cluster. If the operator for example restarts, it will refuse to register such cluster by skipping theAddEvent
due to the name check . The log message from the check will be the only message about this PG cluster actually visible in the logs.Proposed solution
Sync
skip operations for clusters with malformed manifests and instead log warnings such as:To make it clear, this approach means that there can be discrepancy between the cluster state as describe in the (malformed) manifest known to the
k8s
, and the actual cluster state that will match the old version of the manifest stored in the operator's internal data structures. One way to make it easier to see is to add the/status
subresource to thepostgresqls
CRD. The/status
may also show errors found during manifest verification.Related issues
The operator and PG clusters it manages are quite customizable and thus have very many parameters that can be (mis)configured in a manifest. Such misconfigurations take a long time to debug and hence frustrate users. In the future we would like to:
Provide the feedback on the configuration as early as possible. We plan to achieve that by refactoring all the configuration checks into a separate module that can be called, for example, from the UI while a user types in a manifest. The PRs #336 and #385 are first steps in this direction.
Make the operator internal state easier to observe for users. The feature proposed here falls into this category.
Questions to the community
More details to follow.