zalando-incubator / kopf

A Python framework to write Kubernetes operators in just few lines of code.
https://kopf.readthedocs.io
MIT License
969 stars 89 forks source link

Crash the whole operator on unrecoverable errors in watchers/workers #350

Closed nolar closed 4 years ago

nolar commented 4 years ago

What do these changes do?

When a fatal error happens in the operator's watching, queueing, multiplexing, or processing, including API PATCH'ing, then stop the whole operator instead of ignoring and continuing.

Description

This issue was detected in an incident when PATCH request failed due to HTTP 422 "Unprocessable Entity" (#346). Instead of stopping or slowing down any attempts, the operator continued handling repeatedly with 1-2 attempts per second.

On a wider scope, if anything goes wrong in the top-level processing, i.e. before handlers (which have their own error handling and backoff intervals), then crash the whole operator, and let Kubernetes to deal with a broken pod.

This does not prevent incidents with repeated handling completely, but will slow them down at least (restarts are not fast).

All in all, this should protect the users from the framework/operators misbehaviour in some rare cases. In all other cases, nothing changes for the users.


Note: A separate fix will be made (#351) with throttling of unrecoverable errors on a per-resource basis from approximately when the processing begins, and until the handlers (this covers resource PATCH'ing). The operator will stop anyway for errors from watching to that point of processing, but this is a much more narrow scope.

Implementation note: there is already a safety net for the root tasks, such as watchers: if they fail, the operator stops. But the workers are not covered by this, since they are fire-and-forget kind of tasks. So, they should "escalate" the errors their own way — via fatal-flag-setting and own stack trace dumping.


Side-changes:

Issues/PRs

Issues: #346

Related: #331

Type of changes

Checklist

zincr[bot] commented 4 years ago

🤖 zincr found 1 problem , 1 warning

❌ Approvals
ℹī¸ Dependency Licensing
✅ Large Commits
✅ Specification

Details on how to resolve are provided below


Approvals

All proposed changes must be reviewed by project maintainers before they can be merged

Not enough people have approved this pull request - please ensure that 1 additional user, who have not contributed to this pull request approve the changes.

Please ensure that only dependencies with licenses compatible with the license of this project is included in the pull request.

nolar commented 4 years ago

Closed in favor of nolar/kopf#509