Closed nolar closed 4 years ago
â Approvals
âšī¸ Dependency Licensing
â
Large Commits
â
Specification
Details on how to resolve are provided below
All proposed changes must be reviewed by project maintainers before they can be merged
Not enough people have approved this pull request - please ensure that 1 additional user, who have not contributed to this pull request approve the changes.
All dependencies specified in package manager files must be reviewed, banned dependency licenses will block the merge, all new dependencies introduced in this pull request will give a warning, but not block the merge
Please ensure that only dependencies with licenses compatible with the license of this project is included in the pull request.
Closed in favor of nolar/kopf#509
What do these changes do?
When a fatal error happens in the operator's watching, queueing, multiplexing, or processing, including API PATCH'ing, then stop the whole operator instead of ignoring and continuing.
Description
This issue was detected in an incident when PATCH request failed due to HTTP 422 "Unprocessable Entity" (#346). Instead of stopping or slowing down any attempts, the operator continued handling repeatedly with 1-2 attempts per second.
On a wider scope, if anything goes wrong in the top-level processing, i.e. before handlers (which have their own error handling and backoff intervals), then crash the whole operator, and let Kubernetes to deal with a broken pod.
This does not prevent incidents with repeated handling completely, but will slow them down at least (restarts are not fast).
All in all, this should protect the users from the framework/operators misbehaviour in some rare cases. In all other cases, nothing changes for the users.
Note: A separate fix will be made (#351) with throttling of unrecoverable errors on a per-resource basis from approximately when the processing begins, and until the handlers (this covers resource PATCH'ing). The operator will stop anyway for errors from watching to that point of processing, but this is a much more narrow scope.
Implementation note: there is already a safety net for the root tasks, such as watchers: if they fail, the operator stops. But the workers are not covered by this, since they are fire-and-forget kind of tasks. So, they should "escalate" the errors their own way â via fatal-flag-setting and own stack trace dumping.
Side-changes:
asyncio.CancelledError
raised from inside.except:
block left. This is unlikely to happen, but just in case.functools.partial
objects (processors) with all their arguments. This could eventually lead to some data leaks to the logs.Issues/PRs
Type of changes
Checklist
CONTRIBUTORS.txt