Crash the whole operator on unrecoverable errors in watchers/workers

nolar commented 4 years ago

What do these changes do?

When a fatal error happens in the operator's watching, queueing, multiplexing, or processing, including API PATCH'ing, then stop the whole operator instead of ignoring and continuing.

Description

This issue was detected in an incident when PATCH request failed due to HTTP 422 "Unprocessable Entity" (#346). Instead of stopping or slowing down any attempts, the operator continued handling repeatedly with 1-2 attempts per second.

On a wider scope, if anything goes wrong in the top-level processing, i.e. before handlers (which have their own error handling and backoff intervals), then crash the whole operator, and let Kubernetes to deal with a broken pod.

This does not prevent incidents with repeated handling completely, but will slow them down at least (restarts are not fast).

All in all, this should protect the users from the framework/operators misbehaviour in some rare cases. In all other cases, nothing changes for the users.

Note: A separate fix will be made (#351) with throttling of unrecoverable errors on a per-resource basis from approximately when the processing begins, and until the handlers (this covers resource PATCH'ing). The operator will stop anyway for errors from watching to that point of processing, but this is a much more narrow scope.

Implementation note: there is already a safety net for the root tasks, such as watchers: if they fail, the operator stops. But the workers are not covered by this, since they are fire-and-forget kind of tasks. So, they should "escalate" the errors their own way — via fatal-flag-setting and own stack trace dumping.

Side-changes:

Log daemon-killer's exit reason as "cancelled" (as all other tasks), not as "exited unexpectedly" — due to no asyncio.CancelledError raised from inside.
Cover the queue pulling and event batching by this unexpected errors safety net too — by shifting the except: block left. This is unlikely to happen, but just in case.
Stop logging the functools.partial objects (processors) with all their arguments. This could eventually lead to some data leaks to the logs.

Issues/PRs

Issues: #346

Related: #331

Type of changes

Bug fix (non-breaking change which fixes an issue)

Checklist

[x] The code addresses only the mentioned problem, and this problem only
[x] I think the code is well written
[ ] Unit tests for the changes exist
[ ] Documentation reflects the changes
[ ] If you provide code modification, please add yourself to CONTRIBUTORS.txt

zincr[bot] commented 4 years ago

🤖 zincr found 1 problem , 1 warning

❌ Approvals
ℹ️ Dependency Licensing
✅ Large Commits
✅ Specification

Details on how to resolve are provided below

Approvals

All proposed changes must be reviewed by project maintainers before they can be merged

Not enough people have approved this pull request - please ensure that 1 additional user, who have not contributed to this pull request approve the changes.

✅ Approved by PR author @nolar
❌ 1 additional approval needed
Dependency Licensing

All dependencies specified in package manager files must be reviewed, banned dependency licenses will block the merge, all new dependencies introduced in this pull request will give a warning, but not block the merge

Please ensure that only dependencies with licenses compatible with the license of this project is included in the pull request.

ℹ️ Could not process requirements.txt for new dependencies

nolar commented 4 years ago

Closed in favor of nolar/kopf#509

zalando-incubator / kopf