zalando-incubator / kopf

A Python framework to write Kubernetes operators in just few lines of code.
https://kopf.readthedocs.io
MIT License
969 stars 87 forks source link

kopf subhandlers miss some events #279

Open pshchelo opened 4 years ago

pshchelo commented 4 years ago

Expected Behavior

when watched resource is updated several times, all subhandlers must run for at least the latest state of the watched resource

Actual Behavior

some subhandlers are not executed at all when the watched resource is updated quickly enough several times

Steps to Reproduce the Problem

(using kopf namespace and kopfexamples custom resource)

  1. minimal controller
    
    $ cat subhandlers.py
    import functools
    import kopf

N_HANDLERS = 5

@kopf.on.update("zalando.org", "v1", "kopfexamples") @kopf.on.create("zalando.org", "v1", "kopfexamples") async def ensure(body, logger, event, **kwargs): fns = {s: functools.partial(handle, s) for s in range(N_HANDLERS)} await kopf.execute(fns=fns) return {"message": f"{event}d"}

async def handle(s, logger, event, spec, **kwargs): logger.info(f"Handler {s} for event {event} with field {spec['field']}")

  2. run it with `kopf run --standalone -n kopf subhandlers.py`
  3. create example resource

apiVersion: zalando.org/v1 kind: KopfExample metadata: name: kopf-example-1 namespace: kopf spec: field: value

  4. quickly patch it with 2 different values for spec.field:

kubectl -n kopf patch kopfexamples.zalando.org kopf-example-1 --type json -p='[{"op": "replace", "path": "/spec/field", "value": "value1"}]' && kubectl -n kopf patch kopfexamples.zalando.org kopf-example-1 --type json -p='[{"op": "replace", "path": "/spec/field", "value": "value2"}]'

5. see controller logs:

[2019-12-23 17:38:58,993] kopf.reactor.activit [INFO ] Initial authentication has been initiated. [2019-12-23 17:38:59,003] kopf.activities.auth [INFO ] Handler 'login_via_pykube' succeeded. [2019-12-23 17:38:59,004] kopf.reactor.activit [INFO ] Initial authentication has finished. [2019-12-23 17:39:28,932] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 0 for event create with field value [2019-12-23 17:39:28,933] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 'ensure/0' succeeded. [2019-12-23 17:39:29,056] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 1 for event create with field value [2019-12-23 17:39:29,057] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 'ensure/1' succeeded. [2019-12-23 17:39:29,167] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 2 for event create with field value [2019-12-23 17:39:29,169] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 'ensure/2' succeeded. [2019-12-23 17:39:29,279] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 3 for event create with field value [2019-12-23 17:39:29,280] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 'ensure/3' succeeded. [2019-12-23 17:39:29,389] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 4 for event create with field value [2019-12-23 17:39:29,392] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 'ensure/4' succeeded. [2019-12-23 17:39:29,392] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 'ensure' succeeded. [2019-12-23 17:39:29,393] kopf.objects [INFO ] [kopf/kopf-example-1] All handlers succeeded for creation. [2019-12-23 17:39:45,963] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 0 for event update with field value1 [2019-12-23 17:39:45,970] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 'ensure/0' succeeded. [2019-12-23 17:39:46,094] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 1 for event update with field value1 [2019-12-23 17:39:46,096] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 'ensure/1' succeeded. [2019-12-23 17:39:46,245] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 2 for event update with field value2 [2019-12-23 17:39:46,247] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 'ensure/2' succeeded. [2019-12-23 17:39:46,361] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 3 for event update with field value2 [2019-12-23 17:39:46,363] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 'ensure/3' succeeded. [2019-12-23 17:39:46,474] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 4 for event update with field value2 [2019-12-23 17:39:46,475] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 'ensure/4' succeeded. [2019-12-23 17:39:46,476] kopf.objects [INFO ] [kopf/kopf-example-1] Handler 'ensure' succeeded. [2019-12-23 17:39:46,477] kopf.objects [INFO ] [kopf/kopf-example-1] All handlers succeeded for update.


Note that the subhandlers 0 and 1 were never called with `spec.field = value2` and missed that update.
That creates quite a trouble when actual subhandlers create / update other resources (derived from the spec of the watched resource) as they potentially miss updates.

## Specifications

  - Platform: Ubuntu 18.04
  - Kubernetes version: *v1.15.3*
  - Python version: *3.7.3*
  - Python packages installed: *aiohttp==3.6.2,aiojobs==0.2.2,async-timeout==3.0.1,attrs==19.3.0,certifi==2019.11.28,chardet==3.0.4,Click==7.0,idna==2.8,iso8601==0.1.12,kopf==0.24,multidict==4.7.2,pykube-ng==19.12.0,PyYAML
==5.2,requests==2.22.0,typing-extensions==3.7.4.1,urllib3==1.25.7,yarl==1.4.2*
pshchelo commented 4 years ago

when starting the 'subhandler' functions manually (w/o kopf.execute) still an interesting behaviour occurs: it seems that each 'retry' of main handler is using the latest watched object state - and may be this is the reason why the behaviour described in the issue manifests. Consider the following handler code (slight modification of the previous example):

import functools
import kopf

N_HANDLERS = 5

@kopf.on.update("zalando.org", "v1", "kopfexamples")
@kopf.on.create("zalando.org", "v1", "kopfexamples")
async def ensure(body, logger, event, **kwargs):
    fns = {s: functools.partial(handle, s) for s in range(N_HANDLERS)}
    for n, f in fns.items():
        await f(logger, event, body, **kwargs)
    return {"message": f"{event}d"}

async def handle(s, logger, event, body, spec, **kwargs):
    if s == 1 and kwargs['retry'] == 0:
        raise kopf.TemporaryError(f"Handler {s} failed, will retry")
    logger.info(f"Handler {s} for event {event} with field {spec['field']}")

Running with the same repro procedure as described already produces following logs:

2019-12-24 09:04:09,829] kopf.objects         [INFO    ] [kopf/kopf-example-1] Handler 0 for event update with field value1
[2019-12-24 09:04:09,829] kopf.objects         [ERROR   ] [kopf/kopf-example-1] Handler 'ensure' failed temporarily: Handler 1 failed, will retry
[2019-12-24 09:05:09,994] kopf.objects         [INFO    ] [kopf/kopf-example-1] Handler 0 for event update with field value2
[2019-12-24 09:05:09,994] kopf.objects         [INFO    ] [kopf/kopf-example-1] Handler 1 for event update with field value2
[2019-12-24 09:05:09,994] kopf.objects         [INFO    ] [kopf/kopf-example-1] Handler 2 for event update with field value2
[2019-12-24 09:05:09,995] kopf.objects         [INFO    ] [kopf/kopf-example-1] Handler 3 for event update with field value2
[2019-12-24 09:05:09,995] kopf.objects         [INFO    ] [kopf/kopf-example-1] Handler 4 for event update with field value2
[2019-12-24 09:05:09,996] kopf.objects         [INFO    ] [kopf/kopf-example-1] Handler 'ensure' succeeded.
[2019-12-24 09:05:09,996] kopf.objects         [INFO    ] [kopf/kopf-example-1] All handlers succeeded for update.

Since the exception was not raised second time (for value2 in 'subhandler' 1) it looks like it was already in 'retry' mode due to previous exception - but already uses new values from the latest spec of watched object.

nolar commented 4 years ago

Looks like a known problem (with no issue though, so this one will be it). I've made few attempts to solve this problem previously: #9, and then again in #163 — but it failed spectacularly after the merge (had to revert it in #182).

The fix should be implemented again sooner or later, but extra care is needed for the cases where it failed in the previous attempts.

Thanks for this detailed report and a sample scenario to reproduce — it will help when the fix will be implemented the next time.

chainhead commented 4 years ago

Just curious...would you think, this is a Python issue? Or, a Kubernetes issue?

jumpojoy commented 4 years ago

It looks the bug doesn't allow to watch for changes of specific fields of object like status. As we might miss specific status update event, and latest change will not be handled at all.