zigpy / zigpy-znp

TI CC2531, CC13x2, CC26x2 radio support for Zigpy and ZHA
GNU General Public License v3.0
145 stars 40 forks source link

Add asyncio.exceptions.CancelledError as accepted error for retryable_request #141

Closed mdeweerd closed 2 years ago

mdeweerd commented 2 years ago

I got a "asyncio.CanceledError" on an outgoing command, and that reminded me that this error is not catched in the retryable_request :

https://github.com/zigpy/zigpy/blob/fc8e923e2f2482810b9643c0b2bde149a56a1930/zigpy/util.py#L145

There are quite a few methods that are not defined as retryable - I do not know if you want to update thos. I am implementing the retryable functionnality in zha-toolkit when I need it and when it's missing in zigpy.

puddly commented 2 years ago

If an outgoing command is cancelled, it shouldn't be retried. Preventing cancellation doesn't seem like a good idea.

mdeweerd commented 2 years ago

It happened during a script run where I do 101 iterations and this exception stopped the loop.

I do not have an apparent reason why this occurred.

After adding Cancelled to the accepted exceptions list, it passed.

For the record, this is an exception trace (HA2021.12.10):

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/zigpy_znp/zigbee/application.py", line 1368, in _send_request_raw
    response = await asyncio.shield(
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/core.py", line 1511, in catch_exceptions
    await coro_or_task
  File "/usr/src/homeassistant/homeassistant/core.py", line 1530, in _execute_service
    await handler.job.target(service_call)
  File "/config/custom_components/zha_toolkit/__init__.py", line 415, in toolkit_service
    raise handler_exception
  File "/config/custom_components/zha_toolkit/__init__.py", line 379, in toolkit_service
    await handler(
  File "/config/custom_components/zha_toolkit/zcl_attr.py", line 255, in attr_write
    result_read = await cluster.read_attributes(
  File "/usr/local/lib/python3.9/site-packages/zigpy/zcl/__init__.py", line 297, in read_attributes
    result = await self.read_attributes_raw(to_read, manufacturer=manufacturer)
  File "/usr/local/lib/python3.9/site-packages/zigpy/device.py", line 265, in request
    result, msg = await self._application.request(
  File "/usr/local/lib/python3.9/site-packages/zigpy_znp/zigbee/application.py", line 453, in request
    return await self._send_request(
  File "/usr/local/lib/python3.9/site-packages/zigpy_znp/zigbee/application.py", line 1483, in _send_request
    response = await self._send_request_raw(
  File "/usr/local/lib/python3.9/site-packages/zigpy_znp/zigbee/application.py", line 1386, in _send_request_raw
    raise InvalidCommandResponse(
  File "/usr/local/lib/python3.9/site-packages/async_timeout/__init__.py", line 128, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.9/site-packages/async_timeout/__init__.py", line 203, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

You can close this for me - I am handling it locally anyway.

puddly commented 2 years ago

What function call is generating CancelledError? If you're not explicitly cancelling it then this may be a bug in zigpy-znp.

mdeweerd commented 2 years ago
await cluster._write_attributes(attrs, manufacturer=manufacturer)

with manufacturer None, writing 11 to endpoint 1, cluster 513, attribute 8, attr_type 32.

That succeeded in 10 loop cycles before that (with other attribute values), and 101 times after accepting CancelledError in the retryable_request exceptions (which does not mean that CancelledError occurred).

As far as I know I did nothing what could have cancelled this.

puddly commented 2 years ago

Can you provide debug logs of zigpy-znp==0.7.0 sending these requests?

mdeweerd commented 2 years ago

Ok, my setups are HAOS based - 2022.2.3 has this version so I'll upgrade my setup.

What do you want me to acdtivate for debugging? An HA logger setup like this is ok (you know, but just to make sure you notice it: the # lines are not activated).

default: warning
logs:
  #homeassistant.core: debug
  #homeassistant.components.zha: debug
  #bellows.zigbee.application: debug
  #bellows.ezsp: debug
  custom_components.zha: debug
  zigpy: debug
  zigpy_znp: debug
  #zigpy_deconz.zigbee.application: debug
  #zigpy_deconz.api: debug
  ##zigpy_xbee.zigbee.application: debug
  ##zigpy_xbee.api: debug
  #zigpy_zigate: debug
  #zhaquirks: debug
  ## Extra, not sure this works:
  #zigpy.zdo: debug
  #zigpy.device: debug
  #zigpy_znp.api: debug
  #zigpy_znp.zigbee.zdo_converters: debug
  #zigpy_znp.zigbee.application: debug
  #zigpy_znp.commands.application: debug
  #zigpy_znp.commands.zdo: debug
  #zigpy_znp.commands.znp: debug
  #zigpy_znp.logger: debug
  #zigpy.util: debug
mdeweerd commented 2 years ago

A TimeoutError following an attribute read seems to be converted in a Cancelled Error zigpy 0.43.0 zigpy-znp 0.7.0:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/zigpy_znp/zigbee/application.py", line 1286, in _send_request_raw
    response = await asyncio.shield(
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/script.py", line 372, in _async_step
    await getattr(self, handler)()
  File "/usr/src/homeassistant/homeassistant/helpers/script.py", line 575, in _async_call_service_step
    await service_task
  File "/usr/src/homeassistant/homeassistant/core.py", line 1630, in async_call
    task.result()
  File "/usr/src/homeassistant/homeassistant/core.py", line 1667, in _execute_service
    await cast(Callable[[ServiceCall], Awaitable[None]], handler.job.target)(
  File "/config/custom_components/zha_toolkit/__init__.py", line 561, in toolkit_service
    raise handler_exception
  File "/config/custom_components/zha_toolkit/__init__.py", line 525, in toolkit_service
    await handler(
  File "/config/custom_components/zha_toolkit/__init__.py", line 608, in command_handler_default
    await default.default(
  File "/config/custom_components/zha_toolkit/default.py", line 33, in default
    await handler(app, listener, ieee, cmd, data, service, params, event_data)
  File "/config/custom_components/zha_toolkit/zcl_attr.py", line 392, in attr_write
    result_read = await cluster_read_attributes(
  File "/config/custom_components/zha_toolkit/zcl_attr.py", line 252, in cluster_read_attributes
    return await cluster.read_attributes(attrs, manufacturer=manufacturer)
  File "/usr/local/lib/python3.9/site-packages/zigpy/zcl/__init__.py", line 297, in read_attributes
    result = await self.read_attributes_raw(to_read, manufacturer=manufacturer)
  File "/usr/local/lib/python3.9/site-packages/zigpy/device.py", line 265, in request
    result, msg = await self._application.request(
  File "/usr/local/lib/python3.9/site-packages/zigpy_znp/zigbee/application.py", line 439, in request
    return await self._send_request(
  File "/usr/local/lib/python3.9/site-packages/zigpy_znp/zigbee/application.py", line 1401, in _send_request
    response = await self._send_request_raw(
  File "/usr/local/lib/python3.9/site-packages/zigpy_znp/zigbee/application.py", line 1304, in _send_request_raw
    raise InvalidCommandResponse(
  File "/usr/local/lib/python3.9/site-packages/async_timeout/__init__.py", line 129, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.9/site-packages/async_timeout/__init__.py", line 212, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
puddly commented 2 years ago

I've looked a bit closer at this problem and I think the extra traceback context is causing confusion. asyncio.CancelledError is never raised, only asyncio.TimeoutError:

import asyncio
import async_timeout

async def inner():
    async with async_timeout.timeout(0.1):
        await asyncio.sleep(1)

async def main():
    try:
        await asyncio.shield(inner())
    except Exception as e:
        print(f'!!! Caught exception: {e!r}')
        raise

if __name__ == "__main__":
    asyncio.run(main())

Output:

!!! Caught exception: TimeoutError()
Traceback (most recent call last):
  File "/tmp/test.py", line 7, in inner
    await asyncio.sleep(1)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 605, in sleep
    return await future
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/test.py", line 18, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 641, in run_until_complete
    return future.result()
  File "/tmp/test.py", line 12, in main
    await asyncio.shield(inner())
  File "/tmp/test.py", line 6, in inner
    async with async_timeout.timeout(0.1):
  File "/tmp/venv/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
    self._do_exit(exc_type)
  File "/tmp/venv/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

In short, catching asyncio.CancelledError within any asyncio code without a very good reason will not solve this problem and would introduce other more serious bugs.

mdeweerd commented 2 years ago

Hi @puddly

Baseed on your last message, I found in python3.9 code (asyncio/tasks.py):

async def sleep(delay, result=None, *, loop=None):
    """Coroutine that completes after a given time (in seconds)."""
    if delay <= 0:
        await __sleep0()
        return result

    if loop is None:
        loop = events.get_running_loop()
    else:
        warnings.warn("The loop argument is deprecated since Python 3.8, "
                      "and scheduled for removal in Python 3.10.",
                      DeprecationWarning, stacklevel=2)

    future = loop.create_future()
    h = loop.call_later(delay,
                        futures._set_result_unless_cancelled,
                        future, result)
    try:
        return await future
    finally:
        h.cancel()

and in futures.py:

    def cancel(self):
        """Cancel the future and schedule callbacks.

        If the future is already done or cancelled, return False.  Otherwise,
        change the future's state to cancelled, schedule the callbacks and
        return True.
        """
        self.__log_traceback = False
        if self._state != _PENDING:
            return False
        self._state = _CANCELLED
        self.__schedule_callbacks()
        return True

[...]

    def result(self):
        """Return the result this future represents.

        If the future has been cancelled, raises CancelledError.  If the
        future's result isn't yet available, raises InvalidStateError.  If
        the future is done and has an exception set, this exception is raised.
        """
        if self._state == _CANCELLED:
            raise exceptions.CancelledError

While I haven't checked in detail if those items are part of the execution path in this case, it looks like a reasonnable assumption that a timeout on a loop results in a CancelledError exception.

puddly commented 2 years ago

The cancellation error does occur but it never leaves the timeout library, as expected: https://github.com/aio-libs/async-timeout/blob/12624c224268a0d24437739b995c601d4f47887f/async_timeout/__init__.py#L210

I'm unable to reproduce this issue myself. The only way I can see catching the cancellation error helping things is if it's preventing a higher level call from cancelling the request, which is unexpected behavior.