zigpy / bellows

A Python 3 project to implement EZSP for EmberZNet devices
GNU General Public License v3.0
177 stars 86 forks source link

Prevent task cancellation from propagating to ASH #628

Closed puddly closed 3 weeks ago

puddly commented 3 weeks ago

Root cause of https://github.com/home-assistant/core/issues/119424.

During device joining, zigpy cancels a scheduled initialization task if the device re-joins during initialization (this is pretty common). Unfortunately, this cancellation propagates all the way down to the ASH sending task, causing the TX sequence number to increment without waiting for an acknowledgement. There is currently a firmware bug with EmberZNet and while ASH can support multiple pending frames at a time, in reality the stack crashes if the number is greater than one 😄.

This fix prevents an ASH send from being cancelled by using asyncio.shield, which schedules it in a task. Incrementing the TX sequence number after a frame has been sent will have similar issues because our last send may not have been ACKed. An alternative to this approach would be to avoid using coroutines entirely for sending and make the ASH protocol implementation rely on event loop callbacks for timeouts.

codecov[bot] commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.72%. Comparing base (09cf7ce) to head (e000846).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #628 +/- ## ======================================= Coverage 99.72% 99.72% ======================================= Files 75 75 Lines 5002 5016 +14 ======================================= + Hits 4988 5002 +14 Misses 14 14 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

tube0013 commented 3 weeks ago

2 hours in this is working well, no more NCP failures.