zeromq / pyzmq

PyZMQ: Python bindings for zeromq
http://zguide.zeromq.org/py:all
BSD 3-Clause "New" or "Revised" License
3.66k stars 638 forks source link

Asyncio Race Condition Leading to Infinite Loop #2001

Open TheTechromancer opened 1 month ago

TheTechromancer commented 1 month ago

This is a pyzmq bug

What pyzmq version?

26.0.3

What libzmq version?

4.3.5

Python version (and how it was installed)

Python 3.9 via apt

OS

Debian

What happened?

Recently I've run into a bug in cpython that directly affects ZMQ. It triggers whenever asyncio debugging is enabled, and the ZMQ future blocks for more than .1 second:

https://github.com/python/cpython/blob/7c2921844f9fa713f93152bf3a569812cee347a0/Lib/asyncio/base_events.py#L2021-L2023

The bug is due to an unintended recursion that happens when repr() is called on an asyncio task. The recursion is caused by ZMQ's future storing references to other futures including itself, which creates a circular reference. However, because each new layer of recursion must iterate over multiple futures, a RecursionError is never reached, and instead it results in a deadlock where the CPU is stuck at 100%:

image

This is mainly a bug in cpython, and was fixed in 3.11. However, 3.10 and earlier are still vulnerable to this bug, and based on the feedback from the cpython issue, the fix will not be back ported to those older versions.

https://github.com/python/cpython/issues/122296

I'm creating this issue so you're aware of it, and so anyone else googling for the issue can find it. This one was a beast to track down, since it only happens when PYTHONASYNCIODEBUG=1 and when the ZMQ future blocks for more than .1 second. Hopefully it's helpful to someone.

Full traceback: python_traceback.txt

Code to reproduce bug

import asyncio
import time
import functools

async def slow_callback():
    await asyncio.sleep(.1)
    time.sleep(.2)  # Blocking sleep to trigger the warning
    await asyncio.sleep(.1)

async def main():
    task = asyncio.create_task(slow_callback())
    task.add_done_callback(
        functools.partial(print, [task, task])
    )
    await task

if __name__ == "__main__":
    asyncio.run(main(), debug=True)

Traceback, if applicable

No response

More info

No response

minrk commented 1 month ago

Thanks for the report! I'm not sure there's an easy fix, but if one turns up I'll give it a try. Hopefully this will help others find out what's going on, at least.