zerotier / libzt

Encrypted P2P sockets over ZeroTier
https://zerotier.com
Other
183 stars 54 forks source link

segfault when running simultaneous client/server processes using libzt in Python on M1 Mac #142

Open bostonrwalker opened 2 years ago

bostonrwalker commented 2 years ago

I am experiencing segmentation faults when attempting to run simultaneous client/server processes in Python on an M1 Mac.

This issue only happens when running a simultaneous client and server, i.e. the issue does not occur when running multiple server processes at the same time with no client process.

libzt version 1.4.2 Python version: 3.7.9 Architecture: M1 Mac

Here is the crash dump:


Translated Report (Full Report Below)

Process: Python [27458] Path: /Library/Frameworks/Python.framework/Versions/3.7/Resources/Python.app/Contents/MacOS/Python Identifier: org.python.python Version: 3.7.9 (3.7.9) Code Type: X86-64 (Translated) Parent Process: pycharm [563] Responsible: pycharm [563] User ID: 501

Date/Time: 2021-12-06 17:35:29.3678 -0500 OS Version: macOS 12.0.1 (21A559) Report Version: 12 Anonymous UUID: ...

Sleep/Wake UUID: ...

Time Awake Since Boot: 100000 seconds Time Since Wake: 7751 seconds

System Integrity Protection: enabled

Crashed Thread: 3

Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x0000d7b29c65c050 -> 0x000057b29c65c050 (possible pointer authentication failure) Exception Codes: 0x0000000000000001, 0x0000d7b29c65c050 Exception Note: EXC_CORPSE_NOTIFY

VM Region Info: 0x57b29c65c050 is not in any region. Bytes after previous region: 95943603372113 Bytes before following region: 9128476557232 REGION TYPE START - END [ VSIZE] PRT/MAX SHRMOD REGION DETAIL commpage (reserved) 1000000000-7000000000 [384.0G] ---/--- SM=NUL ...(unallocated) ---> GAP OF 0x5f9000000000 BYTES MALLOC_NANO 600000000000-600008000000 [128.0M] rw-/rwx SM=PRV

Thread 0:: Dispatch queue: com.apple.main-thread 0 ??? 0x7ff89489e940 ??? 1 libsystem_kernel.dylib 0x7ff8043df506 psynch_cvwait + 10 2 Python 0x108d3c2dc take_gil + 252 3 Python 0x108d3c88f PyEval_RestoreThread + 31 4 Python 0x108da2a43 os_open + 227 5 Python 0x108c86a79 _PyMethodDef_RawFastCallKeywords + 457 6 Python 0x108c85f6a _PyCFunction_FastCallKeywords + 42 7 Python 0x108d46004 call_function + 724 8 Python 0x108d42fdd _PyEval_EvalFrameDefault + 25181 9 Python 0x108d46b63 _PyEval_EvalCodeWithName + 2467 10 Python 0x108c85b0b _PyFunction_FastCallDict + 523 11 Python 0x108c86e2f _PyObject_Call_Prepend + 143 12 Python 0x108c85889 _PyObject_FastCallDict + 297 13 Python 0x108c87006 _PyObject_CallFunctionVa + 230 14 Python 0x108c871c2 _PyObject_CallFunction_SizeT + 146 15 Python 0x108dd2cea _ioFileIOinit__ + 1082 16 Python 0x108cd1199 type_call + 297 17 Python 0x108c85889 _PyObject_FastCallDict + 297 18 Python 0x108c87006 _PyObject_CallFunctionVa + 230 19 Python 0x108c871c2 _PyObject_CallFunction_SizeT + 146 20 Python 0x108dd0f18 _io_open + 1000 21 Python 0x108c86a79 _PyMethodDef_RawFastCallKeywords + 457 22 Python 0x108c85f6a _PyCFunction_FastCallKeywords + 42 23 Python 0x108d46004 call_function + 724 24 Python 0x108d43124 _PyEval_EvalFrameDefault + 25508 25 Python 0x108d46b63 _PyEval_EvalCodeWithName + 2467 26 Python 0x108c85f31 _PyFunction_FastCallKeywords + 257 27 Python 0x108d46012 call_function + 738 28 Python 0x108d42fc3 _PyEval_EvalFrameDefault + 25155 29 Python 0x108c863a0 function_code_fastcall + 128 30 Python 0x108d46012 call_function + 738 31 Python 0x108d42fc3 _PyEval_EvalFrameDefault + 25155 32 Python 0x108c863a0 function_code_fastcall + 128 33 Python 0x108c85994 _PyFunction_FastCallDict + 148 34 Python 0x108d432ee _PyEval_EvalFrameDefault + 25966 35 Python 0x108d46b63 _PyEval_EvalCodeWithName + 2467 36 Python 0x108c85f31 _PyFunction_FastCallKeywords + 257 37 Python 0x108d46012 call_function + 738 38 Python 0x108d42fdd _PyEval_EvalFrameDefault + 25181 39 Python 0x108c863a0 function_code_fastcall + 128 40 Python 0x108d46012 call_function + 738 41 Python 0x108d42fdd _PyEval_EvalFrameDefault + 25181 42 Python 0x108d46b63 _PyEval_EvalCodeWithName + 2467 43 Python 0x108c85b0b _PyFunction_FastCallDict + 523 44 Python 0x108c86e2f _PyObject_Call_Prepend + 143 45 Python 0x108cd4376 slot_tp_call + 150 46 Python 0x108c85d81 _PyObject_FastCallKeywords + 433 47 Python 0x108d45ed4 call_function + 420 48 Python 0x108d43124 _PyEval_EvalFrameDefault + 25508 49 Python 0x108c863a0 function_code_fastcall + 128 50 Python 0x108d46012 call_function + 738 51 Python 0x108d4307e _PyEval_EvalFrameDefault + 25342 52 Python 0x108c863a0 function_code_fastcall + 128 53 Python 0x108c85994 _PyFunction_FastCallDict + 148 54 Python 0x108d432ee _PyEval_EvalFrameDefault + 25966 55 Python 0x108d46b63 _PyEval_EvalCodeWithName + 2467 56 Python 0x108c85f31 _PyFunction_FastCallKeywords + 257 57 Python 0x108d46012 call_function + 738 58 Python 0x108d42fdd _PyEval_EvalFrameDefault + 25181 59 Python 0x108c863a0 function_code_fastcall + 128 60 Python 0x108d46012 call_function + 738 61 Python 0x108d42fdd _PyEval_EvalFrameDefault + 25181 62 Python 0x108d46b63 _PyEval_EvalCodeWithName + 2467 63 Python 0x108c85b0b _PyFunction_FastCallDict + 523 64 Python 0x108c86e2f _PyObject_Call_Prepend + 143 65 Python 0x108cd4376 slot_tp_call + 150 66 Python 0x108c85d81 _PyObject_FastCallKeywords + 433 67 Python 0x108d45ed4 call_function + 420 68 Python 0x108d43124 _PyEval_EvalFrameDefault + 25508 69 Python 0x108d46b63 _PyEval_EvalCodeWithName + 2467 70 Python 0x108c85f31 _PyFunction_FastCallKeywords + 257 71 Python 0x108d46012 call_function + 738 72 Python 0x108d42fdd _PyEval_EvalFrameDefault + 25181 73 Python 0x108d46b63 _PyEval_EvalCodeWithName + 2467 74 Python 0x108d3cca4 PyEval_EvalCode + 100 75 Python 0x108d7a3b1 PyRun_FileExFlags + 209 76 Python 0x108d79c2a PyRun_SimpleFileExFlags + 890 77 Python 0x108d990c3 pymain_main + 6915 78 Python 0x108d9962a _Py_UnixMain + 58 79 dyld 0x2003924fe start + 462 80 dyld 0x20038d000 ???

Thread 1:: com.apple.rosetta.exceptionserver 0 runtime 0x7ff7ffc688e4 0x7ff7ffc64000 + 18660 1 runtime 0x7ff7ffc75928 0x7ff7ffc64000 + 71976 2 runtime 0x7ff7ffc770a4 0x7ff7ffc64000 + 77988

Thread 2: 0 ??? 0x7ff89489e940 ??? 1 libsystem_kernel.dylib 0x7ff8043df3da __semwait_signal + 10 2 libsystem_c.dylib 0x7ff8042f47df usleep + 53 3 _libzt.cpython-37m-darwin.so 0x109f3203b ZeroTier::zts_main_lwip_driver_loop(void*) + 91 4 _libzt.cpython-37m-darwin.so 0x109fda30f thread_wrapper + 15 5 libsystem_pthread.dylib 0x7ff804419514 _pthread_start + 125 6 libsystem_pthread.dylib 0x7ff80441502f thread_start + 15

Thread 3 Crashed: 0 ??? 0x7ff89489e940 ??? 1 libsystem_kernel.dylib 0x7ff8043e3112 __pthread_kill + 10 2 libsystem_c.dylib 0x7ff8043281c4 raise + 26 3 libsystem_platform.dylib 0x7ff80442ee2d _sigtramp + 29 4 ??? 0x0 ??? 5 Python 0x108d3c401 take_gil + 545 6 Python 0x108d3c8aa PyEval_RestoreThread + 58 7 Python 0x108d73d15 PyGILState_Ensure + 133 8 _libzt.cpython-37m-darwin.so 0x109f41f8a ZeroTier::Events::run() + 426 9 _libzt.cpython-37m-darwin.so 0x109f2c510 cbRun + 16 10 libsystem_pthread.dylib 0x7ff804419514 _pthread_start + 125 11 libsystem_pthread.dylib 0x7ff80441502f thread_start + 15

Thread 4: 0 ??? 0x7ff89489e940 ??? 1 libsystem_kernel.dylib 0x7ff8043de20a __ulock_wait + 10 2 _libzt.cpython-37m-darwin.so 0x109f3083e ZeroTier::VirtualTap::~VirtualTap() + 222 3 _libzt.cpython-37m-darwin.so 0x109f388cd ZeroTier::NodeService::run() + 5661 4 _libzt.cpython-37m-darwin.so 0x109f2d402 _runNodeService + 18 5 libsystem_pthread.dylib 0x7ff804419514 _pthread_start + 125 6 libsystem_pthread.dylib 0x7ff80441502f thread_start + 15

Thread 5: 0 ??? 0x7ff89489e940 ??? 1 libsystem_kernel.dylib 0x7ff8043df506 __psynch_cvwait + 10 2 _libzt.cpython-37m-darwin.so 0x109fda717 sys_arch_sem_wait + 247 3 _libzt.cpython-37m-darwin.so 0x109fdaea8 sys_arch_mbox_fetch + 456 4 _libzt.cpython-37m-darwin.so 0x109fbedb2 tcpip_thread + 162 5 _libzt.cpython-37m-darwin.so 0x109fda30f thread_wrapper + 15 6 libsystem_pthread.dylib 0x7ff804419514 _pthread_start + 125 7 libsystem_pthread.dylib 0x7ff80441502f thread_start + 15

Thread 6: 0 ??? 0x7ff89489e940 ??? 1 libsystem_kernel.dylib 0x7ff8043df3da semwait_signal + 10 2 _libzt.cpython-37m-darwin.so 0x109f31e84 ZeroTier::VirtualTap::threadMain() + 340 3 libzt.cpython-37m-darwin.so 0x109f32919 void* ZeroTier::zt_threadMain(void*) + 9 4 libsystem_pthread.dylib 0x7ff804419514 _pthread_start + 125 5 libsystem_pthread.dylib 0x7ff80441502f thread_start + 15

Thread 3 crashed with X86 Thread State (64-bit): rax: 0x0000000000000000 rbx: 0x000000030185e000 rcx: 0x0000000000000000 rdx: 0x0000000000000000 rdi: 0xd15af4767077b3e9 rsi: 0x000000010032e088 rbp: 0xd15af4767077b3e9 rsp: 0x000000010032e088 r8: 0x000000000000000b r9: 0xd15af4767077b3e9 r10: 0x000000030185e000 r11: 0x00007f960df0b6a0 r12: 0x000000000000041b r13: 0x0000000108ec4640 r14: 0x000000000000000b r15: 0x0000000000000016 rip: rfl: 0x0000000000000283 tmp0: 0xffffffffffffffff tmp1: 0x00007ff89489e914 tmp2: 0x00007ff8043e3108

Binary Images: 0x0 - 0xffffffffffffffff ??? () <00000000-0000-0000-0000-000000000000> ??? 0x7ff8043dc000 - 0x7ff804412fff libsystem_kernel.dylib () <12bd6f13-c452-35ee-9069-51befef29f1a> /usr/lib/system/libsystem_kernel.dylib 0x108c64000 - 0x108e3ffff org.python.python (3.7.9, (c) 2001-2019 Python Software Foundation.) <98c0575f-9521-3790-9428-8a6a0bfa8161> /Library/Frameworks/Python.framework/Versions/3.7/Python 0x20038d000 - 0x2003f8fff dyld () <1a6ae033-9438-33c0-8077-988fd885250a> /usr/lib/dyld 0x7ff7ffc64000 - 0x7ff7ffc93fff runtime () <9f5d65be-d8d0-3979-bb05-e651a67e785c> /usr/libexec/rosetta/runtime 0x7ff8042e4000 - 0x7ff80436cfff libsystem_c.dylib () <991f58b7-b4c0-3c3a-84a8-c9c571de5a27> /usr/lib/system/libsystem_c.dylib 0x109f05000 - 0x10a010fff _libzt.cpython-37m-darwin.so () /Users/USER//_libzt.cpython-37m-darwin.so 0x7ff804413000 - 0x7ff80441efff libsystem_pthread.dylib () <29a2750e-f31b-3630-8761-242a6bc3e99e> /usr/lib/system/libsystem_pthread.dylib 0x7ff80442b000 - 0x7ff804434fff libsystem_platform.dylib (*) /usr/lib/system/libsystem_platform.dylib

External Modification Summary: Calls made by other processes targeting this process: task_for_pid: 0 thread_create: 0 thread_set_state: 0 Calls made by this process: task_for_pid: 0 thread_create: 0 thread_set_state: 0 Calls made by all processes on this machine: task_for_pid: 0 thread_create: 0 thread_set_state: 0

VM Region Summary: ReadOnly portion of Libraries: Total=462.2M resident=0K(0%) swapped_out_or_unallocated=462.2M(100%) Writable regions: Total=1.5G written=0K(0%) resident=0K(0%) swapped_out=0K(0%) unallocated=1.5G(100%)

                            VIRTUAL   REGION 

REGION TYPE SIZE COUNT (non-coalesced) =========== ======= ======= Kernel Alloc Once 8K 1 MALLOC 239.1M 29 MALLOC guard page 96K 4 MALLOC_MEDIUM (reserved) 720.0M 6 reserved VM address space (unallocated) MALLOC_NANO (reserved) 384.0M 1 reserved VM address space (unallocated) Rosetta Arena 4096K 2 Rosetta Generic 924K 228 Rosetta IndirectBranch 128K 1 Rosetta JIT 128.0M 1 Rosetta Return Stack 120K 12 Rosetta Thread Context 120K 12 Stack 19.0M 6 Stack Guard 36K 6 VM_ALLOCATE 10.9M 40 VM_ALLOCATE (reserved) 16K 1 reserved VM address space (unallocated) DATA 3614K 160 DATA_CONST 5132K 82 DATA_DIRTY 208K 50 LINKEDIT 369.9M 45 OBJC_RO 81.6M 1 OBJC_RW 3120K 2 TEXT 92.4M 168 UNICODE 588K 1 dyld private memory 1024K 1 mapped file 5.4G 286 shared memory 32K 2 unshared pmap 2304K 2 =========== ======= ======= TOTAL 7.4G 1150 TOTAL, minus reserved VM space 6.4G 1150

Here is a code snippet to reproduce:

import queue
from multiprocessing import Process, Queue
from time import sleep

import libzt

NET_ID = 0x0123456789abcdef  # Substitute your own
PORT = 8000

def test_get_helloworld():

    _queue = Queue()

    def _listen(_queue: Queue, net_id: int, port: int):
        """Asynchronous server process"""

        # Initialize node
        node = libzt.ZeroTierNode()

        try:
            node.node_start()
            while not node.node_is_online():
                sleep(0.02)
            node.net_join(net_id)
            while not node.net_transport_is_ready(net_id):
                sleep(0.02)

            # Bind to port
            serv = libzt.socket(libzt.ZTS_AF_INET, libzt.ZTS_SOCK_STREAM, 0)
            serv.bind(('0.0.0.0', port))
            serv.listen(1)

            # Get ip address and return to main process
            ip_addr = node.addr_get_ipv4(net_id)
            _queue.put(ip_addr)

            # Handle incoming connections
            while True:
                conn, addr = serv.accept()
                print(f'Accepted connection from {addr}')

                data_rx = conn.recv(4096)
                message_rx = data_rx.decode('iso-8859-1')
                assert message_rx == f"""GET http://{ip_addr}:{port} HTTP/1.1"""

                message_tx = """Hello, world!"""
                data_tx = message_tx.encode('iso-8859-1')
                conn.send(data_tx)
                print(f'Sent message to {addr}')

                conn.close()
        finally:
            # Bring down node
            node.node_stop()
            node.node_free()

    server = Process(target=_listen, args=(_queue, NET_ID, PORT))

    try:
        server.start()

        try:
            ip_addr: str = _queue.get(block=True, timeout=60.)
        except queue.Empty as e:
            raise TimeoutError('Timed out waiting for server to come online') from e

        print(f'Server listening at IP address: {ip_addr}')

        # Initialize node
        node = libzt.ZeroTierNode()
        try:
            node.node_start()
            while not node.node_is_online():
                sleep(0.02)
            node.net_join(NET_ID)
            while not node.net_transport_is_ready(NET_ID):
                sleep(0.02)

            # Connect to server and send data
            client = libzt.socket(libzt.ZTS_AF_INET, libzt.ZTS_SOCK_STREAM, 0)
            client.connect((ip_addr, PORT))

            message_tx = f"GET http://{ip_addr}:{PORT} HTTP/1.1"
            data_tx = message_tx.encode('iso-8859-1')
            client.send(data_tx)

            data_rx = client.recv(1024)
            message_rx = data_rx.decode('iso-8859-1')
            client.close()

        finally:
            # Bring down node
            node.node_stop()
            node.node_free()

        assert message_rx == 'Hello, world!'

    finally:
        server.terminate()
        server.join(timeout=2.)
joseph-henry commented 2 years ago

Our library doesn't officially support running multiple nodes in a single process if that's what you're doing. We can consider changing that in the future but that's likely pretty far down the road.

In the meantime can you run these nodes as separate processes?

bostonrwalker commented 2 years ago

Thanks for the reply. This code should run the nodes in separate processes, although I'm not entirely sure how Python multiprocessing interacts with native libraries.

joseph-henry commented 2 years ago

Hmm. Interesting. The only thing I can think of is that its usage of the Python GIL isn't appropriate for use with multiprocessing import Process. I've tested this many times running a client and server on the same machine but when I do I'd independently start multiple python processes. Try that if you haven't.

Regardless I'd still consider this a bug so I will keep this ticket open.

bostonrwalker commented 2 years ago

Thanks. I tried importing libzt locally within the listen() callable (as per https://stackoverflow.com/questions/34860875/tkinter-spinbox-and-multiprocessing-causes-mac-segfault) and the segfaults are less frequent now.

def _listen(_queue: Queue, net_id: int, port: int):
    """Asynchronous server process"""
    # Local import within process
    import libzt
    ...

I will keep chasing the segfaults and try and provide another replicating example if I can.

bostonrwalker commented 2 years ago

Incidentally, this exact same code fails frequently with ConnectionAbortedError. From what I can tell, this happens whenever the connection process takes more than 60 seconds. Two issues:

  1. If the connection process is timing out, shouldn't the error be a TimeoutError?
  2. The API for setting the timeout should be implemented
bostonrwalker commented 2 years ago

See https://github.com/zerotier/libzt/issues/143

joseph-henry commented 2 years ago

Thank you. I'll take a closer look at these tickets next week.

bostonrwalker commented 2 years ago

Update: I was able to make two observations:

  1. The segfaults only happen when there is an error from the socket API (e.g. ZTS_ERR_ECONNABORTED)
  2. The segfaults don't happen when I step through in debug mode

This leads me to think there is a race condition and it is present in the error handling code. I had a look at handle_error() and it looks like it is calling errno() which is not thread-safe. I put a delay at the top of this function and it seems to fix the issue:

def handle_error(err):
    """Convert libzt error code to exception"""
    time.sleep(0.1)
    ...

So I suspect the issue is within the errno() function.