xorbitsai / xoscar

Python actor framework for heterogeneous computing.
https://xoscar.dev
Apache License 2.0
89 stars 21 forks source link

ENH: fix non-local client connection problem when server listen on 0.0.0.0 #92

Closed frostyplanet closed 2 months ago

frostyplanet commented 2 months ago

Senario: server(supervisor) listen on 0.0.0.0, might because it has multiple ip address, or the server hide behind L4 loadbalancer. client (worker) connect server from another host (by ip or hostname) will cause such error.

File "/root/inference/xinference/deploy/worker.py", line 65, in _start_worker                                                                                                                              
    await start_worker_components(                                                                                                                                                                                 
  File "/root/inference/xinference/deploy/worker.py", line 43, in start_worker_components                                                                                                                    
    await xo.create_actor(                                                                                                                                                                                         
  File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 78, in create_actor                                                                                                                           
    return await ctx.create_actor(actor_cls, *args, uid=uid, address=address, **kwargs)                                                                                                                            
  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 143, in create_actor                                                                                                             
    return self._process_result_message(result)                                                                                                                                                                    
  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message                                                                                                  
    raise message.as_instanceof_cause()                                                                                                                                                                            
  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 598, in create_actor                                                                                                                
    await self._run_coro(message.message_id, actor.__post_create__())                                                                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 370, in _run_coro                                                                                                                   
    return await coro                                                                                                                                                                                              
  File "/root/inference/xinference/core/worker.py", line 192, in __post_create__                                                                                                                             
    await self._supervisor_ref.add_worker(self.address)
  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 224, in send
    future = await self._call(actor_ref.address, send_message, wait=False)
  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 77, in _call
    return await self._caller.call(
  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/core.py", line 180, in call
    client = await self.get_client(router, dest_address)
  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/core.py", line 68, in get_client

File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/router.py", line 143, in get_client
    client = await self._create_client(client_type, address, **kw)
  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/router.py", line 157, in _create_client
    return await client_type.connect(address, local_address=local_address, **kw)
  File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/communication/socket.py", line 255, in connect
    (reader, writer) = await asyncio.open_connection(host=host, port=port, **kwargs)
  File "/usr/lib/python3.10/asyncio/streams.py", line 48, in open_connection
    transport, _ = await loop.create_connection(
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1076, in create_connection
    raise exceptions[0]
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1060, in create_connection
    sock = await self._connect_sock(

File "/usr/lib/python3.10/asyncio/base_events.py", line 969, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 501, in sock_connect
    return await fut
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 541, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [address=172.31.23.86:9020, pid=1126889] [Errno 111] Connect call failed ('0.0.0.0', 9090)

The root cause is that xo.actor_ref() will return ActorRef(address=0.0.0.0) by server-side, although client originally specify server addr is not 0.0.0.0. The next call to actor_ref will raise exception because 0.0.0.0 is treated as 127.0.0.1 by client-side.

We can split ActorRef.address into ip & port, detect zero address and replace with correct ip to fix the problem

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 88.34%. Comparing base (d6465c9) to head (4b93854). Report is 2 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #92 +/- ## ========================================== - Coverage 88.97% 88.34% -0.64% ========================================== Files 48 48 Lines 4038 4040 +2 Branches 770 771 +1 ========================================== - Hits 3593 3569 -24 - Misses 358 380 +22 - Partials 87 91 +4 ``` | [Flag](https://app.codecov.io/gh/xorbitsai/xoscar/pull/92/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=xorbitsai) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/xorbitsai/xoscar/pull/92/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=xorbitsai) | `88.21% <100.00%> (-0.59%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=xorbitsai#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.