xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.59k stars 360 forks source link

CONNECTION ERROT #1786

Open completeqwq opened 1 month ago

completeqwq commented 1 month ago

raise RuntimeError("Cluster is not available after multiple attempts") RuntimeError: Cluster is not available after multiple attempts

nikelius commented 1 month ago

我在Windows下也遇到过类似问题(异常栈见后) 个人判断该问题的原因:缺省防火墙等网络限制了一些端口的访问,导致管理服务无法正常连接worker进程。 (其实不算inference的锅哈)

当时的环境:Win10服务器 详细异常栈: 2024-07-04 14:34:53,905 xinference.core.worker 19788 INFO Purge cache directory: C:\Users\admin.xinference\cache Traceback (most recent call last): File "G:\miniconda\envs\py310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "G:\miniconda\envs\py310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "G:\miniconda\envs\py310\Scripts\xinference-local.exe__main.py", line 7, in sys.exit(local()) File "G:\miniconda\envs\py310\lib\site-packages\click\core.py", line 1157, in call return self.main(*args, kwargs) File "G:\miniconda\envs\py310\lib\site-packages\click\core.py", line 1078, in main rv = self.invoke(ctx) File "G:\miniconda\envs\py310\lib\site-packages\click\core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "G:\miniconda\envs\py310\lib\site-packages\click\core.py", line 783, in invoke return callback(*args, **kwargs) File "G:\miniconda\envs\py310\lib\site-packages\xinference\deploy\cmdline.py", line 225, in local start_local_cluster( File "G:\miniconda\envs\py310\lib\site-packages\xinference\deploy\cmdline.py", line 112, in start_local_cluster main( File "G:\miniconda\envs\py310\lib\site-packages\xinference\deploy\local.py", line 122, in main raise RuntimeError("Cluster is not available after multiple attempts") RuntimeError: Cluster is not available after multiple attempts

completeqwq commented 1 month ago

我在Windows下也遇到过类似问题(异常栈见后) 个人判断该问题的原因:缺省防火墙等网络限制了一些端口的访问,导致管理服务无法正常连接worker进程。 (其实不算inference的锅哈)

当时的环境:Win10服务器 详细异常栈: 2024-07-04 14:34:53,905 xinference.core.worker 19788 INFO Purge cache directory: C:\Users\admin.xinference\cache Traceback (most recent call last): File "G:\miniconda\envs\py310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "G:\miniconda\envs\py310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "G:\miniconda\envs\py310\Scripts\xinference-local.exemain.py", line 7, in sys.exit(local()) File "G:\miniconda\envs\py310\lib\site-packages\click\core.py", line 1157, in call return self.main(args, kwargs) File "G:\miniconda\envs\py310\lib\site-packages\click\core.py", line 1078, in main rv = self.invoke(ctx) File "G:\miniconda\envs\py310\lib\site-packages\click\core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "G:\miniconda\envs\py310\lib\site-packages\click\core.py", line 783, in invoke return __callback(args, **kwargs) File "G:\miniconda\envs\py310\lib\site-packages\xinference\deploy\cmdline.py", line 225, in local start_local_cluster( File "G:\miniconda\envs\py310\lib\site-packages\xinference\deploy\cmdline.py", line 112, in start_local_cluster main( File "G:\miniconda\envs\py310\lib\site-packages\xinference\deploy\local.py", line 122, in main raise RuntimeError("Cluster is not available after multiple attempts") RuntimeError: Cluster is not available after multiple attempts

请问怎么解决的

nikelius commented 1 month ago

请问怎么解决的

其实没解决,windows下稀奇古怪的问题太多,准备尝试wsl方式。 整体来说,服务启动涉及三个端口(supervisor、server,metrics)把这几个端口在防火墙上放开就好。 之前的思路是:1.关闭windows防火墙;2、单独放开某个端口的限制。 方法1,我的场景无法用(其他团队在用改主机,不让关闭防火墙) 方法2,如果是固定端口还好办,用下面命令可以放开某个指定端口: netsh advfirewall firewall add rule name=WSL2api dir=in action=allow protocol=TCP localport=9997 xinference-local --host 0.0.0.0 --port 9997

但是xinference采用的动态端口机制,每次启动所使用的内部端口不一致,这就没法弄了。 第一次:64719 、65264 2024-07-10 14:56:23,283 xinference.core.supervisor 25764 INFO Xinference supervisor 0.0.0.0:64719 started 2024-07-10 14:56:23,621 xinference.core.worker 25764 INFO Starting metrics export server at 0.0.0.0:None 2024-07-10 14:56:23,624 xinference.core.worker 25764 INFO Checking metrics export server... 2024-07-10 14:56:31,034 xinference.core.worker 25764 INFO Metrics server is started at: http://0.0.0.0:65264 2024-07-10 14:56:31,037 xinference.core.worker 25764 INFO Xinference worker 0.0.0.0:64719 started 2024-07-10 14:56:31,038 xinference.core.worker 25764 INFO Purge cache directory: C:\Users\admin.xinference\cache 第二次:53127、65282 2024-07-10 14:59:09,471 xinference.core.supervisor 14648 INFO Xinference supervisor 0.0.0.0:53127 started 2024-07-10 14:59:09,722 xinference.core.worker 14648 INFO Starting metrics export server at 0.0.0.0:None 2024-07-10 14:59:09,737 xinference.core.worker 14648 INFO Checking metrics export server... 2024-07-10 14:59:16,456 xinference.core.worker 14648 INFO Metrics server is started at: http://0.0.0.0:65282 2024-07-10 14:59:16,459 xinference.core.worker 14648 INFO Xinference worker 0.0.0.0:53127 started 2024-07-10 14:59:16,461 xinference.core.worker 14648 INFO Purge cache directory: C:\Users\admin.xinference\cache

nikelius commented 1 month ago

补充:

本地模式(只固定了Metrics 端口,而supervisor 和 worker 是动态的,不解决问题)

命令:xinference-local --host 0.0.0.0 --port 9997 -MH 0.0.0.0 -mp 9996

2024-07-10 15:59:13,111 xinference.core.supervisor 26396 INFO Xinference supervisor 0.0.0.0:48462 started 2024-07-10 15:59:13,330 xinference.core.worker 26396 INFO Starting metrics export server at 0.0.0.0:9996 2024-07-10 15:59:13,345 xinference.core.worker 26396 INFO Checking metrics export server... 2024-07-10 15:59:20,456 xinference.core.worker 26396 INFO Metrics server is started at: http://0.0.0.0:9996 2024-07-10 15:59:20,460 xinference.core.worker 26396 INFO Xinference worker 0.0.0.0:48462 started 2024-07-10 15:59:20,461 xinference.core.worker 26396 INFO Purge cache directory: C:\Users\admin.xinference\cache

分布模式(supervisor 启动完成,但worker 启动失败)

命令:xinference-supervisor --host 10.10.83.104 --port 9997 --supervisor-port 9996

2024-07-10 16:00:11,564 xinference.core.supervisor 19516 INFO Xinference supervisor 10.10.83.104:9996 started 2024-07-10 16:00:18,706 xinference.api.restful_api 8172 INFO Starting Xinference at endpoint: http://10.10.83.104:9997

命令:xinference-worker -e "http://10.10.83.104:9997" --metrics-exporter-host 0.0.0.0 --metrics-exporter-port 9995

2024-07-10 16:01:24,061 xinference.core.worker 6844 INFO Starting metrics export server at 0.0.0.0:9995 2024-07-10 16:01:24,064 xinference.core.worker 6844 INFO Checking metrics export server... 2024-07-10 16:01:31,205 xinference.core.worker 6844 INFO Metrics server is started at: http://0.0.0.0:**9995 OSError: [address=10.10.83.104:9996**, pid=19516] [WinError 1214] 指定的网络名格式无效。

nikelius commented 1 month ago

请问怎么解决的

已解决 @completeqwq 修改启动命令,使用具体的IP地址替代0.0.0.0 就可以了。 原来:xinference-local --host 0.0.0.0 --port 9997 改为:xinference-local --host 10.10.83.104 --port 9997 放开防火墙端口限制:netsh advfirewall firewall add rule name=WSL2api dir=in action=allow protocol=TCP localport=9997

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 7 days with no activity.

lhs0627 commented 2 weeks ago

@nikelius 为什么我有时候xinference-local 就正常启动,有时候就报这个错误呢?求解