loading model error，server error 500

System Info / 系統信息

cuda 12.1，python editon 3.10.12

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[ ] docker / docker
[X] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

刚下载的最新版本，和之前的版本，都有同样问题

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local --host 0.0.0.0 --port 9997

Reproduction / 复现过程

我从xinreference 网页启动模型，我显卡4096，启动qwen7b，yi 1.5 6b都有同样错误，加载模型不到50%，就报错：2024-09-20 10:06:21,267 xinference.core.worker 336466 INFO [request ee3a5cc8-76f4-11ef-ada4-878765a2943a] Enter launch_builtin_model, args: <xinference.core.worker.WorkerActor object at 0x7fc5ac9039c0>, kwargs: model_uid=Yi-1.5-chat-1-0,model_name=Yi-1.5-chat,model_size_in_billions=6,model_format=pytorch,quantization=none,model_engine=Transformers,model_type=LLM,n_gpu=auto,request_limits=None,peft_model_config=None,gpu_idx=None,download_hub=None,model_path=None 2024-09-20 10:06:25,325 xinference.model.llm.llm_family 336466 INFO Caching from Modelscope: 01ai/Yi-1.5-6B-Chat 2024-09-20 10:06:25,625 xinference.model.llm.llm_family 336466 INFO Cache /root/.xinference/cache/Yi-1.5-chat-pytorch-6b exists /root/xinference3_env/lib/python3.10/site-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 2024-09-20 10:06:25,681 transformers.tokenization_utils_base 340793 INFO loading file tokenizer.model 2024-09-20 10:06:25,681 transformers.tokenization_utils_base 340793 INFO loading file tokenizer.json 2024-09-20 10:06:25,681 transformers.tokenization_utils_base 340793 INFO loading file added_tokens.json 2024-09-20 10:06:25,681 transformers.tokenization_utils_base 340793 INFO loading file special_tokens_map.json 2024-09-20 10:06:25,681 transformers.tokenization_utils_base 340793 INFO loading file tokenizer_config.json 2024-09-20 10:06:26,316 transformers.configuration_utils 340793 INFO loading configuration file /root/.xinference/cache/Yi-1.5-chat-pytorch-6b/config.json 2024-09-20 10:06:26,317 transformers.configuration_utils 340793 INFO Model config LlamaConfig { "_name_or_path": "/root/.xinference/cache/Yi-1.5-chat-pytorch-6b", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 4, "pad_token_id": 0, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 5000000.0, "tie_word_embeddings": false, "torch_dtype": "float32", "transformers_version": "4.44.2", "use_cache": false, "vocab_size": 64000 }

2024-09-20 10:06:26,656 transformers.modeling_utils 340793 INFO loading weights file /root/.xinference/cache/Yi-1.5-chat-pytorch-6b/model.safetensors.index.json 2024-09-20 10:06:26,657 transformers.modeling_utils 340793 INFO Instantiating LlamaForCausalLM model under default dtype torch.float32. 2024-09-20 10:06:26,657 transformers.generation.configuration_utils 340793 INFO Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 0, "use_cache": false }

Loading checkpoint shards: 33%|███████████████████▎ | 1/3 [00:22<00:44, 22.31s/it]2024-09-20 10:07:20,809 xinference.core.supervisor 336466 ERROR Worker timeout. address: 0.0.0.0:46554, check count remaining 4... 2024-09-20 10:07:21,223 xinference.core.worker 336466 ERROR Failed to load model Yi-1.5-chat-1-0 Traceback (most recent call last): File "/root/xinference3_env/lib/python3.10/site-packages/xinference/core/worker.py", line 893, in launch_builtin_model await model_ref.load() File "/root/xinference3_env/lib/python3.10/site-packages/xoscar/backends/context.py", line 230, in send result = await self._wait(future, actor_ref.address, send_message) # type: ignore File "/root/xinference3_env/lib/python3.10/site-packages/xoscar/backends/context.py", line 115, in _wait return await future File "/root/xinference3_env/lib/python3.10/site-packages/xoscar/backends/core.py", line 84, in _listen raise ServerClosed( xoscar.errors.ServerClosed: Remote server unixsocket:///44101271552 closed 2024-09-20 10:07:21,329 xinference.core.worker 336466 ERROR [request ee3a5cc8-76f4-11ef-ada4-878765a2943a] Leave launch_builtin_model, error: Remote server unixsocket:///44101271552 closed, elapsed time: 60 s Traceback (most recent call last): File "/root/xinference3_env/lib/python3.10/site-packages/xinference/core/utils.py", line 69, in wrapped ret = await func(*args, **kwargs) File "/root/xinference3_env/lib/python3.10/site-packages/xinference/core/worker.py", line 893, in launch_builtin_model await model_ref.load() File "/root/xinference3_env/lib/python3.10/site-packages/xoscar/backends/context.py", line 230, in send result = await self._wait(future, actor_ref.address, send_message) # type: ignore File "/root/xinference3_env/lib/python3.10/site-packages/xoscar/backends/context.py", line 115, in _wait return await future File "/root/xinference3_env/lib/python3.10/site-packages/xoscar/backends/core.py", line 84, in _listen raise ServerClosed( xoscar.errors.ServerClosed: Remote server unixsocket:///44101271552 closed 2024-09-20 10:07:21,362 xinference.api.restful_api 336157 ERROR [address=0.0.0.0:46554, pid=336466] Remote server unixsocket:///44101271552 closed Traceback (most recent call last): File "/root/xinference3_env/lib/python3.10/site-packages/xinference/api/restful_api.py", line 967, in launch_model model_uid = await (await self._get_supervisor_ref()).launch_builtin_model( File "/root/xinference3_env/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/root/xinference3_env/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/root/xinference3_env/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send result = await self._run_coro(message.message_id, coro) File "/root/xinference3_env/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro return await coro File "/root/xinference3_env/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/root/xinference3_env/lib/python3.10/site-packages/xinference/core/supervisor.py",

Expected behavior / 期待表现

希望帮我排查下问题谢谢

xorbitsai / inference