xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
3.57k stars 296 forks source link

[BUG] Error When Launching Custom Models with Xinference [BUG] #872

Open MrBrabus75 opened 6 months ago

MrBrabus75 commented 6 months ago

I executed the following command to start Xinference:

xinference-local --host 0.0.0.0 --port 9997

I successfully started a rerank model, bge-reranker-large, and everything worked fine. However, when I attempt to launch custom embedding models, specifically e5-mistral-7b-instruct or a custom LLM model like MetaMath-Cybertron, I encounter the error detailed below:

(inference) gpt-lab@gpt-lab-tower:~$ xinference-local --host 0.0.0.0 --port 9997 2024-01-08 19:23:07,544 xinference.core.supervisor 12284 INFO Xinference supervisor 0.0.0.0:12783 started 2024-01-08 19:23:07,606 xinference.core.worker 12284 INFO Xinference worker 0.0.0.0:12783 started 2024-01-08 19:23:07,607 xinference.core.worker 12284 INFO Purge cache directory: /home/gpt-lab/.xinference/cache 2024-01-08 19:23:12,153 xinference.api.restful_api 12171 INFO Starting Xinference at endpoint: http://0.0.0.0:9997 2024-01-08 19:23:31,724 - modelscope - INFO - PyTorch version 2.1.2 Found. 2024-01-08 19:23:31,724 - modelscope - INFO - Loading ast index from /home/gpt-lab/.cache/modelscope/ast_indexer 2024-01-08 19:23:31,852 - modelscope - INFO - Loading done! Current index file version is 1.10.0, with md5 68c0b64cc2fe9141b85988e677fba775 and a total number of 946 components indexed 2024-01-08 19:24:06,531 xinference.model.embedding.core 12284 INFO Embedding model caching from URI: /home/gpt-lab/Bureau/models/e5-mistral-7b-instruct 2024-01-08 19:24:06,532 xinference.model.embedding.core 12284 INFO Embedding cache /home/gpt-lab/Bureau/models/e5-mistral-7b-instruct exists Loading checkpoint shards: 50%|███████████████████████████ | 1/2 [00:20<00:20, 20.92s/it]2024-01-08 19:25:47,995 xinference.core.worker 12284 ERROR Failed to load model e5-mistral-7b-instruct-1-0 Traceback (most recent call last): File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xinference/core/worker.py", line 381, in launch_builtin_model await model_ref.load() File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xoscar/backends/context.py", line 226, in send result = await self._wait(future, actor_ref.address, send_message) # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xoscar/backends/context.py", line 115, in _wait return await future ^^^^^^^^^^^^ File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xoscar/backends/core.py", line 84, in _listen raise ServerClosed( xoscar.errors.ServerClosed: Remote server unixsocket:///3220176896 closed 2024-01-08 19:25:48,381 xinference.api.restful_api 12171 ERROR [address=0.0.0.0:12783, pid=12284] Remote server unixsocket:///3220176896 closed Traceback (most recent call last): File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xinference/api/restful_api.py", line 444, in launch_model model_uid = await (await self._get_supervisor_ref()).launch_builtin_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 657, in send result = await self._run_coro(message.message_id, coro) ^^^^^^^^^^^^^^^^^ File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 368, in _run_coro return await coro File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__ return await super().__on_receive__(message) # type: ignore ^^^^^^^^^^^^^^^^^ File "xoscar/core.pyx", line 558, in __on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__ async with self._lock: ^^^^^^^^^^^^^^^^^ File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__ with debug_async_timeout('actor_lock_timeout', ^^^^^^^^^^^^^^^^^ File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__ result = await result ^^^^^^^^^^^^^^^^^ File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xinference/core/supervisor.py", line 573, in launch_builtin_model await _launch_one_model(rep_model_uid) ^^^^^^^^^^^^^^^^^ File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xinference/core/supervisor.py", line 542, in _launch_one_model await worker_ref.launch_builtin_model( ^^^^^^^^^^^^^^^^^ File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper async with lock: File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper result = await result ^^^^^^^^^^^^^^^^^ File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xinference/core/utils.py", line 35, in wrapped ret = await func(*args, **kwargs) ^^^^^^^^^^^^^^^^^ File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xinference/core/worker.py", line 381, in launch_builtin_model await model_ref.load() ^^^^^^^^^^^^^^^^^ File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xoscar/backends/context.py", line 226, in send result = await self._wait(future, actor_ref.address, send_message) # type: ignore ^^^^^^^^^^^^^^^^^ File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xoscar/backends/context.py", line 115, in _wait return await future ^^^^^^^^^^^^^^^^^ File "/home/gpt-lab/miniconda3/envs/inference/lib/python3.11/site-packages/xoscar/backends/core.py", line 84, in _listen raise ServerClosed( ^^^^^^^^^^^^^^^^^ xoscar.errors.ServerClosed: [address=0.0.0.0:12783, pid=12284] Remote server unixsocket:///3220176896 closed For context, I have CUDA 12.3, PyTorch with CUDA support, and two GTX 1080ti GPUs, each with 12GB VRAM.

Is the issue related to insufficient VRAM, or could there be another problem? Any insights or suggestions would be greatly appreciated

aresnow1 commented 6 months ago
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(your_model_path)

Run the code above and check if it works.

MrBrabus75 commented 5 months ago

Screenshot_20240110_122802

It works fine when used alone, but when running simultaneously with another model, such as bge-reranker-large, it does not work as shown in the image below. I have the impression that the VRAM resources of the two graphics cards are not being fully or correctly utilized.

Screenshot_20240110_122802