xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
5.3k stars 428 forks source link

采用vllm引擎双卡无法启动,单卡可以。采用Transforms引擎单卡和双卡都是正常的 #2256

Closed zxx20231119 closed 1 month ago

zxx20231119 commented 1 month ago

System Info / 系統信息

![Uploading 5.PNG…]()

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

Xinference:0.13.3

The command used to start Xinference / 用以启动 xinference 的命令

容器内部使用xinference --host 0.0.0.0启动

Reproduction / 复现过程

页面启动: 错误信息: 2024-09-09 02:11:16,076 xinference.model.llm.llm_family 117 INFO Caching from URI: /root/llm/Qwen2-57B-A14B-Instruct 2024-09-09 02:11:16,076 xinference.model.llm.llm_family 117 INFO Cache /root/llm/Qwen2-57B-A14B-Instruct exists 2024-09-09 02:11:16,101 xinference.model.llm.vllm.core 3592 INFO Loading Qwen2-57B-A14B-Instruct with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 2, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0. 2024-09-09 02:11:16,103 transformers.configuration_utils 3592 INFO loading configuration file /root/llm/Qwen2-57B-A14B-Instruct/config.json 2024-09-09 02:11:16,104 transformers.configuration_utils 3592 INFO Model config Qwen2MoeConfig { "_name_or_path": "/root/llm/Qwen2-57B-A14B-Instruct", "architectures": [ "Qwen2MoeForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "decoder_sparse_step": 1, "eos_token_id": 151643, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "mlp_only_layers": [], "model_type": "qwen2_moe", "moe_intermediate_size": 2560, "norm_topk_prob": false, "num_attention_heads": 28, "num_experts": 64, "num_experts_per_tok": 8, "num_hidden_layers": 28, "num_key_value_heads": 4, "output_router_logits": false, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "router_aux_loss_coef": 0.001, "shared_expert_intermediate_size": 20480, "sliding_window": 65536, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.42.4", "use_cache": true, "use_sliding_window": false, "vocab_size": 151936 }

2024-09-09 02:11:16,109 INFO worker.py:1596 -- Connecting to existing Ray cluster at address: 192.168.78.243:6379... 2024-09-09 02:11:16,121 INFO worker.py:1781 -- Connected to Ray cluster. 2024-09-09 02:11:16,322 vllm.engine.llm_engine 3592 INFO Initializing an LLM engine (v0.4.3) with config: model='/root/llm/Qwen2-57B-A14B-Instruct', speculative_config=None, tokenizer='/root/llm/Qwen2-57B-A14B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/root/llm/Qwen2-57B-A14B-Instruct) 2024-09-09 02:11:16,327 transformers.tokenization_utils_base 3592 INFO loading file vocab.json 2024-09-09 02:11:16,328 transformers.tokenization_utils_base 3592 INFO loading file merges.txt 2024-09-09 02:11:16,328 transformers.tokenization_utils_base 3592 INFO loading file tokenizer.json 2024-09-09 02:11:16,328 transformers.tokenization_utils_base 3592 INFO loading file added_tokens.json 2024-09-09 02:11:16,328 transformers.tokenization_utils_base 3592 INFO loading file special_tokens_map.json 2024-09-09 02:11:16,328 transformers.tokenization_utils_base 3592 INFO loading file tokenizer_config.json 2024-09-09 02:11:16,539 transformers.tokenization_utils_base 3592 WARNING Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-09-09 02:11:16,556 transformers.generation.configuration_utils 3592 INFO loading configuration file /root/llm/Qwen2-57B-A14B-Instruct/generation_config.json 2024-09-09 02:11:16,556 transformers.generation.configuration_utils 3592 INFO Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.7, "top_k": 20, "top_p": 0.8 }

2024-09-09 02:11:21,786 vllm.utils 3592 INFO Found nccl from library libnccl.so.2 2024-09-09 02:11:21,787 vllm.distributed.device_communicators.pynccl 3592 INFO vLLM is using nccl==2.20.5 2024-09-09 02:11:21,828 vllm.worker.worker_base 3592 ERROR Error executing method init_device. This might cause deadlock in distributed execution. Traceback (most recent call last): File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method return executor(*args, kwargs) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 114, in init_device init_worker_distributed_environment(self.parallel_config, self.rank, File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 349, in init_worker_distributed_environment ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 239, in ensure_model_parallel_initialized initialize_model_parallel(tensor_model_parallel_size, File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 191, in initialize_model_parallel _TP_PYNCCL_COMMUNICATOR = PyNcclCommunicator( File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 94, in init self.comm: ncclComm_t = self.nccl.ncclCommInitRank( File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK raise RuntimeError(f"NCCL error: {error_str}") RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details) 2024-09-09 02:11:21,836 xinference.core.worker 117 ERROR Failed to load model Qwen2-57B-A14B-Instruct-1-0 Traceback (most recent call last): File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/worker.py", line 841, in launch_builtin_model await model_ref.load() File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send result = await self._run_coro(message.message_id, coro) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro return await coro File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/model.py", line 295, in load self._model.load() File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 241, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args engine = cls( File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init self.engine = self._init_engine(*args, *kwargs) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine return engine_class(args, kwargs) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 222, in init self.model_executor = executor_class( File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 317, in init super().init(*args, kwargs) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init super().init(*args, *kwargs) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init self._init_executor() File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor self._init_workers_ray(placement_group) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 171, in _init_workers_ray self._run_workers("init_device") File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers driver_worker_output = self.driver_worker.execute_method( File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method raise e File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method return executor(args, kwargs) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 114, in init_device init_worker_distributed_environment(self.parallel_config, self.rank, File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 349, in init_worker_distributed_environment ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 239, in ensure_model_parallel_initialized initialize_model_parallel(tensor_model_parallel_size, File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 191, in initialize_model_parallel _TP_PYNCCL_COMMUNICATOR = PyNcclCommunicator( File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 94, in init self.comm: ncclComm_t = self.nccl.ncclCommInitRank( File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK raise RuntimeError(f"NCCL error: {error_str}") RuntimeError: [address=0.0.0.0:46170, pid=3592] NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details) 2024-09-09 02:11:21,933 xinference.api.restful_api 28 ERROR [address=0.0.0.0:46170, pid=3592] NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details) Traceback (most recent call last): File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xinference/api/restful_api.py", line 848, in launch_model model_uid = await (await self._get_supervisor_ref()).launch_builtin_model( File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send result = await self._run_coro(message.message_id, coro) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro return await coro File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 988, in launch_builtin_model await _launch_model() File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 952, in _launch_model await _launch_one_model(rep_model_uid) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 932, in _launch_one_model await worker_ref.launch_builtin_model( File "xoscar/core.pyx", line 284, in pyx_actor_method_wrapper async with lock: File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper result = await result File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped ret = await func(*args, **kwargs) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/worker.py", line 841, in launch_builtin_model await model_ref.load() File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send result = await self._run_coro(message.message_id, coro) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro return await coro File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive__ with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive result = await result File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/model.py", line 295, in load self._model.load() File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 241, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args engine = cls( File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init self.engine = self._init_engine(*args, kwargs) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine return engine_class(*args, *kwargs) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 222, in init self.model_executor = executor_class( File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 317, in init super().init(args, kwargs) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init super().init(*args, **kwargs) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init self._init_executor() File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor self._init_workers_ray(placement_group) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 171, in _init_workers_ray self._run_workers("init_device") File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers driver_worker_output = self.driver_worker.execute_method( File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method raise e File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method return executor(*args, **kwargs) File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 114, in init_device init_worker_distributed_environment(self.parallel_config, self.rank, File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 349, in init_worker_distributed_environment ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 239, in ensure_model_parallel_initialized initialize_model_parallel(tensor_model_parallel_size, File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 191, in initialize_model_parallel _TP_PYNCCL_COMMUNICATOR = PyNcclCommunicator( File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 94, in init__ self.comm: ncclComm_t = self.nccl.ncclCommInitRank( File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), File "/root/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK raise RuntimeError(f"NCCL error: {error_str}") RuntimeError: [address=0.0.0.0:46170, pid=3592] NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)

Expected behavior / 期待表现

采用vllm引擎双卡启动模型

zxx20231119 commented 1 month ago

5 显卡信息

zxx20231119 commented 1 month ago

4 页面的启动

qinxuye commented 1 month ago

NCCL 报错了,哪里配置有问题。

zxx20231119 commented 1 month ago

麻烦问一下最新的xinference镜像在启动qwen2-72b的时候只有一个"transformers"推理引擎选项,是不支持vllm吗?还是可以通过配置添加

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 5 days since being marked as stale.

DankerMu commented 3 weeks ago

遇到了同样的问题,transforms可以多卡推理,但是vllm不行,更新了显卡驱动到550,cuda到12.4.1,还是同样的问题

SDAIer commented 3 weeks ago

我用transformer也无法多卡,明明指定的2、3卡,缺提示0卡没资源

---原始邮件--- 发件人: @.> 发送时间: 2024年10月10日(周四) 中午11:05 收件人: @.>; 抄送: @.**@.>; 主题: Re: [xorbitsai/inference] 采用vllm引擎双卡无法启动,单卡可以。采用Transforms引擎单卡和双卡都是正常的 (Issue #2256)

遇到了同样的问题,transforms可以多卡推理,但是vllm不行,更新了显卡驱动到550,cuda到12.4.1,还是同样的问题

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>