Open JinCheng666 opened 3 months ago
后续问题有解决吗?
后续问题有解决吗?
还没有解决。机器是一台虚拟机,重启之前多卡推理一个模型是正常的。重启后就无法多卡共同推理一个模型了。 请问还需要我收集哪些信息,我这边收集。 @qinxuye
试下最新版本还有问题吗?
试下最新版本还有问题吗?
@qinxuye 更新到0.12.3,仍然存在相同的问题
This issue is stale because it has been open for 7 days with no activity.
更新到0.14.3 单卡运行正常,多卡运行仍然报错,但报错变了,如下。 @qinxuye 麻烦帮忙看下,机器硬件环境没有变化 启动脚本: nohup xinference-local --host 0.0.0.0 --port 9997 --log-level DEBUG &
2024-08-26 15:37:47,330 xinference.core.supervisor 1397843 INFO Xinference supervisor 0.0.0.0:46142 started
2024-08-26 15:37:48,794 xinference.core.worker 1397843 INFO Starting metrics export server at 0.0.0.0:None
2024-08-26 15:37:48,796 xinference.core.worker 1397843 INFO Checking metrics export server...
2024-08-26 15:37:50,101 xinference.core.worker 1397843 INFO Metrics server is started at: http://0.0.0.0:41521
2024-08-26 15:37:50,102 xinference.core.worker 1397843 INFO Purge cache directory: /home/gx01/.xinference/cache
2024-08-26 15:37:50,103 xinference.core.supervisor 1397843 DEBUG Enter add_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, '0.0.0.0:46142'), kwargs: {}
2024-08-26 15:37:50,104 xinference.core.supervisor 1397843 DEBUG Worker 0.0.0.0:46142 has been added successfully
2024-08-26 15:37:50,104 xinference.core.supervisor 1397843 DEBUG Leave add_worker, elapsed time: 0 s
2024-08-26 15:37:50,104 xinference.core.worker 1397843 INFO Connected to supervisor as a fresh worker
2024-08-26 15:37:50,114 xinference.core.worker 1397843 INFO Xinference worker 0.0.0.0:46142 started
2024-08-26 15:37:50,116 xinference.core.supervisor 1397843 DEBUG Worker 0.0.0.0:46142 resources: {'cpu': ResourceStatus(usage=0.0, total=32, memory_used=2375675904, memory_available=131400032256, memory_total=135059939328), 'gpu-0': GPUStatus(mem_total=51527024640, mem_free=51032358912, mem_used=494665728), 'gpu-1': GPUStatus(mem_total=51527024640, mem_free=51032358912, mem_used=494665728)}
2024-08-26 15:37:52,322 xinference.core.supervisor 1397843 DEBUG Enter get_status, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>,), kwargs: {}
2024-08-26 15:37:52,322 xinference.core.supervisor 1397843 DEBUG Leave get_status, elapsed time: 0 s
2024-08-26 15:37:53,186 xinference.api.restful_api 1397771 INFO Starting Xinference at endpoint: http://0.0.0.0:9997
2024-08-26 15:37:53,318 uvicorn.error 1397771 INFO Started server process [1397771]
2024-08-26 15:37:53,318 uvicorn.error 1397771 INFO Waiting for application startup.
2024-08-26 15:37:53,318 uvicorn.error 1397771 INFO Application startup complete.
2024-08-26 15:37:53,319 uvicorn.error 1397771 INFO Uvicorn running on http://0.0.0.0:9997 (Press CTRL+C to quit)
2024-08-26 15:48:26,826 xinference.core.supervisor 1397843 DEBUG Enter get_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'qwen:72b'), kwargs: {}
2024-08-26 15:48:26,828 xinference.api.restful_api 1397771 ERROR [address=0.0.0.0:46142, pid=1397843] Model not found in the model list, uid: qwen:72b
Traceback (most recent call last):
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/api/restful_api.py", line 1660, in create_chat_completion
model = await (await self._get_supervisor_ref()).get_model(model_uid)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
result = await self._run_coro(message.message_id, coro)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
return await coro
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped
ret = await func(*args, **kwargs)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1124, in get_model
raise ValueError(f"Model not found in the model list, uid: {model_uid}")
ValueError: [address=0.0.0.0:46142, pid=1397843] Model not found in the model list, uid: qwen:72b
2024-08-26 15:48:26,831 uvicorn.access 1397771 INFO 10.4.134.11:64957 - "POST /v1/chat/completions HTTP/1.1" 400
2024-08-26 15:48:50,350 uvicorn.access 1397771 INFO 10.4.134.25:11074 - "GET / HTTP/1.1" 307
2024-08-26 15:48:50,637 uvicorn.access 1397771 INFO 10.4.134.25:11074 - "GET /v1/cluster/auth HTTP/1.1" 200
2024-08-26 15:48:50,670 xinference.core.supervisor 1397843 DEBUG Enter list_model_registrations, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM'), kwargs: {'detailed': True}
2024-08-26 15:48:50,670 uvicorn.access 1397771 INFO 10.4.134.25:11074 - "GET /v1/cluster/devices HTTP/1.1" 200
2024-08-26 15:48:50,815 xinference.core.supervisor 1397843 DEBUG Leave list_model_registrations, elapsed time: 0 s
2024-08-26 15:48:50,826 uvicorn.access 1397771 INFO 10.4.134.25:11075 - "GET /v1/model_registrations/LLM?detailed=true HTTP/1.1" 200
2024-08-26 15:48:57,861 xinference.core.supervisor 1397843 DEBUG Enter list_model_registrations, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM'), kwargs: {'detailed': False}
2024-08-26 15:48:57,862 xinference.core.supervisor 1397843 DEBUG Leave list_model_registrations, elapsed time: 0 s
2024-08-26 15:48:57,863 uvicorn.access 1397771 INFO 10.4.134.25:11076 - "GET /v1/model_registrations/LLM HTTP/1.1" 200
2024-08-26 15:48:57,871 xinference.core.supervisor 1397843 DEBUG Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'glm-4-9b'), kwargs: {}
2024-08-26 15:48:57,871 xinference.core.supervisor 1397843 DEBUG Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,872 xinference.core.supervisor 1397843 DEBUG Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'glm-4v-9b'), kwargs: {}
2024-08-26 15:48:57,872 xinference.core.supervisor 1397843 DEBUG Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,872 uvicorn.access 1397771 INFO 10.4.134.25:11076 - "GET /v1/model_registrations/LLM/glm-4-9b HTTP/1.1" 200
2024-08-26 15:48:57,874 xinference.core.supervisor 1397843 DEBUG Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'llama3:70b'), kwargs: {}
2024-08-26 15:48:57,874 xinference.core.supervisor 1397843 DEBUG Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,874 xinference.core.supervisor 1397843 DEBUG Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'qwen1.5:14b'), kwargs: {}
2024-08-26 15:48:57,875 xinference.core.supervisor 1397843 DEBUG Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,875 xinference.core.supervisor 1397843 DEBUG Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'qwen1.5:72b'), kwargs: {}
2024-08-26 15:48:57,875 xinference.core.supervisor 1397843 DEBUG Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,875 uvicorn.access 1397771 INFO 10.4.134.25:11077 - "GET /v1/model_registrations/LLM/glm-4v-9b HTTP/1.1" 200
2024-08-26 15:48:57,876 xinference.core.supervisor 1397843 DEBUG Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'qwen2:7b'), kwargs: {}
2024-08-26 15:48:57,876 xinference.core.supervisor 1397843 DEBUG Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,876 xinference.core.supervisor 1397843 DEBUG Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'qwen:110b'), kwargs: {}
2024-08-26 15:48:57,876 uvicorn.access 1397771 INFO 10.4.134.25:11078 - "GET /v1/model_registrations/LLM/llama3%3A70b HTTP/1.1" 200
2024-08-26 15:48:57,876 xinference.core.supervisor 1397843 DEBUG Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,877 uvicorn.access 1397771 INFO 10.4.134.25:11079 - "GET /v1/model_registrations/LLM/qwen1.5%3A14b HTTP/1.1" 200
2024-08-26 15:48:57,877 uvicorn.access 1397771 INFO 10.4.134.25:11080 - "GET /v1/model_registrations/LLM/qwen1.5%3A72b HTTP/1.1" 200
2024-08-26 15:48:57,878 xinference.core.supervisor 1397843 DEBUG Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'qwen:72b'), kwargs: {}
2024-08-26 15:48:57,878 uvicorn.access 1397771 INFO 10.4.134.25:11081 - "GET /v1/model_registrations/LLM/qwen2%3A7b HTTP/1.1" 200
2024-08-26 15:48:57,878 xinference.core.supervisor 1397843 DEBUG Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,878 uvicorn.access 1397771 INFO 10.4.134.25:11076 - "GET /v1/model_registrations/LLM/qwen%3A110b HTTP/1.1" 200
2024-08-26 15:48:57,879 uvicorn.access 1397771 INFO 10.4.134.25:11077 - "GET /v1/model_registrations/LLM/qwen%3A72b HTTP/1.1" 200
2024-08-26 15:49:05,752 xinference.core.supervisor 1397843 DEBUG Enter query_engines_by_model_name, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'qwen:72b'), kwargs: {}
2024-08-26 15:49:05,753 xinference.core.worker 1397843 DEBUG Enter query_engines_by_model_name, args: (<xinference.core.worker.WorkerActor object at 0x7f90541af8d0>, 'qwen:72b'), kwargs: {}
2024-08-26 15:49:05,753 xinference.core.worker 1397843 DEBUG Leave query_engines_by_model_name, elapsed time: 0 s
2024-08-26 15:49:05,753 xinference.core.supervisor 1397843 DEBUG Leave query_engines_by_model_name, elapsed time: 0 s
2024-08-26 15:49:05,753 uvicorn.access 1397771 INFO 10.4.134.25:11085 - "GET /v1/engines/qwen%3A72b HTTP/1.1" 200
2024-08-26 15:49:19,139 xinference.core.supervisor 1397843 DEBUG Enter launch_builtin_model, model_uid: qwen:72b, model_name: qwen:72b, model_size: 72, model_format: gptq, quantization: Int4, replica: 1, kwargs: {}
2024-08-26 15:49:19,140 xinference.core.worker 1397843 DEBUG Enter get_model_count, args: (<xinference.core.worker.WorkerActor object at 0x7f90541af8d0>,), kwargs: {}
2024-08-26 15:49:19,140 xinference.core.worker 1397843 DEBUG Leave get_model_count, elapsed time: 0 s
2024-08-26 15:49:19,140 xinference.core.worker 1397843 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f90541af8d0>,), kwargs: {'model_uid': 'qwen:72b-1-0', 'model_name': 'qwen:72b', 'model_size_in_billions': 72, 'model_format': 'gptq', 'quantization': 'Int4', 'model_engine': 'vLLM', 'model_type': 'LLM', 'n_gpu': 2, 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None, 'download_hub': None, 'model_path': None}
2024-08-26 15:49:19,141 xinference.core.worker 1397843 DEBUG GPU selected: [0, 1] for model qwen:72b-1-0
2024-08-26 15:49:25,934 xinference.model.llm.core 1397843 DEBUG Launching qwen:72b-1-0 with VLLMChatModel
2024-08-26 15:49:25,934 xinference.model.llm.llm_family 1397843 INFO Caching from URI: /home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4
2024-08-26 15:49:25,935 xinference.model.llm.llm_family 1397843 INFO Cache /home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4 exists
2024-08-26 15:49:25,952 xinference.model.llm.vllm.core 1399591 INFO Loading qwen:72b with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 2, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0.
2024-08-26 15:49:25,954 transformers.configuration_utils 1399591 INFO loading configuration file /home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4/config.json
2024-08-26 15:49:25,955 transformers.configuration_utils 1399591 INFO Model config Qwen2Config {
"_name_or_path": "/home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 8192,
"initializer_range": 0.02,
"intermediate_size": 29696,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 8,
"quantization_config": {
"batch_size": 1,
"bits": 4,
"block_name_to_quantize": null,
"cache_block_outputs": true,
"damp_percent": 0.01,
"dataset": null,
"desc_act": false,
"exllama_config": {
"version": 1
},
"group_size": 128,
"max_input_length": null,
"model_seqlen": null,
"module_name_preceding_first_block": null,
"modules_in_block_to_quantize": null,
"pad_token_id": null,
"quant_method": "gptq",
"sym": true,
"tokenizer": null,
"true_sequential": true,
"use_cuda_fp16": false,
"use_exllama": true
},
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.43.4",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 152064
}
2024-08-26 15:49:25,955 transformers.models.auto.image_processing_auto 1399591 INFO Could not locate the image processor configuration file, will try to use the model config instead.
2024-08-26 15:49:25,964 vllm.model_executor.layers.quantization.gptq_marlin 1399591 INFO The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
2024-08-26 15:49:25,976 vllm.config 1399591 INFO Defaulting to use mp for distributed inference
2024-08-26 15:49:25,979 vllm.engine.llm_engine 1399591 INFO Initializing an LLM engine (v0.5.5) with config: model='/home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4, use_v2_block_manager=False, enable_prefix_caching=False)
2024-08-26 15:49:25,989 transformers.tokenization_utils_base 1399591 INFO loading file vocab.json
2024-08-26 15:49:25,989 transformers.tokenization_utils_base 1399591 INFO loading file merges.txt
2024-08-26 15:49:25,989 transformers.tokenization_utils_base 1399591 INFO loading file tokenizer.json
2024-08-26 15:49:25,989 transformers.tokenization_utils_base 1399591 INFO loading file added_tokens.json
2024-08-26 15:49:25,989 transformers.tokenization_utils_base 1399591 INFO loading file special_tokens_map.json
2024-08-26 15:49:25,989 transformers.tokenization_utils_base 1399591 INFO loading file tokenizer_config.json
2024-08-26 15:49:26,226 transformers.tokenization_utils_base 1399591 INFO Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-08-26 15:49:26,245 transformers.generation.configuration_utils 1399591 INFO loading configuration file /home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4/generation_config.json
2024-08-26 15:49:26,246 transformers.generation.configuration_utils 1399591 INFO Generate config GenerationConfig {
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"repetition_penalty": 1.05,
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8
}
2024-08-26 15:49:26,246 vllm.executor.multiproc_gpu_executor 1399591 WARNING Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
2024-08-26 15:49:26,263 vllm.triton_utils.custom_cache_manager 1399591 INFO Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
2024-08-26 15:49:26,513 vllm.executor.multiproc_worker_utils 1399694 INFO Worker ready; awaiting tasks
2024-08-26 15:49:26,913 vllm.distributed.parallel_state 1399591 DEBUG world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:54733 backend=nccl
2024-08-26 15:49:26,959 vllm.distributed.parallel_state 1399694 DEBUG world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:54733 backend=nccl
2024-08-26 15:49:26,985 vllm.utils 1399591 INFO Found nccl from library libnccl.so.2
2024-08-26 15:49:26,985 vllm.utils 1399694 INFO Found nccl from library libnccl.so.2
2024-08-26 15:49:26,985 vllm.distributed.device_communicators.pynccl 1399591 INFO vLLM is using nccl==2.20.5
2024-08-26 15:49:26,985 vllm.distributed.device_communicators.pynccl 1399694 INFO vLLM is using nccl==2.20.5
2024-08-26 15:49:27,237 vllm.distributed.device_communicators.custom_all_reduce_utils 1399591 INFO generating GPU P2P access cache in /home/gx01/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
2024-08-26 15:49:42,218 xinference.core.worker 1397843 ERROR Failed to load model qwen:72b-1-0
Traceback (most recent call last):
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/worker.py", line 888, in launch_builtin_model
await model_ref.load()
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
result = await self._run_coro(message.message_id, coro)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
return await coro
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/model.py", line 303, in load
self._model.load()
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 239, in load
self._engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 740, in from_engine_args
engine = cls(
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 636, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine
return engine_class(*args, **kwargs)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 272, in __init__
super().__init__(*args, **kwargs)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 270, in __init__
self.model_executor = executor_class(
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
super().__init__(*args, **kwargs)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
super().__init__(*args, **kwargs)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 46, in __init__
self._init_executor()
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor
self._run_workers("init_device")
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
driver_worker_output = driver_worker_method(*args, **kwargs)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/worker/worker.py", line 175, in init_device
init_worker_distributed_environment(self.parallel_config, self.rank,
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/worker/worker.py", line 450, in init_worker_distributed_environment
ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized
initialize_model_parallel(tensor_model_parallel_size,
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel
_TP = init_model_parallel_group(group_ranks,
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
return GroupCoordinator(
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 164, in __init__
self.ca_comm = CustomAllreduce(
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 130, in __init__
if not _can_p2p(rank, world_size):
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 31, in _can_p2p
if not gpu_p2p_access_check(rank, i):
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 227, in gpu_p2p_access_check
result = pickle.loads(returned.stdout)
_pickle.UnpicklingError: [address=0.0.0.0:46431, pid=1399591] invalid load key, 'W'.
2024-08-26 15:49:42,311 xinference.core.supervisor 1397843 DEBUG Enter terminate_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'qwen:72b'), kwargs: {'suppress_exception': True}
2024-08-26 15:49:42,311 xinference.core.supervisor 1397843 DEBUG Leave terminate_model, elapsed time: 0 s
2024-08-26 15:49:42,317 xinference.api.restful_api 1397771 ERROR [address=0.0.0.0:46431, pid=1399591] invalid load key, 'W'.
Traceback (most recent call last):
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/api/restful_api.py", line 878, in launch_model
model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
result = await self._run_coro(message.message_id, coro)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
return await coro
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1027, in launch_builtin_model
await _launch_model()
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/supervisor.py", line 991, in _launch_model
await _launch_one_model(rep_model_uid)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/supervisor.py", line 970, in _launch_one_model
await worker_ref.launch_builtin_model(
File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
async with lock:
File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
result = await result
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped
ret = await func(*args, **kwargs)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/worker.py", line 888, in launch_builtin_model
await model_ref.load()
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
result = await self._run_coro(message.message_id, coro)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
return await coro
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/model.py", line 303, in load
self._model.load()
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 239, in load
self._engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 740, in from_engine_args
engine = cls(
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 636, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine
return engine_class(*args, **kwargs)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 272, in __init__
super().__init__(*args, **kwargs)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 270, in __init__
self.model_executor = executor_class(
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
super().__init__(*args, **kwargs)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
super().__init__(*args, **kwargs)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 46, in __init__
self._init_executor()
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor
self._run_workers("init_device")
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
driver_worker_output = driver_worker_method(*args, **kwargs)
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/worker/worker.py", line 175, in init_device
init_worker_distributed_environment(self.parallel_config, self.rank,
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/worker/worker.py", line 450, in init_worker_distributed_environment
ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized
initialize_model_parallel(tensor_model_parallel_size,
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel
_TP = init_model_parallel_group(group_ranks,
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
return GroupCoordinator(
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 164, in __init__
self.ca_comm = CustomAllreduce(
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 130, in __init__
if not _can_p2p(rank, world_size):
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 31, in _can_p2p
if not gpu_p2p_access_check(rank, i):
File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 227, in gpu_p2p_access_check
result = pickle.loads(returned.stdout)
_pickle.UnpicklingError: [address=0.0.0.0:46431, pid=1399591] invalid load key, 'W'.
2024-08-26 15:49:42,319 uvicorn.access 1397771 INFO 10.4.134.25:11092 - "POST /v1/models HTTP/1.1" 500
同样的错误(invalid load key, 'W'.),奇怪的是一台机器可以,另一台就报这个错误
https://github.com/vllm-project/vllm/issues/7846
I also meet this warn :
WARNING 08-27 14:33:56 cuda.py:22] You are using a deprecated
pynvmlpackage. Please install
nvidia-ml-pyinstead. See https://pypi.org/project/pynvml for more information.
试过pip uninstall pynvml可行
同样的错误(invalid load key, 'W'.),奇怪的是一台机器可以,另一台就报这个错误
我也是,挺奇怪的。重启前可以多卡跑,重启后就不行了
Describe the bug
重启电脑后出现以下问题,重启前多卡是正常运行的 问题:多卡运行模型启动报错,单卡运行正常
To Reproduce
To help us to reproduce this bug, please provide information below:
Your Python version. 3.10
The version of xinference you use. 0.12.1