Closed soulzzz closed 2 weeks ago
What's the version of vllm?
pip list
Package Version
--------------------------------- --------------
absl-py 2.1.0
accelerate 0.33.0
aiobotocore 2.7.0
aiofiles 23.2.1
aiohappyeyeballs 2.3.5
aiohttp 3.10.3
aioitertools 0.11.0
aioprometheus 23.12.0
aiosignal 1.3.1
alembic 1.13.2
aliyun-python-sdk-core 2.15.1
aliyun-python-sdk-kms 2.16.4
altair 5.4.0
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyio 4.4.0
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
asttokens 2.4.1
async-lru 2.0.4
async-timeout 4.0.3
attrdict 2.0.1
attrs 24.2.0
audioread 3.0.1
auto_gptq 0.7.1
autoawq 0.2.5
autoawq_kernels 0.0.6
autopage 0.5.2
babel 2.16.0
bcrypt 4.2.0
beautifulsoup4 4.12.3
bibtexparser 2.0.0b7
bitsandbytes 0.43.3
bleach 6.1.0
boto3 1.28.64
botocore 1.31.64
cdifflib 1.2.6
certifi 2024.7.4
cffi 1.17.0
cfgv 3.4.0
charset-normalizer 3.3.2
chattts 0.1.1
click 8.1.7
cliff 4.7.0
clldutils 3.22.2
cloudpickle 3.0.0
cmaes 0.11.1
cmake 3.30.2
cmd2 2.4.3
colorama 0.4.6
coloredlogs 15.0.1
colorlog 6.8.2
comm 0.2.2
conformer 0.3.2
contourpy 1.2.1
controlnet_aux 0.0.7
crcmod 1.7
cryptography 43.0.0
csvw 3.3.0
cycler 0.12.1
Cython 3.0.11
datasets 2.21.0
debugpy 1.8.5
decorator 5.1.1
defusedxml 0.7.1
diffusers 0.25.0
dill 0.3.8
diskcache 5.6.3
distlib 0.3.8
distro 1.9.0
dlinfo 1.2.1
ecdsa 0.19.0
editdistance 0.8.1
einops 0.8.0
einx 0.3.0
encodec 0.1.1
exceptiongroup 1.2.2
executing 2.0.1
fastapi 0.110.3
fastjsonschema 2.20.0
ffmpeg-python 0.2.0
ffmpy 0.4.0
filelock 3.15.4
FlagEmbedding 1.2.11
flatbuffers 24.3.25
fonttools 4.53.1
fqdn 1.5.1
frozendict 2.4.4
frozenlist 1.4.1
fsspec 2023.10.0
funasr 1.1.5
future 1.0.0
gdown 5.2.0
gekko 1.2.1
gradio 4.26.0
gradio_client 0.15.1
greenlet 3.0.3
grpcio 1.65.4
h11 0.14.0
hiredis 3.0.0
httpcore 1.0.5
httptools 0.6.1
httpx 0.27.0
huggingface-hub 0.24.5
humanfriendly 10.0
hydra-colorlog 1.2.0
hydra-core 1.3.2
hydra-optuna-sweeper 1.2.0
HyperPyYAML 1.2.2
identify 2.6.0
idna 3.7
imageio 2.35.0
imageio-ffmpeg 0.5.1
importlib_metadata 8.2.0
importlib_resources 6.4.0
inflect 7.3.1
iniconfig 2.0.0
interegular 0.3.3
ipykernel 6.29.5
ipython 8.26.0
ipywidgets 8.1.3
isodate 0.6.1
isoduration 20.11.0
jaconv 0.4.0
jamo 0.4.1
jedi 0.19.1
jieba 0.42.1
Jinja2 3.1.4
jmespath 0.10.0
joblib 1.4.2
json5 0.9.25
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2023.12.1
jupyter_client 8.6.2
jupyter_core 5.7.2
jupyter-events 0.10.0
jupyter-lsp 2.2.5
jupyter_server 2.14.2
jupyter_server_terminals 0.5.3
jupyterlab 4.2.4
jupyterlab_pygments 0.3.0
jupyterlab_server 2.27.3
jupyterlab_widgets 3.0.11
kaldiio 2.18.0
kiwisolver 1.4.5
language-tags 1.2.0
lark 1.2.2
lazy_loader 0.4
libnacl 2.1.0
librosa 0.10.2.post1
lightning 2.4.0
lightning-utilities 0.11.6
llama_cpp_python 0.2.88
llvmlite 0.43.0
lm-format-enforcer 0.10.3
lxml 5.3.0
Mako 1.3.5
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matcha-tts 0.0.5.1
matplotlib 3.9.2
matplotlib-inline 0.1.7
mdurl 0.1.2
mistune 3.0.2
modelscope 1.17.1
more-itertools 10.4.0
mpmath 1.3.0
msgpack 1.0.8
multidict 6.0.5
multiprocess 0.70.16
narwhals 1.4.1
nbclient 0.10.0
nbconvert 7.16.4
nbformat 5.10.4
nemo_text_processing 1.0.2
nest-asyncio 1.6.0
networkx 3.3
ninja 1.11.1.1
nodeenv 1.9.1
notebook 7.2.1
notebook_shim 0.2.4
numba 0.60.0
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.560.30
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.6.20
nvidia-nvtx-cu12 12.1.105
omegaconf 2.3.0
onnxruntime 1.16.0
openai 1.39.0
openai-whisper 20230306
opencv-contrib-python 4.10.0.84
opencv-python 4.10.0.84
optimum 1.21.3
optuna 2.10.1
orjson 3.10.7
oss2 2.18.6
outlines 0.0.46
overrides 7.7.0
packaging 24.1
pandas 2.2.2
pandocfilters 1.5.1
parso 0.8.4
passlib 1.7.4
pbr 6.0.0
peft 0.12.0
pexpect 4.9.0
phonemizer 3.3.0
pillow 10.4.0
pip 24.2
piper-phonemize 1.1.0
platformdirs 4.2.2
pluggy 1.5.0
pooch 1.8.2
pre-commit 3.8.0
prettytable 3.11.0
prometheus_client 0.20.0
prometheus-fastapi-instrumentator 7.0.0
prompt_toolkit 3.0.47
protobuf 4.25.4
psutil 6.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py-cpuinfo 9.0.0
pyairports 2.1.1
pyarrow 17.0.0
pyasn1 0.6.0
pybase16384 0.3.7
pycountry 24.6.1
pycparser 2.22
pycryptodome 3.20.0
pydantic 2.8.2
pydantic_core 2.20.1
pydub 0.25.1
Pygments 2.18.0
pylatexenc 2.10
pynini 2.1.5
pynndescent 0.5.13
pynvml 11.5.3
pyparsing 3.1.2
pyperclip 1.9.0
PySocks 1.7.1
pytest 8.3.2
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-jose 3.3.0
python-json-logger 2.0.7
python-multipart 0.0.9
pytorch-lightning 2.4.0
pytorch-wpe 0.0.1
pytz 2024.1
PyYAML 6.0.2
pyzmq 26.1.0
quantile-python 1.1
ray 2.34.0
rdflib 7.0.0
redis 5.0.8
referencing 0.35.1
regex 2024.7.24
requests 2.32.3
rfc3339-validator 0.1.4
rfc3986 1.5.0
rfc3986-validator 0.1.1
rich 13.7.1
rootutils 1.0.7
rouge 1.0.1
rpds-py 0.20.0
rsa 4.9
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
ruff 0.5.7
s3fs 2023.10.0
s3transfer 0.7.0
sacremoses 0.1.1
safetensors 0.4.4
scikit-image 0.24.0
scikit-learn 1.5.1
scipy 1.14.0
seaborn 0.13.2
segments 2.2.1
semantic-version 2.10.0
Send2Trash 1.8.3
sentence-transformers 3.0.1
sentencepiece 0.2.0
setuptools 72.2.0
sglang 0.2.12
shellingham 1.5.4
six 1.16.0
sniffio 1.3.1
soundfile 0.12.1
soupsieve 2.6
soxr 0.4.0
SQLAlchemy 2.0.32
sse-starlette 2.1.3
stack-data 0.6.3
starlette 0.37.2
stevedore 5.2.0
sympy 1.13.2
tabulate 0.9.0
tblib 3.0.0
tensorboard 2.17.0
tensorboard-data-server 0.7.2
tensorboardX 2.6.2.2
tensorizer 2.9.0
terminado 0.18.1
threadpoolctl 3.5.0
tifffile 2024.8.10
tiktoken 0.7.0
timm 1.0.8
tinycss2 1.3.0
tokenizers 0.19.1
tomli 2.0.1
tomlkit 0.12.0
torch 2.4.0
torch-complex 0.4.4
torchaudio 2.4.0
torchmetrics 1.4.1
torchvision 0.19.0
tornado 6.4.1
tqdm 4.66.5
traitlets 5.14.3
transformers 4.43.4
transformers-stream-generator 0.0.5
triton 3.0.0
typeguard 4.3.0
typer 0.11.1
types-python-dateutil 2.9.0.20240316
typing_extensions 4.12.2
tzdata 2024.1
umap-learn 0.5.6
Unidecode 1.3.8
uri-template 1.3.0
uritemplate 4.1.1
urllib3 2.0.7
uvicorn 0.30.6
uvloop 0.19.0
vector-quantize-pytorch 1.15.6
virtualenv 20.26.3
vllm 0.5.4
vllm-flash-attn 2.6.1
vocos 0.1.0
watchfiles 0.23.0
wcwidth 0.2.13
webcolors 24.8.0
webencodings 0.5.1
websocket-client 1.8.0
websockets 11.0.3
Werkzeug 3.0.3
WeTextProcessing 1.0.3
wget 3.2
wheel 0.44.0
widgetsnbextension 4.0.11
wrapt 1.16.0
xformers 0.0.27.post2
xinference 0.14.1.post1
xoscar 0.3.3
xxhash 3.4.1
yarl 1.9.4
zipp 3.20.0
zstandard 0.23.0
i switch above env i post and still got the same error
2024-08-15 10:26:02,268 xinference.model.llm.llm_family 112606 INFO Caching from Hugging Face: ModelCloud/internlm-2.5-7b-chat-gptq-4bit
2024-08-15 10:26:02,277 xinference.model.llm.vllm.core 112668 INFO Loading internlm2.5-chat with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 1, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0.
2024-08-15 10:26:02,278 transformers.configuration_utils 112668 INFO loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-15 10:26:02,279 transformers.configuration_utils 112668 INFO loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-15 10:26:02,280 transformers.configuration_utils 112668 INFO Model config InternLM2Config {
"_name_or_path": "/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b",
"architectures": [
"InternLM2ForCausalLM"
],
"attn_implementation": "eager",
"auto_map": {
"AutoConfig": "configuration_internlm2.InternLM2Config",
"AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
"AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
},
"bias": false,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 32768,
"model_type": "internlm2",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pad_token_id": 2,
"pretraining_tp": 1,
"quantization_config": {
"bits": 4,
"checkpoint_format": "gptq",
"damp_percent": 0.01,
"desc_act": true,
"group_size": 128,
"lm_head": false,
"meta": {
"quantizer": "gptqmodel:0.9.5"
},
"model_file_base_name": null,
"model_name_or_path": null,
"quant_method": "gptq",
"static_groups": false,
"sym": true,
"true_sequential": true
},
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 2.0,
"type": "dynamic"
},
"rope_theta": 1000000,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.43.4",
"use_cache": true,
"vocab_size": 92544
}
2024-08-15 10:26:02,280 vllm.model_executor.layers.quantization.gptq_marlin 112668 INFO The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
2024-08-15 10:26:02,280 vllm.engine.llm_engine 112668 INFO Initializing an LLM engine (v0.5.4) with config: model='/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b', speculative_config=None, tokenizer='/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b, use_v2_block_manager=False, enable_prefix_caching=False)
2024-08-15 10:26:02,282 transformers.tokenization_utils_base 112668 INFO loading file ./tokenizer.model
2024-08-15 10:26:02,283 transformers.tokenization_utils_base 112668 INFO loading file added_tokens.json
2024-08-15 10:26:02,283 transformers.tokenization_utils_base 112668 INFO loading file special_tokens_map.json
2024-08-15 10:26:02,283 transformers.tokenization_utils_base 112668 INFO loading file tokenizer_config.json
2024-08-15 10:26:02,283 transformers.tokenization_utils_base 112668 INFO loading file tokenizer.json
2024-08-15 10:26:02,372 transformers.tokenization_utils_base 112668 INFO Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-08-15 10:26:02,376 transformers.configuration_utils 112668 INFO loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-15 10:26:02,377 transformers.configuration_utils 112668 INFO loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-15 10:26:02,378 transformers.configuration_utils 112668 INFO Model config InternLM2Config {
"_name_or_path": "/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b",
"architectures": [
"InternLM2ForCausalLM"
],
"attn_implementation": "eager",
"auto_map": {
"AutoConfig": "configuration_internlm2.InternLM2Config",
"AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
"AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
},
"bias": false,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 32768,
"model_type": "internlm2",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pad_token_id": 2,
"pretraining_tp": 1,
"quantization_config": {
"bits": 4,
"checkpoint_format": "gptq",
"damp_percent": 0.01,
"desc_act": true,
"group_size": 128,
"lm_head": false,
"meta": {
"quantizer": "gptqmodel:0.9.5"
},
"model_file_base_name": null,
"model_name_or_path": null,
"quant_method": "gptq",
"static_groups": false,
"sym": true,
"true_sequential": true
},
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 2.0,
"type": "dynamic"
},
"rope_theta": 1000000,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.43.4",
"use_cache": true,
"vocab_size": 92544
}
2024-08-15 10:26:02,378 transformers.generation.configuration_utils 112668 INFO Generate config GenerationConfig {
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 2
}
2024-08-15 10:26:02,657 vllm.worker.model_runner 112668 INFO Starting to load model /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
2024-08-15 10:26:02,832 xinference.core.worker 112606 ERROR Failed to load model internlm2.5-chat-1-0
Traceback (most recent call last):
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/worker.py", line 882, in launch_builtin_model
await model_ref.load()
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
result = await self._run_coro(message.message_id, coro)
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
return await coro
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/model.py", line 300, in load
self._model.load()
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 243, in load
self._engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
engine = cls(
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
return engine_class(*args, **kwargs)
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in __init__
self.model_executor = executor_class(
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
self.driver_worker.load_model()
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in load_model
self.model_runner.load_model()
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 722, in load_model
self.model = get_model(model_config=self.model_config,
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
return loader.load_model(model_config=model_config,
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
model.load_weights(
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/model_executor/models/internlm2.py", line 327, in load_weights
loaded_weight = loaded_weight.view(-1, 2 + kv_groups,
RuntimeError: [address=0.0.0.0:33277, pid=112668] shape '[-1, 6, 128, 4096]' is invalid for input of size 4096
2024-08-15 10:26:02,922 xinference.api.restful_api 112560 ERROR [address=0.0.0.0:33277, pid=112668] shape '[-1, 6, 128, 4096]' is invalid for input of size 4096
Traceback (most recent call last):
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/api/restful_api.py", line 878, in launch_model
model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
result = await self._run_coro(message.message_id, coro)
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
return await coro
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1027, in launch_builtin_model
await _launch_model()
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 991, in _launch_model
await _launch_one_model(rep_model_uid)
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 970, in _launch_one_model
await worker_ref.launch_builtin_model(
File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
async with lock:
File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
result = await result
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped
ret = await func(*args, **kwargs)
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/worker.py", line 882, in launch_builtin_model
await model_ref.load()
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
result = await self._run_coro(message.message_id, coro)
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
return await coro
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/model.py", line 300, in load
self._model.load()
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 243, in load
self._engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
engine = cls(
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
return engine_class(*args, **kwargs)
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in __init__
self.model_executor = executor_class(
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
self.driver_worker.load_model()
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in load_model
self.model_runner.load_model()
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 722, in load_model
self.model = get_model(model_config=self.model_config,
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
return loader.load_model(model_config=model_config,
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
model.load_weights(
File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/model_executor/models/internlm2.py", line 327, in load_weights
loaded_weight = loaded_weight.view(-1, 2 + kv_groups,
RuntimeError: [address=0.0.0.0:33277, pid=112668] shape '[-1, 6, 128, 4096]' is invalid for input of size 4096
What's the version of vllm?
i post the env above,pls have a check
also can we add this model internlm2_5-7b-chat-4bit into the built-in?
also can we add this model internlm2_5-7b-chat-4bit into the built-in?
Looks like it's an AWQ quantization version, do you have interest to add it to Xinference and send a PR?
also can we add this model internlm2_5-7b-chat-4bit into the built-in?
Looks like it's an AWQ quantization version, do you have interest to add it to Xinference and send a PR?
I want to do this but I can neither run the internlm2.5-7B-chat vllm gptq-4Int format nor the custom internlm2_5-7b-chat-4bit vllm awq-4Int format model, I would like to see the problem solved
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 5 days since being marked as stale.
System Info / 系統信息
Cuda:12.5 python:3.9 ubuntu22.04
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
Version info / 版本信息
v0.14.1.post1
The command used to start Xinference / 用以启动 xinference 的命令
xinference-local --host 0.0.0.0 --port 9997
Reproduction / 复现过程
1.run internlm2.5-chat or internlm2.5-chat-1M in vllm gpqt-4int format 2.error shows below:
Expected behavior / 期待表现
run successfully