xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.77k stars 375 forks source link

internlm2.5-7B-chat& internlm2.5-7B-chat-1M can't run in vllm gptq-4Int #2089

Closed soulzzz closed 2 weeks ago

soulzzz commented 1 month ago

System Info / 系統信息

Cuda:12.5 python:3.9 ubuntu22.04

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

v0.14.1.post1

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local --host 0.0.0.0 --port 9997

Reproduction / 复现过程

1.run internlm2.5-chat or internlm2.5-chat-1M in vllm gpqt-4int format 2.error shows below:

2024-08-14 21:19:07,464 xinference.model.llm.llm_family 20920 INFO     Caching from Hugging Face: ModelCloud/internlm-2.5-7b-chat-gptq-4bit
2024-08-14 21:19:07,477 xinference.model.llm.vllm.core 24351 INFO     Loading internlm2.5-chat with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 1, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0.
2024-08-14 21:19:07,478 transformers.configuration_utils 24351 INFO     loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-14 21:19:07,479 transformers.configuration_utils 24351 INFO     loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-14 21:19:07,480 transformers.configuration_utils 24351 INFO     Model config InternLM2Config {
  "_name_or_path": "/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b",
  "architectures": [
    "InternLM2ForCausalLM"
  ],
  "attn_implementation": "eager",
  "auto_map": {
    "AutoConfig": "configuration_internlm2.InternLM2Config",
    "AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
    "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
  },
  "bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "internlm2",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 2,
  "pretraining_tp": 1,
  "quantization_config": {
    "bits": 4,
    "checkpoint_format": "gptq",
    "damp_percent": 0.01,
    "desc_act": true,
    "group_size": 128,
    "lm_head": false,
    "meta": {
      "quantizer": "gptqmodel:0.9.5"
    },
    "model_file_base_name": null,
    "model_name_or_path": null,
    "quant_method": "gptq",
    "static_groups": false,
    "sym": true,
    "true_sequential": true
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 2.0,
    "type": "dynamic"
  },
  "rope_theta": 1000000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.4",
  "use_cache": true,
  "vocab_size": 92544
}

2024-08-14 21:19:07,480 vllm.model_executor.layers.quantization.gptq_marlin 24351 INFO     The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
2024-08-14 21:19:07,481 vllm.engine.llm_engine 24351 INFO     Initializing an LLM engine (v0.5.4) with config: model='/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b', speculative_config=None, tokenizer='/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b, use_v2_block_manager=False, enable_prefix_caching=False)
2024-08-14 21:19:07,483 transformers.tokenization_utils_base 24351 INFO     loading file ./tokenizer.model
2024-08-14 21:19:07,483 transformers.tokenization_utils_base 24351 INFO     loading file added_tokens.json
2024-08-14 21:19:07,483 transformers.tokenization_utils_base 24351 INFO     loading file special_tokens_map.json
2024-08-14 21:19:07,483 transformers.tokenization_utils_base 24351 INFO     loading file tokenizer_config.json
2024-08-14 21:19:07,483 transformers.tokenization_utils_base 24351 INFO     loading file tokenizer.json
2024-08-14 21:19:07,575 transformers.tokenization_utils_base 24351 INFO     Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-08-14 21:19:07,580 transformers.configuration_utils 24351 INFO     loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-14 21:19:07,580 transformers.configuration_utils 24351 INFO     loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-14 21:19:07,581 transformers.configuration_utils 24351 INFO     Model config InternLM2Config {
  "_name_or_path": "/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b",
  "architectures": [
    "InternLM2ForCausalLM"
  ],
  "attn_implementation": "eager",
  "auto_map": {
    "AutoConfig": "configuration_internlm2.InternLM2Config",
    "AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
    "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
  },
  "bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "internlm2",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 2,
  "pretraining_tp": 1,
  "quantization_config": {
    "bits": 4,
    "checkpoint_format": "gptq",
    "damp_percent": 0.01,
    "desc_act": true,
    "group_size": 128,
    "lm_head": false,
    "meta": {
      "quantizer": "gptqmodel:0.9.5"
    },
    "model_file_base_name": null,
    "model_name_or_path": null,
    "quant_method": "gptq",
    "static_groups": false,
    "sym": true,
    "true_sequential": true
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 2.0,
    "type": "dynamic"
  },
  "rope_theta": 1000000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.4",
  "use_cache": true,
  "vocab_size": 92544
}

2024-08-14 21:19:07,581 transformers.generation.configuration_utils 24351 INFO     Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 2
}

2024-08-14 21:19:07,821 vllm.worker.model_runner 24351 INFO     Starting to load model /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
2024-08-14 21:19:07,976 xinference.core.worker 20920 ERROR    Failed to load model internlm2.5-chat-1-0
Traceback (most recent call last):
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/worker.py", line 882, in launch_builtin_model
    await model_ref.load()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/model.py", line 300, in load
    self._model.load()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/model/llm/vllm/core.py", line 243, in load
    self._engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 249, in __init__
    self.model_executor = executor_class(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
    self.driver_worker.load_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/worker/worker.py", line 139, in load_model
    self.model_runner.load_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 722, in load_model
    self.model = get_model(model_config=self.model_config,
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
    return loader.load_model(model_config=model_config,
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
    model.load_weights(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/model_executor/models/internlm2.py", line 327, in load_weights
    loaded_weight = loaded_weight.view(-1, 2 + kv_groups,
RuntimeError: [address=0.0.0.0:45927, pid=24351] shape '[-1, 6, 128, 4096]' is invalid for input of size 4096
2024-08-14 21:19:08,068 xinference.api.restful_api 20864 ERROR    [address=0.0.0.0:45927, pid=24351] shape '[-1, 6, 128, 4096]' is invalid for input of size 4096
Traceback (most recent call last):
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/api/restful_api.py", line 878, in launch_model
    model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/supervisor.py", line 1027, in launch_builtin_model
    await _launch_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/supervisor.py", line 991, in _launch_model
    await _launch_one_model(rep_model_uid)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/supervisor.py", line 970, in _launch_one_model
    await worker_ref.launch_builtin_model(
  File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
    async with lock:
  File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
    result = await result
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/utils.py", line 45, in wrapped
    ret = await func(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/worker.py", line 882, in launch_builtin_model
    await model_ref.load()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/model.py", line 300, in load
    self._model.load()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/model/llm/vllm/core.py", line 243, in load
    self._engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 249, in __init__
    self.model_executor = executor_class(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
    self.driver_worker.load_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/worker/worker.py", line 139, in load_model
    self.model_runner.load_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 722, in load_model
    self.model = get_model(model_config=self.model_config,
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
    return loader.load_model(model_config=model_config,
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
    model.load_weights(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/model_executor/models/internlm2.py", line 327, in load_weights
    loaded_weight = loaded_weight.view(-1, 2 + kv_groups,
RuntimeError: [address=0.0.0.0:45927, pid=24351] shape '[-1, 6, 128, 4096]' is invalid for input of size 4096

Expected behavior / 期待表现

run successfully

qinxuye commented 4 weeks ago

What's the version of vllm?

soulzzz commented 4 weeks ago
pip list
Package                           Version
--------------------------------- --------------
absl-py                           2.1.0
accelerate                        0.33.0
aiobotocore                       2.7.0
aiofiles                          23.2.1
aiohappyeyeballs                  2.3.5
aiohttp                           3.10.3
aioitertools                      0.11.0
aioprometheus                     23.12.0
aiosignal                         1.3.1
alembic                           1.13.2
aliyun-python-sdk-core            2.15.1
aliyun-python-sdk-kms             2.16.4
altair                            5.4.0
annotated-types                   0.7.0
antlr4-python3-runtime            4.9.3
anyio                             4.4.0
argon2-cffi                       23.1.0
argon2-cffi-bindings              21.2.0
arrow                             1.3.0
asttokens                         2.4.1
async-lru                         2.0.4
async-timeout                     4.0.3
attrdict                          2.0.1
attrs                             24.2.0
audioread                         3.0.1
auto_gptq                         0.7.1
autoawq                           0.2.5
autoawq_kernels                   0.0.6
autopage                          0.5.2
babel                             2.16.0
bcrypt                            4.2.0
beautifulsoup4                    4.12.3
bibtexparser                      2.0.0b7
bitsandbytes                      0.43.3
bleach                            6.1.0
boto3                             1.28.64
botocore                          1.31.64
cdifflib                          1.2.6
certifi                           2024.7.4
cffi                              1.17.0
cfgv                              3.4.0
charset-normalizer                3.3.2
chattts                           0.1.1
click                             8.1.7
cliff                             4.7.0
clldutils                         3.22.2
cloudpickle                       3.0.0
cmaes                             0.11.1
cmake                             3.30.2
cmd2                              2.4.3
colorama                          0.4.6
coloredlogs                       15.0.1
colorlog                          6.8.2
comm                              0.2.2
conformer                         0.3.2
contourpy                         1.2.1
controlnet_aux                    0.0.7
crcmod                            1.7
cryptography                      43.0.0
csvw                              3.3.0
cycler                            0.12.1
Cython                            3.0.11
datasets                          2.21.0
debugpy                           1.8.5
decorator                         5.1.1
defusedxml                        0.7.1
diffusers                         0.25.0
dill                              0.3.8
diskcache                         5.6.3
distlib                           0.3.8
distro                            1.9.0
dlinfo                            1.2.1
ecdsa                             0.19.0
editdistance                      0.8.1
einops                            0.8.0
einx                              0.3.0
encodec                           0.1.1
exceptiongroup                    1.2.2
executing                         2.0.1
fastapi                           0.110.3
fastjsonschema                    2.20.0
ffmpeg-python                     0.2.0
ffmpy                             0.4.0
filelock                          3.15.4
FlagEmbedding                     1.2.11
flatbuffers                       24.3.25
fonttools                         4.53.1
fqdn                              1.5.1
frozendict                        2.4.4
frozenlist                        1.4.1
fsspec                            2023.10.0
funasr                            1.1.5
future                            1.0.0
gdown                             5.2.0
gekko                             1.2.1
gradio                            4.26.0
gradio_client                     0.15.1
greenlet                          3.0.3
grpcio                            1.65.4
h11                               0.14.0
hiredis                           3.0.0
httpcore                          1.0.5
httptools                         0.6.1
httpx                             0.27.0
huggingface-hub                   0.24.5
humanfriendly                     10.0
hydra-colorlog                    1.2.0
hydra-core                        1.3.2
hydra-optuna-sweeper              1.2.0
HyperPyYAML                       1.2.2
identify                          2.6.0
idna                              3.7
imageio                           2.35.0
imageio-ffmpeg                    0.5.1
importlib_metadata                8.2.0
importlib_resources               6.4.0
inflect                           7.3.1
iniconfig                         2.0.0
interegular                       0.3.3
ipykernel                         6.29.5
ipython                           8.26.0
ipywidgets                        8.1.3
isodate                           0.6.1
isoduration                       20.11.0
jaconv                            0.4.0
jamo                              0.4.1
jedi                              0.19.1
jieba                             0.42.1
Jinja2                            3.1.4
jmespath                          0.10.0
joblib                            1.4.2
json5                             0.9.25
jsonpointer                       3.0.0
jsonschema                        4.23.0
jsonschema-specifications         2023.12.1
jupyter_client                    8.6.2
jupyter_core                      5.7.2
jupyter-events                    0.10.0
jupyter-lsp                       2.2.5
jupyter_server                    2.14.2
jupyter_server_terminals          0.5.3
jupyterlab                        4.2.4
jupyterlab_pygments               0.3.0
jupyterlab_server                 2.27.3
jupyterlab_widgets                3.0.11
kaldiio                           2.18.0
kiwisolver                        1.4.5
language-tags                     1.2.0
lark                              1.2.2
lazy_loader                       0.4
libnacl                           2.1.0
librosa                           0.10.2.post1
lightning                         2.4.0
lightning-utilities               0.11.6
llama_cpp_python                  0.2.88
llvmlite                          0.43.0
lm-format-enforcer                0.10.3
lxml                              5.3.0
Mako                              1.3.5
Markdown                          3.6
markdown-it-py                    3.0.0
MarkupSafe                        2.1.5
matcha-tts                        0.0.5.1
matplotlib                        3.9.2
matplotlib-inline                 0.1.7
mdurl                             0.1.2
mistune                           3.0.2
modelscope                        1.17.1
more-itertools                    10.4.0
mpmath                            1.3.0
msgpack                           1.0.8
multidict                         6.0.5
multiprocess                      0.70.16
narwhals                          1.4.1
nbclient                          0.10.0
nbconvert                         7.16.4
nbformat                          5.10.4
nemo_text_processing              1.0.2
nest-asyncio                      1.6.0
networkx                          3.3
ninja                             1.11.1.1
nodeenv                           1.9.1
notebook                          7.2.1
notebook_shim                     0.2.4
numba                             0.60.0
numpy                             1.26.4
nvidia-cublas-cu12                12.1.3.1
nvidia-cuda-cupti-cu12            12.1.105
nvidia-cuda-nvrtc-cu12            12.1.105
nvidia-cuda-runtime-cu12          12.1.105
nvidia-cudnn-cu12                 9.1.0.70
nvidia-cufft-cu12                 11.0.2.54
nvidia-curand-cu12                10.3.2.106
nvidia-cusolver-cu12              11.4.5.107
nvidia-cusparse-cu12              12.1.0.106
nvidia-ml-py                      12.560.30
nvidia-nccl-cu12                  2.20.5
nvidia-nvjitlink-cu12             12.6.20
nvidia-nvtx-cu12                  12.1.105
omegaconf                         2.3.0
onnxruntime                       1.16.0
openai                            1.39.0
openai-whisper                    20230306
opencv-contrib-python             4.10.0.84
opencv-python                     4.10.0.84
optimum                           1.21.3
optuna                            2.10.1
orjson                            3.10.7
oss2                              2.18.6
outlines                          0.0.46
overrides                         7.7.0
packaging                         24.1
pandas                            2.2.2
pandocfilters                     1.5.1
parso                             0.8.4
passlib                           1.7.4
pbr                               6.0.0
peft                              0.12.0
pexpect                           4.9.0
phonemizer                        3.3.0
pillow                            10.4.0
pip                               24.2
piper-phonemize                   1.1.0
platformdirs                      4.2.2
pluggy                            1.5.0
pooch                             1.8.2
pre-commit                        3.8.0
prettytable                       3.11.0
prometheus_client                 0.20.0
prometheus-fastapi-instrumentator 7.0.0
prompt_toolkit                    3.0.47
protobuf                          4.25.4
psutil                            6.0.0
ptyprocess                        0.7.0
pure_eval                         0.2.3
py-cpuinfo                        9.0.0
pyairports                        2.1.1
pyarrow                           17.0.0
pyasn1                            0.6.0
pybase16384                       0.3.7
pycountry                         24.6.1
pycparser                         2.22
pycryptodome                      3.20.0
pydantic                          2.8.2
pydantic_core                     2.20.1
pydub                             0.25.1
Pygments                          2.18.0
pylatexenc                        2.10
pynini                            2.1.5
pynndescent                       0.5.13
pynvml                            11.5.3
pyparsing                         3.1.2
pyperclip                         1.9.0
PySocks                           1.7.1
pytest                            8.3.2
python-dateutil                   2.9.0.post0
python-dotenv                     1.0.1
python-jose                       3.3.0
python-json-logger                2.0.7
python-multipart                  0.0.9
pytorch-lightning                 2.4.0
pytorch-wpe                       0.0.1
pytz                              2024.1
PyYAML                            6.0.2
pyzmq                             26.1.0
quantile-python                   1.1
ray                               2.34.0
rdflib                            7.0.0
redis                             5.0.8
referencing                       0.35.1
regex                             2024.7.24
requests                          2.32.3
rfc3339-validator                 0.1.4
rfc3986                           1.5.0
rfc3986-validator                 0.1.1
rich                              13.7.1
rootutils                         1.0.7
rouge                             1.0.1
rpds-py                           0.20.0
rsa                               4.9
ruamel.yaml                       0.18.6
ruamel.yaml.clib                  0.2.8
ruff                              0.5.7
s3fs                              2023.10.0
s3transfer                        0.7.0
sacremoses                        0.1.1
safetensors                       0.4.4
scikit-image                      0.24.0
scikit-learn                      1.5.1
scipy                             1.14.0
seaborn                           0.13.2
segments                          2.2.1
semantic-version                  2.10.0
Send2Trash                        1.8.3
sentence-transformers             3.0.1
sentencepiece                     0.2.0
setuptools                        72.2.0
sglang                            0.2.12
shellingham                       1.5.4
six                               1.16.0
sniffio                           1.3.1
soundfile                         0.12.1
soupsieve                         2.6
soxr                              0.4.0
SQLAlchemy                        2.0.32
sse-starlette                     2.1.3
stack-data                        0.6.3
starlette                         0.37.2
stevedore                         5.2.0
sympy                             1.13.2
tabulate                          0.9.0
tblib                             3.0.0
tensorboard                       2.17.0
tensorboard-data-server           0.7.2
tensorboardX                      2.6.2.2
tensorizer                        2.9.0
terminado                         0.18.1
threadpoolctl                     3.5.0
tifffile                          2024.8.10
tiktoken                          0.7.0
timm                              1.0.8
tinycss2                          1.3.0
tokenizers                        0.19.1
tomli                             2.0.1
tomlkit                           0.12.0
torch                             2.4.0
torch-complex                     0.4.4
torchaudio                        2.4.0
torchmetrics                      1.4.1
torchvision                       0.19.0
tornado                           6.4.1
tqdm                              4.66.5
traitlets                         5.14.3
transformers                      4.43.4
transformers-stream-generator     0.0.5
triton                            3.0.0
typeguard                         4.3.0
typer                             0.11.1
types-python-dateutil             2.9.0.20240316
typing_extensions                 4.12.2
tzdata                            2024.1
umap-learn                        0.5.6
Unidecode                         1.3.8
uri-template                      1.3.0
uritemplate                       4.1.1
urllib3                           2.0.7
uvicorn                           0.30.6
uvloop                            0.19.0
vector-quantize-pytorch           1.15.6
virtualenv                        20.26.3
vllm                              0.5.4
vllm-flash-attn                   2.6.1
vocos                             0.1.0
watchfiles                        0.23.0
wcwidth                           0.2.13
webcolors                         24.8.0
webencodings                      0.5.1
websocket-client                  1.8.0
websockets                        11.0.3
Werkzeug                          3.0.3
WeTextProcessing                  1.0.3
wget                              3.2
wheel                             0.44.0
widgetsnbextension                4.0.11
wrapt                             1.16.0
xformers                          0.0.27.post2
xinference                        0.14.1.post1
xoscar                            0.3.3
xxhash                            3.4.1
yarl                              1.9.4
zipp                              3.20.0
zstandard                         0.23.0
soulzzz commented 4 weeks ago

i switch above env i post and still got the same error

2024-08-15 10:26:02,268 xinference.model.llm.llm_family 112606 INFO     Caching from Hugging Face: ModelCloud/internlm-2.5-7b-chat-gptq-4bit
2024-08-15 10:26:02,277 xinference.model.llm.vllm.core 112668 INFO     Loading internlm2.5-chat with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 1, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0.
2024-08-15 10:26:02,278 transformers.configuration_utils 112668 INFO     loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-15 10:26:02,279 transformers.configuration_utils 112668 INFO     loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-15 10:26:02,280 transformers.configuration_utils 112668 INFO     Model config InternLM2Config {
  "_name_or_path": "/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b",
  "architectures": [
    "InternLM2ForCausalLM"
  ],
  "attn_implementation": "eager",
  "auto_map": {
    "AutoConfig": "configuration_internlm2.InternLM2Config",
    "AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
    "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
  },
  "bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "internlm2",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 2,
  "pretraining_tp": 1,
  "quantization_config": {
    "bits": 4,
    "checkpoint_format": "gptq",
    "damp_percent": 0.01,
    "desc_act": true,
    "group_size": 128,
    "lm_head": false,
    "meta": {
      "quantizer": "gptqmodel:0.9.5"
    },
    "model_file_base_name": null,
    "model_name_or_path": null,
    "quant_method": "gptq",
    "static_groups": false,
    "sym": true,
    "true_sequential": true
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 2.0,
    "type": "dynamic"
  },
  "rope_theta": 1000000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.4",
  "use_cache": true,
  "vocab_size": 92544
}

2024-08-15 10:26:02,280 vllm.model_executor.layers.quantization.gptq_marlin 112668 INFO     The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
2024-08-15 10:26:02,280 vllm.engine.llm_engine 112668 INFO     Initializing an LLM engine (v0.5.4) with config: model='/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b', speculative_config=None, tokenizer='/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b, use_v2_block_manager=False, enable_prefix_caching=False)
2024-08-15 10:26:02,282 transformers.tokenization_utils_base 112668 INFO     loading file ./tokenizer.model
2024-08-15 10:26:02,283 transformers.tokenization_utils_base 112668 INFO     loading file added_tokens.json
2024-08-15 10:26:02,283 transformers.tokenization_utils_base 112668 INFO     loading file special_tokens_map.json
2024-08-15 10:26:02,283 transformers.tokenization_utils_base 112668 INFO     loading file tokenizer_config.json
2024-08-15 10:26:02,283 transformers.tokenization_utils_base 112668 INFO     loading file tokenizer.json
2024-08-15 10:26:02,372 transformers.tokenization_utils_base 112668 INFO     Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-08-15 10:26:02,376 transformers.configuration_utils 112668 INFO     loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-15 10:26:02,377 transformers.configuration_utils 112668 INFO     loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-15 10:26:02,378 transformers.configuration_utils 112668 INFO     Model config InternLM2Config {
  "_name_or_path": "/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b",
  "architectures": [
    "InternLM2ForCausalLM"
  ],
  "attn_implementation": "eager",
  "auto_map": {
    "AutoConfig": "configuration_internlm2.InternLM2Config",
    "AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
    "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
  },
  "bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "internlm2",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 2,
  "pretraining_tp": 1,
  "quantization_config": {
    "bits": 4,
    "checkpoint_format": "gptq",
    "damp_percent": 0.01,
    "desc_act": true,
    "group_size": 128,
    "lm_head": false,
    "meta": {
      "quantizer": "gptqmodel:0.9.5"
    },
    "model_file_base_name": null,
    "model_name_or_path": null,
    "quant_method": "gptq",
    "static_groups": false,
    "sym": true,
    "true_sequential": true
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 2.0,
    "type": "dynamic"
  },
  "rope_theta": 1000000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.4",
  "use_cache": true,
  "vocab_size": 92544
}

2024-08-15 10:26:02,378 transformers.generation.configuration_utils 112668 INFO     Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 2
}

2024-08-15 10:26:02,657 vllm.worker.model_runner 112668 INFO     Starting to load model /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
2024-08-15 10:26:02,832 xinference.core.worker 112606 ERROR    Failed to load model internlm2.5-chat-1-0
Traceback (most recent call last):
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/worker.py", line 882, in launch_builtin_model
    await model_ref.load()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/model.py", line 300, in load
    self._model.load()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 243, in load
    self._engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in __init__
    self.model_executor = executor_class(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
    self.driver_worker.load_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in load_model
    self.model_runner.load_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 722, in load_model
    self.model = get_model(model_config=self.model_config,
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
    return loader.load_model(model_config=model_config,
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
    model.load_weights(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/model_executor/models/internlm2.py", line 327, in load_weights
    loaded_weight = loaded_weight.view(-1, 2 + kv_groups,
RuntimeError: [address=0.0.0.0:33277, pid=112668] shape '[-1, 6, 128, 4096]' is invalid for input of size 4096
2024-08-15 10:26:02,922 xinference.api.restful_api 112560 ERROR    [address=0.0.0.0:33277, pid=112668] shape '[-1, 6, 128, 4096]' is invalid for input of size 4096
Traceback (most recent call last):
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/api/restful_api.py", line 878, in launch_model
    model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1027, in launch_builtin_model
    await _launch_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 991, in _launch_model
    await _launch_one_model(rep_model_uid)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 970, in _launch_one_model
    await worker_ref.launch_builtin_model(
  File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
    async with lock:
  File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
    result = await result
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped
    ret = await func(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/worker.py", line 882, in launch_builtin_model
    await model_ref.load()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/core/model.py", line 300, in load
    self._model.load()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 243, in load
    self._engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in __init__
    self.model_executor = executor_class(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
    self.driver_worker.load_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in load_model
    self.model_runner.load_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 722, in load_model
    self.model = get_model(model_config=self.model_config,
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
    return loader.load_model(model_config=model_config,
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
    model.load_weights(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.10/site-packages/vllm/model_executor/models/internlm2.py", line 327, in load_weights
    loaded_weight = loaded_weight.view(-1, 2 + kv_groups,
RuntimeError: [address=0.0.0.0:33277, pid=112668] shape '[-1, 6, 128, 4096]' is invalid for input of size 4096
soulzzz commented 4 weeks ago

What's the version of vllm?

i post the env above,pls have a check

soulzzz commented 4 weeks ago

also can we add this model internlm2_5-7b-chat-4bit into the built-in?

qinxuye commented 4 weeks ago

also can we add this model internlm2_5-7b-chat-4bit into the built-in?

Looks like it's an AWQ quantization version, do you have interest to add it to Xinference and send a PR?

soulzzz commented 4 weeks ago

also can we add this model internlm2_5-7b-chat-4bit into the built-in?

Looks like it's an AWQ quantization version, do you have interest to add it to Xinference and send a PR?

I want to do this but I can neither run the internlm2.5-7B-chat vllm gptq-4Int format nor the custom internlm2_5-7b-chat-4bit vllm awq-4Int format model, I would like to see the problem solved

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been inactive for 5 days since being marked as stale.