internlm2.5-7B-chat& internlm2.5-7B-chat-1M can't run in vllm gptq-4Int

soulzzz commented 1 month ago

System Info / 系統信息

Cuda:12.5 python:3.9 ubuntu22.04

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[ ] docker / docker
[X] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

v0.14.1.post1

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local --host 0.0.0.0 --port 9997

Reproduction / 复现过程

1.run internlm2.5-chat or internlm2.5-chat-1M in vllm gpqt-4int format 2.error shows below:

2024-08-14 21:19:07,464 xinference.model.llm.llm_family 20920 INFO     Caching from Hugging Face: ModelCloud/internlm-2.5-7b-chat-gptq-4bit
2024-08-14 21:19:07,477 xinference.model.llm.vllm.core 24351 INFO     Loading internlm2.5-chat with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 1, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0.
2024-08-14 21:19:07,478 transformers.configuration_utils 24351 INFO     loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-14 21:19:07,479 transformers.configuration_utils 24351 INFO     loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-14 21:19:07,480 transformers.configuration_utils 24351 INFO     Model config InternLM2Config {
  "_name_or_path": "/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b",
  "architectures": [
    "InternLM2ForCausalLM"
  ],
  "attn_implementation": "eager",
  "auto_map": {
    "AutoConfig": "configuration_internlm2.InternLM2Config",
    "AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
    "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
  },
  "bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "internlm2",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 2,
  "pretraining_tp": 1,
  "quantization_config": {
    "bits": 4,
    "checkpoint_format": "gptq",
    "damp_percent": 0.01,
    "desc_act": true,
    "group_size": 128,
    "lm_head": false,
    "meta": {
      "quantizer": "gptqmodel:0.9.5"
    },
    "model_file_base_name": null,
    "model_name_or_path": null,
    "quant_method": "gptq",
    "static_groups": false,
    "sym": true,
    "true_sequential": true
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 2.0,
    "type": "dynamic"
  },
  "rope_theta": 1000000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.4",
  "use_cache": true,
  "vocab_size": 92544
}

2024-08-14 21:19:07,480 vllm.model_executor.layers.quantization.gptq_marlin 24351 INFO     The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
2024-08-14 21:19:07,481 vllm.engine.llm_engine 24351 INFO     Initializing an LLM engine (v0.5.4) with config: model='/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b', speculative_config=None, tokenizer='/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b, use_v2_block_manager=False, enable_prefix_caching=False)
2024-08-14 21:19:07,483 transformers.tokenization_utils_base 24351 INFO     loading file ./tokenizer.model
2024-08-14 21:19:07,483 transformers.tokenization_utils_base 24351 INFO     loading file added_tokens.json
2024-08-14 21:19:07,483 transformers.tokenization_utils_base 24351 INFO     loading file special_tokens_map.json
2024-08-14 21:19:07,483 transformers.tokenization_utils_base 24351 INFO     loading file tokenizer_config.json
2024-08-14 21:19:07,483 transformers.tokenization_utils_base 24351 INFO     loading file tokenizer.json
2024-08-14 21:19:07,575 transformers.tokenization_utils_base 24351 INFO     Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-08-14 21:19:07,580 transformers.configuration_utils 24351 INFO     loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-14 21:19:07,580 transformers.configuration_utils 24351 INFO     loading configuration file /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b/config.json
2024-08-14 21:19:07,581 transformers.configuration_utils 24351 INFO     Model config InternLM2Config {
  "_name_or_path": "/home/sky/.xinference/cache/internlm2_5-chat-gptq-7b",
  "architectures": [
    "InternLM2ForCausalLM"
  ],
  "attn_implementation": "eager",
  "auto_map": {
    "AutoConfig": "configuration_internlm2.InternLM2Config",
    "AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
    "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
  },
  "bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "internlm2",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 2,
  "pretraining_tp": 1,
  "quantization_config": {
    "bits": 4,
    "checkpoint_format": "gptq",
    "damp_percent": 0.01,
    "desc_act": true,
    "group_size": 128,
    "lm_head": false,
    "meta": {
      "quantizer": "gptqmodel:0.9.5"
    },
    "model_file_base_name": null,
    "model_name_or_path": null,
    "quant_method": "gptq",
    "static_groups": false,
    "sym": true,
    "true_sequential": true
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 2.0,
    "type": "dynamic"
  },
  "rope_theta": 1000000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.4",
  "use_cache": true,
  "vocab_size": 92544
}

2024-08-14 21:19:07,581 transformers.generation.configuration_utils 24351 INFO     Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 2
}

2024-08-14 21:19:07,821 vllm.worker.model_runner 24351 INFO     Starting to load model /home/sky/.xinference/cache/internlm2_5-chat-gptq-7b...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
2024-08-14 21:19:07,976 xinference.core.worker 20920 ERROR    Failed to load model internlm2.5-chat-1-0
Traceback (most recent call last):
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/worker.py", line 882, in launch_builtin_model
    await model_ref.load()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/model.py", line 300, in load
    self._model.load()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/model/llm/vllm/core.py", line 243, in load
    self._engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 249, in __init__
    self.model_executor = executor_class(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
    self.driver_worker.load_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/worker/worker.py", line 139, in load_model
    self.model_runner.load_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 722, in load_model
    self.model = get_model(model_config=self.model_config,
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
    return loader.load_model(model_config=model_config,
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
    model.load_weights(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/model_executor/models/internlm2.py", line 327, in load_weights
    loaded_weight = loaded_weight.view(-1, 2 + kv_groups,
RuntimeError: [address=0.0.0.0:45927, pid=24351] shape '[-1, 6, 128, 4096]' is invalid for input of size 4096
2024-08-14 21:19:08,068 xinference.api.restful_api 20864 ERROR    [address=0.0.0.0:45927, pid=24351] shape '[-1, 6, 128, 4096]' is invalid for input of size 4096
Traceback (most recent call last):
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/api/restful_api.py", line 878, in launch_model
    model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/supervisor.py", line 1027, in launch_builtin_model
    await _launch_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/supervisor.py", line 991, in _launch_model
    await _launch_one_model(rep_model_uid)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/supervisor.py", line 970, in _launch_one_model
    await worker_ref.launch_builtin_model(
  File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
    async with lock:
  File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
    result = await result
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/utils.py", line 45, in wrapped
    ret = await func(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/worker.py", line 882, in launch_builtin_model
    await model_ref.load()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/core/model.py", line 300, in load
    self._model.load()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/xinference/model/llm/vllm/core.py", line 243, in load
    self._engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 249, in __init__
    self.model_executor = executor_class(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
    self.driver_worker.load_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/worker/worker.py", line 139, in load_model
    self.model_runner.load_model()
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 722, in load_model
    self.model = get_model(model_config=self.model_config,
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
    return loader.load_model(model_config=model_config,
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
    model.load_weights(
  File "/home/sky/anaconda3/envs/Xinference/lib/python3.9/site-packages/vllm/model_executor/models/internlm2.py", line 327, in load_weights
    loaded_weight = loaded_weight.view(-1, 2 + kv_groups,
RuntimeError: [address=0.0.0.0:45927, pid=24351] shape '[-1, 6, 128, 4096]' is invalid for input of size 4096

Expected behavior / 期待表现

run successfully