使用swift微调的qwen1.5模型，加载peft-model后推理速度很慢

lordk911 commented 5 months ago

Describe the bug

使用swift通过lora微调的qwen1.5-14b-chat-gptq-int4模型，加载peft-model后推理速度很慢

To Reproduce

To help us to reproduce this bug, please provide information below:

Your Python version. 3.10
The version of xinference you use. 0.10.0
Versions of crucial packages.
Full stack of the error.
Minimized code to reproduce the error.

cat swift/output/qwen1half-14b-chat-int4/v2-20240407-163120/checkpoint-1546/default/adapter_config.json

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": null,
  "bias": "none",
  "enable_lora": null,
  "fan_in_fan_out": false,
  "inference_mode": false,
  "init_lora_weights": true,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 32,
  "lora_dropout": 0.05,
  "lora_dtype": "fp32",
  "lr_ratio": null,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "model_key_mapping": null,
  "modules_to_save": [],
  "peft_type": "LORA",
  "r": 8,
  "rank_pattern": {},
  "revision": null,
  "swift_type": "LORA",
  "target_modules": [
    "q_proj",
    "k_proj",
    "v_proj"
  ],
  "task_type": null,
  "use_dora": false,
  "use_merged_linear": false,
  "use_qa_lora": false,
  "use_rslora": false
}

cat swift/output/qwen1half-14b-chat-int4/v2-20240407-163120/checkpoint-1546/configuration.json

{
    "framework": "pytorch",
    "task": "text-generation",
    "allow_remote": true,
    "adapter_cfg": {
        "model_id_or_path": "/data/xinference/cache/qwen1.5-chat-gptq-14b-Int4",
        "model_revision": "master",
        "sft_type": "lora",
        "tuner_backend": "swift",
        "template_type": "qwen",
        "dtype": "fp16",
        "system": "You are a helpful assistant and you are very professional in doing data analysis based on SQL."
    }
}

script to launch model :

xinference launch -n qwen1.5-chat -u qwen1.5-14B-Chat-SQL -s 14 -f gptq --max_model_len 32000 -e "http://10.9.123.456:9997" --worker-ip 10.9.123.456 --peft-model-path /data/llm-project/swift/output/qwen1half-14b-chat-int4/v2-20240407-163120/checkpoint-1546/default

the log of xinference worker:

2024-04-08 09:50:19,996 xinference.model.llm.pytorch.utils 65345 INFO     Average generation speed: 1.97 tokens/s.
2024-04-08 09:56:56,031 xinference.model.llm.pytorch.utils 65345 INFO     Average generation speed: 1.98 tokens/s.

if I direct use qwen1.5-chat-gptq-14b-Int4 the log of xinference worker:

INFO 04-08 09:54:08 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.1%, CPU KV cache usage: 0.0%
INFO 04-08 09:54:13 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 44.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.8%, CPU KV cache usage: 0.0%

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

ChengjieLi28 commented 5 months ago

@lordk911 Now VLLM model does not support lora. Will support in next two release versions.

lordk911 commented 5 months ago

@lordk911 Now VLLM model does not support lora. Will support in next two release versions.

thanks for reply

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been inactive for 5 days since being marked as stale.

xorbitsai / inference