xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.69k stars 369 forks source link

使用swift微调的qwen1.5模型,加载peft-model后推理速度很慢 #1255

Closed lordk911 closed 3 weeks ago

lordk911 commented 5 months ago

Describe the bug

使用swift通过lora微调的qwen1.5-14b-chat-gptq-int4模型,加载peft-model后推理速度很慢

To Reproduce

To help us to reproduce this bug, please provide information below:

  1. Your Python version. 3.10
  2. The version of xinference you use. 0.10.0
  3. Versions of crucial packages.
  4. Full stack of the error.
  5. Minimized code to reproduce the error.

cat swift/output/qwen1half-14b-chat-int4/v2-20240407-163120/checkpoint-1546/default/adapter_config.json

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": null,
  "bias": "none",
  "enable_lora": null,
  "fan_in_fan_out": false,
  "inference_mode": false,
  "init_lora_weights": true,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 32,
  "lora_dropout": 0.05,
  "lora_dtype": "fp32",
  "lr_ratio": null,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "model_key_mapping": null,
  "modules_to_save": [],
  "peft_type": "LORA",
  "r": 8,
  "rank_pattern": {},
  "revision": null,
  "swift_type": "LORA",
  "target_modules": [
    "q_proj",
    "k_proj",
    "v_proj"
  ],
  "task_type": null,
  "use_dora": false,
  "use_merged_linear": false,
  "use_qa_lora": false,
  "use_rslora": false
}

cat swift/output/qwen1half-14b-chat-int4/v2-20240407-163120/checkpoint-1546/configuration.json

{
    "framework": "pytorch",
    "task": "text-generation",
    "allow_remote": true,
    "adapter_cfg": {
        "model_id_or_path": "/data/xinference/cache/qwen1.5-chat-gptq-14b-Int4",
        "model_revision": "master",
        "sft_type": "lora",
        "tuner_backend": "swift",
        "template_type": "qwen",
        "dtype": "fp16",
        "system": "You are a helpful assistant and you are very professional in doing data analysis based on SQL."
    }
}

script to launch model :

xinference launch -n qwen1.5-chat -u qwen1.5-14B-Chat-SQL -s 14 -f gptq --max_model_len 32000 -e "http://10.9.123.456:9997" --worker-ip 10.9.123.456 --peft-model-path /data/llm-project/swift/output/qwen1half-14b-chat-int4/v2-20240407-163120/checkpoint-1546/default

the log of xinference worker:

2024-04-08 09:50:19,996 xinference.model.llm.pytorch.utils 65345 INFO     Average generation speed: 1.97 tokens/s.
2024-04-08 09:56:56,031 xinference.model.llm.pytorch.utils 65345 INFO     Average generation speed: 1.98 tokens/s.

if I direct use qwen1.5-chat-gptq-14b-Int4 the log of xinference worker:

INFO 04-08 09:54:08 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.1%, CPU KV cache usage: 0.0%
INFO 04-08 09:54:13 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 44.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.8%, CPU KV cache usage: 0.0%

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

ChengjieLi28 commented 5 months ago

@lordk911 Now VLLM model does not support lora. Will support in next two release versions.

lordk911 commented 5 months ago

@lordk911 Now VLLM model does not support lora. Will support in next two release versions.

thanks for reply

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been inactive for 5 days since being marked as stale.