vllm 容器依赖报错

提交前必须检查以下项目 | The following items must be checked before submission

[X] 请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。 | Make sure you are using the latest code from the repository (git pull), some issues have already been addressed and fixed.
[X] 我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案 | I have searched the existing issues / discussions

问题类型 | Type of problem

模型推理和部署 | Model inference and deployment

操作系统 | Operating system

Linux

详细描述问题 | Detailed description of the problem

环境文件

PORT=8000

# model related
MODEL_NAME=Qwen1.5-72B-Chat-AWQ
MODEL_PATH=/workspace/share_data/base_llms/Qwen1.5-72B-Chat-AWQ
PROMPT_NAME=qwen2
EMBEDDING_NAME=/workspace/share_data/base_llms/m3e-base
CONTEXT_LEN=12000
LOAD_IN_8BIT=false
LOAD_IN_4BIT=True

TASKS=llm,rag

# device related
GPUS=0
NUM_GPUs=1
DTYPE=auto
DEVICE=cuda
DEVICE_MAP=auto

# api related
API_PREFIX=/v1

# vllm related
ENGINE=vllm
TRUST_REMOTE_CODE=true
TOKENIZE_MODE=auto
TENSOR_PARALLEL_SIZE=1
GPU_MEMORY_UTILIZATION=0.95
# 批量大小
MAX_NUM_SEQS=256

Dependencies

# 请在此处粘贴依赖情况
# Please paste the dependencies here

运行日志或截图 | Runtime logs or screenshots

WARNING: CUDA Minor Version Compatibility mode ENABLED.
  Using driver version 530.30.02 which has support for CUDA 12.1.  This container
  was built with CUDA 12.2 and will be run in Minor Version Compatibility mode.
  CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
  with this container but was unavailable:
  [[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

2024-04-23 09:18:34.708 | DEBUG    | api.config:<module>:338 - SETTINGS: {
    "embedding_name": "/workspace/share_data/base_llms/m3e-base",
    "rerank_name": null,
    "embedding_size": -1,
    "embedding_device": "cuda:0",
    "rerank_device": "cuda:0",
    "trust_remote_code": true,
    "tokenize_mode": "auto",
    "tensor_parallel_size": 1,
    "gpu_memory_utilization": 0.95,
    "max_num_batched_tokens": -1,
    "max_num_seqs": 256,
    "quantization_method": null,
    "enforce_eager": false,
    "max_context_len_to_capture": 8192,
    "max_loras": 1,
    "max_lora_rank": 32,
    "lora_extra_vocab_size": 256,
    "lora_dtype": "auto",
    "max_cpu_loras": -1,
    "lora_modules": "",
    "vllm_disable_log_stats": true,
    "model_name": "Qwen1.5-72B-Chat-AWQ",
    "model_path": "/workspace/share_data/base_llms/Qwen1.5-72B-Chat-AWQ",
    "dtype": "auto",
    "load_in_8bit": false,
    "load_in_4bit": true,
    "context_length": 12000,
    "chat_template": "qwen2",
    "rope_scaling": null,
    "flash_attn": false,
    "use_streamer_v2": true,
    "interrupt_requests": true,
    "host": "0.0.0.0",
    "port": 8000,
    "api_prefix": "/v1",
    "engine": "vllm",
    "tasks": [
        "llm",
        "rag"
    ],
    "device_map": "auto",
    "gpus": "0",
    "num_gpus": 1,
    "activate_inference": true,
    "model_names": [
        "Qwen1.5-72B-Chat-AWQ",
        "m3e-base"
    ],
    "api_keys": null
}
WARNING 04-23 09:18:40 config.py:208] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-23 09:18:40 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/workspace/share_data/base_llms/Qwen1.5-72B-Chat-AWQ', tokenizer='/workspace/share_data/base_llms/Qwen1.5-72B-Chat-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=12000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-23 09:18:40 selector.py:45] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-23 09:18:40 selector.py:21] Using XFormers backend.
INFO 04-23 09:18:51 model_runner.py:104] Loading model weights took 38.4595 GB
INFO 04-23 09:18:58 gpu_executor.py:94] # GPU blocks: 844, # CPU blocks: 102
INFO 04-23 09:19:00 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-23 09:19:00 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-23 09:19:17 model_runner.py:867] Graph capturing finished in 17 secs.
2024-04-23 09:19:17.899 | INFO     | api.models:create_vllm_engine:127 - Using vllm engine
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Form data requires "python-multipart" to be installed. 
You can install "python-multipart" with: 

pip install python-multipart

Traceback (most recent call last):
  File "/workspace/api/server.py", line 18, in <module>
    from api.routes.file import file_router
  File "/workspace/api/routes/file.py", line 46, in <module>
    async def upload_file(file: UploadFile):
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 944, in decorator
    self.add_api_route(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 883, in add_api_route
    route = route_class(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 519, in __init__
    self.body_field = get_body_field(dependant=self.dependant, name=self.unique_id)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/dependencies/utils.py", line 817, in get_body_field
    check_file_field(final_field)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/dependencies/utils.py", line 100, in check_file_field
    raise RuntimeError(multipart_not_installed_error) from None
RuntimeError: Form data requires "python-multipart" to be installed. 
You can install "python-multipart" with: 

pip install python-multipart

xusenlinzy / api-for-open-llm

vllm 容器依赖报错 #268

提交前必须检查以下项目 | The following items must be checked before submission

问题类型 | Type of problem

操作系统 | Operating system

详细描述问题 | Detailed description of the problem

Dependencies

运行日志或截图 | Runtime logs or screenshots