xusenlinzy / api-for-open-llm

Openai style api for open large language models, using LLMs just as chatgpt! Support for LLaMA, LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, Xverse, SqlCoder, CodeLLaMA, ChatGLM, ChatGLM2, ChatGLM3 etc. 开源大模型的统一后端接口
Apache License 2.0
2.16k stars 252 forks source link

vllm 容器依赖报错 #268

Closed Tendo33 closed 2 months ago

Tendo33 commented 2 months ago

提交前必须检查以下项目 | The following items must be checked before submission

问题类型 | Type of problem

模型推理和部署 | Model inference and deployment

操作系统 | Operating system

Linux

详细描述问题 | Detailed description of the problem

环境文件

PORT=8000

# model related
MODEL_NAME=Qwen1.5-72B-Chat-AWQ
MODEL_PATH=/workspace/share_data/base_llms/Qwen1.5-72B-Chat-AWQ
PROMPT_NAME=qwen2
EMBEDDING_NAME=/workspace/share_data/base_llms/m3e-base
CONTEXT_LEN=12000
LOAD_IN_8BIT=false
LOAD_IN_4BIT=True

TASKS=llm,rag

# device related
GPUS=0
NUM_GPUs=1
DTYPE=auto
DEVICE=cuda
DEVICE_MAP=auto

# api related
API_PREFIX=/v1

# vllm related
ENGINE=vllm
TRUST_REMOTE_CODE=true
TOKENIZE_MODE=auto
TENSOR_PARALLEL_SIZE=1
GPU_MEMORY_UTILIZATION=0.95
# 批量大小
MAX_NUM_SEQS=256

Dependencies

# 请在此处粘贴依赖情况
# Please paste the dependencies here

运行日志或截图 | Runtime logs or screenshots

WARNING: CUDA Minor Version Compatibility mode ENABLED.
  Using driver version 530.30.02 which has support for CUDA 12.1.  This container
  was built with CUDA 12.2 and will be run in Minor Version Compatibility mode.
  CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
  with this container but was unavailable:
  [[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

2024-04-23 09:18:34.708 | DEBUG    | api.config:<module>:338 - SETTINGS: {
    "embedding_name": "/workspace/share_data/base_llms/m3e-base",
    "rerank_name": null,
    "embedding_size": -1,
    "embedding_device": "cuda:0",
    "rerank_device": "cuda:0",
    "trust_remote_code": true,
    "tokenize_mode": "auto",
    "tensor_parallel_size": 1,
    "gpu_memory_utilization": 0.95,
    "max_num_batched_tokens": -1,
    "max_num_seqs": 256,
    "quantization_method": null,
    "enforce_eager": false,
    "max_context_len_to_capture": 8192,
    "max_loras": 1,
    "max_lora_rank": 32,
    "lora_extra_vocab_size": 256,
    "lora_dtype": "auto",
    "max_cpu_loras": -1,
    "lora_modules": "",
    "vllm_disable_log_stats": true,
    "model_name": "Qwen1.5-72B-Chat-AWQ",
    "model_path": "/workspace/share_data/base_llms/Qwen1.5-72B-Chat-AWQ",
    "dtype": "auto",
    "load_in_8bit": false,
    "load_in_4bit": true,
    "context_length": 12000,
    "chat_template": "qwen2",
    "rope_scaling": null,
    "flash_attn": false,
    "use_streamer_v2": true,
    "interrupt_requests": true,
    "host": "0.0.0.0",
    "port": 8000,
    "api_prefix": "/v1",
    "engine": "vllm",
    "tasks": [
        "llm",
        "rag"
    ],
    "device_map": "auto",
    "gpus": "0",
    "num_gpus": 1,
    "activate_inference": true,
    "model_names": [
        "Qwen1.5-72B-Chat-AWQ",
        "m3e-base"
    ],
    "api_keys": null
}
WARNING 04-23 09:18:40 config.py:208] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-23 09:18:40 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/workspace/share_data/base_llms/Qwen1.5-72B-Chat-AWQ', tokenizer='/workspace/share_data/base_llms/Qwen1.5-72B-Chat-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=12000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-23 09:18:40 selector.py:45] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-23 09:18:40 selector.py:21] Using XFormers backend.
INFO 04-23 09:18:51 model_runner.py:104] Loading model weights took 38.4595 GB
INFO 04-23 09:18:58 gpu_executor.py:94] # GPU blocks: 844, # CPU blocks: 102
INFO 04-23 09:19:00 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-23 09:19:00 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-23 09:19:17 model_runner.py:867] Graph capturing finished in 17 secs.
2024-04-23 09:19:17.899 | INFO     | api.models:create_vllm_engine:127 - Using vllm engine
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Form data requires "python-multipart" to be installed. 
You can install "python-multipart" with: 

pip install python-multipart

Traceback (most recent call last):
  File "/workspace/api/server.py", line 18, in <module>
    from api.routes.file import file_router
  File "/workspace/api/routes/file.py", line 46, in <module>
    async def upload_file(file: UploadFile):
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 944, in decorator
    self.add_api_route(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 883, in add_api_route
    route = route_class(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 519, in __init__
    self.body_field = get_body_field(dependant=self.dependant, name=self.unique_id)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/dependencies/utils.py", line 817, in get_body_field
    check_file_field(final_field)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/dependencies/utils.py", line 100, in check_file_field
    raise RuntimeError(multipart_not_installed_error) from None
RuntimeError: Form data requires "python-multipart" to be installed. 
You can install "python-multipart" with: 

pip install python-multipart
xusenlinzy commented 2 months ago

requirements.txt 加上了