Openai style api for open large language models, using LLMs just as chatgpt! Support for LLaMA, LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, Xverse, SqlCoder, CodeLLaMA, ChatGLM, ChatGLM2, ChatGLM3 etc. 开源大模型的统一后端接口
提交前必须检查以下项目 | The following items must be checked before submission
[X] 请确保使用的是仓库最新代码(git pull),一些问题已被解决和修复。 | Make sure you are using the latest code from the repository (git pull), some issues have already been addressed and fixed.
[X] 我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案 | I have searched the existing issues / discussions
问题类型 | Type of problem
模型推理和部署 | Model inference and deployment
操作系统 | Operating system
Linux
详细描述问题 | Detailed description of the problem
环境文件
PORT=8000
# model related
MODEL_NAME=Qwen1.5-72B-Chat-AWQ
MODEL_PATH=/workspace/share_data/base_llms/Qwen1.5-72B-Chat-AWQ
PROMPT_NAME=qwen2
EMBEDDING_NAME=/workspace/share_data/base_llms/m3e-base
CONTEXT_LEN=12000
LOAD_IN_8BIT=false
LOAD_IN_4BIT=True
TASKS=llm,rag
# device related
GPUS=0
NUM_GPUs=1
DTYPE=auto
DEVICE=cuda
DEVICE_MAP=auto
# api related
API_PREFIX=/v1
# vllm related
ENGINE=vllm
TRUST_REMOTE_CODE=true
TOKENIZE_MODE=auto
TENSOR_PARALLEL_SIZE=1
GPU_MEMORY_UTILIZATION=0.95
# 批量大小
MAX_NUM_SEQS=256
Dependencies
# 请在此处粘贴依赖情况
# Please paste the dependencies here
运行日志或截图 | Runtime logs or screenshots
WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 530.30.02 which has support for CUDA 12.1. This container
was built with CUDA 12.2 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
2024-04-23 09:18:34.708 | DEBUG | api.config:<module>:338 - SETTINGS: {
"embedding_name": "/workspace/share_data/base_llms/m3e-base",
"rerank_name": null,
"embedding_size": -1,
"embedding_device": "cuda:0",
"rerank_device": "cuda:0",
"trust_remote_code": true,
"tokenize_mode": "auto",
"tensor_parallel_size": 1,
"gpu_memory_utilization": 0.95,
"max_num_batched_tokens": -1,
"max_num_seqs": 256,
"quantization_method": null,
"enforce_eager": false,
"max_context_len_to_capture": 8192,
"max_loras": 1,
"max_lora_rank": 32,
"lora_extra_vocab_size": 256,
"lora_dtype": "auto",
"max_cpu_loras": -1,
"lora_modules": "",
"vllm_disable_log_stats": true,
"model_name": "Qwen1.5-72B-Chat-AWQ",
"model_path": "/workspace/share_data/base_llms/Qwen1.5-72B-Chat-AWQ",
"dtype": "auto",
"load_in_8bit": false,
"load_in_4bit": true,
"context_length": 12000,
"chat_template": "qwen2",
"rope_scaling": null,
"flash_attn": false,
"use_streamer_v2": true,
"interrupt_requests": true,
"host": "0.0.0.0",
"port": 8000,
"api_prefix": "/v1",
"engine": "vllm",
"tasks": [
"llm",
"rag"
],
"device_map": "auto",
"gpus": "0",
"num_gpus": 1,
"activate_inference": true,
"model_names": [
"Qwen1.5-72B-Chat-AWQ",
"m3e-base"
],
"api_keys": null
}
WARNING 04-23 09:18:40 config.py:208] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-23 09:18:40 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/workspace/share_data/base_llms/Qwen1.5-72B-Chat-AWQ', tokenizer='/workspace/share_data/base_llms/Qwen1.5-72B-Chat-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=12000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-23 09:18:40 selector.py:45] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-23 09:18:40 selector.py:21] Using XFormers backend.
INFO 04-23 09:18:51 model_runner.py:104] Loading model weights took 38.4595 GB
INFO 04-23 09:18:58 gpu_executor.py:94] # GPU blocks: 844, # CPU blocks: 102
INFO 04-23 09:19:00 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-23 09:19:00 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-23 09:19:17 model_runner.py:867] Graph capturing finished in 17 secs.
2024-04-23 09:19:17.899 | INFO | api.models:create_vllm_engine:127 - Using vllm engine
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Form data requires "python-multipart" to be installed.
You can install "python-multipart" with:
pip install python-multipart
Traceback (most recent call last):
File "/workspace/api/server.py", line 18, in <module>
from api.routes.file import file_router
File "/workspace/api/routes/file.py", line 46, in <module>
async def upload_file(file: UploadFile):
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 944, in decorator
self.add_api_route(
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 883, in add_api_route
route = route_class(
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 519, in __init__
self.body_field = get_body_field(dependant=self.dependant, name=self.unique_id)
File "/usr/local/lib/python3.10/dist-packages/fastapi/dependencies/utils.py", line 817, in get_body_field
check_file_field(final_field)
File "/usr/local/lib/python3.10/dist-packages/fastapi/dependencies/utils.py", line 100, in check_file_field
raise RuntimeError(multipart_not_installed_error) from None
RuntimeError: Form data requires "python-multipart" to be installed.
You can install "python-multipart" with:
pip install python-multipart
提交前必须检查以下项目 | The following items must be checked before submission
问题类型 | Type of problem
模型推理和部署 | Model inference and deployment
操作系统 | Operating system
Linux
详细描述问题 | Detailed description of the problem
环境文件
Dependencies
运行日志或截图 | Runtime logs or screenshots