Openai style api for open large language models, using LLMs just as chatgpt! Support for LLaMA, LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, Xverse, SqlCoder, CodeLLaMA, ChatGLM, ChatGLM2, ChatGLM3 etc. 开源大模型的统一后端接口
# 启动日志:
2024-06-05 10:44:15.472 | DEBUG | api.config:<module>:338 - SETTINGS: {
"trust_remote_code": true,
"tokenize_mode": "auto",
"tensor_parallel_size": 1,
"gpu_memory_utilization": 0.9,
"max_num_batched_tokens": -1,
"max_num_seqs": 256,
"quantization_method": null,
"enforce_eager": false,
"max_seq_len_to_capture": 8192,
"max_loras": 1,
"max_lora_rank": 32,
"lora_extra_vocab_size": 256,
"lora_dtype": "auto",
"max_cpu_loras": -1,
"lora_modules": "",
"vllm_disable_log_stats": true,
"model_name": "qwen",
"model_path": "/opt/ai-models/Qwen-14B-Chat-Int4",
"dtype": "auto",
"load_in_8bit": false,
"load_in_4bit": true,
"context_length": -1,
"chat_template": null,
"rope_scaling": null,
"flash_attn": false,
"use_streamer_v2": false,
"interrupt_requests": true,
"host": "0.0.0.0",
"port": 8010,
"api_prefix": "/v1",
"engine": "vllm",
"tasks": [
"llm"
],
"device_map": "auto",
"gpus": null,
"num_gpus": 1,
"activate_inference": true,
"model_names": [
"qwen"
],
"api_keys": null
}
INFO 06-05 10:44:19 config.py:1130] Casting torch.float32 to torch.float16.
INFO 06-05 10:44:19 config.py:1151] Downcasting torch.float32 to torch.float16.
INFO 06-05 10:44:21 gptq_marlin.py:133] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 06-05 10:44:21 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/opt/ai-models/Qwen-14B-Chat-Int4', speculative_config=None, tokenizer='/opt/ai-models/Qwen-14B-Chat-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/opt/ai-models/Qwen-14B-Chat-Int4)
WARNING 06-05 10:44:22 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 06-05 10:44:27 model_runner.py:146] Loading model weights took 9.0253 GB
INFO 06-05 10:44:30 gpu_executor.py:83] # GPU blocks: 2030, # CPU blocks: 327
INFO 06-05 10:44:36 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-05 10:44:36 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-05 10:44:51 model_runner.py:924] Graph capturing finished in 15 secs.
2024-06-05 10:44:51.670 | INFO | api.models:create_vllm_engine:127 - Using vllm engine
WARNING 06-05 10:44:52 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8010 (Press CTRL+C to quit)
# 推理请求日志:
2024-06-05 10:44:54.752 | DEBUG | api.vllm_routes.chat:create_chat_completion:65 - ==== request ====
{'model': 'Qwen', 'frequency_penalty': 0.0, 'function_call': None, 'functions': None, 'logit_bias': None, 'logprobs': False, 'max_tokens': 512, 'n': 1, 'presence_penalty': 0.0, 'response_format': None, 'seed': None, 'stop': ['<|endoftext|>', '<|im_end|>'], 'temperature': 0.3, 'tool_choice': None, 'tools': None, 'top_logprobs': None, 'top_p': 1.0, 'user': None, 'stream': False, 'repetition_penalty': 1.03, 'typical_p': None, 'watermark': False, 'best_of': 1, 'ignore_eos': False, 'use_beam_search': False, 'stop_token_ids': [151643, 151644, 151645], 'skip_special_tokens': True, 'spaces_between_special_tokens': True, 'min_p': 0.0, 'include_stop_str_in_output': False, 'length_penalty': 1.0, 'guided_json': None, 'guided_regex': None, 'guided_choice': None, 'guided_grammar': None, 'guided_decoding_backend': 'lm-format-enforcer', 'prompt_or_messages': [{'content': '根据我的描述帮我写一段文字,字数要求50字以内;例如:节假日帮我写一段节日祝福,人物帮我介绍下人物生平简介,地点帮我介绍下来历和风俗。\n现在输入:${content}\n请按上面的要求回答。', 'role': 'system'}, {'content': '周星驰', 'role': 'user'}], 'echo': False}
2024-06-05 10:44:55.222 | INFO | api.vllm_routes.chat:create_chat_completion:107 - ==== guided_decoding_backend ====
lm-format-enforcer
INFO: 10.49.16.60:58712 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/opt/app/api/vllm_routes/chat.py", line 110, in create_chat_completion
await get_guided_decoding_logits_processor(
File "/home/ops/.local/lib/python3.10/site-packages/vllm/model_executor/guided_decoding/__init__.py", line 20, in get_guided_decoding_logits_processor
return await get_lm_format_enforcer_guided_decoding_logits_processor(
File "/home/ops/.local/lib/python3.10/site-packages/vllm/model_executor/guided_decoding/lm_format_enforcer_decoding.py", line 30, in get_lm_format_enforcer_guided_decoding_logits_processor
tokenizer_data = _cached_build_vllm_token_enforcer_tokenizer_data(
File "/home/ops/.local/lib/python3.10/site-packages/vllm/model_executor/guided_decoding/lm_format_enforcer_decoding.py", line 70, in _cached_build_vllm_token_enforcer_tokenizer_data
return build_vllm_token_enforcer_tokenizer_data(tokenizer)
File "/home/ops/.local/lib/python3.10/site-packages/lmformatenforcer/integrations/vllm.py", line 40, in build_vllm_token_enforcer_tokenizer_data
return build_token_enforcer_tokenizer_data(tokenizer)
File "/home/ops/.local/lib/python3.10/site-packages/lmformatenforcer/integrations/transformers.py", line 70, in build_token_enforcer_tokenizer_data
regular_tokens = _build_regular_tokens_list(tokenizer)
File "/home/ops/.local/lib/python3.10/site-packages/lmformatenforcer/integrations/transformers.py", line 58, in _build_regular_tokens_list
for token_idx in range(len(tokenizer)):
TypeError: object of type 'Encoding' has no len()
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ops/.local/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/ops/.local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
return await self.app(scope, receive, send)
File "/home/ops/.local/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/home/ops/.local/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/ops/.local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/home/ops/.local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/home/ops/.local/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/home/ops/.local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/ops/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/ops/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/ops/.local/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/ops/.local/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/home/ops/.local/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/home/ops/.local/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/ops/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/ops/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/ops/.local/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/home/ops/.local/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/home/ops/.local/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/opt/app/api/vllm_routes/chat.py", line 118, in create_chat_completion
await get_guided_decoding_logits_processor(
TypeError: get_guided_decoding_logits_processor() missing 1 required positional argument: 'tokenizer'
问题类型 | Type of problem
模型推理和部署 | Model inference and deployment
操作系统 | Operating system
Linux
详细描述问题 | Detailed description of the problem
使用主分支最新代码。Docker部署,安装依赖。 推理Qwen-14B,启动没有报错。在处理推理请求时报错。(见日志)
Docker镜像环境: CUDA:12.2 Python:3.10 vllm:0.4.3
Dependencies
运行日志或截图 | Runtime logs or screenshots