vllm模式推理报错 - Githubissues

问题类型 | Type of problem

模型推理和部署 | Model inference and deployment

操作系统 | Operating system

Linux

详细描述问题 | Detailed description of the problem

使用主分支最新代码。Docker部署，安装依赖。推理Qwen-14B，启动没有报错。在处理推理请求时报错。（见日志）

Docker镜像环境： CUDA：12.2 Python：3.10 vllm：0.4.3

Dependencies

peft                              0.11.1
sentence-transformers             3.0.0
torch                             2.3.0
torchvision                       0.18.0
transformers                      4.40.2
transformers-stream-generator     0.0.5

运行日志或截图 | Runtime logs or screenshots

# 启动日志：
2024-06-05 10:44:15.472 | DEBUG    | api.config:<module>:338 - SETTINGS: {
    "trust_remote_code": true,
    "tokenize_mode": "auto",
    "tensor_parallel_size": 1,
    "gpu_memory_utilization": 0.9,
    "max_num_batched_tokens": -1,
    "max_num_seqs": 256,
    "quantization_method": null,
    "enforce_eager": false,
    "max_seq_len_to_capture": 8192,
    "max_loras": 1,
    "max_lora_rank": 32,
    "lora_extra_vocab_size": 256,
    "lora_dtype": "auto",
    "max_cpu_loras": -1,
    "lora_modules": "",
    "vllm_disable_log_stats": true,
    "model_name": "qwen",
    "model_path": "/opt/ai-models/Qwen-14B-Chat-Int4",
    "dtype": "auto",
    "load_in_8bit": false,
    "load_in_4bit": true,
    "context_length": -1,
    "chat_template": null,
    "rope_scaling": null,
    "flash_attn": false,
    "use_streamer_v2": false,
    "interrupt_requests": true,
    "host": "0.0.0.0",
    "port": 8010,
    "api_prefix": "/v1",
    "engine": "vllm",
    "tasks": [
        "llm"
    ],
    "device_map": "auto",
    "gpus": null,
    "num_gpus": 1,
    "activate_inference": true,
    "model_names": [
        "qwen"
    ],
    "api_keys": null
}
INFO 06-05 10:44:19 config.py:1130] Casting torch.float32 to torch.float16.
INFO 06-05 10:44:19 config.py:1151] Downcasting torch.float32 to torch.float16.
INFO 06-05 10:44:21 gptq_marlin.py:133] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 06-05 10:44:21 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/opt/ai-models/Qwen-14B-Chat-Int4', speculative_config=None, tokenizer='/opt/ai-models/Qwen-14B-Chat-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/opt/ai-models/Qwen-14B-Chat-Int4)
WARNING 06-05 10:44:22 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 06-05 10:44:27 model_runner.py:146] Loading model weights took 9.0253 GB
INFO 06-05 10:44:30 gpu_executor.py:83] # GPU blocks: 2030, # CPU blocks: 327
INFO 06-05 10:44:36 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-05 10:44:36 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-05 10:44:51 model_runner.py:924] Graph capturing finished in 15 secs.
2024-06-05 10:44:51.670 | INFO     | api.models:create_vllm_engine:127 - Using vllm engine
WARNING 06-05 10:44:52 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8010 (Press CTRL+C to quit)

# 推理请求日志：

2024-06-05 10:44:54.752 | DEBUG    | api.vllm_routes.chat:create_chat_completion:65 - ==== request ====
{'model': 'Qwen', 'frequency_penalty': 0.0, 'function_call': None, 'functions': None, 'logit_bias': None, 'logprobs': False, 'max_tokens': 512, 'n': 1, 'presence_penalty': 0.0, 'response_format': None, 'seed': None, 'stop': ['<|endoftext|>', '<|im_end|>'], 'temperature': 0.3, 'tool_choice': None, 'tools': None, 'top_logprobs': None, 'top_p': 1.0, 'user': None, 'stream': False, 'repetition_penalty': 1.03, 'typical_p': None, 'watermark': False, 'best_of': 1, 'ignore_eos': False, 'use_beam_search': False, 'stop_token_ids': [151643, 151644, 151645], 'skip_special_tokens': True, 'spaces_between_special_tokens': True, 'min_p': 0.0, 'include_stop_str_in_output': False, 'length_penalty': 1.0, 'guided_json': None, 'guided_regex': None, 'guided_choice': None, 'guided_grammar': None, 'guided_decoding_backend': 'lm-format-enforcer', 'prompt_or_messages': [{'content': '根据我的描述帮我写一段文字，字数要求50字以内；例如：节假日帮我写一段节日祝福，人物帮我介绍下人物生平简介，地点帮我介绍下来历和风俗。\n现在输入：${content}\n请按上面的要求回答。', 'role': 'system'}, {'content': '周星驰', 'role': 'user'}], 'echo': False}
2024-06-05 10:44:55.222 | INFO     | api.vllm_routes.chat:create_chat_completion:107 - ==== guided_decoding_backend ====
lm-format-enforcer
INFO:     10.49.16.60:58712 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/app/api/vllm_routes/chat.py", line 110, in create_chat_completion
    await get_guided_decoding_logits_processor(
  File "/home/ops/.local/lib/python3.10/site-packages/vllm/model_executor/guided_decoding/__init__.py", line 20, in get_guided_decoding_logits_processor
    return await get_lm_format_enforcer_guided_decoding_logits_processor(
  File "/home/ops/.local/lib/python3.10/site-packages/vllm/model_executor/guided_decoding/lm_format_enforcer_decoding.py", line 30, in get_lm_format_enforcer_guided_decoding_logits_processor
    tokenizer_data = _cached_build_vllm_token_enforcer_tokenizer_data(
  File "/home/ops/.local/lib/python3.10/site-packages/vllm/model_executor/guided_decoding/lm_format_enforcer_decoding.py", line 70, in _cached_build_vllm_token_enforcer_tokenizer_data
    return build_vllm_token_enforcer_tokenizer_data(tokenizer)
  File "/home/ops/.local/lib/python3.10/site-packages/lmformatenforcer/integrations/vllm.py", line 40, in build_vllm_token_enforcer_tokenizer_data
    return build_token_enforcer_tokenizer_data(tokenizer)
  File "/home/ops/.local/lib/python3.10/site-packages/lmformatenforcer/integrations/transformers.py", line 70, in build_token_enforcer_tokenizer_data
    regular_tokens = _build_regular_tokens_list(tokenizer)
  File "/home/ops/.local/lib/python3.10/site-packages/lmformatenforcer/integrations/transformers.py", line 58, in _build_regular_tokens_list
    for token_idx in range(len(tokenizer)):
TypeError: object of type 'Encoding' has no len()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ops/.local/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/ops/.local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/home/ops/.local/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/ops/.local/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ops/.local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/ops/.local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/ops/.local/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/ops/.local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/ops/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/ops/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/ops/.local/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ops/.local/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/ops/.local/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/ops/.local/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/ops/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/ops/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/ops/.local/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/home/ops/.local/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/home/ops/.local/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/app/api/vllm_routes/chat.py", line 118, in create_chat_completion
    await get_guided_decoding_logits_processor(
TypeError: get_guided_decoding_logits_processor() missing 1 required positional argument: 'tokenizer'

xusenlinzy / api-for-open-llm

vllm模式推理报错 #279

问题类型 | Type of problem

操作系统 | Operating system

详细描述问题 | Detailed description of the problem

Dependencies

运行日志或截图 | Runtime logs or screenshots