xusenlinzy / api-for-open-llm

Openai style api for open large language models, using LLMs just as chatgpt! Support for LLaMA, LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, Xverse, SqlCoder, CodeLLaMA, ChatGLM, ChatGLM2, ChatGLM3 etc. 开源大模型的统一后端接口
Apache License 2.0
2.16k stars 252 forks source link

dcoker 部署 vllm 出现 404 Not Found #271

Closed skyliwq closed 3 weeks ago

skyliwq commented 1 month ago

docker 部署 vllm qwen模型 启动成功,调用出现 "POST /v1/chat/completions HTTP/1.1" 404 Not Found 什么原因无法解决,请大神帮帮忙.

http://127.0.0.1:7891/docs 显示 No operations defined in spec!

2024-05-09 07:00:03 INFO: Started server process [1] 2024-05-09 07:00:03 INFO: Waiting for application startup. 2024-05-09 07:00:03 INFO: Application startup complete. 2024-05-09 07:00:03 INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) 2024-05-09 07:00:04 INFO: 172.16.1.1:57384 - "POST /v1/chat/completions HTTP/1.1" 404 Not Found

Tendo33 commented 1 month ago

发一下,请求的脚本

skyliwq commented 1 month ago

发一下,请求的脚本

我是用psotman 请求的 显示 { "detail": "Not Found" } 用官方提供的 tests chat.py 也显示错误 部署用的官方docke-compose 文件 配置都没问题

Tendo33 commented 1 month ago

发一下,请求的脚本

我是用psotman 请求的 显示 { "detail": "Not Found" } 用官方提供的 tests chat.py 也显示错误 部署用的官方docke-compose 文件 配置都没问题

请求的地址有加 /v1 吗? 如果不行可以去部署地址的 /docs 看一下fastapi 接口,可以直接在线起到跟 psotman 一样的效果

skyliwq commented 1 month ago

Reference i 请求参数 部署参数都对的 核实了很多遍 http://127.0.0.1:7891/docs 显示 No operations defined in spec!

skyliwq commented 1 month ago

配置为 ENGINE=vllm 报错 ENGINE=default 正常

xusenlinzy commented 1 month ago

那应该是vllm安装没有成功

skyliwq commented 1 month ago

那应该是vllm安装没有成功 直接docker部署的 如何从新安装,大神指点 root@a73600e73869:/workspace# pip show vllm Name: vllm Version: 0.4.0 Summary: A high-throughput and memory-efficient inference and serving engine for LLMs Home-page: https://github.com/vllm-project/vllm Author: vLLM Team Author-email: License: Apache 2.0 Location: /usr/local/lib/python3.10/dist-packages Requires: cmake, fastapi, ninja, numpy, outlines, prometheus-client, psutil, py-cpuinfo, pydantic, pynvml, ray, requests, sentencepiece, tiktoken, torch, transformers, triton, uvicorn, xformers

Tendo33 commented 1 month ago

你 docker build 镜像的时候用的哪个docker File ?换成 vllm 那个

skyliwq commented 1 month ago

vllm

换的是这个

JadynWong commented 1 month ago

同样的问题, 最新的代码, 使用docker-compose vllm部署, GPU只有embedding模型的占用, 日志也不报错. 请求404

LOG

=============
== PyTorch ==
=============
NVIDIA Release 23.10 (build 71422337)
PyTorch Version 2.1.0a0+32f93b1
Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...
2024-05-21 10:20:09.754 | DEBUG    | api.config:<module>:338 - SETTINGS: {
    "embedding_name": "/models/BAAI/bge-m3",
    "rerank_name": null,
    "embedding_size": -1,
    "embedding_device": "cuda:0",
    "rerank_device": "cuda:0",
    "trust_remote_code": true,
    "tokenize_mode": "slow",
    "tensor_parallel_size": 1,
    "gpu_memory_utilization": 0.9,
    "max_num_batched_tokens": -1,
    "max_num_seqs": 256,
    "quantization_method": null,
    "enforce_eager": false,
    "max_context_len_to_capture": 8192,
    "max_loras": 1,
    "max_lora_rank": 32,
    "lora_extra_vocab_size": 256,
    "lora_dtype": "auto",
    "max_cpu_loras": -1,
    "lora_modules": "",
    "vllm_disable_log_stats": true,
    "model_name": "qwen2",
    "model_path": "/models/Qwen/Qwen1.5-14B-Chat",
    "dtype": "bfloat16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "context_length": -1,
    "chat_template": "qwen2",
    "rope_scaling": null,
    "flash_attn": false,
    "use_streamer_v2": true,
    "interrupt_requests": true,
    "host": "0.0.0.0",
    "port": 8000,
    "api_prefix": "/v1",
    "engine": "vllm",
    "tasks": [
        "llm",
        "rag"
    ],
    "device_map": "auto",
    "gpus": null,
    "num_gpus": 1,
    "activate_inference": true,
    "model_names": [
        "qwen2",
        "bge-m3"
    ],
    "api_keys": [
        "xxxxxxxxxxxxxxx"
    ]
}
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     172.18.0.1:51554 - "GET /v1/models HTTP/1.1" 404 Not Found
INFO:     172.18.0.1:45768 - "POST /v1/chat/completions HTTP/1.1" 404 Not Found
JadynWong commented 1 month ago

Clip_2024-05-21_19-28-39

https://github.com/xusenlinzy/api-for-open-llm/blob/e46e48056a02ffbd90e0dfe4bc2f803df1e7e4e1/api/models.py#L100 此处加了一行打印异常日志

今天下午才拉取的代码, 重新构建的镜像, 期间没有任何报错

docker build -f docker/Dockerfile.vllm -t llm-api:vllm .

可能相关的问题 https://github.com/vllm-project/vllm/issues/3528

liho00 commented 1 month ago

遇到一样的问题,看来是vllm的问题?