vllm本地部署时，vllm engine启动失败

提交前必须检查以下项目 | The following items must be checked before submission

[X] 请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。 | Make sure you are using the latest code from the repository (git pull), some issues have already been addressed and fixed.
[X] 我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案 | I have searched the existing issues / discussions

问题类型 | Type of problem

启动命令 | Startup command

操作系统 | Operating system

Linux

详细描述问题 | Detailed description of the problem

#运行代码
#在一个新的conda环境中，python版本为3.8.10
pip install torch==2.1.0
pip install vllm==0.4.0                #运行之后发现安转了2.1.2的torch
pip install -r requirements.txt    #运行之后发现安转了2.3的torch

#因为0.4.0的vllm要求2.1.2的torch，所以这里重新安装了2.1.2的torch
pip install torch==2.1.2

#检查torch，如下python命令都正常，显示GPU可用，并且可以正确识别V100显卡
import torch

print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.current_device())
print(torch.cuda.get_device_name(0))

#启动立即遇到报错，初始化vllm engine时报错
python server.py

#系统：ubuntu 20.04，
#cuda版本：12.1，
#cudnn版本：8.9.7

#########################################
#.env文件：

PORT=8000

#llm related
MODEL_NAME=qwen2
PROMPT_NAME=qwen2
MODEL_PATH=/home/ruibn/safe/Qwen1.5-0.5B-Chat

#rag related
EMBEDDING_NAME=/home/ruibn/safe/bce-embedding-base_v1
RERANK_NAME=/home/ruibn/safe/bce-reranker-base_v1

#vllm related
ENGINE=vllm
TOKENIZE_MODE=auto
GPU_MEMORY_UTILIZATION=0.5
TENSOR_PARALLEL_SIZE=1
DTYPE=half

TASKS=llm,rag
#########################################
#所有模型都是下载好在本地
#p.s.不使用vllm，用default engine是可以正常运行的

Dependencies

# 请在此处粘贴依赖情况
# Please paste the dependencies here
Package                       Version
----------------------------- ------------  
accelerate                    0.30.1
aiohttp                       3.9.5
aiosignal                     1.3.1
annotated-types               0.6.0
antlr4-python3-runtime        4.9.3
anyio                         4.3.0
async-timeout                 4.0.3
attrs                         23.2.0
backoff                       2.2.1
beautifulsoup4                4.12.3
bitsandbytes                  0.43.1
Brotli                        1.0.9
certifi                       2024.2.2
cffi                          1.16.0
chardet                       5.2.0
charset-normalizer            2.0.4
click                         8.1.7
cloudpickle                   3.0.0
cmake                         3.29.3
coloredlogs                   15.0.1
contourpy                     1.1.1
cpm-kernels                   1.0.11
cryptography                  42.0.7
cycler                        0.12.1
dataclasses-json              0.6.6
deepdiff                      7.0.1
Deprecated                    1.2.14
diskcache                     5.6.3
distro                        1.9.0
dnspython                     2.6.1
effdet                        0.4.1
einops                        0.8.0
email_validator               2.1.1
emoji                         2.11.1
et-xmlfile                    1.1.0
exceptiongroup                1.2.1
fastapi                       0.111.0
fastapi-cli                   0.0.3
filelock                      3.13.1
filetype                      1.2.0
flatbuffers                   24.3.25
fonttools                     4.51.0
frozenlist                    1.4.1
fsspec                        2024.5.0
gmpy2                         2.1.2
greenlet                      3.0.3
h11                           0.14.0
httpcore                      1.0.5
httptools                     0.6.1
httpx                         0.27.0
huggingface-hub               0.23.0
humanfriendly                 10.0
idna                          3.7
importlib_metadata            7.1.0
importlib_resources           6.4.0
interegular                   0.3.3
iopath                        0.1.10
Jinja2                        3.1.3
joblib                        1.4.2
jsonpatch                     1.33
jsonpath-python               1.0.6
jsonpointer                   2.4
jsonschema                    4.22.0
jsonschema-specifications     2023.12.1
kiwisolver                    1.4.5
langchain                     0.1.20
langchain-community           0.0.38
langchain-core                0.1.52
langchain-text-splitters      0.0.2
langdetect                    1.0.9
langsmith                     0.1.59
lark                          1.1.9
layoutparser                  0.3.4
llvmlite                      0.41.1
loguru                        0.7.2
lxml                          5.2.2
Markdown                      3.6
markdown-it-py                3.0.0
MarkupSafe                    2.1.3
marshmallow                   3.21.2
matplotlib                    3.7.5
mdurl                         0.1.2
mkl-fft                       1.3.8
mkl-random                    1.2.4
mkl-service                   2.4.0
mpmath                        1.3.0
msg-parser                    1.2.0
msgpack                       1.0.8
multidict                     6.0.5
mypy-extensions               1.0.0
nest-asyncio                  1.6.0
networkx                      3.1
ninja                         1.11.1.1
nltk                          3.8.1
numba                         0.58.1
numpy                         1.24.3
nvidia-cublas-cu12            12.1.3.1
nvidia-cuda-cupti-cu12        12.1.105
nvidia-cuda-nvrtc-cu12        12.1.105
nvidia-cuda-runtime-cu12      12.1.105
nvidia-cudnn-cu12             8.9.2.26
nvidia-cufft-cu12             11.0.2.54
nvidia-curand-cu12            10.3.2.106
nvidia-cusolver-cu12          11.4.5.107
nvidia-cusparse-cu12          12.1.0.106
nvidia-nccl-cu12              2.18.1
nvidia-nvjitlink-cu12         12.4.127
nvidia-nvtx-cu12              12.1.105
olefile                       0.47
omegaconf                     2.3.0
onnx                          1.16.0
onnxruntime                   1.15.1
openai                        1.30.1
opencv-python                 4.9.0.80
openparse                     0.5.6
openpyxl                      3.1.2
ordered-set                   4.1.0
orjson                        3.10.3
outlines                      0.0.34
packaging                     23.2
pandas                        2.0.3
pdf2image                     1.17.0
pdfminer.six                  20231228
pdfplumber                    0.11.0
peft                          0.11.0
pikepdf                       8.15.1
pillow                        10.3.0
pip                           24.0
pkgutil_resolve_name          1.3.10
portalocker                   2.8.2
prometheus_client             0.20.0
protobuf                      5.26.1
psutil                        5.9.8
py-cpuinfo                    9.0.0
pyclipper                     1.3.0.post5   
pycocotools                   2.0.7
pycparser                     2.22
pydantic                      2.7.1
pydantic_core                 2.18.2
Pygments                      2.18.0
PyMuPDF                       1.24.4
PyMuPDFb                      1.24.3
pynvml                        11.5.0
pypandoc                      1.13
pyparsing                     3.1.2
pypdf                         4.2.0
pypdfium2                     4.30.0
PySocks                       1.7.1
pytesseract                   0.3.10
python-dateutil               2.9.0.post0   
python-docx                   1.1.2
python-dotenv                 1.0.0
python-iso639                 2024.4.27
python-magic                  0.4.27
python-multipart              0.0.9
python-pptx                   0.6.23
pytz                          2024.1
PyYAML                        6.0.1
rapidfuzz                     3.9.0
rapidocr-onnxruntime          1.3.19
ray                           2.10.0
referencing                   0.35.1
regex                         2024.5.15
requests                      2.31.0
rich                          13.7.1
rpds-py                       0.18.1
safetensors                   0.4.3
scikit-learn                  1.3.2
scipy                         1.10.1
sentence-transformers         2.7.0
sentencepiece                 0.2.0
setuptools                    69.5.1
shapely                       2.0.4
shellingham                   1.5.4
six                           1.16.0
sniffio                       1.3.1
soupsieve                     2.5
SQLAlchemy                    2.0.30
sse-starlette                 2.1.0
starlette                     0.37.2
starlette-context             0.3.6
sympy                         1.12
tabulate                      0.9.0
tenacity                      8.3.0
threadpoolctl                 3.5.0
tiktoken                      0.6.0
timm                          1.0.3
tokenizers                    0.19.1
torch                         2.1.2
torchaudio                    2.1.2
torchvision                   0.16.2
tqdm                          4.66.4
transformers                  4.40.2
transformers-stream-generator 0.0.5
triton                        2.1.0
typer                         0.12.3
typing_extensions             4.11.0
typing-inspect                0.9.0
tzdata                        2024.1
ujson                         5.10.0
unstructured                  0.11.8
unstructured-client           0.22.0
unstructured-inference        0.7.18
unstructured.pytesseract      0.3.12
urllib3                       2.2.1
uvicorn                       0.29.0
uvloop                        0.19.0
vllm                          0.4.0
watchfiles                    0.21.0
websockets                    12.0
wheel                         0.43.0
wrapt                         1.16.0
xformers                      0.0.23.post1  
xlrd                          2.0.1
XlsxWriter                    3.2.0
yarl                          1.9.4
zipp                          3.18.2

运行日志或截图 | Runtime logs or screenshots

# 请在此处粘贴运行日志
# Please paste the run log here
(srrat) ruibn@llm1:~/safe/api-for-open-llm$ python server.py
2024-05-17 05:26:43.612 | DEBUG    | api.config:<module>:338 - SETTINGS: {
    "embedding_name": "/home/ruibn/safe/bce-embedding-base_v1",
    "rerank_name": "/home/ruibn/safe/bce-reranker-base_v1",
    "embedding_size": -1,
    "embedding_device": "cuda:0",
    "rerank_device": "cuda:0",
    "trust_remote_code": false,
    "tokenize_mode": "auto",
    "tensor_parallel_size": 1,
    "gpu_memory_utilization": 0.5,
    "max_num_batched_tokens": -1,
    "max_num_seqs": 256,
    "quantization_method": null,
    "enforce_eager": false,
    "max_context_len_to_capture": 8192,
    "max_loras": 1,
    "max_lora_rank": 32,
    "lora_extra_vocab_size": 256,
    "lora_dtype": "auto",
    "max_cpu_loras": -1,
    "lora_modules": "",
    "vllm_disable_log_stats": true,
    "model_name": "qwen2",
    "model_path": "/home/ruibn/safe/Qwen1.5-0.5B-Chat",
    "dtype": "half",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "context_length": -1,
    "chat_template": "qwen2",
    "rope_scaling": null,
    "flash_attn": false,
    "use_streamer_v2": true,
    "interrupt_requests": true,
    "host": "0.0.0.0",
    "port": 8000,
    "api_prefix": "/v1",
    "engine": "vllm",
    "tasks": [
        "llm",
        "rag"
    ],
    "device_map": "cuda:1",
    "gpus": null,
    "num_gpus": 1,
    "activate_inference": true,
    "model_names": [
        "qwen2",
        "bce-embedding-base_v1",
        "bce-reranker-base_v1"
    ],
    "api_keys": null
}
/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
2024-05-17 05:26:57.800 | INFO     | api.rag.models.rerank:__init__:45 - Loading from `/home/ruibn/safe/bce-reranker-base_v1`.
ERROR 05-17 05:26:59 pynccl.py:53] Failed to load NCCL library from libnccl.so.2 .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path.
INFO 05-17 05:26:59 pynccl_utils.py:17] Failed to import NCCL library: libnccl.so.2: cannot open shared object file: No such file or directory
INFO 05-17 05:26:59 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
WARNING 05-17 05:26:59 config.py:748] Casting torch.bfloat16 to torch.float16.
INFO 05-17 05:26:59 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/home/ruibn/safe/Qwen1.5-0.5B-Chat', tokenizer='/home/ruibn/safe/Qwen1.5-0.5B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 05-17 05:26:59 selector.py:34] Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO 05-17 05:26:59 selector.py:21] Using XFormers backend.
INFO 05-17 05:27:01 model_runner.py:104] Loading model weights took 0.8865 GB
Traceback (most recent call last):
  File "server.py", line 2, in <module>
    from api.models import (
  File "/home/ruibn/safe/api-for-open-llm/api/models.py", line 199, in <module>
    LLM_ENGINE = create_vllm_engine()
  File "/home/ruibn/safe/api-for-open-llm/api/models.py", line 125, in create_vllm_engine
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 348, in from_engine_args
    engine = cls(
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 422, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 111, in __init__
    self.model_executor = executor_class(model_config, cache_config,
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/executor/gpu_executor.py", line 40, in __init__
    self._init_cache()
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/executor/gpu_executor.py", line 80, in _init_cache
    self.driver_worker.profile_num_available_blocks(
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/worker/worker.py", line 131, in profile_num_available_blocks
    self.model_runner.profile_run()
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 742, in profile_run
    self.execute_model(seqs, kv_caches)
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 663, in execute_model
    hidden_states = model_executable(**execute_model_kwargs)
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs) 
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 317, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs) 
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 254, in forward
    hidden_states, residual = layer(
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs) 
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 207, in forward
    hidden_states = self.self_attn(
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs) 
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/model_executor/models/qwen2.py", line 150, in forward
    qkv, _ = self.qkv_proj(hidden_states)   
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs) 
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/model_executor/layers/linear.py", line 215, in forward
    output_parallel = self.linear_method.apply_weights(
  File "/home/ruibn/.conda/envs/srrat/lib/python3.8/site-packages/vllm/model_executor/layers/linear.py", line 79, in apply_weights
    return F.linear(x, weight, bias)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

xusenlinzy / api-for-open-llm

vllm本地部署时，vllm engine启动失败 #274

提交前必须检查以下项目 | The following items must be checked before submission

问题类型 | Type of problem

操作系统 | Operating system

详细描述问题 | Detailed description of the problem

Dependencies

运行日志或截图 | Runtime logs or screenshots