xusenlinzy / api-for-open-llm

Openai style api for open large language models, using LLMs just as chatgpt! Support for LLaMA, LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, Xverse, SqlCoder, CodeLLaMA, ChatGLM, ChatGLM2, ChatGLM3 etc. 开源大模型的统一后端接口
Apache License 2.0
2.36k stars 270 forks source link

vllm启动方式添加embedding模型报错 #60

Closed youzhonghui closed 1 year ago

youzhonghui commented 1 year ago

提交前必须检查以下项目 | The following items must be checked before submission

问题类型 | Type of problem

模型推理和部署 | Model inference and deployment

操作系统 | Operating system

Linux

详细描述问题 | Detailed description of the problem

如果不添加embedding_name,服务可以正常启动。添加了以后报错,我尝试修改源码将'device'直接替换成'cuda',但会有其他错误。

启动的docker-compose.yml如下:

version: '3.8'
services:
  qwen:
    image: llm-api:vllm
    container_name: qwen
    command: "python api/vllm_server.py --port 80 --allow-credentials --model_name qwen --model /model/qwen --trust-remote-code --embedding_name /model/m3e-base --tokenizer-mode auto --dtype half"
    ports:
      - "9999:80"
    volumes:
      - /data/AIGC-space/serving/api-for-open-llm:/workspace
      - /data/AIGC-space/models/Qwen-7B-Chat:/model/qwen
      - /data/AIGC-space/models/m3e-base:/model/m3e-base
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Dependencies

No response

运行日志或截图 | Runtime logs or screenshots

qwen | INFO 08-12 13:18:41 llm_engine.py:70] Initializing an LLM engine with config: model='/model/qwen', tokenizer='/model/qwen', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0) qwen | WARNING 08-12 13:18:42 tokenizer.py:63] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. qwen | INFO 08-12 13:18:55 llm_engine.py:196] # GPU blocks: 562, # CPU blocks: 512 qwen | WARNING 08-12 13:18:58 tokenizer.py:63] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. qwen | Traceback (most recent call last): qwen | File "api/vllm_server.py", line 666, in qwen | embed_client = SentenceTransformer(args.embedding_name, device=args.device) qwen | AttributeError: 'Namespace' object has no attribute 'device'

xusenlinzy commented 1 year ago

您好,最新代码已经解决了该问题,请拉取最新代码启动模型

https://github.com/xusenlinzy/api-for-open-llm/blob/1e05d931153787e42754634452ac5ceed8186213/docker/Dockerfile.vllm#L12