xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
5.33k stars 431 forks source link

v0.14.4 docker 无法启动 #2202

Closed zhaolj closed 2 months ago

zhaolj commented 2 months ago

System Info / 系統信息

ubuntu 22.04 Docker version 27.2.0, build 3ab4256 CUDA Version: 12.5

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

v0.14.4

The command used to start Xinference / 用以启动 xinference 的命令

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:v0.14.4 xinference-local -H 0.0.0.0 --log-level debug

Reproduction / 复现过程

  1. run the command docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:v0.14.4 xinference-local -H 0.0.0.0 --log-level debug.
  2. then get the error message:
    WARNING 08-30 19:00:12 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead. See https://pypi.org/project/pynvml for more information.
    usage: api_server.py [-h] [--host HOST] [--port PORT]
                     [--uvicorn-log-level {debug,info,warning,error,critical,trace}]
                     [--allow-credentials] [--allowed-origins ALLOWED_ORIGINS]
                     [--allowed-methods ALLOWED_METHODS]
                     [--allowed-headers ALLOWED_HEADERS] [--api-key API_KEY]
                     [--lora-modules LORA_MODULES [LORA_MODULES ...]]
                     [--prompt-adapters PROMPT_ADAPTERS [PROMPT_ADAPTERS ...]]
                     [--chat-template CHAT_TEMPLATE]
                     [--response-role RESPONSE_ROLE]
                     [--ssl-keyfile SSL_KEYFILE] [--ssl-certfile SSL_CERTFILE]
                     [--ssl-ca-certs SSL_CA_CERTS]
                     [--ssl-cert-reqs SSL_CERT_REQS] [--root-path ROOT_PATH]
                     [--middleware MIDDLEWARE] [--return-tokens-as-token-ids]
                     [--disable-frontend-multiprocessing] [--model MODEL]
                     [--tokenizer TOKENIZER] [--skip-tokenizer-init]
                     [--revision REVISION] [--code-revision CODE_REVISION]
                     [--tokenizer-revision TOKENIZER_REVISION]
                     [--tokenizer-mode {auto,slow}] [--trust-remote-code]
                     [--download-dir DOWNLOAD_DIR]
                     [--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes}]
                     [--dtype {auto,half,float16,bfloat16,float,float32}]
                     [--kv-cache-dtype {auto,fp8,fp8_e5m2,fp8_e4m3}]
                     [--quantization-param-path QUANTIZATION_PARAM_PATH]
                     [--max-model-len MAX_MODEL_LEN]
                     [--guided-decoding-backend {outlines,lm-format-enforcer}]
                     [--distributed-executor-backend {ray,mp}]
                     [--worker-use-ray]
                     [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
                     [--tensor-parallel-size TENSOR_PARALLEL_SIZE]
                     [--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS]
                     [--ray-workers-use-nsight]
                     [--block-size {8,16,32,128,256,512,1024,2048}]
                     [--enable-prefix-caching] [--disable-sliding-window]
                     [--use-v2-block-manager]
                     [--num-lookahead-slots NUM_LOOKAHEAD_SLOTS] [--seed SEED]
                     [--swap-space SWAP_SPACE]
                     [--cpu-offload-gb CPU_OFFLOAD_GB]
                     [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
                     [--num-gpu-blocks-override NUM_GPU_BLOCKS_OVERRIDE]
                     [--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS]
                     [--max-num-seqs MAX_NUM_SEQS]
                     [--max-logprobs MAX_LOGPROBS] [--disable-log-stats]
                     [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,squeezellm,compressed-tensors,bitsandbytes,qqq,experts_int8,None}]
                     [--rope-scaling ROPE_SCALING] [--rope-theta ROPE_THETA]
                     [--enforce-eager]
                     [--max-context-len-to-capture MAX_CONTEXT_LEN_TO_CAPTURE]
                     [--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE]
                     [--disable-custom-all-reduce]
                     [--tokenizer-pool-size TOKENIZER_POOL_SIZE]
                     [--tokenizer-pool-type TOKENIZER_POOL_TYPE]
                     [--tokenizer-pool-extra-config TOKENIZER_POOL_EXTRA_CONFIG]
                     [--limit-mm-per-prompt LIMIT_MM_PER_PROMPT]
                     [--enable-lora] [--max-loras MAX_LORAS]
                     [--max-lora-rank MAX_LORA_RANK]
                     [--lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE]
                     [--lora-dtype {auto,float16,bfloat16,float32}]
                     [--long-lora-scaling-factors LONG_LORA_SCALING_FACTORS]
                     [--max-cpu-loras MAX_CPU_LORAS] [--fully-sharded-loras]
                     [--enable-prompt-adapter]
                     [--max-prompt-adapters MAX_PROMPT_ADAPTERS]
                     [--max-prompt-adapter-token MAX_PROMPT_ADAPTER_TOKEN]
                     [--device {auto,cuda,neuron,cpu,openvino,tpu,xpu}]
                     [--num-scheduler-steps NUM_SCHEDULER_STEPS]
                     [--scheduler-delay-factor SCHEDULER_DELAY_FACTOR]
                     [--enable-chunked-prefill [ENABLE_CHUNKED_PREFILL]]
                     [--speculative-model SPECULATIVE_MODEL]
                     [--speculative-model-quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,squeezellm,compressed-tensors,bitsandbytes,qqq,experts_int8,None}]
                     [--num-speculative-tokens NUM_SPECULATIVE_TOKENS]
                     [--speculative-draft-tensor-parallel-size SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE]
                     [--speculative-max-model-len SPECULATIVE_MAX_MODEL_LEN]
                     [--speculative-disable-by-batch-size SPECULATIVE_DISABLE_BY_BATCH_SIZE]
                     [--ngram-prompt-lookup-max NGRAM_PROMPT_LOOKUP_MAX]
                     [--ngram-prompt-lookup-min NGRAM_PROMPT_LOOKUP_MIN]
                     [--spec-decoding-acceptance-method {rejection_sampler,typical_acceptance_sampler}]
                     [--typical-acceptance-sampler-posterior-threshold TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD]
                     [--typical-acceptance-sampler-posterior-alpha TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA]
                     [--disable-logprobs-during-spec-decoding [DISABLE_LOGPROBS_DURING_SPEC_DECODING]]
                     [--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG]
                     [--ignore-patterns IGNORE_PATTERNS]
                     [--preemption-mode PREEMPTION_MODE]
                     [--served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]]
                     [--qlora-adapter-name-or-path QLORA_ADAPTER_NAME_OR_PATH]
                     [--otlp-traces-endpoint OTLP_TRACES_ENDPOINT]
                     [--collect-detailed-traces COLLECT_DETAILED_TRACES]
                     [--engine-use-ray] [--disable-log-requests]
                     [--max-log-len MAX_LOG_LEN]
    api_server.py: error: unrecognized arguments: xinference-local -H 0.0.0.0 --log-level debug

Expected behavior / 期待表现

just run normmaly

Mrluzhe commented 2 months ago

WIN 11 22631.4037 Docker Desktop 4.33.1 (161083) WSL2 CUDA Version: 12.6 遇到的问题完全一致

Minamiyama commented 2 months ago

+1

AAEE86 commented 2 months ago

api_server.py: error: unrecognized arguments: bash -c xinference -H 0.0.0.0 WARNING 09-02 09:27:38 cuda.py:22] You are using a deprecated pynvml package. Please install nvidia-ml-py instead. See https://pypi.org/project/pynvml for more information.

RandyChen1985 commented 2 months ago

+1 image

nikelius commented 2 months ago

WIN 11 22631.4037 Docker Desktop 4.33.1 (161083) WSL2 CUDA Version: 12.6 遇到的问题完全一致

系统环境: WIN 10-22H2(19045.4780) Docker Desktop 4.33.1 (161083) / WSL2 Driver Version: 551.61 / cuda_12.4.r12.4/compiler.33961263_0

启动命令1(当前是v0.14.3版): G:>docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:latest xinference-local -H 0.0.0.0 --log-level debug 错误信息:docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container proce ss: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: WSL environment detected but no adapters were found: unknown.

启动命令2(当前是v0.14.3版,去掉gpus参数): G:>docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 xprobe/xinference:latest xinference-local -H 0.0.0.0 --log-level debug 错误信息:RuntimeError: Failed to load shared library '/usr/local/lib/python3.10/dist-packages/llama_cpp/lib/libllama.so': libcuda.so.1: cannot open shared object file: No such file or directory

问题:@Minamiyama @Mrluzhe 我这个是 Docker-Desktop 没装好么?需要怎么操作?

ChengjieLi28 commented 2 months ago

@zhaolj @AAEE86 @Minamiyama @Mrluzhe 尝试重新拉取下xprobe/xinference:v0.14.4(Dockerhub版本),我刚push了个修复版本,如果修复了这个问题,请回复一下,谢谢。

imaben commented 2 months ago

@ChengjieLi28 同样的问题,我这里拉最新的镜像可以启动了。但启动的貌似是个api server,没有ui

ChengjieLi28 commented 2 months ago

@ChengjieLi28 同样的问题,我这里拉最新的镜像可以启动了。但启动的貌似是个api server,没有ui

docker完整命令是什么?

imaben commented 2 months ago

@ChengjieLi28 同样的问题,我这里拉最新的镜像可以启动了。但启动的貌似是个api server,没有ui

docker完整命令是什么?

services:
  xinference:
    image: xprobe/xinference:v0.14.4
    ports:
      - "9997:9997"
    volumes:
      - /data/llm/models/.xinference:/root/.xinference
      - /data/llm/models/.cache/huggingface:/root/.cache/huggingface
      - /data/llm/models/.cache/modelscope:/root/.cache/modelscope
    command: 'xinference-local --host 0.0.0.0 --port 9997 --log-level debug'
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              driver: nvidia
              count: all
    environment:
      - XINFERENCE_MODEL_SRC=modelscope
      - HF_ENDPOINT=https://hf-mirror.com
ChengjieLi28 commented 2 months ago

@ChengjieLi28 同样的问题,我这里拉最新的镜像可以启动了。但启动的貌似是个api server,没有ui

docker完整命令是什么?

services:
  xinference:
    image: xprobe/xinference:v0.14.4
    ports:
      - "9997:9997"
    volumes:
      - /data/llm/models/.xinference:/root/.xinference
      - /data/llm/models/.cache/huggingface:/root/.cache/huggingface
      - /data/llm/models/.cache/modelscope:/root/.cache/modelscope
    command: 'xinference-local --host 0.0.0.0 --port 9997 --log-level debug'
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              driver: nvidia
              count: all
    environment:
      - XINFERENCE_MODEL_SRC=modelscope
      - HF_ENDPOINT=https://hf-mirror.com

访问宿主机的9997端口啊,这样能拉起来就是对的。

imaben commented 2 months ago

@ChengjieLi28 同样的问题,我这里拉最新的镜像可以启动了。但启动的貌似是个api server,没有ui

docker完整命令是什么?

services:
  xinference:
    image: xprobe/xinference:v0.14.4
    ports:
      - "9997:9997"
    volumes:
      - /data/llm/models/.xinference:/root/.xinference
      - /data/llm/models/.cache/huggingface:/root/.cache/huggingface
      - /data/llm/models/.cache/modelscope:/root/.cache/modelscope
    command: 'xinference-local --host 0.0.0.0 --port 9997 --log-level debug'
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              driver: nvidia
              count: all
    environment:
      - XINFERENCE_MODEL_SRC=modelscope
      - HF_ENDPOINT=https://hf-mirror.com

访问宿主机的9997端口啊

访问宿主机的9997,会自动跳转到 http://ip:9997/ui,然后是

{"detail":"Not Found"}
imaben commented 2 months ago

回退到0.14.3好了 再借楼反馈另外一个问题,加载llm、embedding模型都可以利用到gpu,但是rerank模型试了几个,都无法加载到gpu上,全是用cpu跑的

ChengjieLi28 commented 2 months ago

@ChengjieLi28 同样的问题,我这里拉最新的镜像可以启动了。但启动的貌似是个api server,没有ui

docker完整命令是什么?

services:
  xinference:
    image: xprobe/xinference:v0.14.4
    ports:
      - "9997:9997"
    volumes:
      - /data/llm/models/.xinference:/root/.xinference
      - /data/llm/models/.cache/huggingface:/root/.cache/huggingface
      - /data/llm/models/.cache/modelscope:/root/.cache/modelscope
    command: 'xinference-local --host 0.0.0.0 --port 9997 --log-level debug'
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              driver: nvidia
              count: all
    environment:
      - XINFERENCE_MODEL_SRC=modelscope
      - HF_ENDPOINT=https://hf-mirror.com

访问宿主机的9997端口啊

访问宿主机的9997,会自动跳转到 http://ip:9997/ui,然后是

{"detail":"Not Found"}

0.14.4的镜像前端编译有问题,要等0.14.4.post1了,应该很快会发。

ChengjieLi28 commented 2 months ago

回退到0.14.3好了 再借楼反馈另外一个问题,加载llm、embedding模型都可以利用到gpu,但是rerank模型试了几个,都无法加载到gpu上,全是用cpu跑的

去单开issue,这个与此issue无关,把相关的启动命令细节附上,怎么证明是CPU

yushengliao commented 2 months ago

感觉几乎很多版本发布都有问题,能否加强下测试, 都是要靠post1 post2版本