v0.14.4 docker 无法启动

zhaolj commented 2 months ago

System Info / 系統信息

ubuntu 22.04 Docker version 27.2.0, build 3ab4256 CUDA Version: 12.5

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[X] docker / docker
[ ] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

v0.14.4

The command used to start Xinference / 用以启动 xinference 的命令

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:v0.14.4 xinference-local -H 0.0.0.0 --log-level debug

Reproduction / 复现过程

run the command docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:v0.14.4 xinference-local -H 0.0.0.0 --log-level debug.

then get the error message:

WARNING 08-30 19:00:12 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead. See https://pypi.org/project/pynvml for more information.
usage: api_server.py [-h] [--host HOST] [--port PORT]
                 [--uvicorn-log-level {debug,info,warning,error,critical,trace}]
                 [--allow-credentials] [--allowed-origins ALLOWED_ORIGINS]
                 [--allowed-methods ALLOWED_METHODS]
                 [--allowed-headers ALLOWED_HEADERS] [--api-key API_KEY]
                 [--lora-modules LORA_MODULES [LORA_MODULES ...]]
                 [--prompt-adapters PROMPT_ADAPTERS [PROMPT_ADAPTERS ...]]
                 [--chat-template CHAT_TEMPLATE]
                 [--response-role RESPONSE_ROLE]
                 [--ssl-keyfile SSL_KEYFILE] [--ssl-certfile SSL_CERTFILE]
                 [--ssl-ca-certs SSL_CA_CERTS]
                 [--ssl-cert-reqs SSL_CERT_REQS] [--root-path ROOT_PATH]
                 [--middleware MIDDLEWARE] [--return-tokens-as-token-ids]
                 [--disable-frontend-multiprocessing] [--model MODEL]
                 [--tokenizer TOKENIZER] [--skip-tokenizer-init]
                 [--revision REVISION] [--code-revision CODE_REVISION]
                 [--tokenizer-revision TOKENIZER_REVISION]
                 [--tokenizer-mode {auto,slow}] [--trust-remote-code]
                 [--download-dir DOWNLOAD_DIR]
                 [--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes}]
                 [--dtype {auto,half,float16,bfloat16,float,float32}]
                 [--kv-cache-dtype {auto,fp8,fp8_e5m2,fp8_e4m3}]
                 [--quantization-param-path QUANTIZATION_PARAM_PATH]
                 [--max-model-len MAX_MODEL_LEN]
                 [--guided-decoding-backend {outlines,lm-format-enforcer}]
                 [--distributed-executor-backend {ray,mp}]
                 [--worker-use-ray]
                 [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
                 [--tensor-parallel-size TENSOR_PARALLEL_SIZE]
                 [--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS]
                 [--ray-workers-use-nsight]
                 [--block-size {8,16,32,128,256,512,1024,2048}]
                 [--enable-prefix-caching] [--disable-sliding-window]
                 [--use-v2-block-manager]
                 [--num-lookahead-slots NUM_LOOKAHEAD_SLOTS] [--seed SEED]
                 [--swap-space SWAP_SPACE]
                 [--cpu-offload-gb CPU_OFFLOAD_GB]
                 [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
                 [--num-gpu-blocks-override NUM_GPU_BLOCKS_OVERRIDE]
                 [--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS]
                 [--max-num-seqs MAX_NUM_SEQS]
                 [--max-logprobs MAX_LOGPROBS] [--disable-log-stats]
                 [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,squeezellm,compressed-tensors,bitsandbytes,qqq,experts_int8,None}]
                 [--rope-scaling ROPE_SCALING] [--rope-theta ROPE_THETA]
                 [--enforce-eager]
                 [--max-context-len-to-capture MAX_CONTEXT_LEN_TO_CAPTURE]
                 [--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE]
                 [--disable-custom-all-reduce]
                 [--tokenizer-pool-size TOKENIZER_POOL_SIZE]
                 [--tokenizer-pool-type TOKENIZER_POOL_TYPE]
                 [--tokenizer-pool-extra-config TOKENIZER_POOL_EXTRA_CONFIG]
                 [--limit-mm-per-prompt LIMIT_MM_PER_PROMPT]
                 [--enable-lora] [--max-loras MAX_LORAS]
                 [--max-lora-rank MAX_LORA_RANK]
                 [--lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE]
                 [--lora-dtype {auto,float16,bfloat16,float32}]
                 [--long-lora-scaling-factors LONG_LORA_SCALING_FACTORS]
                 [--max-cpu-loras MAX_CPU_LORAS] [--fully-sharded-loras]
                 [--enable-prompt-adapter]
                 [--max-prompt-adapters MAX_PROMPT_ADAPTERS]
                 [--max-prompt-adapter-token MAX_PROMPT_ADAPTER_TOKEN]
                 [--device {auto,cuda,neuron,cpu,openvino,tpu,xpu}]
                 [--num-scheduler-steps NUM_SCHEDULER_STEPS]
                 [--scheduler-delay-factor SCHEDULER_DELAY_FACTOR]
                 [--enable-chunked-prefill [ENABLE_CHUNKED_PREFILL]]
                 [--speculative-model SPECULATIVE_MODEL]
                 [--speculative-model-quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,squeezellm,compressed-tensors,bitsandbytes,qqq,experts_int8,None}]
                 [--num-speculative-tokens NUM_SPECULATIVE_TOKENS]
                 [--speculative-draft-tensor-parallel-size SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE]
                 [--speculative-max-model-len SPECULATIVE_MAX_MODEL_LEN]
                 [--speculative-disable-by-batch-size SPECULATIVE_DISABLE_BY_BATCH_SIZE]
                 [--ngram-prompt-lookup-max NGRAM_PROMPT_LOOKUP_MAX]
                 [--ngram-prompt-lookup-min NGRAM_PROMPT_LOOKUP_MIN]
                 [--spec-decoding-acceptance-method {rejection_sampler,typical_acceptance_sampler}]
                 [--typical-acceptance-sampler-posterior-threshold TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD]
                 [--typical-acceptance-sampler-posterior-alpha TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA]
                 [--disable-logprobs-during-spec-decoding [DISABLE_LOGPROBS_DURING_SPEC_DECODING]]
                 [--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG]
                 [--ignore-patterns IGNORE_PATTERNS]
                 [--preemption-mode PREEMPTION_MODE]
                 [--served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]]
                 [--qlora-adapter-name-or-path QLORA_ADAPTER_NAME_OR_PATH]
                 [--otlp-traces-endpoint OTLP_TRACES_ENDPOINT]
                 [--collect-detailed-traces COLLECT_DETAILED_TRACES]
                 [--engine-use-ray] [--disable-log-requests]
                 [--max-log-len MAX_LOG_LEN]
api_server.py: error: unrecognized arguments: xinference-local -H 0.0.0.0 --log-level debug

Expected behavior / 期待表现

just run normmaly

Mrluzhe commented 2 months ago

WIN 11 22631.4037 Docker Desktop 4.33.1 (161083) WSL2 CUDA Version: 12.6 遇到的问题完全一致

Minamiyama commented 2 months ago

+1

AAEE86 commented 2 months ago

api_server.py: error: unrecognized arguments: bash -c xinference -H 0.0.0.0 WARNING 09-02 09:27:38 cuda.py:22] You are using a deprecated pynvml package. Please install nvidia-ml-py instead. See https://pypi.org/project/pynvml for more information.

RandyChen1985 commented 2 months ago

+1

nikelius commented 2 months ago

WIN 11 22631.4037 Docker Desktop 4.33.1 (161083) WSL2 CUDA Version: 12.6 遇到的问题完全一致

系统环境： WIN 10-22H2(19045.4780) Docker Desktop 4.33.1 (161083) / WSL2 Driver Version: 551.61 / cuda_12.4.r12.4/compiler.33961263_0

启动命令1（当前是v0.14.3版）： G:>docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:latest xinference-local -H 0.0.0.0 --log-level debug 错误信息：docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container proce ss: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: WSL environment detected but no adapters were found: unknown.

启动命令2（当前是v0.14.3版，去掉gpus参数）： G:>docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 xprobe/xinference:latest xinference-local -H 0.0.0.0 --log-level debug 错误信息：RuntimeError: Failed to load shared library '/usr/local/lib/python3.10/dist-packages/llama_cpp/lib/libllama.so': libcuda.so.1: cannot open shared object file: No such file or directory

问题：@Minamiyama @Mrluzhe 我这个是 Docker-Desktop 没装好么？需要怎么操作？

ChengjieLi28 commented 2 months ago

@zhaolj @AAEE86 @Minamiyama @Mrluzhe 尝试重新拉取下xprobe/xinference:v0.14.4（Dockerhub版本），我刚push了个修复版本，如果修复了这个问题，请回复一下，谢谢。

imaben commented 2 months ago

@ChengjieLi28 同样的问题，我这里拉最新的镜像可以启动了。但启动的貌似是个api server，没有ui

ChengjieLi28 commented 2 months ago

@ChengjieLi28 同样的问题，我这里拉最新的镜像可以启动了。但启动的貌似是个api server，没有ui

docker完整命令是什么？

imaben commented 2 months ago

@ChengjieLi28 同样的问题，我这里拉最新的镜像可以启动了。但启动的貌似是个api server，没有ui

docker完整命令是什么？

services:
  xinference:
    image: xprobe/xinference:v0.14.4
    ports:
      - "9997:9997"
    volumes:
      - /data/llm/models/.xinference:/root/.xinference
      - /data/llm/models/.cache/huggingface:/root/.cache/huggingface
      - /data/llm/models/.cache/modelscope:/root/.cache/modelscope
    command: 'xinference-local --host 0.0.0.0 --port 9997 --log-level debug'
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              driver: nvidia
              count: all
    environment:
      - XINFERENCE_MODEL_SRC=modelscope
      - HF_ENDPOINT=https://hf-mirror.com

ChengjieLi28 commented 2 months ago

@ChengjieLi28 同样的问题，我这里拉最新的镜像可以启动了。但启动的貌似是个api server，没有ui

docker完整命令是什么？

services:
  xinference:
    image: xprobe/xinference:v0.14.4
    ports:
      - "9997:9997"
    volumes:
      - /data/llm/models/.xinference:/root/.xinference
      - /data/llm/models/.cache/huggingface:/root/.cache/huggingface
      - /data/llm/models/.cache/modelscope:/root/.cache/modelscope
    command: 'xinference-local --host 0.0.0.0 --port 9997 --log-level debug'
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              driver: nvidia
              count: all
    environment:
      - XINFERENCE_MODEL_SRC=modelscope
      - HF_ENDPOINT=https://hf-mirror.com

访问宿主机的9997端口啊，这样能拉起来就是对的。

imaben commented 2 months ago

@ChengjieLi28 同样的问题，我这里拉最新的镜像可以启动了。但启动的貌似是个api server，没有ui

docker完整命令是什么？

services:
  xinference:
    image: xprobe/xinference:v0.14.4
    ports:
      - "9997:9997"
    volumes:
      - /data/llm/models/.xinference:/root/.xinference
      - /data/llm/models/.cache/huggingface:/root/.cache/huggingface
      - /data/llm/models/.cache/modelscope:/root/.cache/modelscope
    command: 'xinference-local --host 0.0.0.0 --port 9997 --log-level debug'
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              driver: nvidia
              count: all
    environment:
      - XINFERENCE_MODEL_SRC=modelscope
      - HF_ENDPOINT=https://hf-mirror.com

访问宿主机的9997端口啊

访问宿主机的9997，会自动跳转到 http://ip:9997/ui，然后是

{"detail":"Not Found"}

imaben commented 2 months ago

回退到0.14.3好了再借楼反馈另外一个问题，加载llm、embedding模型都可以利用到gpu，但是rerank模型试了几个，都无法加载到gpu上，全是用cpu跑的

ChengjieLi28 commented 2 months ago

@ChengjieLi28 同样的问题，我这里拉最新的镜像可以启动了。但启动的貌似是个api server，没有ui

docker完整命令是什么？

services:
  xinference:
    image: xprobe/xinference:v0.14.4
    ports:
      - "9997:9997"
    volumes:
      - /data/llm/models/.xinference:/root/.xinference
      - /data/llm/models/.cache/huggingface:/root/.cache/huggingface
      - /data/llm/models/.cache/modelscope:/root/.cache/modelscope
    command: 'xinference-local --host 0.0.0.0 --port 9997 --log-level debug'
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              driver: nvidia
              count: all
    environment:
      - XINFERENCE_MODEL_SRC=modelscope
      - HF_ENDPOINT=https://hf-mirror.com

访问宿主机的9997端口啊

访问宿主机的9997，会自动跳转到 http://ip:9997/ui，然后是

{"detail":"Not Found"}

0.14.4的镜像前端编译有问题，要等0.14.4.post1了，应该很快会发。

ChengjieLi28 commented 2 months ago

回退到0.14.3好了再借楼反馈另外一个问题，加载llm、embedding模型都可以利用到gpu，但是rerank模型试了几个，都无法加载到gpu上，全是用cpu跑的

去单开issue，这个与此issue无关，把相关的启动命令细节附上，怎么证明是CPU

yushengliao commented 2 months ago

感觉几乎很多版本发布都有问题，能否加强下测试，都是要靠post1 post2版本

xorbitsai / inference