Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
set in web:
launch_model:qwen2.5-instruct
model_format:ggufv2
model_size:7
quantization:q4_k_m
N GPU layers:1
replica:1
Expected behavior / 期待表现
What I'm trying to know is, is this a tool configuration issue or an instruction set support issue or something? When I use a different server to configure the same quantization model with llama.cpp, I don't get the same error. If needed, I can provide the configuration information that is working properly. Thank you.
to add:llama_cpp_python==0.3.1,and the container running successfully on another server is configured with the same version of python, llama_cpp_python, conda, torch
System Info / 系統信息
SERVER:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz PRETTY_NAME:"Debian GNU/Linux 11 (bullseye)" python:3.11.5 conda:23.10.0 torch:2.4.1+cpu
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
Version info / 版本信息
xinference:0.15.4
The command used to start Xinference / 用以启动 xinference 的命令
docker run -it -e XINFERENCE_MODEL_SRC=modelscope -p 9996:9997 -v ./xinference:/root/.xinference --name xinference-cpu image_name xinference-local -H 0.0.0.0 --log-level debug
Reproduction / 复现过程
set in web: launch_model:qwen2.5-instruct model_format:ggufv2 model_size:7 quantization:q4_k_m N GPU layers:1 replica:1
Expected behavior / 期待表现
What I'm trying to know is, is this a tool configuration issue or an instruction set support issue or something? When I use a different server to configure the same quantization model with llama.cpp, I don't get the same error. If needed, I can provide the configuration information that is working properly. Thank you.