xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.84k stars 384 forks source link

部署Qwen1.5_14B,回复莫名其妙 #1921

Closed zifeiyu-tan closed 1 month ago

zifeiyu-tan commented 1 month ago

System Info / 系統信息

accelerate 0.31.0 addict 2.4.0 Agently 3.3.1.4 aiobotocore 2.7.0 aiofiles 23.2.1 aiohttp 3.9.5 aioitertools 0.11.0 aioprometheus 23.12.0 aiosignal 1.3.1 aliyun-python-sdk-core 2.15.1 aliyun-python-sdk-kms 2.16.3 altair 5.3.0 annotated-types 0.7.0 anthropic 0.28.0 antlr4-python3-runtime 4.9.3 anyio 4.4.0 asgiref 3.8.1 asttokens 2.4.1 async-timeout 4.0.3 attrdict 2.0.1 attrs 23.2.0 audioread 3.0.1 auto_gptq 0.7.1 autoawq 0.2.5 autoawq_kernels 0.0.6 backoff 2.2.1 bce-python-sdk 0.9.14 bcrypt 4.1.3 beautifulsoup4 4.12.3 bitsandbytes 0.43.1 botocore 1.31.64 build 1.2.1 cachetools 5.3.3 cdifflib 1.2.6 certifi 2024.6.2 cffi 1.16.0 charset-normalizer 3.3.2 chatglm-cpp 0.3.3 chroma-hnswlib 0.7.3 chromadb 0.5.3 click 8.1.7 cloudpickle 3.0.0 cmake 3.29.5.1 colorama 0.4.6 coloredlogs 15.0.1 colorlog 6.8.2 comm 0.2.2 contourpy 1.2.1 controlnet_aux 0.0.7 crcmod 1.7 cryptography 42.0.8 cycler 0.12.1 Cython 3.0.10 dataclasses-json 0.6.7 datasets 2.18.0 debugpy 1.6.7 decorator 5.1.1 Deprecated 1.2.14 diffusers 0.29.0 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 duckduckgo_search 6.1.6 ecdsa 0.19.0 editdistance 0.8.1 einops 0.8.0 einx 0.3.0 encodec 0.1.1 entrypoints 0.4 erniebot 0.5.6 exceptiongroup 1.2.0 executing 2.0.1 fastapi 0.110.3 ffmpy 0.3.2 filelock 3.15.1 FlagEmbedding 1.2.10 flatbuffers 24.3.25 fonttools 4.53.0 frozendict 2.4.4 frozenlist 1.4.1 fsspec 2023.10.0 future 1.0.0 gast 0.5.4 gekko 1.1.1 gitdb 4.0.11 GitPython 3.1.43 google-auth 2.31.0 googleapis-common-protos 1.63.2 gradio 4.26.0 gradio_client 0.15.1 greenlet 3.0.3 grpcio 1.64.1 h11 0.14.0 hf_transfer 0.1.6 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.23.3 humanfriendly 10.0 idna 3.7 imageio 2.34.1 importlib_metadata 7.1.0 importlib_resources 6.4.0 inflect 7.2.1 interegular 0.3.3 ipykernel 6.29.4 ipython 8.25.0 jedi 0.19.1 Jinja2 3.1.4 jiter 0.4.2 jmespath 0.10.0 joblib 1.4.2 json5 0.9.25 jsonpatch 1.33 jsonpointer 3.0.0 jsonschema 4.22.0 jsonschema-specifications 2023.12.1 jupyter-client 7.3.4 jupyter_core 5.7.2 kiwisolver 1.4.5 kubernetes 30.1.0 langchain 0.2.6 langchain-community 0.2.6 langchain-core 0.2.11 langchain-text-splitters 0.2.1 langsmith 0.1.77 lark 1.1.9 lazy_loader 0.4 librosa 0.10.2.post1 litellm 1.40.12 llama_cpp_python 0.2.78 llvmlite 0.43.0 lm-format-enforcer 0.10.1 markdown-it-py 3.0.0 MarkupSafe 2.1.5 marshmallow 3.21.3 matplotlib 3.9.0 matplotlib-inline 0.1.7 mdurl 0.1.2 mmh3 4.1.0 modelscope 1.15.0 monotonic 1.6 more-itertools 10.3.0 mpmath 1.3.0 msgpack 1.0.8 multidict 6.0.5 multiprocess 0.70.16 mypy-extensions 1.0.0 nemo_text_processing 1.0.2 nest_asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 numba 0.60.0 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.555.43 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.5.40 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 omegaconf 2.3.0 onnxruntime 1.18.1 openai 1.34.0 opencv-contrib-python 4.10.0.82 opencv-python 4.10.0.82 opentelemetry-api 1.25.0 opentelemetry-exporter-otlp-proto-common 1.25.0 opentelemetry-exporter-otlp-proto-grpc 1.25.0 opentelemetry-instrumentation 0.46b0 opentelemetry-instrumentation-asgi 0.46b0 opentelemetry-instrumentation-fastapi 0.46b0 opentelemetry-proto 1.25.0 opentelemetry-sdk 1.25.0 opentelemetry-semantic-conventions 0.46b0 opentelemetry-util-http 0.46b0 optimum 1.20.0 orjson 3.10.5 oss2 2.18.5 outlines 0.0.34 overrides 7.7.0 packaging 24.1 pandas 2.2.2 parso 0.8.4 passlib 1.7.4 peft 0.11.1 pexpect 4.9.0 pickleshare 0.7.5 pillow 10.3.0 pip 24.0 platformdirs 4.2.2 plumbum 1.8.3 pooch 1.8.2 posthog 3.5.0 prometheus_client 0.20.0 prometheus-fastapi-instrumentator 7.0.0 prompt_toolkit 3.0.47 protobuf 4.25.3 psutil 5.9.0 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 16.1.0 pyarrow-hotfix 0.6 pyasn1 0.6.0 pyasn1_modules 0.4.0 pycparser 2.22 pycryptodome 3.20.0 pydantic 2.7.4 pydantic_core 2.18.4 pydub 0.25.1 Pygments 2.18.0 PyJWT 2.8.0 pynini 2.1.5 pynvml 11.5.0 pyparsing 3.1.2 PyPika 0.48.9 pyproject_hooks 1.1.0 pyreqwest_impersonate 0.4.7 python-dateutil 2.9.0 python-dotenv 1.0.1 python-jose 3.3.0 python-multipart 0.0.9 pytz 2024.1 PyYAML 6.0.1 pyzmq 25.1.2 quantile-python 1.1 ray 2.24.0 referencing 0.35.1 regex 2024.5.15 requests 2.32.3 requests-oauthlib 2.0.0 rich 13.7.1 rouge 1.0.1 rpds-py 0.18.1 rpyc 6.0.0 rsa 4.9 ruff 0.4.8 s3fs 2023.10.0 sacremoses 0.1.1 safetensors 0.4.3 scikit-image 0.23.2 scikit-learn 1.5.0 scipy 1.13.1 seaborn 0.13.2 semantic-version 2.10.0 sentence-transformers 3.0.1 sentencepiece 0.2.0 setuptools 69.5.1 sglang 0.1.17 shellingham 1.5.4 simplejson 3.19.2 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 sortedcontainers 2.4.0 soundfile 0.12.1 soupsieve 2.5 soxr 0.3.7 SQLAlchemy 2.0.30 sse-starlette 2.1.0 stack-data 0.6.2 starlette 0.37.2 sympy 1.12.1 tabulate 0.9.0 tblib 3.0.0 tenacity 8.3.0 threadpoolctl 3.5.0 tifffile 2024.5.22 tiktoken 0.7.0 timm 1.0.3 tokenizers 0.19.1 tomli 2.0.1 tomlkit 0.12.0 toolz 0.12.1 torch 2.3.0 torchaudio 2.3.0 torchvision 0.18.0 tornado 6.1 tqdm 4.66.4 traitlets 5.14.3 transformers 4.41.2 transformers-stream-generator 0.0.5 triton 2.3.0 typeguard 4.3.0 typer 0.11.1 typing_extensions 4.12.2 typing-inspect 0.9.0 tzdata 2024.1 ultralytics 8.2.45 ultralytics-thop 2.0.0 urllib3 2.0.7 uvicorn 0.30.1 uvloop 0.19.0 vector-quantize-pytorch 1.14.24 vllm 0.4.3 vllm-flash-attn 2.5.8.post2 vocos 0.1.0 watchfiles 0.22.0 wcwidth 0.2.13 websocket-client 1.8.0 websockets 11.0.3 WeTextProcessing 1.0.1 wget 3.2 wheel 0.43.0 wrapt 1.16.0 xformers 0.0.26.post1 xinference 0.12.2.post1 xoscar 0.3.0 xxhash 3.4.1 yapf 0.40.2 yarl 1.9.4 zhipuai 2.1.0.20240521 zipp 3.19.2 zmq 0.0.0 zstandard 0.22.0

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

xinference 0.12.2.post1

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local --host 0.0.0.0 --port 9997

Reproduction / 复现过程

步骤1:部署完毕后,运行以下代码 `from langchain.chains import LLMChain from langchain_community.llms import Xinference from langchain.prompts import PromptTemplate

llm = Xinference(server_url="localhost", model_uid="Qwen-14B-Chat")

test = llm.invoke('你好') print(test)

template =""" 使用下面的上下文来回答问题。 如果你不知道答案,就说你不知道,不要编造答案。 {context} 问题: {question} 回答:"""

prompt = PromptTemplate.from_template(template)

query_llm = prompt | llm context = "小明有8块饼干" query = "我吃掉了小明4块饼干,小明还有几块饼干" response = query_llm.invoke({"context": context, "question": query}) print(response) ` 步骤2: 当我运行test = llm.invoke('你好')时,它回复的是一些奇怪的东西,如下: “,我有一个朋友在一家公司工作。后来他觉得累了就辞职了。老板说等找到人就放他走。后来,他的妻子帮他找到了一个替代者。老板让他留下一个月,他说不。他已经走了好几天了,现在工资还没发。我想问一下这个劳动纠

劳动者离职后,对工资有争议,劳动仲裁时效为一年。 根据《劳动仲裁法》 第二十七条 劳动争议申请仲裁的时效期间为一年。仲裁时效期间从当事人知道或者应当知道其权利被侵害之日起计算。 前款规定的仲裁时效,因当事人一方向对方当事人主张权利,或者向有关部门请求权利救济,或者对方当事人同意履行义务而中断。从中断时起,仲裁时效期间重新计算。”

步骤3: 而response = query_llm.invoke({"context": context, "question": query}) 正常情况下,它能答对,但有时也会出现一些奇奇怪怪的回复

Expected behavior / 期待表现

希望代码调用的结果和gradio的结果保持一致,不会出现奇奇怪怪结果

zifeiyu-tan commented 1 month ago

上述的response = query_llm.invoke({"context": context, "question": query}) 出现的一些奇怪回复: 小明还有____块饼干。

小明还有4块饼干。

<无输出>

小明还剩4块饼干。

qinxuye commented 1 month ago

Can you try to use OpenAI in langchain? You can connect to xinference endpoint.

zifeiyu-tan commented 1 month ago

Can you try to use OpenAI in langchain? You can connect to xinference endpoint.

 Thank you for your reply. Do you mean to use openai API or chatgpt? I don't understand what you mean.
  In addition, regarding the above issues, when I used gradio provided by xinference, Qwen1.5 performed perfectly, except when I called the API as described above.
qinxuye commented 1 month ago

I mean you can try to use https://python.langchain.com/v0.2/docs/integrations/chat/openai/ providing base_url=http://xxx:9997/v1' and model=qwen1.5-chat because xinf is compatible with OpenAI interface. The reason is that we try to contribute some changes to langchain, but they do not care about the community now.

zifeiyu-tan commented 1 month ago

` from langchain.prompts import PromptTemplate from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model_name="Qwen-14B-Chat",openai_api_key="Qwen-14B-Chat",openai_api_base="http://127.0.0.1:9997/v1")

template =""" 使用下面的上下文来回答问题。 如果你不知道答案,就说你不知道,不要编造答案。 {context} 问题: {question} 回答:"""

prompt = PromptTemplate.from_template(template)

query_llm = prompt | chat_model context = "小明有8块饼干" query = "我吃掉了小明4块饼干,小明还有几块饼干" response = query_llm.invoke({"context": context, "question": query}) print(response)

` 当我把代码改成如上时,问题似乎是解决了。