zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks
https://privategpt.dev
Apache License 2.0
54.27k stars 7.3k forks source link

LLM Chat only returns "#" characters #1514

Closed tomroh closed 10 months ago

tomroh commented 10 months ago

No matter the prompt, privateGPT only returns hashes as the response. This doesn't occur when not using CUBLAS.

image

Set up info:

NVIDIA GeForce RTX 4080 Windows 11

image image

accelerate==0.25.0 aiofiles==23.2.1 aiohttp==3.9.1 aiosignal==1.3.1 aiostream==0.5.2 altair==5.2.0 annotated-types==0.6.0 anyio==3.7.1 attrs==23.1.0 beautifulsoup4==4.12.2 black==22.12.0 boto3==1.34.2 botocore==1.34.2 build==1.0.3 CacheControl==0.13.1 certifi==2023.11.17 cfgv==3.4.0 charset-normalizer==3.3.2 cleo==2.1.0 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 contourpy==1.2.0 coverage==7.3.3 crashtest==0.4.1 cycler==0.12.1 dataclasses-json==0.5.14 datasets==2.14.4 Deprecated==1.2.14 dill==0.3.7 diskcache==5.6.3 distlib==0.3.8 distro==1.8.0 dnspython==2.4.2 dulwich==0.21.7 email-validator==2.1.0.post1 evaluate==0.4.1 fastapi==0.103.2 fastjsonschema==2.19.1 ffmpy==0.3.1 filelock==3.13.1 flatbuffers==23.5.26 fonttools==4.46.0 frozenlist==1.4.1 fsspec==2023.12.2 gradio==4.10.0 gradio_client==0.7.3 greenlet==3.0.2 grpcio==1.60.0 grpcio-tools==1.60.0 h11==0.14.0 h2==4.1.0 hpack==4.0.0 httpcore==1.0.2 httptools==0.6.1 httpx==0.25.2 huggingface-hub==0.19.4 humanfriendly==10.0 hyperframe==6.0.1 identify==2.5.33 idna==3.6 importlib-resources==6.1.1 iniconfig==2.0.0 injector==0.21.0 installer==0.7.0 itsdangerous==2.1.2 jaraco.classes==3.3.0 Jinja2==3.1.2 jmespath==1.0.1 joblib==1.3.2 jsonschema==4.20.0 jsonschema-specifications==2023.11.2 keyring==24.3.0 kiwisolver==1.4.5 llama-index==0.9.3 llama_cpp_python==0.2.29 markdown-it-py==3.0.0 MarkupSafe==2.1.3 marshmallow==3.20.1 matplotlib==3.8.2 mdurl==0.1.2 more-itertools==10.2.0 mpmath==1.3.0 msgpack==1.0.7 multidict==6.0.4 multiprocess==0.70.15 mypy==1.7.1 mypy-extensions==1.0.0 nest-asyncio==1.5.8 networkx==3.2.1 nltk==3.8.1 nodeenv==1.8.0 numpy==1.26.3 onnx==1.15.0 onnxruntime==1.16.3 openai==1.5.0 optimum==1.16.1 orjson==3.9.10 packaging==23.2 pandas==2.1.4 pathspec==0.12.1 pexpect==4.9.0 Pillow==10.1.0 pkginfo==1.9.6 platformdirs==4.1.0 pluggy==1.3.0 poetry==1.7.1 poetry-core==1.8.1 poetry-plugin-export==1.6.0 portalocker==2.8.2 pre-commit==2.21.0 -e git+https://github.com/imartinez/privateGPT@d3acd85fe34030f8cfd7daf50b30c534087bdf2b#egg=private_gpt protobuf==4.25.1 psutil==5.9.6 ptyprocess==0.7.0 pyarrow==14.0.1 pydantic==2.5.2 pydantic-extra-types==2.2.0 pydantic-settings==2.1.0 pydantic_core==2.14.5 pydub==0.25.1 Pygments==2.17.2 pyparsing==3.1.1 pypdf==3.17.2 pyproject_hooks==1.0.0 pyreadline3==3.4.1 pytest==7.4.3 pytest-asyncio==0.21.1 pytest-cov==3.0.0 python-dateutil==2.8.2 python-dotenv==1.0.0 python-multipart==0.0.6 pytz==2023.3.post1 pywin32==306 pywin32-ctypes==0.2.2 PyYAML==6.0.1 qdrant-client==1.7.0 rapidfuzz==3.6.1 referencing==0.32.0 regex==2023.10.3 requests==2.31.0 requests-toolbelt==1.0.0 responses==0.18.0 rich==13.7.0 rpds-py==0.14.1 ruff==0.1.8 s3transfer==0.9.0 safetensors==0.4.1 scikit-learn==1.3.2 scipy==1.11.4 semantic-version==2.10.0 sentence-transformers==2.2.2 sentencepiece==0.1.99 shellingham==1.5.4 six==1.16.0 sniffio==1.3.0 soupsieve==2.5 SQLAlchemy==2.0.23 starlette==0.27.0 sympy==1.12 tenacity==8.2.3 threadpoolctl==3.2.0 tiktoken==0.5.2 tokenizers==0.15.0 tomlkit==0.12.0 toolz==0.12.0 torch==2.1.2+cu121 torchaudio==2.1.2+cu121 torchvision==0.16.2+cu121 tqdm==4.66.1 transformers==4.36.1 trove-classifiers==2024.1.8 typer==0.9.0 types-PyYAML==6.0.12.12 typing-inspect==0.9.0 typing_extensions==4.9.0 tzdata==2023.3 ujson==5.9.0 urllib3==1.26.18 uvicorn==0.24.0.post1 virtualenv==20.25.0 watchdog==3.0.0 watchfiles==0.21.0 websockets==11.0.3 wrapt==1.16.0 xxhash==3.4.1 yarl==1.9.4

sr3dna commented 10 months ago

I also encountered the same issue. I am setting up on a ubuntu 22.04 based OS (PopOS) with RTX 4070. LmStudio and Stable Diffusion have been running fine on this setup.

aruncserecs commented 10 months ago

I am also facing the same issue. user: hi


Response: assistant: ################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################


image
imashoksundar commented 10 months ago

+1. Facing the same issue when running it in docker. Chat response only returns '#' characters

LaurentEsingle commented 10 months ago

+1. Same issue on Windows WSL2

denniszander commented 10 months ago

+1 Facing the same issue on Ubuntu 22.04 with RTX 2060.

jamador47 commented 10 months ago

+1 Facing same issue RTX 3080 Windows 11

sslovelady commented 10 months ago

Facing same issue on Windows 11 RTX 3060ti - works with CPU, not with CUDA.


FIXED IT - seems the latest version of llama-cpp-python (0.2.29) is incompatible:

$env:CMAKE_ARGS='-DLLAMA_CUBLAS=on'; poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python==0.2.23

jamador47 commented 10 months ago

Thank you @sslovelady confirming the model is now working as expected. You are a life saver :) - BTW i can confirming downgrading the CUDA does nothing. The fix is Version for llama-cpp-python (0.2.23)

shepard153 commented 10 months ago

Yep. Llama-cpp-python is kinda broken. See this thread: https://github.com/abetlen/llama-cpp-python/issues/1089

naveenk2022 commented 10 months ago

You don't need to downgrade llama-cpp-python! Make the following edit to /private_gpt/components/llm/llm_component.py:


        logger.info("Initializing the LLM in mode=%s", llm_mode)
        match settings.llm.mode:
            case "local":
                from llama_index.llms import LlamaCPP

                prompt_style = get_prompt_style(settings.local.prompt_style)

                self.llm = LlamaCPP(
                    model_path=str(models_path / settings.local.llm_hf_model_file),
                    temperature=0.1,
                    max_new_tokens=settings.llm.max_new_tokens,
                    context_window=settings.llm.context_window,
                    generate_kwargs={},
                    # All to GPU
                    # Adding "offload_kqv":True fixes the broken generator 
                    model_kwargs={"n_gpu_layers": -1, "offload_kqv": True},
                    # transform inputs into Llama2 format
                    messages_to_prompt=prompt_style.messages_to_prompt,
                    completion_to_prompt=prompt_style.completion_to_prompt,
                    verbose=True,
                )
tomroh commented 10 months ago

I can confirm just downgrading llama-cpp-python works for me as well. Thanks @sslovelady

Koesters commented 10 months ago

You don't need to downgrade llama-cpp-python! Make the following edit to /private_gpt/components/llm/llm_component.py:


        logger.info("Initializing the LLM in mode=%s", llm_mode)
        match settings.llm.mode:
            case "local":
                from llama_index.llms import LlamaCPP

                prompt_style = get_prompt_style(settings.local.prompt_style)

                self.llm = LlamaCPP(
                    model_path=str(models_path / settings.local.llm_hf_model_file),
                    temperature=0.1,
                    max_new_tokens=settings.llm.max_new_tokens,
                    context_window=settings.llm.context_window,
                    generate_kwargs={},
                    # All to GPU
                    # Adding "offload_kqv":True fixes the broken generator 
                    model_kwargs={"n_gpu_layers": -1, "offload_kqv": True},
                    # transform inputs into Llama2 format
                    messages_to_prompt=prompt_style.messages_to_prompt,
                    completion_to_prompt=prompt_style.completion_to_prompt,
                    verbose=True,
                )

What needs to be changed you have not written:

add , "offload_kqv": True to model_kwargs={"n_gpu_layers": -1}

Tested on two Ubuntu 22.04 with Cuda 12.3 with partial layer upload. 5 of 41 on TheBloke/Llama-2-13B-chat-GGUF and TheBloke/GodziLLa2-70B-GGUF offloaded 30/81 layers to GPU

sr3dna commented 10 months ago

The fix proposed by @Koesters works for me as well! I didn't downgrade anything. Thank you Koesters!