LLM Chat only returns "#" characters

tomroh commented 10 months ago

No matter the prompt, privateGPT only returns hashes as the response. This doesn't occur when not using CUBLAS.

Set up info:

NVIDIA GeForce RTX 4080 Windows 11

accelerate==0.25.0 aiofiles==23.2.1 aiohttp==3.9.1 aiosignal==1.3.1 aiostream==0.5.2 altair==5.2.0 annotated-types==0.6.0 anyio==3.7.1 attrs==23.1.0 beautifulsoup4==4.12.2 black==22.12.0 boto3==1.34.2 botocore==1.34.2 build==1.0.3 CacheControl==0.13.1 certifi==2023.11.17 cfgv==3.4.0 charset-normalizer==3.3.2 cleo==2.1.0 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 contourpy==1.2.0 coverage==7.3.3 crashtest==0.4.1 cycler==0.12.1 dataclasses-json==0.5.14 datasets==2.14.4 Deprecated==1.2.14 dill==0.3.7 diskcache==5.6.3 distlib==0.3.8 distro==1.8.0 dnspython==2.4.2 dulwich==0.21.7 email-validator==2.1.0.post1 evaluate==0.4.1 fastapi==0.103.2 fastjsonschema==2.19.1 ffmpy==0.3.1 filelock==3.13.1 flatbuffers==23.5.26 fonttools==4.46.0 frozenlist==1.4.1 fsspec==2023.12.2 gradio==4.10.0 gradio_client==0.7.3 greenlet==3.0.2 grpcio==1.60.0 grpcio-tools==1.60.0 h11==0.14.0 h2==4.1.0 hpack==4.0.0 httpcore==1.0.2 httptools==0.6.1 httpx==0.25.2 huggingface-hub==0.19.4 humanfriendly==10.0 hyperframe==6.0.1 identify==2.5.33 idna==3.6 importlib-resources==6.1.1 iniconfig==2.0.0 injector==0.21.0 installer==0.7.0 itsdangerous==2.1.2 jaraco.classes==3.3.0 Jinja2==3.1.2 jmespath==1.0.1 joblib==1.3.2 jsonschema==4.20.0 jsonschema-specifications==2023.11.2 keyring==24.3.0 kiwisolver==1.4.5 llama-index==0.9.3 llama_cpp_python==0.2.29 markdown-it-py==3.0.0 MarkupSafe==2.1.3 marshmallow==3.20.1 matplotlib==3.8.2 mdurl==0.1.2 more-itertools==10.2.0 mpmath==1.3.0 msgpack==1.0.7 multidict==6.0.4 multiprocess==0.70.15 mypy==1.7.1 mypy-extensions==1.0.0 nest-asyncio==1.5.8 networkx==3.2.1 nltk==3.8.1 nodeenv==1.8.0 numpy==1.26.3 onnx==1.15.0 onnxruntime==1.16.3 openai==1.5.0 optimum==1.16.1 orjson==3.9.10 packaging==23.2 pandas==2.1.4 pathspec==0.12.1 pexpect==4.9.0 Pillow==10.1.0 pkginfo==1.9.6 platformdirs==4.1.0 pluggy==1.3.0 poetry==1.7.1 poetry-core==1.8.1 poetry-plugin-export==1.6.0 portalocker==2.8.2 pre-commit==2.21.0 -e git+https://github.com/imartinez/privateGPT@d3acd85fe34030f8cfd7daf50b30c534087bdf2b#egg=private_gpt protobuf==4.25.1 psutil==5.9.6 ptyprocess==0.7.0 pyarrow==14.0.1 pydantic==2.5.2 pydantic-extra-types==2.2.0 pydantic-settings==2.1.0 pydantic_core==2.14.5 pydub==0.25.1 Pygments==2.17.2 pyparsing==3.1.1 pypdf==3.17.2 pyproject_hooks==1.0.0 pyreadline3==3.4.1 pytest==7.4.3 pytest-asyncio==0.21.1 pytest-cov==3.0.0 python-dateutil==2.8.2 python-dotenv==1.0.0 python-multipart==0.0.6 pytz==2023.3.post1 pywin32==306 pywin32-ctypes==0.2.2 PyYAML==6.0.1 qdrant-client==1.7.0 rapidfuzz==3.6.1 referencing==0.32.0 regex==2023.10.3 requests==2.31.0 requests-toolbelt==1.0.0 responses==0.18.0 rich==13.7.0 rpds-py==0.14.1 ruff==0.1.8 s3transfer==0.9.0 safetensors==0.4.1 scikit-learn==1.3.2 scipy==1.11.4 semantic-version==2.10.0 sentence-transformers==2.2.2 sentencepiece==0.1.99 shellingham==1.5.4 six==1.16.0 sniffio==1.3.0 soupsieve==2.5 SQLAlchemy==2.0.23 starlette==0.27.0 sympy==1.12 tenacity==8.2.3 threadpoolctl==3.2.0 tiktoken==0.5.2 tokenizers==0.15.0 tomlkit==0.12.0 toolz==0.12.0 torch==2.1.2+cu121 torchaudio==2.1.2+cu121 torchvision==0.16.2+cu121 tqdm==4.66.1 transformers==4.36.1 trove-classifiers==2024.1.8 typer==0.9.0 types-PyYAML==6.0.12.12 typing-inspect==0.9.0 typing_extensions==4.9.0 tzdata==2023.3 ujson==5.9.0 urllib3==1.26.18 uvicorn==0.24.0.post1 virtualenv==20.25.0 watchdog==3.0.0 watchfiles==0.21.0 websockets==11.0.3 wrapt==1.16.0 xxhash==3.4.1 yarl==1.9.4

sr3dna commented 10 months ago

I also encountered the same issue. I am setting up on a ubuntu 22.04 based OS (PopOS) with RTX 4070. LmStudio and Stable Diffusion have been running fine on this setup.

aruncserecs commented 10 months ago

I am also facing the same issue. user: hi

Response: assistant: ################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

imashoksundar commented 10 months ago

+1. Facing the same issue when running it in docker. Chat response only returns '#' characters

LaurentEsingle commented 10 months ago

+1. Same issue on Windows WSL2

denniszander commented 10 months ago

+1 Facing the same issue on Ubuntu 22.04 with RTX 2060.

jamador47 commented 10 months ago

+1 Facing same issue RTX 3080 Windows 11

sslovelady commented 10 months ago

Facing same issue on Windows 11 RTX 3060ti - works with CPU, not with CUDA.

FIXED IT - seems the latest version of llama-cpp-python (0.2.29) is incompatible:

Downgraded CUDA to 11.7.1 (Not certain this is necessary, but it was done.)
Ran the following command from the install guide, but specified version for llama-cpp-python (0.2.23):

$env:CMAKE_ARGS='-DLLAMA_CUBLAS=on'; poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python==0.2.23

jamador47 commented 10 months ago

Thank you @sslovelady confirming the model is now working as expected. You are a life saver :) - BTW i can confirming downgrading the CUDA does nothing. The fix is Version for llama-cpp-python (0.2.23)

shepard153 commented 10 months ago

Yep. Llama-cpp-python is kinda broken. See this thread: https://github.com/abetlen/llama-cpp-python/issues/1089

naveenk2022 commented 10 months ago

You don't need to downgrade llama-cpp-python! Make the following edit to /private_gpt/components/llm/llm_component.py:


        logger.info("Initializing the LLM in mode=%s", llm_mode)
        match settings.llm.mode:
            case "local":
                from llama_index.llms import LlamaCPP

                prompt_style = get_prompt_style(settings.local.prompt_style)

                self.llm = LlamaCPP(
                    model_path=str(models_path / settings.local.llm_hf_model_file),
                    temperature=0.1,
                    max_new_tokens=settings.llm.max_new_tokens,
                    context_window=settings.llm.context_window,
                    generate_kwargs={},
                    # All to GPU
                    # Adding "offload_kqv":True fixes the broken generator 
                    model_kwargs={"n_gpu_layers": -1, "offload_kqv": True},
                    # transform inputs into Llama2 format
                    messages_to_prompt=prompt_style.messages_to_prompt,
                    completion_to_prompt=prompt_style.completion_to_prompt,
                    verbose=True,
                )

tomroh commented 10 months ago

I can confirm just downgrading llama-cpp-python works for me as well. Thanks @sslovelady

Koesters commented 10 months ago

You don't need to downgrade llama-cpp-python! Make the following edit to /private_gpt/components/llm/llm_component.py:


        logger.info("Initializing the LLM in mode=%s", llm_mode)
        match settings.llm.mode:
            case "local":
                from llama_index.llms import LlamaCPP

                prompt_style = get_prompt_style(settings.local.prompt_style)

                self.llm = LlamaCPP(
                    model_path=str(models_path / settings.local.llm_hf_model_file),
                    temperature=0.1,
                    max_new_tokens=settings.llm.max_new_tokens,
                    context_window=settings.llm.context_window,
                    generate_kwargs={},
                    # All to GPU
                    # Adding "offload_kqv":True fixes the broken generator 
                    model_kwargs={"n_gpu_layers": -1, "offload_kqv": True},
                    # transform inputs into Llama2 format
                    messages_to_prompt=prompt_style.messages_to_prompt,
                    completion_to_prompt=prompt_style.completion_to_prompt,
                    verbose=True,
                )

What needs to be changed you have not written:

add , "offload_kqv": True to model_kwargs={"n_gpu_layers": -1}

Tested on two Ubuntu 22.04 with Cuda 12.3 with partial layer upload. 5 of 41 on TheBloke/Llama-2-13B-chat-GGUF and TheBloke/GodziLLa2-70B-GGUF offloaded 30/81 layers to GPU

sr3dna commented 10 months ago

The fix proposed by @Koesters works for me as well! I didn't downgrade anything. Thank you Koesters!

zylon-ai / private-gpt

LLM Chat only returns "#" characters #1514