NVIDIA GPU and numpy - Githubissues

Hi, I'm trying to setup Private GPT on windows WSL. I followed the instructions here and here but I'm not able to correctly run PGTP. If I follow this instructions: poetry install --extras "ui llms-llama-cpp embeddings-huggingface vector-stores-qdrant" I'm able to run PGPT with numpy 1.26.4 but with BLAS=0 (CPU).

If I run this instead: CMAKE_ARGS='-DLLAMA_CUBLAS=on' poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python I get BLAS=1 (GPU) but it automatically upgrades numpy to a 2.x version and PGPT doesn't work because it gives an error like "A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.0 as it may crash". immagine

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

packagex requires numpy x.y.z but you have numpy 2.0.0 which is incompatible.

Is there a way I can downgrade numpy AND use GPU (BLAS=1)?

After several hours of troubleshooting I finally managed to solve the issue.

Install

First of all you have to install llama-cpp forcing a specific version of numpy<2:

CMAKE_ARGS='-DLLAMA_CUBLAS=on' poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python numpy==1.26.0

Ensure to:

Update your windows drivers to the latest (I'm not really sure if this helped solve the issue but I did it anyway).
Reboot your system.

Run Private GPT:

PGPT_PROFILES=local make run

If this solves your problem, good, you're done.

If you instead stumble upon another error about "CUDA error: out of memory" and "TOKENIZERS_PARALLELISM=(true | false)", ensure to set this variable to true:

TOKENIZERS_PARALLELISM=true

Then rerun Private GPT as always:

PGPT_PROFILES=local make run

This solved the issue for me. Now Private GPT uses my NVIDIA GPU, is super fast and replies in 2-3 seconds.

I also suppose the first command should be updated on the official documentation.

On a side note: I have this warning at the end of the run that I do not quite understand and that I cannot solve. If someone has a suggestion, thanks in advance.

py.warnings - /home/<user>/.cache/pypoetry/virtualenvs/private-gpt-ta_62_V8-py3.11/lib/python3.11/site-packages/llama_cpp/llama.py:1054: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
  warnings.warn(

After several hours of troubleshooting I finally managed to solve the issue.

Install

First of all you have to install llama-cpp forcing a specific version of numpy<2:
CMAKE_ARGS='-DLLAMA_CUBLAS=on' poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python numpy==1.26.0
Ensure to:

Update your windows drivers to the latest (I'm not really sure if this helped solve the issue but I did it anyway).

Reboot your system.

Run Private GPT:
PGPT_PROFILES=local make run
If this solves your problem, good, you're done.

If you instead stumble upon another error about "CUDA error: out of memory" and "TOKENIZERS_PARALLELISM=(true | false)", ensure to set this variable to true:
TOKENIZERS_PARALLELISM=true
Then rerun Private GPT as always:
PGPT_PROFILES=local make run
This solved the issue for me. Now Private GPT uses my NVIDIA GPU, is super fast and replies in 2-3 seconds.

I also suppose the first command should be updated on the official documentation.

On a side note: I have this warning at the end of the run that I do not quite understand and that I cannot solve. If someone has a suggestion, thanks in advance.
py.warnings - /home/<user>/.cache/pypoetry/virtualenvs/private-gpt-ta_62_V8-py3.11/lib/python3.11/site-packages/llama_cpp/llama.py:1054: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
  warnings.warn(
Hey thank you for that numpy part! Like you, I am also having a GPU memory problem, it seems that 7 of 8GBs fill up as soon as i start the UI and then sometime when the file is too big I see 8GB in NVTOP and then I get a memory error message: CUDA out of memory. Tried to allocate 22.00 MiB. GPU 0 has a total capacty of 7.92 GiB of which 4.62 MiB is free. Including non-PyTorch memory, this process has 7.91 GiB memory in use. Of the allocated memory 2.99 GiB is allocated by PyTorch, and 41.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

zylon-ai / private-gpt

NVIDIA GPU and numpy #1979

Install

Ensure to:

Run Private GPT:

Install

Ensure to:

Run Private GPT: