Llama-CPP NVIDIA GPU support problem

VILLO88 commented 1 month ago

Hello, I'm trying to add gpu support to my privategpt to speed up and everything seems to work (info below) but when I ask a question about an attached document the program crashes with the errors you see attached:

13:28:31.657 [INFO    ]             uvicorn.error - Waiting for application startup.
13:28:31.657 [INFO    ]             uvicorn.error - Application startup complete.
13:28:31.658 [INFO    ]             uvicorn.error - Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
13:28:38.580 [INFO    ]            uvicorn.access - 127.0.0.1:40540 - "GET / HTTP/1.1" 200
13:28:38.744 [INFO    ]            uvicorn.access - 127.0.0.1:40540 - "GET /info HTTP/1.1" 200
13:28:38.757 [INFO    ]            uvicorn.access - 127.0.0.1:40540 - "GET /theme.css HTTP/1.1" 200
13:28:38.976 [INFO    ]            uvicorn.access - 127.0.0.1:40546 - "POST /run/predict HTTP/1.1" 200
13:28:42.643 [INFO    ]            uvicorn.access - 127.0.0.1:40546 - "POST /run/predict HTTP/1.1" 200
13:28:42.647 [INFO    ]            uvicorn.access - 127.0.0.1:40540 - "POST /run/predict HTTP/1.1" 200
13:28:42.657 [INFO    ]            uvicorn.access - 127.0.0.1:40546 - "POST /run/predict HTTP/1.1" 200
13:28:42.668 [INFO    ]            uvicorn.access - 127.0.0.1:40546 - "POST /queue/join HTTP/1.1" 200
13:28:42.675 [INFO    ]            uvicorn.access - 127.0.0.1:40546 - "GET /queue/data?session_hash=1n227fvma74 HTTP/1.1" 200
ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: no kernel image is available for execution on the device
  current device: 0, in function ggml_cuda_compute_forward at /tmp/pip-install-ddhqme6y/llama-cpp-python_faf2d72bbfd246b8b4278a72b85fcccd/vendor/llama.cpp/ggml-cuda.cu:2304
  err
GGML_ASSERT: /tmp/pip-install-ddhqme6y/llama-cpp-python_faf2d72bbfd246b8b4278a72b85fcccd/vendor/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
make: *** [Makefile:36: run] Aborted

my setup

os: debian 12 python 3.11.2 Cuda compilation tools, release 12.4, V12.4.131 GPU: Nvidia quadro K2200 4GB Nvidia driver (latest) 550.78

The only thing i've changed to make it run is gpu_layers:

settings_kwargs = {
                    "tfs_z": settings.llamacpp.tfs_z,  # ollama and llama-cpp
                    "top_k": settings.llamacpp.top_k,  # ollama and llama-cpp
                    "top_p": settings.llamacpp.top_p,  # ollama and llama-cpp
                    "repeat_penalty": settings.llamacpp.repeat_penalty,  # ollama llama-cpp
                    "n_gpu_layers": 15,
                    "offload_kqv": True,
                }

recap

In summary, privategpt starts correctly and the gpu is recognized, but when I try to ask a question it crashes with the error above. Is this a fixable bug or is it related to the fact that my card has "only" 4GB?

If the cause was my card, do you have any cards to suggest that would definitely work? maybe something cheap.. thanks

alxspiker commented 1 month ago

Hi @andriiborysov,

To resolve the issue with adding NVIDIA GPU support for Llama-CPP and addressing the errors you encounter, follow these steps:

Verify CUDA Installation: Ensure that CUDA is installed and properly configured. Check the version with:
```
nvcc --version
```
Check NVIDIA Driver: Make sure your NVIDIA driver is up to date. Check the driver version with:
```
nvidia-smi
```
Ensure Compatibility: Confirm that the version of CUDA installed is compatible with your NVIDIA driver and the libraries you are using.
Install CUDA Toolkit: Follow the instructions from the NVIDIA CUDA Toolkit website to install CUDA.
Install NVIDIA cuDNN: Follow the instructions from the NVIDIA cuDNN website to install cuDNN.

Verify GPU Availability in Python: Use this script to check if the GPU is available:

import torch

if torch.cuda.is_available():
   print(f"CUDA is available. Device count: {torch.cuda.device_count()}")
   for i in range(torch.cuda.device_count()):
       print(f"Device {i}: {torch.cuda.get_device_name(i)}")
else:
   print("CUDA is not available.")

Run a Simple Llama-CPP Model on GPU: Create a script to test running a Llama-CPP model on GPU:

from llama_cpp import LlamaModel

model = LlamaModel(model_name="your_model_name", use_gpu=True)
input_text = "Translate English to French: 'Hello, how are you?'"
output = model.generate(input_text)
print(output)

Following these steps should help resolve the GPU support issue. If you continue to face problems, please provide the exact error messages and additional details about your setup for further assistance.

Best regards, alxspiker

VILLO88 commented 1 month ago

Hi @AndriiBorysov,

To resolve the issue with adding NVIDIA GPU support for Llama-CPP and addressing the errors you encounter, follow these steps:

1. **Verify CUDA Installation**:
   Ensure that CUDA is installed and properly configured. Check the version with:
   ```shell
   nvcc --version
   ```

2. **Check NVIDIA Driver**:
   Make sure your NVIDIA driver is up to date. Check the driver version with:
   ```shell
   nvidia-smi
   ```

3. **Ensure Compatibility**:
   Confirm that the version of CUDA installed is compatible with your NVIDIA driver and the libraries you are using.

4. **Install CUDA Toolkit**:
   Follow the instructions from the [NVIDIA CUDA Toolkit website](https://developer.nvidia.com/cuda-downloads) to install CUDA.

5. **Install NVIDIA cuDNN**:
   Follow the instructions from the [NVIDIA cuDNN website](https://developer.nvidia.com/cudnn) to install cuDNN.

6. **Verify GPU Availability in Python**:
   Use this script to check if the GPU is available:
   ```python
   import torch

   if torch.cuda.is_available():
       print(f"CUDA is available. Device count: {torch.cuda.device_count()}")
       for i in range(torch.cuda.device_count()):
           print(f"Device {i}: {torch.cuda.get_device_name(i)}")
   else:
       print("CUDA is not available.")
   ```

7. **Run a Simple Llama-CPP Model on GPU**:
   Create a script to test running a Llama-CPP model on GPU:
   ```python
   from llama_cpp import LlamaModel

   model = LlamaModel(model_name="your_model_name", use_gpu=True)
   input_text = "Translate English to French: 'Hello, how are you?'"
   output = model.generate(input_text)
   print(output)
   ```

Following these steps should help resolve the GPU support issue. If you continue to face problems, please provide the exact error messages and additional details about your setup for further assistance.

Best regards, alxspiker

Thanks for the quick answer, in reference to what you told me to try: 1) nvcc --version

    nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

2)nvidia-smi

Thu May 23 17:19:27 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro K2200                   Off |   00000000:02:00.0  On |                  N/A |
| 42%   38C    P0              1W /   39W |     416MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1380      G   /usr/lib/xorg/Xorg                             80MiB |
|    0   N/A  N/A      1544      G   /usr/bin/gnome-shell                          103MiB |
|    0   N/A  N/A      2511      G   /usr/lib/firefox-esr/firefox-esr              173MiB |
|    0   N/A  N/A      8365      G   /usr/bin/nautilus                              50MiB |
|    0   N/A  N/A     11352      G   /usr/bin/nvidia-settings                        0MiB |
+-----------------------------------------------------------------------------------------+

3) nvidia driver is up to date and compatible with cuda version

4 and 5 satisfied)

6) running the script: CUDA is available. Device count: 1 Device 0: Quadro K2200

7) I had some problems running this script you gave me(i'm not a programmer) but i made another small script that run llama cpp using the mistral model of private-gpt and seems using the gpu(normally the gpu load is 0%) photo attached Screenshot from 2024-05-23 17-01-23

but when i try to query files in the private-gpt session i ve got the same error:

ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: no kernel image is available for execution on the device
  current device: 0, in function ggml_cuda_compute_forward at /tmp/pip-install-ddhqme6y/llama-cpp-python_faf2d72bbfd246b8b4278a72b85fcccd/vendor/llama.cpp/ggml-cuda.cu:2304
  err
GGML_ASSERT: /tmp/pip-install-ddhqme6y/llama-cpp-python_faf2d72bbfd246b8b4278a72b85fcccd/vendor/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
make: *** [Makefile:36: run] Aborted

Any idea? thank you

VILLO88 commented 1 month ago

Screenshot from 2024-05-23 17-01-23

zylon-ai / private-gpt

Llama-CPP NVIDIA GPU support problem #1937

my setup

recap