Open swvajanyatek opened 1 year ago
If you run
$env:CMAKE_ARGS='-DLLAMA_CUBLAS=on'; poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python
does it fail?
What model are you trying to load? Also, is Cuda in your environment path?
myuser@mymachine:/mnt/c/dev/git/github/privateGPT$ export CMAKE_ARGS="-DLLAMA_CUBLAS=on"
myuser@mymachine:/mnt/c/dev/git/github/privateGPT$ poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python
Collecting llama-cpp-python
Downloading llama_cpp_python-0.2.13.tar.gz (7.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.2/7.2 MB 25.8 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Installing backend dependencies ... done
Preparing metadata (pyproject.toml) ... done
Collecting typing-extensions>=4.5.0 (from llama-cpp-python)
Downloading typing_extensions-4.8.0-py3-none-any.whl.metadata (3.0 kB)
Collecting numpy>=1.20.0 (from llama-cpp-python)
Downloading numpy-1.26.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.2/61.2 kB 202.5 MB/s eta 0:00:00
Collecting diskcache>=5.6.1 (from llama-cpp-python)
Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.5/45.5 kB 178.4 MB/s eta 0:00:00
Downloading numpy-1.26.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 46.4 MB/s eta 0:00:00
Downloading typing_extensions-4.8.0-py3-none-any.whl (31 kB)
Building wheels for collected packages: llama-cpp-python
Building wheel for llama-cpp-python (pyproject.toml) ... done
Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.13-cp311-cp311-manylinux_2_35_x86_64.whl size=4096130 sha256=d4d3be49e622524654d
9c3fdc9fc429c97596934db8c05d36882742f532dea33
Stored in directory: /tmp/pip-ephem-wheel-cache-8ywzpy4i/wheels/bb/fc/2d/b62eb092d886ada0b78d62c7d84ade2b1b688f9613584bc93b
Successfully built llama-cpp-python
Installing collected packages: typing-extensions, numpy, diskcache, llama-cpp-python
Attempting uninstall: typing-extensions
Found existing installation: typing_extensions 4.8.0
Uninstalling typing_extensions-4.8.0:
Successfully uninstalled typing_extensions-4.8.0
Attempting uninstall: numpy
Found existing installation: numpy 1.26.1
Uninstalling numpy-1.26.1:
Successfully uninstalled numpy-1.26.1
Attempting uninstall: diskcache
Found existing installation: diskcache 5.6.3
Uninstalling diskcache-5.6.3:
Successfully uninstalled diskcache-5.6.3
Attempting uninstall: llama-cpp-python
Found existing installation: llama_cpp_python 0.2.11
Uninstalling llama_cpp_python-0.2.11:
Successfully uninstalled llama_cpp_python-0.2.11
Successfully installed diskcache-5.6.3 llama-cpp-python-0.2.13 numpy-1.26.1 typing-extensions-4.8.0
myuser@mymachine:/mnt/c/dev/git/github/privateGPT$
myuser@mymachine:/mnt/c/dev/git/github/privateGPT$ PGPT_PROFILES=local poetry run python -m private_gpt
17:09:08.736 [INFO ] private_gpt.settings.settings_loader - Starting application with profiles=['default', 'local']
CUDA error 100 at /tmp/pip-install-pr8zzwn4/llama-cpp-python_8a4cf88dbf754a3eb9cea7b61f302bed/vendor/llama.cpp/ggml-cuda.cu:5823: no CUDA-capable de vice is detected current device: 0
Its an old laptop, maybe 7 yrs old. Is it just too underpowered to run this?
Had the same issue with nvcc release 11.5 and CUDA Version: 12.3
Following steps had helped
apt-get purge nvidia-cuda-toolkit
/usr/local/cuda-12.3/bin
to PATH environment variable
echo 'export PATH="/usr/local/cuda-12.3/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc
CMAKE_ARGS='-DLLAMA_CUBLAS=on' poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python
PGPT_PROFILES=local make run
@AHPyXA - awesome sauce, that did the trick, thank you!
I am now able to launch the app, but I'm seeing some errors at the very end of the startup, though the UI seems to work:
myuser@mymachine:/mnt/c/dev/git/github/privateGPT$ PGPT_PROFILES=local make run
poetry run python -m private_gpt
10:34:26.703 [INFO ] private_gpt.settings.settings_loader - Starting application with profiles=['default', 'local']
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: Quadro M1000M, compute capability 5.0
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /mnt/c/dev/git/github/privateGPT/models/mistral-7b-in
struct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor 0: token_embd.weight q4_K [ 4096, 32000, 1, 1 ]
...
llama_model_loader: - tensor 290: output.weight q6_K [ 4096, 32000, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
...
llama_model_loader: - kv 19: general.quantization_version u32
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
...
llm_load_print_meta: model size = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 70.42 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4095.05 MB
...............................................................................................
llama_new_context_with_model: n_ctx = 3900
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 487.50 MB
llama_new_context_with_model: kv self size = 487.50 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 282.00 MB
llama_new_context_with_model: VRAM scratch buffer: 275.37 MB
llama_new_context_with_model: total VRAM used: 4857.93 MB (model: 4095.05 MB, context: 762.87 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BL
AS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
10:38:33.366 [INFO ] chromadb.telemetry.product.posthog - Anonymized telemetry enabled. See https://docs.trychroma.com/teleme
try for more information.
10:38:48.559 [INFO ] uvicorn.error - Started server process [3783]
10:38:48.559 [INFO ] uvicorn.error - Waiting for application startup.
10:38:48.560 [INFO ] uvicorn.error - Application startup complete.
10:38:48.560 [INFO ] uvicorn.error - Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
It looks like I can upload a pdf file, but when I ask a question, it crashes:
...
11:33:15.645 [INFO ] uvicorn.access - 127.0.0.1:40524 - "GET /assets/logo-0a070fcf.svg HTTP/1.1" 200
llama_new_context_with_model: kv self size = 487.50 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 282.00 MB
llama_new_context_with_model: VRAM scratch buffer: 275.37 MB
llama_new_context_with_model: total VRAM used: 4857.93 MB (model: 4095.05 MB, context: 762.87 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BL
AS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
10:38:33.366 [INFO ] chromadb.telemetry.product.posthog - Anonymized telemetry enabled. See https://docs.trychroma.com/teleme
try for more information.
10:38:48.559 [INFO ] uvicorn.error - Started server process [3783]
10:38:48.559 [INFO ] uvicorn.error - Waiting for application startup.
10:38:48.560 [INFO ] uvicorn.error - Application startup complete.
10:38:48.560 [INFO ] uvicorn.error - Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
11:33:14.488 [INFO ] uvicorn.access - 127.0.0.1:40476 - "GET / HTTP/1.1" 200
11:40:48.762 [INFO ] uvicorn.access - 127.0.0.1:39886 - "POST /upload HTTP/1.1" 200
11:40:49.254 [INFO ] uvicorn.error - ('127.0.0.1', 39906) - "WebSocket /queue/join" [accepted]
11:40:49.255 [INFO ] uvicorn.error - connection open
Parsing documents into nodes: 100%|██████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 165.11it/s]
Generating embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:26<00:00, 1.52it/s]
11:41:17.413 [INFO ] uvicorn.error - connection closed
11:43:43.561 [INFO ] uvicorn.access - 127.0.0.1:47306 - "POST /run/predict HTTP/1.1" 200
11:43:43.566 [INFO ] uvicorn.access - 127.0.0.1:47322 - "POST /run/predict HTTP/1.1" 200
11:43:43.585 [INFO ] uvicorn.access - 127.0.0.1:47306 - "POST /run/predict HTTP/1.1" 200
11:43:43.941 [INFO ] uvicorn.error - ('127.0.0.1', 47344) - "WebSocket /queue/join" [accepted]
11:43:43.942 [INFO ] uvicorn.error - connection open
CUDA error 209 at /tmp/pip-install-xr5k_0nd/llama-cpp-python_ca66f5e6557a4e9f8ac49a1fb528206d/vendor/llama.cpp/ggml-cuda.cu:6768: no kernel image is available for execution on the device
current device: 0
make: *** [Makefile:36: run] Error 1
Stale issue
Having similar issue. Same default conf, just run the commands. I followed @AHPyXA steps, and got mine working
Perhaps @swvajanyatek update your repository and give it another swing.
And yes, I'm able to run queries and it works fine, no errors after following the steps
Stale issue
I followed the directions for the "Linux NVIDIA GPU support and Windows-WSL" section, and below is what my WSL now shows, but I'm still getting "no CUDA-capable device is detected". What am I missing?