Is privateGPT based on CPU or GPU? Why in my case it's unbelievably slow?

Does it have something to do with tensorflow? And it's weird that from the following console messages,

It took PrivateGPT 51 seconds to answer 1 single question ?????
Unable to register cuDNN/cuFFT/cuBLAS factory
This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.

Does that mean, I'm NOT using tensorflow-gpu? But ONLY tensorflow-CPU ???

➜  privateGPT git:(main) ✗ python privateGPT.py
2023-08-03 15:30:51.990327: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-08-03 15:30:51.990368: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-08-03 15:30:51.990374: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-08-03 15:30:51.995080: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
Found model file at  ./models/ggml-gpt4all-j-v1.3-groovy.bin
gptj_model_load: loading model from './models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: kv self size  =  896.00 MB
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285

Enter a query: How are you man?
 I'm doing well, thank you for asking!

> Question:
How are you man?

> Answer (took 51.14 s.):
 I'm doing well, thank you for asking!

> source_documents/state_of_the_union.txt:
For more than two years, COVID-19 has impacted every decision in our lives and the life of the nation. 

And I know you’re tired, frustrated, and exhausted. 

But I also know this. 

Because of the progress we’ve made, because of your resilience and the tools we have, tonight I can say  
we are moving forward safely, back to more normal routines.  

We’ve reached a new moment in the fight against COVID-19, with severe cases down to a level not seen since last July.

> source_documents/state_of_the_union.txt:
For more than two years, COVID-19 has impacted every decision in our lives and the life of the nation. 

And I know you’re tired, frustrated, and exhausted. 

But I also know this. 

Because of the progress we’ve made, because of your resilience and the tools we have, tonight I can say  
we are moving forward safely, back to more normal routines.  

We’ve reached a new moment in the fight against COVID-19, with severe cases down to a level not seen since last July.

> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.

> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.

Enter a query:

@Komal-99 oh I see now you did specify the version sorry. I would reach out on the llama-cpp-python repo to get help with that then, definitely outside the scope of this repo.

Do let me if their anything you get to resolve this

https://github.com/PromtEngineer/localGPT trying this now as it seems to be pre-built for GPU use.

@bioshazard I got you... Thank you...

with modeltype, I got the following ERRORs:

➜  privateGPT git:(main) ✗ python privateGPT.py
2023-08-07 09:52:37.920830: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9346] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-08-07 09:52:37.920871: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-08-07 09:52:37.920880: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-08-07 09:52:37.926065: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
  Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5
llama.cpp: loading model from ./models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32032
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 18 (mostly Q6_K)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llama_model_load_internal: mem required  =  454.09 MB (+  400.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 360 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 11001 MB
llama_new_context_with_model: kv self size  =  400.00 MB

Enter a query: Hi, how are you?
llama_tokenize_with_model: too many tokens
Traceback (most recent call last):
  File "....../privateGPT.py", line 97, in <module>
    main()
  File "....../privateGPT.py", line 68, in main
    res = qa(query)
  File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 258, in __call__
    raise e
  File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 252, in __call__
    self._call(inputs, run_manager=run_manager)
  File "~/.local/lib/python3.10/site-packages/langchain/chains/retrieval_qa/base.py", line 133, in _call
    answer = self.combine_documents_chain.run(
  File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 456, in run
    return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[
  File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 258, in __call__
    raise e
  File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 252, in __call__
    self._call(inputs, run_manager=run_manager)
  File "~/.local/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 106, in _call
    output, extra_return_dict = self.combine_docs(
  File "~/.local/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 165, in combine_docs
    return self.llm_chain.predict(callbacks=callbacks, **inputs), {}
  File "~/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 252, in predict
    return self(kwargs, callbacks=callbacks)[self.output_key]
  File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 258, in __call__
    raise e
  File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 252, in __call__
    self._call(inputs, run_manager=run_manager)
  File "~/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 92, in _call
    response = self.generate([inputs], run_manager=run_manager)
  File "~/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 102, in generate
    return self.llm.generate_prompt(
  File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 451, in generate_prompt
    return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs)
  File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 582, in generate
    output = self._generate_helper(
  File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 488, in _generate_helper
    raise e
  File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 475, in _generate_helper
    self._generate(
  File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 961, in _generate
    self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
  File "~/.local/lib/python3.10/site-packages/langchain/llms/llamacpp.py", line 238, in _call
    for chunk in self._stream(
  File "~/.local/lib/python3.10/site-packages/langchain/llms/llamacpp.py", line 288, in _stream
    for part in result:
  File "~/.local/lib/python3.10/site-packages/llama_cpp/llama.py", line 855, in _create_completion
    raise ValueError(
ValueError: Requested tokens (558) exceed context window of 512
Exception ignored in: <function Llama.__del__ at 0x7fad80434e50>
Traceback (most recent call last):
  File "~/.local/lib/python3.10/site-packages/llama_cpp/llama.py", line 1508, in __del__
TypeError: 'NoneType' object is not callable

hi, i noticed that your CUDA detected two GPUs. does it mean that your privateGPT uses both of them for inference? if so, do you mind showing how you did that? thanks in advance.

Apologies but it seemed to just do it by itself. I have no idea how. It seems to use the internal graphics card by default and then fall back to the nvidia, which is annoying, because the internal only has like 200MB graphics memory as opposed to the nvidia which has 4 GB. It's something I am still trying to figure out. Also, I am trying to use something other than LLama because that seems to be censored. Pretty much every topic I give it under my research areas it goes "Sorry I can't answer that because I am censored" or words to that effect.

Re: "two GPUS", I see this in the quoted output:

ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device

OK but that's strange because it seems to want to use 200MB ... and the Nvidia has 4GB

llama_model_load_internal: total VRAM used: 11001 MB

Also the error at the end shows you have the context set to 512, try override to 4096 see my earlier comments about that too.

Hey. Sorry to be a pain. I get the following error output. No idea what it means.

john@john-GF63-Thin-11SC:~/ai/git/llama-cpp-python$ CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir Defaulting to user installation because normal site-packages is not writeable Collecting llama-cpp-python Downloading llama_cpp_python-0.2.6.tar.gz (1.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 3.3 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Installing backend dependencies ... done Preparing metadata (pyproject.toml) ... done Collecting typing-extensions>=4.5.0 (from llama-cpp-python) Obtaining dependency information for typing-extensions>=4.5.0 from https://files.pythonhosted.org/packages/ec/6b/63cc3df74987c36fe26157ee12e09e8f9db4de771e0f3404263117e75b95/typing_extensions-4.7.1-py3-none-any.whl.metadata Downloading typing_extensions-4.7.1-py3-none-any.whl.metadata (3.1 kB) Collecting numpy>=1.20.0 (from llama-cpp-python) Obtaining dependency information for numpy>=1.20.0 from https://files.pythonhosted.org/packages/71/3c/3b1981c6a1986adc9ee7db760c0c34ea5b14ac3da9ecfcf1ea2a4ec6c398/numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata Downloading numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB) Collecting diskcache>=5.6.1 (from llama-cpp-python) Obtaining dependency information for diskcache>=5.6.1 from https://files.pythonhosted.org/packages/3f/27/4570e78fc0bf5ea0ca45eb1de3818a23787af9b390c0b0a0033a1b8236f9/diskcache-5.6.3-py3-none-any.whl.metadata Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB) Downloading diskcache-5.6.3-py3-none-any.whl (45 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.5/45.5 kB 4.6 MB/s eta 0:00:00 Downloading numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 9.7 MB/s eta 0:00:00 Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB) Building wheels for collected packages: llama-cpp-python Building wheel for llama-cpp-python (pyproject.toml) ... error error: subprocess-exited-with-error

× Building wheel for llama-cpp-python (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [43 lines of output] scikit-build-core 0.5.0 using CMake 3.27.4 (wheel) Configuring CMake... loading initial cache file /tmp/tmp5uzeaf90/build/CMakeInit.txt -- The C compiler identification is GNU 11.4.0 -- The CXX compiler identification is GNU 11.4.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Git: /usr/bin/git (found version "2.34.1") fatal: not a git repository (or any of the parent directories): .git fatal: not a git repository (or any of the parent directories): .git CMake Warning at vendor/llama.cpp/CMakeLists.txt:125 (message): Git repository not found; to enable automatic generation of build info, make sure Git is installed and the project is a Git repository.
  **-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
  -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
  -- Found Threads: TRUE
  -- Found CUDAToolkit: /usr/local/cuda/include (found version "12.2.140")
  -- cuBLAS found
  -- The CUDA compiler identification is unknown
  CMake Error at /tmp/pip-build-env-7gf9iivy/normal/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.27/Modules/CMakeDetermineCUDACompiler.cmake:603 (message):
    Failed to detect a default CUDA architecture.

    Compiler output:

  Call Stack (most recent call first):
    vendor/llama.cpp/CMakeLists.txt:286 (enable_language)

  -- Configuring incomplete, errors occurred!

  *** CMake configuration failed
  [end of output]
note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for llama-cpp-python Failed to build llama-cpp-python ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects**

Try this

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python

zylon-ai / private-gpt

Is privateGPT based on CPU or GPU? Why in my case it's unbelievably slow? #931