jiapei100 commented 11 months ago

Does it have something to do with tensorflow? And it's weird that from the following console messages,

It took PrivateGPT 51 seconds to answer 1 single question ?????
Unable to register cuDNN/cuFFT/cuBLAS factory
This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.

Does that mean, I'm NOT using tensorflow-gpu? But ONLY tensorflow-CPU ???

➜  privateGPT git:(main) ✗ python privateGPT.py
2023-08-03 15:30:51.990327: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-08-03 15:30:51.990368: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-08-03 15:30:51.990374: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-08-03 15:30:51.995080: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
Found model file at  ./models/ggml-gpt4all-j-v1.3-groovy.bin
gptj_model_load: loading model from './models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: kv self size  =  896.00 MB
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285

Enter a query: How are you man?
 I'm doing well, thank you for asking!

> Question:
How are you man?

> Answer (took 51.14 s.):
 I'm doing well, thank you for asking!

> source_documents/state_of_the_union.txt:
For more than two years, COVID-19 has impacted every decision in our lives and the life of the nation. 

And I know you’re tired, frustrated, and exhausted. 

But I also know this. 

Because of the progress we’ve made, because of your resilience and the tools we have, tonight I can say  
we are moving forward safely, back to more normal routines.  

We’ve reached a new moment in the fight against COVID-19, with severe cases down to a level not seen since last July.

> source_documents/state_of_the_union.txt:
For more than two years, COVID-19 has impacted every decision in our lives and the life of the nation. 

And I know you’re tired, frustrated, and exhausted. 

But I also know this. 

Because of the progress we’ve made, because of your resilience and the tools we have, tonight I can say  
we are moving forward safely, back to more normal routines.  

We’ve reached a new moment in the fight against COVID-19, with severe cases down to a level not seen since last July.

> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.

> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.

Enter a query:

bioshazard commented 11 months ago

You need to install llama-cpp-python with GPU support

https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

then add n_gpu_layers=X to https://github.com/imartinez/privateGPT/blob/main/privateGPT.py#L36

eg,

            llm = LlamaCpp(model_path=model_path, max_tokens=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_gpu_layers=43)

I am surprised there is not an env var in the python script to dynamically set GPU layers, but these were the steps I took to get my GPU using it. YMMV on the GPU layer count you can get away with offloading but I do the full 43 of llama 2 hermes 13b cuz I have a 3090 with 24G vram. Here is my output with all the above applied:

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
llama.cpp: loading model from REDCATED/models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32032
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 18 (mostly Q6_K)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2136.07 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 360 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 12209 MB
llama_new_context_with_model: kv self size  =  400.00 MB

Enter a query:

johndev8964 commented 11 months ago

So, How much is the speed updated after implementing the GPU? @bioshazard Can you show me the query result?

jiapei100 commented 11 months ago

@bioshazard I got you... Thank you...

with modeltype, I got the following ERRORs:

➜  privateGPT git:(main) ✗ python privateGPT.py
2023-08-07 09:52:37.920830: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9346] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-08-07 09:52:37.920871: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-08-07 09:52:37.920880: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-08-07 09:52:37.926065: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
  Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5
llama.cpp: loading model from ./models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32032
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 18 (mostly Q6_K)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llama_model_load_internal: mem required  =  454.09 MB (+  400.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 360 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 11001 MB
llama_new_context_with_model: kv self size  =  400.00 MB

Enter a query: Hi, how are you?
llama_tokenize_with_model: too many tokens
Traceback (most recent call last):
  File "....../privateGPT.py", line 97, in <module>
    main()
  File "....../privateGPT.py", line 68, in main
    res = qa(query)
  File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 258, in __call__
    raise e
  File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 252, in __call__
    self._call(inputs, run_manager=run_manager)
  File "~/.local/lib/python3.10/site-packages/langchain/chains/retrieval_qa/base.py", line 133, in _call
    answer = self.combine_documents_chain.run(
  File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 456, in run
    return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[
  File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 258, in __call__
    raise e
  File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 252, in __call__
    self._call(inputs, run_manager=run_manager)
  File "~/.local/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 106, in _call
    output, extra_return_dict = self.combine_docs(
  File "~/.local/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 165, in combine_docs
    return self.llm_chain.predict(callbacks=callbacks, **inputs), {}
  File "~/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 252, in predict
    return self(kwargs, callbacks=callbacks)[self.output_key]
  File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 258, in __call__
    raise e
  File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 252, in __call__
    self._call(inputs, run_manager=run_manager)
  File "~/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 92, in _call
    response = self.generate([inputs], run_manager=run_manager)
  File "~/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 102, in generate
    return self.llm.generate_prompt(
  File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 451, in generate_prompt
    return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs)
  File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 582, in generate
    output = self._generate_helper(
  File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 488, in _generate_helper
    raise e
  File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 475, in _generate_helper
    self._generate(
  File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 961, in _generate
    self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
  File "~/.local/lib/python3.10/site-packages/langchain/llms/llamacpp.py", line 238, in _call
    for chunk in self._stream(
  File "~/.local/lib/python3.10/site-packages/langchain/llms/llamacpp.py", line 288, in _stream
    for part in result:
  File "~/.local/lib/python3.10/site-packages/llama_cpp/llama.py", line 855, in _create_completion
    raise ValueError(
ValueError: Requested tokens (558) exceed context window of 512
Exception ignored in: <function Llama.__del__ at 0x7fad80434e50>
Traceback (most recent call last):
  File "~/.local/lib/python3.10/site-packages/llama_cpp/llama.py", line 1508, in __del__
TypeError: 'NoneType' object is not callable

jiapei100 commented 11 months ago

@bioshazard

By the way, do you mean that ONLY llama-cpp-python has GPU support, while GPT4All does NOT??

Unbelievable....

bioshazard commented 11 months ago

Couple things:

GPT4All I think is CPU only. At top of their repo (https://github.com/nomic-ai/gpt4all) they say "Open-source assistant-style large language models that run locally on your CPU" which is great for enabling literally anyone to get in on it, but not for GPU people. I could be wrong tho maybe there is some GPU support
If you do use a GPU, you can use ggml models with llama-cpp-python in the way I offer.

Also, if you are running into tensorflow, or really any python issues... imo start with a fresh venv (https://docs.python.org/3/library/venv.html):

# Init
cd privateGPT/
python3 -m venv venv
source venv/bin/activate
# ... this is for if you have CUDA hardware, look up llama-cpp-python readme for the many ways to compile
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -r requirements.txt

# Run (notice `python` not `python3` now, venv introduces a new `python` command to PATH from `venv/bin`)
python privategpt.py

# Exit venv when you are done
deactivate

# Re-activate as needed
cd privateGPT/
source venv/bin/activate
python privategpt.py

Sorry if I created any confusion, hopefully the above is useful at least for people on Linux. lmk if this works or fails. seriously tho, if you have any python issues, imo its always best to start fresh than to fix anything. venv ftw!

bioshazard commented 11 months ago

@jiapei100 , looks like you have n_ctx set to 512 so thats way too small of a context, try n_ctx=4096 in the LlamaCpp initialization step for that specific model. And set max_tokens to like 512. Here is my line under model_type in privategpt.py and I think I set my batch to 512 for that hermes model but YMMV

llm = LlamaCpp(model_path=model_path, n_ctx=4096, max_tokens=512, n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_gpu_layers=43)

bioshazard commented 11 months ago

@johndev8964 2.4s after chroma db warms up! And again tho this is with nous-hermes-llama2-13b.ggmlv3.q6_K.bin so YMMV based on the model/GPU you choose.

> Question:
what is capital

> Answer (took 2.64 s.):
 In economics, capital refers to any man-made resource used in production or investment to create further goods or services. It can include physical assets like machinery or buildings as well as financial assets such as stocks and bonds. In the context of this passage, it appears that the author is specifically discussing "capital goods," which are durable items used in production processes, such as machines, tools, and equipment.

> source_documents/Man_Economy_and_State_with_Power_and_Market_Rothbard.epub:
There is another consideration that reinforces our conclusion. Professor Lachmann has been diligently reminding us of what economists generally forget: that “capital” is not just a homogeneous blob that can be added to or subtracted from. Capital is an intricate, delicate, interweaving structure of capital goods. All of the delicate strands of this structure have to fit, and fit precisely, or else malinvestment occurs. The free market is almost an automatic mechanism for such fitting; and we

jiapei100 commented 11 months ago

@bioshazard

Thank you soooooo much... I got it... Yeah, privateGPT.py doesn't use that particular parameter, line 50 is modified as:

case "LlamaCpp":
            llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, max_tokens=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_gpu_layers=43)

The speed is like 3 times faster...

➜  privateGPT git:(main) ✗ python privateGPT.py
2023-08-07 12:32:16.404648: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9346] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-08-07 12:32:16.404688: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-08-07 12:32:16.404702: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-08-07 12:32:16.409845: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From /home/lvision/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/lvision/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
  Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5
llama.cpp: loading model from /opt/AIModels/nous-hermes-llama2-13b.ggmlv3.q6_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32032
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 18 (mostly Q6_K)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llama_model_load_internal: mem required  =  753.09 MB (+ 3200.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 640 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 14081 MB
llama_new_context_with_model: kv self size  = 3200.00 MB

Enter a query: Hi, how are you today?
 I am doing well, thank you for asking! How about you?

> Question:
Hi, how are you today?

> Answer (took 14.75 s.):
 I am doing well, thank you for asking! How about you?

> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.

> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.

> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.

> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.

Enter a query:

Thank you so much...

jiapei100 commented 11 months ago

@bioshazard However, you are also using a 3090 right? So, how come, with the same model, privateGPT's speed is about 6 times faster than mine?

jiapei100 commented 11 months ago

@bioshazard BTW, can you help to take a look at my other 2 issues at localGPT ?

Thank you ....

bioshazard commented 11 months ago

@jiapei100 , that response time I think includes the chroma db retrieval delay, try subsequent calls. my first query was slower. also our speeds might vary based on CPU/memory idk otherwise.

sure ill take a peak at those but no promises and don't hold your breath on my follow through there. did we solve this one? maybe close it? glad it worked!

johndev8964 commented 11 months ago

l__ if self.ctx is not None: ^^^^^^^^ AttributeError: 'Llama' object has no attribute 'ctx'

@bioshazard Can you show me your env file? I am getting this above error. Thanks.

johndev8964 commented 11 months ago

PERSIST_DIRECTORY=db

MODEL_TYPE=GPT4All

MODEL_TYPE=LlamaCpp MODEL_PATH=models/ggml-model-q4_0.bin EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2 MODEL_N_CTX=1000 MODEL_N_BATCH=8 TARGET_SOURCE_CHUNKS=4

This is my env.

bioshazard commented 11 months ago

@johndev8964 here is my .env tho I am using batch 512 with this model in other contexts. Stopped using this repo for a sec while I build out my slack bot.

PERSIST_DIRECTORY=db
MODEL_TYPE=LlamaCpp
MODEL_PATH=models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2
MODEL_N_CTX=4096
MODEL_N_BATCH=8
TARGET_SOURCE_CHUNKS=8

Tho keep in mind I changed some stuff in the code too:

diff --git a/privateGPT.py b/privateGPT.py
index a11fe24..eb43d86 100755
--- a/privateGPT.py
+++ b/privateGPT.py
@@ -33,7 +33,7 @@ def main():
     # Prepare the LLM
     match model_type:
         case "LlamaCpp":
-            llm = LlamaCpp(model_path=model_path, max_tokens=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=False)
+            llm = LlamaCpp(model_path=model_path, n_ctx=4096, max_tokens=512, n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_gpu_layers=43)
         case "GPT4All":
             llm = GPT4All(model=model_path, max_tokens=model_n_ctx, backend='gptj', n_batch=model_n_batch, callbacks=callbacks, verbose=False)
         case _default:

johndev8964 commented 11 months ago

@bioshazard Thanks for your kind answer.

The problem is fixed. I changed model to koala. It works now.

bxdoan commented 10 months ago

Hi @johndev8964 , I have the same problem as you, what model koala do you use? can you share me the link?

johndev8964 commented 10 months ago

@bxdoan https://huggingface.co/TheBloke/koala-7B-GGML/tree/main

bxdoan commented 10 months ago

@johndev8964 nice, thanks bro But it seems that they just answer fast when the same question occurs the second time, right? Or you have any configuration for fast answer in the first ask

johndev8964 commented 10 months ago

it is super slow on my side too.

bioshazard commented 10 months ago

First answer will always be slow I suspect because it is initializing the chroma DB in memory. To make it faster you'd need a warm DB source, probably outside the scope of this issue.

douglasg14b commented 10 months ago

Can the DB be on a remote host (on LAN) that can cache the entire thing in memory? Might that provide performance improvements?

bioshazard commented 10 months ago

You have two options IMO:

Modify this repo into an API with some consuming service so the DB stays warm
Modify this repo to use something like weaviate on the network or same host to keep a warm DB

toshanmugaraj commented 10 months ago

@bxdoan https://huggingface.co/TheBloke/koala-7B-GGML/tree/main

which file in this link

mcchung52 commented 10 months ago

sorry for a newb question... @bioshazard or anyone really.. i did "pip install llama-cpp-python" but where can i find the .bin file? thanks~

JohnOstrowick commented 9 months ago

Hi guys. Sorry to come in at a different angle here. I've tried @jit(target_backend='cuda') at various points in the code but it barfs up a lot of errors. Is it not feasible to use JIT to force it to use Cuda (my GPU is obviously Nvidia). I did a few test scripts and I literally just had to add that decoration to the def() to make it use the GPU.

Also. It seems to use a very low "temperature" and merely quote from the source documents, instead of actually doing summaries. Is there a way to up the temperature?

Also, sorry for my ignorance. Where does it store the stuff it ingests? In the LLM file, or where? Or is it just in RAM? Reason is I am worried it ingests stuff and then loses recollection of it after reboot...

bioshazard commented 9 months ago

@JohnOstrowick did you attempt the edits I offered? It should work on GPU with llama if you compile the library correctly and update the one line to specify the layers to offload.

Is the code you are trying to add for llama?

You can increase the temperature in the same line.

It's stored in a simple file in the repository. That's just how chroma does it.

JohnOstrowick commented 9 months ago

OK thanks let me try your solution, but before I go ahead: does this force you to use the llama LLM file or can I still use falcon and others?

I see what you mean re storage; it goes into the db/ directory.

bioshazard commented 9 months ago

@JohnOstrowick no I think you need to use a llama model for llamacpp and GPT4all is CPU only.

So for falcon you would need to extend this repository to add that third type of llm. The addition of falcon support is surely going to need its own issue apart from this one.

JohnOstrowick commented 9 months ago

Hi there. So I am trying LLama but it now says too many tokens. What do I do to edit this?

llama_tokenize: too many tokens Traceback (most recent call last): File "/home/john/ai/git/privateGPT/privateGPT.py", line 83, in main() File "/home/john/ai/git/privateGPT/privateGPT.py", line 54, in main res = qa(query) File "/home/john/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 243, in call raise e File "/home/john/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 237, in call self._call(inputs, run_manager=run_manager) File "/home/john/.local/lib/python3.10/site-packages/langchain/chains/retrieval_qa/base.py", line 131, in _call answer = self.combine_documents_chain.run( File "/home/john/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 445, in run return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[ File "/home/john/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 243, in call raise e File "/home/john/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 237, in call self._call(inputs, run_manager=run_manager) File "/home/john/.local/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 106, in _call output, extra_return_dict = self.combine_docs( File "/home/john/.local/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 165, in combine_docs return self.llm_chain.predict(callbacks=callbacks, inputs), {} File "/home/john/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 252, in predict return self(kwargs, callbacks=callbacks)[self.output_key] File "/home/john/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 243, in call raise e File "/home/john/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 237, in call self._call(inputs, run_manager=run_manager) File "/home/john/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 92, in _call response = self.generate([inputs], run_manager=run_manager) File "/home/john/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 102, in generate return self.llm.generate_prompt( File "/home/john/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 186, in generate_prompt return self.generate(prompt_strings, stop=stop, callbacks=callbacks, kwargs) File "/home/john/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 279, in generate output = self._generate_helper( File "/home/john/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 223, in _generate_helper raise e File "/home/john/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 210, in _generate_helper self._generate( File "/home/john/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 602, in _generate self._call(prompt, stop=stop, run_manager=run_manager, **kwargs) File "/home/john/.local/lib/python3.10/site-packages/langchain/llms/llamacpp.py", line 230, in _call for token in self.stream(prompt=prompt, stop=stop, run_manager=run_manager): File "/home/john/.local/lib/python3.10/site-packages/langchain/llms/llamacpp.py", line 280, in stream for chunk in result: File "/home/john/.local/lib/python3.10/site-packages/llama_cpp/llama.py", line 822, in _create_completion raise ValueError( ValueError: Requested tokens (623) exceed context window of 512

bioshazard commented 9 months ago

I might have mentioned it in an earlier reply. But per the output you provided it seems you are only using a 512 context and should override it to use 4096. Refer to my earlier reply or the llama CPP docs to see how you can set the context window.

JohnOstrowick commented 9 months ago

Thanks, after applying the patch it does this:

john@john-GF63-Thin-11SC:~/ai/git/privateGPT$ python3.10 privateGPT_llama.py File "/home/john/ai/git/privateGPT/privateGPT_llama.py", line 37 case "GPT4All": ^^^^^^^^^ SyntaxError: invalid syntax john@john-GF63-Thin-11SC:~/ai/git/privateGPT$

Code edited as follows:


    # Prepare the LLM
    match model_type:
        case "LlamaCpp":
         llm = LlamaCpp(model_path=model_path, n_ctx=4096, max_tokens=512, n_ba
tch=model_n_batch, callbacks=callbacks, verbose=False, n_gpu_layers=43)
         case "GPT4All":
             llm = GPT4All(model=model_path, max_tokens=model_n_ctx, backend='g
ptj', n_batch=model_n_batch, callbacks=callbacks, verbose=False)
         case _default:
    # raise exception if model_type is not supported
            raise Exception(f"Model type {model_type} is not supported. Please 
choose one of the following: LlamaCpp, GPT4All")

I see it was spacing that was the issue, fixed in pycharm.

bioshazard commented 9 months ago

It may be that if you paste my exact text in it will not do what you need. I expect if you provided the resulting context to chat GPT that it could guide you through what is wrong with the syntax of your result. Or if you paste the surrounding context here I can try to take a look at it to determine where the syntax error is. It might be a tab or a space or a missing colon or something.

JohnOstrowick commented 9 months ago

OK so next question is it doesn't seem to have improved performance, it still takes 1 minute to respond. I did see your comment "First answer will always be slow I suspect because it is initializing the chroma DB in memory. To make it faster you'd need a warm DB source, probably outside the scope of this issue." But it is equally slow on 2nd question. I think it may be to do with the .bin LLM file I am using? Edit: I tried nous-hermes-llama2-13b.ggmlv3.q6_K.bin and it is significantly slower - 2 minutes instead of 1 with llama 2.7. As always I am grateful for your time.

bioshazard commented 9 months ago

@JohnOstrowick what does the output look like at the start? Does yours look like my earliest post where it shows that the cuda device is detected and the layers get loaded? What is your CPU/GPU

JohnOstrowick commented 9 months ago

cuda is installed...

Device-1: Intel TigerLake-H GT1 [UHD Graphics] driver: i915 v: kernel Device-2: NVIDIA TU117M [GeForce GTX 1650 Mobile / Max-Q] driver: nvidia v: 535.104.05 Device-3: Acer HD Webcam type: USB driver: uvcvideo Display: x11 server: X.Org v: 1.21.1.4 driver: X: loaded: modesetting,nvidia unloaded: fbdev,nouveau,vesa gpu: i915 resolution: 1920x1080~60Hz OpenGL: renderer: Mesa Intel UHD Graphics (TGL GT1) v: 4.6 Mesa 23.0.4-0ubuntu1~22.04.1

to your other question, the output at the start;

python3.10 privateGPT_llama_pycharmedit.py llama.cpp: loading model from models/llama-2-7b-chat.ggmlv3.q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 4096 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: mem required = 5407.72 MB (+ 1026.00 MB per state) llama_new_context_with_model: kv self size = 2048.00 MB

bioshazard commented 9 months ago

@JohnOstrowick looks like your llama-cpp-python was not compiled with GPU support (see the difference between my output and yours). Review my instruction for how to force it to install with cuBlas. Further, you might need to offload less layers than my 43/43 example as you only have 4G vram. I have 24G so I had room for all those layers. You will need to find the sweet spot. Right now your completions are being done on CPU.

JohnOstrowick commented 9 months ago

Hey. Sorry to be a pain. I get the following error output. No idea what it means.

john@john-GF63-Thin-11SC:~/ai/git/llama-cpp-python$ CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir Defaulting to user installation because normal site-packages is not writeable Collecting llama-cpp-python Downloading llama_cpp_python-0.2.6.tar.gz (1.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 3.3 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Installing backend dependencies ... done Preparing metadata (pyproject.toml) ... done Collecting typing-extensions>=4.5.0 (from llama-cpp-python) Obtaining dependency information for typing-extensions>=4.5.0 from https://files.pythonhosted.org/packages/ec/6b/63cc3df74987c36fe26157ee12e09e8f9db4de771e0f3404263117e75b95/typing_extensions-4.7.1-py3-none-any.whl.metadata Downloading typing_extensions-4.7.1-py3-none-any.whl.metadata (3.1 kB) Collecting numpy>=1.20.0 (from llama-cpp-python) Obtaining dependency information for numpy>=1.20.0 from https://files.pythonhosted.org/packages/71/3c/3b1981c6a1986adc9ee7db760c0c34ea5b14ac3da9ecfcf1ea2a4ec6c398/numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata Downloading numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB) Collecting diskcache>=5.6.1 (from llama-cpp-python) Obtaining dependency information for diskcache>=5.6.1 from https://files.pythonhosted.org/packages/3f/27/4570e78fc0bf5ea0ca45eb1de3818a23787af9b390c0b0a0033a1b8236f9/diskcache-5.6.3-py3-none-any.whl.metadata Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB) Downloading diskcache-5.6.3-py3-none-any.whl (45 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.5/45.5 kB 4.6 MB/s eta 0:00:00 Downloading numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 9.7 MB/s eta 0:00:00 Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB) Building wheels for collected packages: llama-cpp-python Building wheel for llama-cpp-python (pyproject.toml) ... error error: subprocess-exited-with-error

× Building wheel for llama-cpp-python (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [43 lines of output] scikit-build-core 0.5.0 using CMake 3.27.4 (wheel) Configuring CMake... loading initial cache file /tmp/tmp5uzeaf90/build/CMakeInit.txt -- The C compiler identification is GNU 11.4.0 -- The CXX compiler identification is GNU 11.4.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Git: /usr/bin/git (found version "2.34.1") fatal: not a git repository (or any of the parent directories): .git fatal: not a git repository (or any of the parent directories): .git CMake Warning at vendor/llama.cpp/CMakeLists.txt:125 (message): Git repository not found; to enable automatic generation of build info, make sure Git is installed and the project is a Git repository.

  **-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
  -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
  -- Found Threads: TRUE
  -- Found CUDAToolkit: /usr/local/cuda/include (found version "12.2.140")
  -- cuBLAS found
  -- The CUDA compiler identification is unknown
  CMake Error at /tmp/pip-build-env-7gf9iivy/normal/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.27/Modules/CMakeDetermineCUDACompiler.cmake:603 (message):
    Failed to detect a default CUDA architecture.

    Compiler output:

  Call Stack (most recent call first):
    vendor/llama.cpp/CMakeLists.txt:286 (enable_language)

  -- Configuring incomplete, errors occurred!

  *** CMake configuration failed
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for llama-cpp-python Failed to build llama-cpp-python ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects**

bioshazard commented 9 months ago

Maybe try making a new venv? Are you on WSL? This failure is certainly outside the scope of this repository and rather an issue in the llama CPP python solution itself. I think I ran into this error on WSL when I tried it. If I remember later today I'll link my Nvidia setup instructions in case there are any steps you might need to take beyond the initial driver install.

JohnOstrowick commented 9 months ago

OK sorry I am not clear on that. This is the settings file I think you posted...?

PERSIST_DIRECTORY=db MODEL_TYPE=LlamaCpp MODEL_PATH=models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2 MODEL_N_CTX=4096 MODEL_N_BATCH=8 TARGET_SOURCE_CHUNKS=8

Is this still OK?

The code patch I used didn't say anything about 43 layers?

    case "LlamaCpp":
        llm = LlamaCpp(model_path=model_path, max_tokens=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=False)
    case "GPT4All":
        llm = GPT4All(model=model_path, max_tokens=model_n_ctx, backend='gptj', n_batch=model_n_batch, callbacks=callbacks, verbose=False)
    case _default:

JohnOstrowick commented 9 months ago

Maybe try making a new venv? Are you on WSL? This failure is certainly outside the scope of this repository and rather an issue in the llama CPP python solution itself. I think I ran into this error on WSL when I tried it. If I remember later today I'll link my Nvidia setup instructions in case there are any steps you might need to take beyond the initial driver install.

--No, I'm on ubuntu 22

JohnOstrowick commented 9 months ago

Hi there. OK I managed to get everything done including downloading the LLM file, but now I get this error when I turn on verbose errors:

warnings.warn(errors.NumbaDeprecationWarning(msg, gguf_init_from_file: invalid magic number 67676a74 error loading model: llama_model_loader: failed to load model from models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin

llama_load_model_from_file: failed to load model Traceback (most recent call last): File "/home/john/ai/development/privateGPT/privateGPT_cuda_jmo.py", line 89, in main() File "/home/john/.local/lib/python3.10/site-packages/langchain/load/serializable.py", line 74, in init super().init(**kwargs) File "pydantic/main.py", line 341, in pydantic.main.BaseModel.init pydantic.error_wrappers.ValidationError: 1 validation error for LlamaCpp root Could not load Llama model from path: models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin. Received error (type=value_error)

Komal-99 commented 9 months ago

Hi, @bioshazard
Read the above comments and try to use them in my code for solving inference time issue in LlamaCPP by using GPU A30 24gb. But as soon as
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir and then added n_gpu_layers parameter. privategpt.py giving me 1 validation error for pydantic package. I though may be because earlier my model was working on Llama-cpp-python==0.1.57 but not on running force installation command it is updated to latest version which supports .gguf model not .ggml. so, I tried running command with version. pip install llama-cpp-python==0.1.57 --no-cache-dir privategpt.py file runs successfully but not showing GPU configuration. Please Help really stuck in this issue model llama-2-7b-chat.ggmlv3.q4_0.bintaking 50-60 seconds per query which is not good for my use case application.

hyperp0ppy commented 9 months ago

@johndev8964 2.4s after chroma db warms up! And again tho this is with nous-hermes-llama2-13b.ggmlv3.q6_K.bin so YMMV based on the model/GPU you choose.

> Question:
what is capital

> Answer (took 2.64 s.):
 In economics, capital refers to any man-made resource used in production or investment to create further goods or services. It can include physical assets like machinery or buildings as well as financial assets such as stocks and bonds. In the context of this passage, it appears that the author is specifically discussing "capital goods," which are durable items used in production processes, such as machines, tools, and equipment.

> source_documents/Man_Economy_and_State_with_Power_and_Market_Rothbard.epub:
There is another consideration that reinforces our conclusion. Professor Lachmann has been diligently reminding us of what economists generally forget: that “capital” is not just a homogeneous blob that can be added to or subtracted from. Capital is an intricate, delicate, interweaving structure of capital goods. All of the delicate strands of this structure have to fit, and fit precisely, or else malinvestment occurs. The free market is almost an automatic mechanism for such fitting; and we

Hi,

Can you show a step-by-step process to get this response time? I'm a novice and would appreciate your help very much. Thank you in advance.

Sync-Z1 commented 9 months ago

You need to install llama-cpp-python with GPU support

https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

then add n_gpu_layers=X to https://github.com/imartinez/privateGPT/blob/main/privateGPT.py#L36

eg,

            llm = LlamaCpp(model_path=model_path, max_tokens=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_gpu_layers=43)

I am surprised there is not an env var in the python script to dynamically set GPU layers, but these were the steps I took to get my GPU using it. YMMV on the GPU layer count you can get away with offloading but I do the full 43 of llama 2 hermes 13b cuz I have a 3090 with 24G vram. Here is my output with all the above applied:

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
llama.cpp: loading model from REDCATED/models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32032
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 18 (mostly Q6_K)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2136.07 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 360 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 12209 MB
llama_new_context_with_model: kv self size  =  400.00 MB

Enter a query:

How about M2 Macbook Air tho

bioshazard commented 9 months ago

Hi there. OK I managed to get everything done including downloading the LLM file, but now I get this error when I turn on verbose errors:

warnings.warn(errors.NumbaDeprecationWarning(msg, gguf_init_from_file: invalid magic number 67676a74 error loading model: llama_model_loader: failed to load model from models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin

llama_load_model_from_file: failed to load model Traceback (most recent call last): File "/home/john/ai/development/privateGPT/privateGPT_cuda_jmo.py", line 89, in main() File "/home/john/.local/lib/python3.10/site-packages/langchain/load/serializable.py", line 74, in init super().init(kwargs) File "pydantic/main.py", line 341, in pydantic.main.BaseModel.init pydantic.error_wrappers.ValidationError: 1 validation error for LlamaCpp root** Could not load Llama model from path: models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin. Received error (type=value_error)

@JohnOstrowick I suspect you don't have that model at that path. Try replacing the relative module path with the absolute path like /home/.../...K.bin to be sure there is no problem with the path it is attempting to reach. Also verify that you permissions are compatible to reading as the user you are executing with.

bioshazard commented 9 months ago

so, I tried running command with version. pip install llama-cpp-python==0.1.57 --no-cache-dir privategpt.py file runs successfully but not showing GPU configuration.

@Komal-99 seems like you are super close. You did the pip install correctly the first time, but since you did not use the version the repo expects, it failed. Try to create a new venv and run CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -r requirements.txt --force-reinstall --upgrade --no-cache-dir in the repo so you get both the right llama-cpp-python version AND that it gets installed with the right env vars set.

bioshazard commented 9 months ago

How about M2 Macbook Air tho

@Sync-Z1 I haven't tried this myself, but maybe you can refer to the official llama-cpp-python docs https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md

As well tho see my last reply about installing the right version at the same time as those official instructions will yield an incompatible version for this repo.

bioshazard commented 9 months ago

Also idk how I got roped into generally explaining how to use llama-cpp-python lol but I believe at this point I have covered every possible situation:

Only llama-cpp-python (as opposed to GPT4All) supports GPU acceleration
This repo requires a specific version of llama-cpp-python (just use a dedicated venv or conda env)
llama-cpp-python must be installed with env vars to instruct it to compile the GPU support
You need to alter this repo to update the LlamaCpp line to offload the GPU layers and fix the context window.
- (someone should do a PR, I won't cuz I don't even use this repo)
Any questions about "what about XYZ hardware" just go to the official llama-cpp-python docs (RTFM anyway imo)
Be sure to use absolute path /home/... to your model in case you have some weird relative path issues.

So read back through the thread for each insight and good luck

Komal-99 commented 9 months ago

so, I tried running command with version. pip install llama-cpp-python==0.1.57 --no-cache-dir privategpt.py file runs successfully but not showing GPU configuration.

@Komal-99 seems like you are super close. You did the pip install correctly the first time, but since you did not use the version the repo expects, it failed. Try to create a new venv and run CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -r requirements.txt --force-reinstall --upgrade --no-cache-dir in the repo so you get both the right llama-cpp-python version AND that it gets installed with the right env vars set.

But I am getting right version but the model is not compatible is shows Model not found after updating llama-cpp-python to latest as ggml model is not supported anymore replaced by gguf. But as mentioned above the quantization model I am using is in ggml format

bioshazard commented 9 months ago

@Komal-99 oh I see now you did specify the version sorry. I would reach out on the llama-cpp-python repo to get help with that then, definitely outside the scope of this repo.

zylon-ai / private-gpt

Is privateGPT based on CPU or GPU? Why in my case it's unbelievably slow? #931

MODEL_TYPE=GPT4All