Closed jiapei100 closed 5 months ago
You need to install llama-cpp-python
with GPU support
https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
then add n_gpu_layers=X
to https://github.com/imartinez/privateGPT/blob/main/privateGPT.py#L36
eg,
llm = LlamaCpp(model_path=model_path, max_tokens=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_gpu_layers=43)
I am surprised there is not an env var in the python script to dynamically set GPU layers, but these were the steps I took to get my GPU using it. YMMV on the GPU layer count you can get away with offloading but I do the full 43 of llama 2 hermes 13b cuz I have a 3090 with 24G vram. Here is my output with all the above applied:
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090
llama.cpp: loading model from REDCATED/models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32032
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 18 (mostly Q6_K)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 2136.07 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 360 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 12209 MB
llama_new_context_with_model: kv self size = 400.00 MB
Enter a query:
So, How much is the speed updated after implementing the GPU? @bioshazard Can you show me the query result?
@bioshazard I got you... Thank you...
with modeltype, I got the following ERRORs:
➜ privateGPT git:(main) ✗ python privateGPT.py
2023-08-07 09:52:37.920830: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9346] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-08-07 09:52:37.920871: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-08-07 09:52:37.920880: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-08-07 09:52:37.926065: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5
llama.cpp: loading model from ./models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32032
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 18 (mostly Q6_K)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llama_model_load_internal: mem required = 454.09 MB (+ 400.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 360 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 11001 MB
llama_new_context_with_model: kv self size = 400.00 MB
Enter a query: Hi, how are you?
llama_tokenize_with_model: too many tokens
Traceback (most recent call last):
File "....../privateGPT.py", line 97, in <module>
main()
File "....../privateGPT.py", line 68, in main
res = qa(query)
File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 258, in __call__
raise e
File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 252, in __call__
self._call(inputs, run_manager=run_manager)
File "~/.local/lib/python3.10/site-packages/langchain/chains/retrieval_qa/base.py", line 133, in _call
answer = self.combine_documents_chain.run(
File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 456, in run
return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[
File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 258, in __call__
raise e
File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 252, in __call__
self._call(inputs, run_manager=run_manager)
File "~/.local/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 106, in _call
output, extra_return_dict = self.combine_docs(
File "~/.local/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 165, in combine_docs
return self.llm_chain.predict(callbacks=callbacks, **inputs), {}
File "~/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 252, in predict
return self(kwargs, callbacks=callbacks)[self.output_key]
File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 258, in __call__
raise e
File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 252, in __call__
self._call(inputs, run_manager=run_manager)
File "~/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 92, in _call
response = self.generate([inputs], run_manager=run_manager)
File "~/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 102, in generate
return self.llm.generate_prompt(
File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 451, in generate_prompt
return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs)
File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 582, in generate
output = self._generate_helper(
File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 488, in _generate_helper
raise e
File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 475, in _generate_helper
self._generate(
File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 961, in _generate
self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
File "~/.local/lib/python3.10/site-packages/langchain/llms/llamacpp.py", line 238, in _call
for chunk in self._stream(
File "~/.local/lib/python3.10/site-packages/langchain/llms/llamacpp.py", line 288, in _stream
for part in result:
File "~/.local/lib/python3.10/site-packages/llama_cpp/llama.py", line 855, in _create_completion
raise ValueError(
ValueError: Requested tokens (558) exceed context window of 512
Exception ignored in: <function Llama.__del__ at 0x7fad80434e50>
Traceback (most recent call last):
File "~/.local/lib/python3.10/site-packages/llama_cpp/llama.py", line 1508, in __del__
TypeError: 'NoneType' object is not callable
@bioshazard
By the way, do you mean that ONLY llama-cpp-python has GPU support, while GPT4All does NOT??
Unbelievable....
Couple things:
llama-cpp-python
in the way I offer.Also, if you are running into tensorflow, or really any python issues... imo start with a fresh venv
(https://docs.python.org/3/library/venv.html):
# Init
cd privateGPT/
python3 -m venv venv
source venv/bin/activate
# ... this is for if you have CUDA hardware, look up llama-cpp-python readme for the many ways to compile
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -r requirements.txt
# Run (notice `python` not `python3` now, venv introduces a new `python` command to PATH from `venv/bin`)
python privategpt.py
# Exit venv when you are done
deactivate
# Re-activate as needed
cd privateGPT/
source venv/bin/activate
python privategpt.py
Sorry if I created any confusion, hopefully the above is useful at least for people on Linux. lmk if this works or fails. seriously tho, if you have any python issues, imo its always best to start fresh than to fix anything. venv
ftw!
@jiapei100 , looks like you have n_ctx set to 512 so thats way too small of a context, try n_ctx=4096
in the LlamaCpp
initialization step for that specific model. And set max_tokens to like 512. Here is my line under model_type in privategpt.py and I think I set my batch to 512 for that hermes model but YMMV
llm = LlamaCpp(model_path=model_path, n_ctx=4096, max_tokens=512, n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_gpu_layers=43)
@johndev8964 2.4s after chroma db warms up! And again tho this is with nous-hermes-llama2-13b.ggmlv3.q6_K.bin
so YMMV based on the model/GPU you choose.
> Question:
what is capital
> Answer (took 2.64 s.):
In economics, capital refers to any man-made resource used in production or investment to create further goods or services. It can include physical assets like machinery or buildings as well as financial assets such as stocks and bonds. In the context of this passage, it appears that the author is specifically discussing "capital goods," which are durable items used in production processes, such as machines, tools, and equipment.
> source_documents/Man_Economy_and_State_with_Power_and_Market_Rothbard.epub:
There is another consideration that reinforces our conclusion. Professor Lachmann has been diligently reminding us of what economists generally forget: that “capital” is not just a homogeneous blob that can be added to or subtracted from. Capital is an intricate, delicate, interweaving structure of capital goods. All of the delicate strands of this structure have to fit, and fit precisely, or else malinvestment occurs. The free market is almost an automatic mechanism for such fitting; and we
@bioshazard
Thank you soooooo much... I got it... Yeah, privateGPT.py doesn't use that particular parameter, line 50 is modified as:
case "LlamaCpp":
llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, max_tokens=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_gpu_layers=43)
The speed is like 3 times faster...
➜ privateGPT git:(main) ✗ python privateGPT.py
2023-08-07 12:32:16.404648: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9346] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-08-07 12:32:16.404688: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-08-07 12:32:16.404702: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-08-07 12:32:16.409845: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From /home/lvision/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/lvision/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5
llama.cpp: loading model from /opt/AIModels/nous-hermes-llama2-13b.ggmlv3.q6_K.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32032
llama_model_load_internal: n_ctx = 4096
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 18 (mostly Q6_K)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llama_model_load_internal: mem required = 753.09 MB (+ 3200.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 640 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 14081 MB
llama_new_context_with_model: kv self size = 3200.00 MB
Enter a query: Hi, how are you today?
I am doing well, thank you for asking! How about you?
> Question:
Hi, how are you today?
> Answer (took 14.75 s.):
I am doing well, thank you for asking! How about you?
> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
> source_documents/state_of_the_union.txt:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
Enter a query:
Thank you so much...
@bioshazard However, you are also using a 3090 right? So, how come, with the same model, privateGPT's speed is about 6 times faster than mine?
@bioshazard BTW, can you help to take a look at my other 2 issues at localGPT ?
Thank you ....
@jiapei100 , that response time I think includes the chroma db retrieval delay, try subsequent calls. my first query was slower. also our speeds might vary based on CPU/memory idk otherwise.
sure ill take a peak at those but no promises and don't hold your breath on my follow through there. did we solve this one? maybe close it? glad it worked!
l__ if self.ctx is not None: ^^^^^^^^ AttributeError: 'Llama' object has no attribute 'ctx'
@bioshazard Can you show me your env file? I am getting this above error. Thanks.
PERSIST_DIRECTORY=db
MODEL_TYPE=LlamaCpp MODEL_PATH=models/ggml-model-q4_0.bin EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2 MODEL_N_CTX=1000 MODEL_N_BATCH=8 TARGET_SOURCE_CHUNKS=4
This is my env.
@johndev8964 here is my .env tho I am using batch 512 with this model in other contexts. Stopped using this repo for a sec while I build out my slack bot.
PERSIST_DIRECTORY=db
MODEL_TYPE=LlamaCpp
MODEL_PATH=models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2
MODEL_N_CTX=4096
MODEL_N_BATCH=8
TARGET_SOURCE_CHUNKS=8
Tho keep in mind I changed some stuff in the code too:
diff --git a/privateGPT.py b/privateGPT.py
index a11fe24..eb43d86 100755
--- a/privateGPT.py
+++ b/privateGPT.py
@@ -33,7 +33,7 @@ def main():
# Prepare the LLM
match model_type:
case "LlamaCpp":
- llm = LlamaCpp(model_path=model_path, max_tokens=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=False)
+ llm = LlamaCpp(model_path=model_path, n_ctx=4096, max_tokens=512, n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_gpu_layers=43)
case "GPT4All":
llm = GPT4All(model=model_path, max_tokens=model_n_ctx, backend='gptj', n_batch=model_n_batch, callbacks=callbacks, verbose=False)
case _default:
@bioshazard Thanks for your kind answer.
The problem is fixed. I changed model to koala. It works now.
Hi @johndev8964 , I have the same problem as you, what model koala do you use? can you share me the link?
@johndev8964 nice, thanks bro But it seems that they just answer fast when the same question occurs the second time, right? Or you have any configuration for fast answer in the first ask
it is super slow on my side too.
First answer will always be slow I suspect because it is initializing the chroma DB in memory. To make it faster you'd need a warm DB source, probably outside the scope of this issue.
Can the DB be on a remote host (on LAN) that can cache the entire thing in memory? Might that provide performance improvements?
You have two options IMO:
@bxdoan https://huggingface.co/TheBloke/koala-7B-GGML/tree/main
which file in this link
sorry for a newb question... @bioshazard or anyone really.. i did "pip install llama-cpp-python" but where can i find the .bin file? thanks~
Hi guys. Sorry to come in at a different angle here. I've tried @jit(target_backend='cuda') at various points in the code but it barfs up a lot of errors. Is it not feasible to use JIT to force it to use Cuda (my GPU is obviously Nvidia). I did a few test scripts and I literally just had to add that decoration to the def() to make it use the GPU.
Also. It seems to use a very low "temperature" and merely quote from the source documents, instead of actually doing summaries. Is there a way to up the temperature?
Also, sorry for my ignorance. Where does it store the stuff it ingests? In the LLM file, or where? Or is it just in RAM? Reason is I am worried it ingests stuff and then loses recollection of it after reboot...
@JohnOstrowick did you attempt the edits I offered? It should work on GPU with llama if you compile the library correctly and update the one line to specify the layers to offload.
Is the code you are trying to add for llama?
You can increase the temperature in the same line.
It's stored in a simple file in the repository. That's just how chroma does it.
OK thanks let me try your solution, but before I go ahead: does this force you to use the llama LLM file or can I still use falcon and others?
I see what you mean re storage; it goes into the db/ directory.
@JohnOstrowick no I think you need to use a llama model for llamacpp and GPT4all is CPU only.
So for falcon you would need to extend this repository to add that third type of llm. The addition of falcon support is surely going to need its own issue apart from this one.
Hi there. So I am trying LLama but it now says too many tokens. What do I do to edit this?
llama_tokenize: too many tokens
Traceback (most recent call last):
File "/home/john/ai/git/privateGPT/privateGPT.py", line 83, in
I might have mentioned it in an earlier reply. But per the output you provided it seems you are only using a 512 context and should override it to use 4096. Refer to my earlier reply or the llama CPP docs to see how you can set the context window.
Thanks, after applying the patch it does this:
john@john-GF63-Thin-11SC:~/ai/git/privateGPT$ python3.10 privateGPT_llama.py File "/home/john/ai/git/privateGPT/privateGPT_llama.py", line 37 case "GPT4All": ^^^^^^^^^ SyntaxError: invalid syntax john@john-GF63-Thin-11SC:~/ai/git/privateGPT$
Code edited as follows:
# Prepare the LLM
match model_type:
case "LlamaCpp":
llm = LlamaCpp(model_path=model_path, n_ctx=4096, max_tokens=512, n_ba
tch=model_n_batch, callbacks=callbacks, verbose=False, n_gpu_layers=43)
case "GPT4All":
llm = GPT4All(model=model_path, max_tokens=model_n_ctx, backend='g
ptj', n_batch=model_n_batch, callbacks=callbacks, verbose=False)
case _default:
# raise exception if model_type is not supported
raise Exception(f"Model type {model_type} is not supported. Please
choose one of the following: LlamaCpp, GPT4All")
I see it was spacing that was the issue, fixed in pycharm.
It may be that if you paste my exact text in it will not do what you need. I expect if you provided the resulting context to chat GPT that it could guide you through what is wrong with the syntax of your result. Or if you paste the surrounding context here I can try to take a look at it to determine where the syntax error is. It might be a tab or a space or a missing colon or something.
OK so next question is it doesn't seem to have improved performance, it still takes 1 minute to respond. I did see your comment "First answer will always be slow I suspect because it is initializing the chroma DB in memory. To make it faster you'd need a warm DB source, probably outside the scope of this issue." But it is equally slow on 2nd question. I think it may be to do with the .bin LLM file I am using? Edit: I tried nous-hermes-llama2-13b.ggmlv3.q6_K.bin and it is significantly slower - 2 minutes instead of 1 with llama 2.7. As always I am grateful for your time.
@JohnOstrowick what does the output look like at the start? Does yours look like my earliest post where it shows that the cuda device is detected and the layers get loaded? What is your CPU/GPU
cuda is installed...
Device-1: Intel TigerLake-H GT1 [UHD Graphics] driver: i915 v: kernel Device-2: NVIDIA TU117M [GeForce GTX 1650 Mobile / Max-Q] driver: nvidia v: 535.104.05 Device-3: Acer HD Webcam type: USB driver: uvcvideo Display: x11 server: X.Org v: 1.21.1.4 driver: X: loaded: modesetting,nvidia unloaded: fbdev,nouveau,vesa gpu: i915 resolution: 1920x1080~60Hz OpenGL: renderer: Mesa Intel UHD Graphics (TGL GT1) v: 4.6 Mesa 23.0.4-0ubuntu1~22.04.1
to your other question, the output at the start;
python3.10 privateGPT_llama_pycharmedit.py llama.cpp: loading model from models/llama-2-7b-chat.ggmlv3.q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 4096 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: mem required = 5407.72 MB (+ 1026.00 MB per state) llama_new_context_with_model: kv self size = 2048.00 MB
@JohnOstrowick looks like your llama-cpp-python was not compiled with GPU support (see the difference between my output and yours). Review my instruction for how to force it to install with cuBlas. Further, you might need to offload less layers than my 43/43 example as you only have 4G vram. I have 24G so I had room for all those layers. You will need to find the sweet spot. Right now your completions are being done on CPU.
Hey. Sorry to be a pain. I get the following error output. No idea what it means.
john@john-GF63-Thin-11SC:~/ai/git/llama-cpp-python$ CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir Defaulting to user installation because normal site-packages is not writeable Collecting llama-cpp-python Downloading llama_cpp_python-0.2.6.tar.gz (1.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 3.3 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Installing backend dependencies ... done Preparing metadata (pyproject.toml) ... done Collecting typing-extensions>=4.5.0 (from llama-cpp-python) Obtaining dependency information for typing-extensions>=4.5.0 from https://files.pythonhosted.org/packages/ec/6b/63cc3df74987c36fe26157ee12e09e8f9db4de771e0f3404263117e75b95/typing_extensions-4.7.1-py3-none-any.whl.metadata Downloading typing_extensions-4.7.1-py3-none-any.whl.metadata (3.1 kB) Collecting numpy>=1.20.0 (from llama-cpp-python) Obtaining dependency information for numpy>=1.20.0 from https://files.pythonhosted.org/packages/71/3c/3b1981c6a1986adc9ee7db760c0c34ea5b14ac3da9ecfcf1ea2a4ec6c398/numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata Downloading numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB) Collecting diskcache>=5.6.1 (from llama-cpp-python) Obtaining dependency information for diskcache>=5.6.1 from https://files.pythonhosted.org/packages/3f/27/4570e78fc0bf5ea0ca45eb1de3818a23787af9b390c0b0a0033a1b8236f9/diskcache-5.6.3-py3-none-any.whl.metadata Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB) Downloading diskcache-5.6.3-py3-none-any.whl (45 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.5/45.5 kB 4.6 MB/s eta 0:00:00 Downloading numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 9.7 MB/s eta 0:00:00 Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB) Building wheels for collected packages: llama-cpp-python Building wheel for llama-cpp-python (pyproject.toml) ... error error: subprocess-exited-with-error
× Building wheel for llama-cpp-python (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [43 lines of output] scikit-build-core 0.5.0 using CMake 3.27.4 (wheel) Configuring CMake... loading initial cache file /tmp/tmp5uzeaf90/build/CMakeInit.txt -- The C compiler identification is GNU 11.4.0 -- The CXX compiler identification is GNU 11.4.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Git: /usr/bin/git (found version "2.34.1") fatal: not a git repository (or any of the parent directories): .git fatal: not a git repository (or any of the parent directories): .git CMake Warning at vendor/llama.cpp/CMakeLists.txt:125 (message): Git repository not found; to enable automatic generation of build info, make sure Git is installed and the project is a Git repository.
**-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.2.140")
-- cuBLAS found
-- The CUDA compiler identification is unknown
CMake Error at /tmp/pip-build-env-7gf9iivy/normal/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.27/Modules/CMakeDetermineCUDACompiler.cmake:603 (message):
Failed to detect a default CUDA architecture.
Compiler output:
Call Stack (most recent call first):
vendor/llama.cpp/CMakeLists.txt:286 (enable_language)
-- Configuring incomplete, errors occurred!
*** CMake configuration failed
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for llama-cpp-python Failed to build llama-cpp-python ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects**
Maybe try making a new venv? Are you on WSL? This failure is certainly outside the scope of this repository and rather an issue in the llama CPP python solution itself. I think I ran into this error on WSL when I tried it. If I remember later today I'll link my Nvidia setup instructions in case there are any steps you might need to take beyond the initial driver install.
OK sorry I am not clear on that. This is the settings file I think you posted...?
PERSIST_DIRECTORY=db MODEL_TYPE=LlamaCpp MODEL_PATH=models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2 MODEL_N_CTX=4096 MODEL_N_BATCH=8 TARGET_SOURCE_CHUNKS=8
Is this still OK?
The code patch I used didn't say anything about 43 layers?
case "LlamaCpp":
llm = LlamaCpp(model_path=model_path, max_tokens=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=False)
case "GPT4All":
llm = GPT4All(model=model_path, max_tokens=model_n_ctx, backend='gptj', n_batch=model_n_batch, callbacks=callbacks, verbose=False)
case _default:
Maybe try making a new venv? Are you on WSL? This failure is certainly outside the scope of this repository and rather an issue in the llama CPP python solution itself. I think I ran into this error on WSL when I tried it. If I remember later today I'll link my Nvidia setup instructions in case there are any steps you might need to take beyond the initial driver install.
--No, I'm on ubuntu 22
Hi there. OK I managed to get everything done including downloading the LLM file, but now I get this error when I turn on verbose errors:
warnings.warn(errors.NumbaDeprecationWarning(msg, gguf_init_from_file: invalid magic number 67676a74 error loading model: llama_model_loader: failed to load model from models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin
llama_load_model_from_file: failed to load model
Traceback (most recent call last):
File "/home/john/ai/development/privateGPT/privateGPT_cuda_jmo.py", line 89, in
Hi, @bioshazard
Read the above comments and try to use them in my code for solving inference time issue in LlamaCPP by using GPU A30 24gb.
But as soon as
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
and then added n_gpu_layers parameter.
privategpt.py giving me 1 validation error for pydantic package.
I though may be because earlier my model was working on Llama-cpp-python==0.1.57 but not on running force installation command it is updated to latest version which supports .gguf model not .ggml.
so, I tried running command with version.
pip install llama-cpp-python==0.1.57 --no-cache-dir
privategpt.py file runs successfully but not showing GPU configuration.
Please Help really stuck in this issue model
llama-2-7b-chat.ggmlv3.q4_0.bin
taking 50-60 seconds per query which is not good for my use case application.
@johndev8964 2.4s after chroma db warms up! And again tho this is with
nous-hermes-llama2-13b.ggmlv3.q6_K.bin
so YMMV based on the model/GPU you choose.> Question: what is capital > Answer (took 2.64 s.): In economics, capital refers to any man-made resource used in production or investment to create further goods or services. It can include physical assets like machinery or buildings as well as financial assets such as stocks and bonds. In the context of this passage, it appears that the author is specifically discussing "capital goods," which are durable items used in production processes, such as machines, tools, and equipment. > source_documents/Man_Economy_and_State_with_Power_and_Market_Rothbard.epub: There is another consideration that reinforces our conclusion. Professor Lachmann has been diligently reminding us of what economists generally forget: that “capital” is not just a homogeneous blob that can be added to or subtracted from. Capital is an intricate, delicate, interweaving structure of capital goods. All of the delicate strands of this structure have to fit, and fit precisely, or else malinvestment occurs. The free market is almost an automatic mechanism for such fitting; and we
Hi,
Can you show a step-by-step process to get this response time? I'm a novice and would appreciate your help very much. Thank you in advance.
You need to install
llama-cpp-python
with GPU supporthttps://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
then add
n_gpu_layers=X
to https://github.com/imartinez/privateGPT/blob/main/privateGPT.py#L36eg,
llm = LlamaCpp(model_path=model_path, max_tokens=model_n_ctx, n_batch=model_n_batch, callbacks=callbacks, verbose=False, n_gpu_layers=43)
I am surprised there is not an env var in the python script to dynamically set GPU layers, but these were the steps I took to get my GPU using it. YMMV on the GPU layer count you can get away with offloading but I do the full 43 of llama 2 hermes 13b cuz I have a 3090 with 24G vram. Here is my output with all the above applied:
ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090 llama.cpp: loading model from REDCATED/models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32032 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 18 (mostly Q6_K) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.09 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 2136.07 MB (+ 1608.00 MB per state) llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 360 MB VRAM for the scratch buffer llama_model_load_internal: offloading 40 repeating layers to GPU llama_model_load_internal: offloading non-repeating layers to GPU llama_model_load_internal: offloading v cache to GPU llama_model_load_internal: offloading k cache to GPU llama_model_load_internal: offloaded 43/43 layers to GPU llama_model_load_internal: total VRAM used: 12209 MB llama_new_context_with_model: kv self size = 400.00 MB Enter a query:
How about M2 Macbook Air tho
Hi there. OK I managed to get everything done including downloading the LLM file, but now I get this error when I turn on verbose errors:
warnings.warn(errors.NumbaDeprecationWarning(msg, gguf_init_from_file: invalid magic number 67676a74 error loading model: llama_model_loader: failed to load model from models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin
llama_load_model_from_file: failed to load model Traceback (most recent call last): File "/home/john/ai/development/privateGPT/privateGPT_cuda_jmo.py", line 89, in main() File "/home/john/.local/lib/python3.10/site-packages/langchain/load/serializable.py", line 74, in init super().init(kwargs) File "pydantic/main.py", line 341, in pydantic.main.BaseModel.init pydantic.error_wrappers.ValidationError: 1 validation error for LlamaCpp root** Could not load Llama model from path: models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin. Received error (type=value_error)
@JohnOstrowick I suspect you don't have that model at that path. Try replacing the relative module path with the absolute path like /home/.../...K.bin
to be sure there is no problem with the path it is attempting to reach. Also verify that you permissions are compatible to reading as the user you are executing with.
so, I tried running command with version.
pip install llama-cpp-python==0.1.57 --no-cache-dir
privategpt.py file runs successfully but not showing GPU configuration.
@Komal-99 seems like you are super close. You did the pip install correctly the first time, but since you did not use the version the repo expects, it failed. Try to create a new venv
and run CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -r requirements.txt --force-reinstall --upgrade --no-cache-dir
in the repo so you get both the right llama-cpp-python version AND that it gets installed with the right env vars set.
How about M2 Macbook Air tho
@Sync-Z1 I haven't tried this myself, but maybe you can refer to the official llama-cpp-python
docs https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md
As well tho see my last reply about installing the right version at the same time as those official instructions will yield an incompatible version for this repo.
Also idk how I got roped into generally explaining how to use llama-cpp-python
lol but I believe at this point I have covered every possible situation:
llama-cpp-python
(as opposed to GPT4All) supports GPU accelerationllama-cpp-python
(just use a dedicated venv
or conda
env)llama-cpp-python
must be installed with env vars to instruct it to compile the GPU supportLlamaCpp
line to offload the GPU layers and fix the context window.
llama-cpp-python
docs (RTFM anyway imo)/home/...
to your model in case you have some weird relative path issues.So read back through the thread for each insight and good luck
so, I tried running command with version.
pip install llama-cpp-python==0.1.57 --no-cache-dir
privategpt.py file runs successfully but not showing GPU configuration.@Komal-99 seems like you are super close. You did the pip install correctly the first time, but since you did not use the version the repo expects, it failed. Try to create a new
venv
and runCMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install -r requirements.txt --force-reinstall --upgrade --no-cache-dir
in the repo so you get both the right llama-cpp-python version AND that it gets installed with the right env vars set.
But I am getting right version but the model is not compatible is shows Model not found after updating llama-cpp-python to latest as ggml model is not supported anymore replaced by gguf. But as mentioned above the quantization model I am using is in ggml format
@Komal-99 oh I see now you did specify the version sorry. I would reach out on the llama-cpp-python
repo to get help with that then, definitely outside the scope of this repo.
Does it have something to do with tensorflow? And it's weird that from the following console messages,
Does that mean, I'm NOT using tensorflow-gpu? But ONLY tensorflow-CPU ???