Closed DenizK7 closed 9 months ago
I'm afraid I've never got offloading to work so can't help much here, but with that size graphics card you could easily run the model as a GPTQ. Is there a specific reason you want to use ggml and offload layers rather than just run it on the GPU?
Actually I dont know what is GPTQ. I just want to perform tasks on GPU instead of CPU. After that I saw (https://github.com/imartinez/privateGPT/issues/885) on issues and then tried. I will check what is GPTQ. Could you provide me a source ?
Sorry realised this project doesn't have support for GPTQ models. I use chatdocs which is a fork of this (I think). if you search huggingface.com for the GPTQ version of your model something should come up. It should run lightning fast with your GPU, but you will need to make sure you've got the CUDA packages installed, but it looks like you probably already have that.
Thanks. Do you have any idea why this pool error occurs ?
Afraid not :/ Probably a good idea to try it with a different model, just in case it's specific to it. If that doesn't work maybe try reinstalling in a clean virtual env?
When I run the code with GPU this error occurs ggml_new_tensor_impl: not enough space in the context's memory pool (needed 13340448, available 10485760) Traceback (most recent call last):
My cuda version is 11.8
I tried this code to fix error but nothing changed
import torch
device = torch.device("cuda:0")
memory_fraction = 1.0
torch.cuda.set_per_process_memory_fraction(memory_fraction, device=device)
Hardware Device 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6 16GBVRam
Output of code Using embedded DuckDB with persistence: data will be stored in: db ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6 llama.cpp: loading model from models/koala-7B.ggmlv3.q8_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 1.0e-06 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 7 (mostly Q8_0) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 530.90 MB (+ 1024.00 MB per state) llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer llama_model_load_internal: offloading 32 repeating layers to GPU llama_model_load_internal: offloading non-repeating layers to GPU llama_model_load_internal: offloading v cache to GPU llama_model_load_internal: offloading k cache to GPU llama_model_load_internal: offloaded 35/35 layers to GPU llama_model_load_internal: total VRAM used: 8104 MB llama_new_context_with_model: kv self size = 1024.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
ERROR ggml_new_tensor_impl: not enough space in the context's memory pool (needed 13340448, available 10485760) Traceback (most recent call last): File "C:\Users\xx\Documents\GitHub\privateGPT\privateGPT.py", line 91, in
main()
File "C:\Users\xx\Documents\GitHub\privateGPT\privateGPT.py", line 62, in main
res = qa(query)
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\base.py", line 140, in call
raise e
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\base.py", line 134, in call
self._call(inputs, run_manager=run_manager)
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\retrieval_qa\base.py", line 120, in _call
answer = self.combine_documents_chain.run(
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\base.py", line 239, in run
return self(kwargs, callbacks=callbacks)[self.output_keys[0]]
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\base.py", line 140, in call
raise e
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\base.py", line 134, in call
self._call(inputs, run_manager=run_manager)
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\combine_documents\base.py", line 84, in _call
output, extra_return_dict = self.combine_docs(
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\combine_documents\stuff.py", line 87, in combine_docs
return self.llm_chain.predict(callbacks=callbacks, **inputs), {}
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\llm.py", line 213, in predict
return self(kwargs, callbacks=callbacks)[self.output_key]
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\base.py", line 140, in call
raise e
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\base.py", line 134, in call
self._call(inputs, run_manager=run_manager)
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\llm.py", line 69, in _call
response = self.generate([inputs], run_manager=run_manager)
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\llm.py", line 79, in generate
return self.llm.generate_prompt(
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\llms\base.py", line 134, in generate_prompt
return self.generate(prompt_strings, stop=stop, callbacks=callbacks)
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\llms\base.py", line 191, in generate
raise e
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\llms\base.py", line 185, in generate
self._generate(prompts, stop=stop, run_manager=run_manager)
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\llms\base.py", line 436, in _generate
self._call(prompt, stop=stop, run_manager=run_manager)
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\llms\llamacpp.py", line 225, in _call
for token in self.stream(prompt=prompt, stop=stop, run_manager=run_manager):
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\llms\llamacpp.py", line 274, in stream
for chunk in result:
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\llama_cpp\llama.py", line 899, in _create_completion
for token in self.generate(
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\llama_cpp\llama.py", line 721, in generate
self.eval(tokens)
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\llama_cpp\llama.py", line 461, in eval
return_code = llama_cpp.llama_eval(
File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\llama_cpp\llama_cpp.py", line 678, in llama_eval
return _lib.llama_eval(ctx, tokens, n_tokens, n_past, n_threads)
OSError: exception: access violation writing 0x0000000000000050