zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks
https://privategpt.dev
Apache License 2.0
54.09k stars 7.27k forks source link

When compiling with GPU not enough space error #921

Closed DenizK7 closed 9 months ago

DenizK7 commented 1 year ago

When I run the code with GPU this error occurs ggml_new_tensor_impl: not enough space in the context's memory pool (needed 13340448, available 10485760) Traceback (most recent call last):

The error doesn't always happen. However, if there is a word that is used too much in the documents, the application usually gives an error when asked.

My cuda version is 11.8

I tried this code to fix error but nothing changed import torch device = torch.device("cuda:0") memory_fraction = 1.0 torch.cuda.set_per_process_memory_fraction(memory_fraction, device=device)

Hardware Device 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6 16GBVRam

Output of code Using embedded DuckDB with persistence: data will be stored in: db ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6 llama.cpp: loading model from models/koala-7B.ggmlv3.q8_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 1.0e-06 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 7 (mostly Q8_0) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 530.90 MB (+ 1024.00 MB per state) llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer llama_model_load_internal: offloading 32 repeating layers to GPU llama_model_load_internal: offloading non-repeating layers to GPU llama_model_load_internal: offloading v cache to GPU llama_model_load_internal: offloading k cache to GPU llama_model_load_internal: offloaded 35/35 layers to GPU llama_model_load_internal: total VRAM used: 8104 MB llama_new_context_with_model: kv self size = 1024.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

ERROR ggml_new_tensor_impl: not enough space in the context's memory pool (needed 13340448, available 10485760) Traceback (most recent call last): File "C:\Users\xx\Documents\GitHub\privateGPT\privateGPT.py", line 91, in main() File "C:\Users\xx\Documents\GitHub\privateGPT\privateGPT.py", line 62, in main res = qa(query) File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\base.py", line 140, in call raise e File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\base.py", line 134, in call self._call(inputs, run_manager=run_manager) File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\retrieval_qa\base.py", line 120, in _call answer = self.combine_documents_chain.run( File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\base.py", line 239, in run return self(kwargs, callbacks=callbacks)[self.output_keys[0]] File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\base.py", line 140, in call raise e File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\base.py", line 134, in call self._call(inputs, run_manager=run_manager) File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\combine_documents\base.py", line 84, in _call output, extra_return_dict = self.combine_docs( File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\combine_documents\stuff.py", line 87, in combine_docs return self.llm_chain.predict(callbacks=callbacks, **inputs), {} File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\llm.py", line 213, in predict return self(kwargs, callbacks=callbacks)[self.output_key] File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\base.py", line 140, in call raise e File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\base.py", line 134, in call self._call(inputs, run_manager=run_manager) File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\llm.py", line 69, in _call response = self.generate([inputs], run_manager=run_manager) File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\chains\llm.py", line 79, in generate return self.llm.generate_prompt( File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\llms\base.py", line 134, in generate_prompt return self.generate(prompt_strings, stop=stop, callbacks=callbacks) File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\llms\base.py", line 191, in generate raise e File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\llms\base.py", line 185, in generate self._generate(prompts, stop=stop, run_manager=run_manager) File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\llms\base.py", line 436, in _generate self._call(prompt, stop=stop, run_manager=run_manager) File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\llms\llamacpp.py", line 225, in _call for token in self.stream(prompt=prompt, stop=stop, run_manager=run_manager): File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\langchain\llms\llamacpp.py", line 274, in stream for chunk in result: File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\llama_cpp\llama.py", line 899, in _create_completion for token in self.generate( File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\llama_cpp\llama.py", line 721, in generate self.eval(tokens) File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\llama_cpp\llama.py", line 461, in eval return_code = llama_cpp.llama_eval( File "C:\Users\xx\AppData\Roaming\Python\Python310\site-packages\llama_cpp\llama_cpp.py", line 678, in llama_eval return _lib.llama_eval(ctx, tokens, n_tokens, n_past, n_threads) OSError: exception: access violation writing 0x0000000000000050

Ciaranwuk commented 1 year ago

I'm afraid I've never got offloading to work so can't help much here, but with that size graphics card you could easily run the model as a GPTQ. Is there a specific reason you want to use ggml and offload layers rather than just run it on the GPU?

DenizK7 commented 1 year ago

Actually I dont know what is GPTQ. I just want to perform tasks on GPU instead of CPU. After that I saw (https://github.com/imartinez/privateGPT/issues/885) on issues and then tried. I will check what is GPTQ. Could you provide me a source ?

Ciaranwuk commented 1 year ago

Sorry realised this project doesn't have support for GPTQ models. I use chatdocs which is a fork of this (I think). if you search huggingface.com for the GPTQ version of your model something should come up. It should run lightning fast with your GPU, but you will need to make sure you've got the CUDA packages installed, but it looks like you probably already have that.

DenizK7 commented 1 year ago

Thanks. Do you have any idea why this pool error occurs ?

Ciaranwuk commented 1 year ago

Afraid not :/ Probably a good idea to try it with a different model, just in case it's specific to it. If that doesn't work maybe try reinstalling in a clean virtual env?