Closed jiapei100 closed 8 months ago
@Komal-99 oh I see now you did specify the version sorry. I would reach out on the
llama-cpp-python
repo to get help with that then, definitely outside the scope of this repo.
Do let me if their anything you get to resolve this
https://github.com/PromtEngineer/localGPT trying this now as it seems to be pre-built for GPU use.
@bioshazard I got you... Thank you...
with modeltype, I got the following ERRORs:
➜ privateGPT git:(main) ✗ python privateGPT.py 2023-08-07 09:52:37.920830: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9346] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-08-07 09:52:37.920871: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-08-07 09:52:37.920880: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-08-07 09:52:37.926065: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01. Instructions for updating: The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`. WARNING:tensorflow:From ~/.local/lib/python3.10/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01. Instructions for updating: The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`. ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6 Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5 llama.cpp: loading model from ./models/nous-hermes-llama2-13b.ggmlv3.q6_K.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32032 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_head_kv = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 1.0e-06 llama_model_load_internal: n_ff = 13824 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 18 (mostly Q6_K) llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.11 MB llama_model_load_internal: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device llama_model_load_internal: mem required = 454.09 MB (+ 400.00 MB per state) llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 360 MB VRAM for the scratch buffer llama_model_load_internal: offloading 40 repeating layers to GPU llama_model_load_internal: offloading non-repeating layers to GPU llama_model_load_internal: offloading v cache to GPU llama_model_load_internal: offloading k cache to GPU llama_model_load_internal: offloaded 43/43 layers to GPU llama_model_load_internal: total VRAM used: 11001 MB llama_new_context_with_model: kv self size = 400.00 MB Enter a query: Hi, how are you? llama_tokenize_with_model: too many tokens Traceback (most recent call last): File "....../privateGPT.py", line 97, in <module> main() File "....../privateGPT.py", line 68, in main res = qa(query) File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 258, in __call__ raise e File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 252, in __call__ self._call(inputs, run_manager=run_manager) File "~/.local/lib/python3.10/site-packages/langchain/chains/retrieval_qa/base.py", line 133, in _call answer = self.combine_documents_chain.run( File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 456, in run return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[ File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 258, in __call__ raise e File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 252, in __call__ self._call(inputs, run_manager=run_manager) File "~/.local/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 106, in _call output, extra_return_dict = self.combine_docs( File "~/.local/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 165, in combine_docs return self.llm_chain.predict(callbacks=callbacks, **inputs), {} File "~/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 252, in predict return self(kwargs, callbacks=callbacks)[self.output_key] File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 258, in __call__ raise e File "~/.local/lib/python3.10/site-packages/langchain/chains/base.py", line 252, in __call__ self._call(inputs, run_manager=run_manager) File "~/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 92, in _call response = self.generate([inputs], run_manager=run_manager) File "~/.local/lib/python3.10/site-packages/langchain/chains/llm.py", line 102, in generate return self.llm.generate_prompt( File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 451, in generate_prompt return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs) File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 582, in generate output = self._generate_helper( File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 488, in _generate_helper raise e File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 475, in _generate_helper self._generate( File "~/.local/lib/python3.10/site-packages/langchain/llms/base.py", line 961, in _generate self._call(prompt, stop=stop, run_manager=run_manager, **kwargs) File "~/.local/lib/python3.10/site-packages/langchain/llms/llamacpp.py", line 238, in _call for chunk in self._stream( File "~/.local/lib/python3.10/site-packages/langchain/llms/llamacpp.py", line 288, in _stream for part in result: File "~/.local/lib/python3.10/site-packages/llama_cpp/llama.py", line 855, in _create_completion raise ValueError( ValueError: Requested tokens (558) exceed context window of 512 Exception ignored in: <function Llama.__del__ at 0x7fad80434e50> Traceback (most recent call last): File "~/.local/lib/python3.10/site-packages/llama_cpp/llama.py", line 1508, in __del__ TypeError: 'NoneType' object is not callable
hi, i noticed that your CUDA detected two GPUs. does it mean that your privateGPT uses both of them for inference? if so, do you mind showing how you did that? thanks in advance.
Apologies but it seemed to just do it by itself. I have no idea how. It seems to use the internal graphics card by default and then fall back to the nvidia, which is annoying, because the internal only has like 200MB graphics memory as opposed to the nvidia which has 4 GB. It's something I am still trying to figure out. Also, I am trying to use something other than LLama because that seems to be censored. Pretty much every topic I give it under my research areas it goes "Sorry I can't answer that because I am censored" or words to that effect.
Re: "two GPUS", I see this in the quoted output:
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
OK but that's strange because it seems to want to use 200MB ... and the Nvidia has 4GB
llama_model_load_internal: total VRAM used: 11001 MB
Also the error at the end shows you have the context set to 512, try override to 4096 see my earlier comments about that too.
Hey. Sorry to be a pain. I get the following error output. No idea what it means.
john@john-GF63-Thin-11SC:~/ai/git/llama-cpp-python$ CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir Defaulting to user installation because normal site-packages is not writeable Collecting llama-cpp-python Downloading llama_cpp_python-0.2.6.tar.gz (1.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 3.3 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Installing backend dependencies ... done Preparing metadata (pyproject.toml) ... done Collecting typing-extensions>=4.5.0 (from llama-cpp-python) Obtaining dependency information for typing-extensions>=4.5.0 from https://files.pythonhosted.org/packages/ec/6b/63cc3df74987c36fe26157ee12e09e8f9db4de771e0f3404263117e75b95/typing_extensions-4.7.1-py3-none-any.whl.metadata Downloading typing_extensions-4.7.1-py3-none-any.whl.metadata (3.1 kB) Collecting numpy>=1.20.0 (from llama-cpp-python) Obtaining dependency information for numpy>=1.20.0 from https://files.pythonhosted.org/packages/71/3c/3b1981c6a1986adc9ee7db760c0c34ea5b14ac3da9ecfcf1ea2a4ec6c398/numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata Downloading numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB) Collecting diskcache>=5.6.1 (from llama-cpp-python) Obtaining dependency information for diskcache>=5.6.1 from https://files.pythonhosted.org/packages/3f/27/4570e78fc0bf5ea0ca45eb1de3818a23787af9b390c0b0a0033a1b8236f9/diskcache-5.6.3-py3-none-any.whl.metadata Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB) Downloading diskcache-5.6.3-py3-none-any.whl (45 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.5/45.5 kB 4.6 MB/s eta 0:00:00 Downloading numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 9.7 MB/s eta 0:00:00 Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB) Building wheels for collected packages: llama-cpp-python Building wheel for llama-cpp-python (pyproject.toml) ... error error: subprocess-exited-with-error
× Building wheel for llama-cpp-python (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [43 lines of output] scikit-build-core 0.5.0 using CMake 3.27.4 (wheel) Configuring CMake... loading initial cache file /tmp/tmp5uzeaf90/build/CMakeInit.txt -- The C compiler identification is GNU 11.4.0 -- The CXX compiler identification is GNU 11.4.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Git: /usr/bin/git (found version "2.34.1") fatal: not a git repository (or any of the parent directories): .git fatal: not a git repository (or any of the parent directories): .git CMake Warning at vendor/llama.cpp/CMakeLists.txt:125 (message): Git repository not found; to enable automatic generation of build info, make sure Git is installed and the project is a Git repository.
**-- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- Found CUDAToolkit: /usr/local/cuda/include (found version "12.2.140") -- cuBLAS found -- The CUDA compiler identification is unknown CMake Error at /tmp/pip-build-env-7gf9iivy/normal/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.27/Modules/CMakeDetermineCUDACompiler.cmake:603 (message): Failed to detect a default CUDA architecture. Compiler output: Call Stack (most recent call first): vendor/llama.cpp/CMakeLists.txt:286 (enable_language) -- Configuring incomplete, errors occurred! *** CMake configuration failed [end of output]
note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for llama-cpp-python Failed to build llama-cpp-python ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects**
Try this
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python
Does it have something to do with tensorflow? And it's weird that from the following console messages,
Does that mean, I'm NOT using tensorflow-gpu? But ONLY tensorflow-CPU ???