Wrong part of document choosen as context, yields incorrect answer to simple question

MartyLake commented 1 year ago

Hi, Thanks for providing such tool,

using this diff

diff --git a/ingest.py b/ingest.py
index e5d27b3..acae658 100644
--- a/ingest.py
+++ b/ingest.py
@@ -2,20 +2,25 @@ from langchain.document_loaders import TextLoader
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain.vectorstores import Chroma
 from langchain.embeddings import LlamaCppEmbeddings
+from sys import argv

 def main():
     # Load document and split in chunks
-    loader = TextLoader('./source_documents/state_of_the_union.txt', encoding='utf8')
+    print(f"ingest.py: ingesting {argv[1]}")
+    loader = TextLoader(argv[1], encoding="utf8")
     documents = loader.load()
     text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
     texts = text_splitter.split_documents(documents)

I tried to feed a private git wiki, consisting of many md files with

find source_documents | grep "md$" | parallel -j1 python3 ingest.py {}

One of those md file contains a section called Setting up a nightly build.

Then, when querying the model with simple queries like How to setup a nightly build?, I see that the step Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. chooses a document that has nothing to do with the requested topic, then yields a hallucinated and wrong answer.

imartinez commented 1 year ago

Hey @MartyLake! Yeah, I'm experiencing the same issue. Looking at it, it is obviously a problem with the embeddings process (that's why you see the context passed to the LLM doesn't contain relevant text). The issue could have to do with the embeddings model used, or the way we are generating the embeddings (text splitter selected and its config).

Before switching to a different embeddings model, I'll try some different text splitter configurations and see if results improve.

Feel free to do the same and share your results!

imartinez commented 1 year ago

I just updated master with a change on the splitter (by the way I also made ingest take an argument, thanks for that)

https://github.com/imartinez/privateGPT/commit/92244a90b4ef094481ffcefb2e8a1ebdbbcd110d

The result looks better for me in many cases. Let me know if it improved for you. Here is an example:

`Enter a query: What was the NATO Alliance created for?

llama_print_timings: load time = 980.41 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: prompt eval time = 2438.41 ms / 10 tokens ( 243.84 ms per token) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run) llama_print_timings: total time = 2453.60 ms gptj_generate: seed = 1683300608 gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gptj_generate: number of tokens in prompt = 474

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people.

Throughout our history weve learned this lesson when dictators do not pay a price for their aggression they cause more chaos.

They keep moving.

And the costs and the threats to America and the world keep rising.

Thats why the NATO Alliance was created to secure peace and stability in Europe after World War 2.

We prepared extensively and carefully.

We spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin.

I spent countless hours unifying our European allies. We shared with the world in advance what we knew Putin was planning and precisely how he would try to falsely justify his aggression.

We countered Russias lies with truth.

And now that he has acted the free world is holding him accountable.

And we remain clear-eyed. The Ukrainians are fighting back with pure courage. But the next few days weeks, months, will be hard on them.

Putin has unleashed violence and chaos. But while he may make gains on the battlefield – he will pay a continuing high price over the long run.

And a proud Ukrainian people, who have known 30 years of independence, have repeatedly shown that they will not tolerate anyone who tries to take their country backwards.

Our forces are not going to Europe to fight in Ukraine, but to defend our NATO Allies – in the event that Putin decides to keep moving west.

For that purpose weve mobilized American ground forces, air squadrons, and ship deployments to protect NATO countries including Poland, Romania, Latvia, Lithuania, and Estonia.

As I have made crystal clear the United States and our Allies will defend every inch of territory of NATO countries with the full force of our collective power.

Question: What was the NATO Alliance created for? Helpful Answer: The NATO Alliance was created to secure peace and stability in Europe after World War 2, and to counter the aggression of dictators such as Hitler and Stalin. The alliance is made up of countries from Europe, Asia, Africa, and the Americas, and is responsible for defending NATO countries, including Poland, Romania, Latvia, Lithuania, and Estonia.<|endoftext|>`

MartyLake commented 1 year ago

@imartinez Thanks! will do the ingest again and try the new version.

imartinez commented 1 year ago

Pull the latest changes and re-run the ingestion. Results will be much better.

MartyLake commented 1 year ago

$ git pull
Déjà à jour.

$ find source_documents | grep "md$" | parallel -j1 python3 ingest.py {}
# here was tons of output, and ran all night

$ pip3 install pygpt4all
Requirement already satisfied: pygpt4all in /opt/homebrew/lib/python3.11/site-packages (1.1.0)
Requirement already satisfied: pyllamacpp in /opt/homebrew/lib/python3.11/site-packages (from pygpt4all) (1.0.7)
Requirement already satisfied: pygptj in /opt/homebrew/lib/python3.11/site-packages (from pygpt4all) (1.0.10)

[notice] A new release of pip is available: 23.0.1 -> 23.1.2
[notice] To update, run: python3.11 -m pip install --upgrade pip
$ python3 privateGPT.py
llama.cpp: loading model from ./models/ggml-model-q4_0.bin
llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this
llama_model_load_internal: format     = 'ggml' (old version with low tokenizer quality and no mmap support)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 2052.00 MB per state)
...................................................................................................
.
llama_init_from_file: kv self size  =  512.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
Using embedded DuckDB with persistence: data will be stored in: db
Traceback (most recent call last):
  File "/Users/mtt/Documents/personal/privateGPT/privateGPT.py", line 39, in <module>
    main()
  File "/Users/mtt/Documents/personal/privateGPT/privateGPT.py", line 15, in main
    llm = GPT4All(model='./models/ggml-gpt4all-j-v1.3-groovy.bin', backend='gptj', callbacks=callbacks, verbose=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for GPT4All
__root__
  Could not import pygpt4all python package. Please install it with `pip install pygpt4all`. (type=value_error)
[2023-05-10 12:13:57,709] {duckdb.py:414} INFO - Persisting DB to disk, putting it in the save folder: db

I am a bit puzzled now.

zylon-ai / private-gpt

Wrong part of document choosen as context, yields incorrect answer to simple question #5