zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks
https://privategpt.dev
Apache License 2.0
54.21k stars 7.29k forks source link

Change chunking/splitting method from SentenceWindowNodeParser to JSONNodeParser #2072

Open Beet-Farms opened 2 months ago

Beet-Farms commented 2 months ago

Question

I’m currently using PrivateGPT v0.6.1 with Llama-CPP support on a Windows machine with qdrant DB. LLM used is Mistral-7B-Instruct-v0.3 and embedding model is BAAI/bge-m3.

I have a situation where I need to ingest a large JSON file - say a telephone directory, where each record should remain intact as a single node. When using the SentenceWindowNodeParser, the records often split at improper places, leading to jumbled responses when querying the LLM, especially when it comes to matching users to their telephone numbers.

I made the following changes to ingest_service.py

  1. Replaced the import statement from llama_index.core.node_parser import SentenceWindowNodeParser with from llama_index.core.node_parser import JSONNodeParser
  2. Replaced node_parser = SentenceWindowNodeParser.from_defaults() with node_parser = JSONNodeParser.from_defaults()

After making these changes, I tried ingesting the JSON file again. It didn’t throw any errors, but the console showed that the file was converted into 1 document, with a message saying: private_gpt.components.ingest.ingest_component - Inserting count=0 nodes in the index. As expected, I don't see any nodes in Qdrant.

What am I missing? Your advice would be greatly appreciated!

jaluma commented 2 months ago

Can you check if there's nodes with original logic? To discard if the error is in NodeParser or reading JSON file.

Beet-Farms commented 2 months ago

Thanks for your response. Yes, I can see nodes in qdrant when using default SentenceWindowNodeParser. I checked MarkdownNodeParser which works fine. Looks like there is a problem when attempting to use JSONNodeParser only.

To enable MarkdownNodeParser, I followed similar steps that I attempted for 'JSONNodeParser`.

In ingest_service.py:

  1. Replaced the import statement from llama_index.core.node_parser import SentenceWindowNodeParser with from llama_index.core.node_parser import MarkdownNodeParser
  2. Replaced node_parser = SentenceWindowNodeParser.from_defaults() with node_parser = MarkdownNodeParser.from_defaults()

Would love to know if someone succeeded in using JSONNodeParser.