weaviate / Verba

Retrieval Augmented Generation (RAG) chatbot powered by Weaviate
BSD 3-Clause "New" or "Revised" License
6.09k stars 652 forks source link

Document Issues #115

Closed Moshie1112 closed 1 month ago

Moshie1112 commented 7 months ago

My documents are txt. They either Load documents no chunks; or Load 0 documents with no chunks; or Chunk mismatch for 1fa2a323-d32c-4a87-89fc-4566c56d30fd 0 != 37

I do not know what to do. I am trying to load Youtube transcripts, if that matters... help

cam-barts commented 7 months ago

Hi @Moshie1112, would you be able to drop the document that you are trying to upload, as well as the chunking settings you are using so that I can try to replicate the issue? I have tried with this youtube video transcript How_to_grow_your_SRE_practice.txt. I've verified this configuration works at least on my machine: The SimpleReader (since it's a text file), the TokenChunker set to 750 units and a 250 overlap, and the MiniLmEmbedder.

doyled-it commented 5 months ago

I get the same issue, but a different chunk mismatch. This is the document I was using with the PDFReader.

My TokenChunker is set to 250 with a 50 overlap and I'm using the ADAEmbedder.

Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence The White House.pdf

I'm running Verba using Docker Compose.

Edit: Turns out @cam-barts that if I use your .txt file in your message that I get the same error as @Moshie1112 and I are getting.

sbhadana commented 4 months ago

I am facing exactly same issue "Chunk mismatch for 1f6e1308-08a3-4f98-b52c-424fe71a39c0 0 != 2" with ADAEmbedder. Any help would be appreciated.

Thanks

thomashacker commented 4 months ago

We improved the Reader functionality in the newest release, it should now support all basic file types! Let me know if the error still persists

qlmeng86 commented 4 months ago

I'm facing the same issue in the latest release v1.0.2. The error message is "Chunk mismatch for e1831290-33d6-4724-9661-64245306bf53 0 != 168" when I uploading the file README.md. My TokenChunker is set to 50 with a 20 overlap and I'm using the ADAEmbedder.

thomashacker commented 4 months ago

Are you encountering any errors in the CLI? Did you verify that your OpenAI key is working?

sbhadana commented 4 months ago

yes Azure openai key is correct. Error occurred while executing verba start in python venv.

dnbeze commented 3 months ago

Just in case anyone else had this issue - I had same problem but found it was because I ran out of credits on my openapi account :)

if you check console output you may see

✘ {'errors': {'error': [{'message': 'update vector: connection to: OpenAI API failed with status: 429 error: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: