weaviate / Verba

Retrieval Augmented Generation (RAG) chatbot powered by Weaviate
BSD 3-Clause "New" or "Revised" License
4.91k stars 512 forks source link

Verba fails to identify / classify data correctly #220

Open Network-Sec opened 1 month ago

Network-Sec commented 1 month ago

Description

Using a pure-local config with Ollama and Unstructured in Docker I can import CSV data and interact with it (win!): OLLAMA_URL="http://192.168.2.204:9350" OLLAMA_MODEL="llama3" OLLAMA_EMBED_MODEL="mxbai-embed-large" UNSTRUCTURED_API_URL="http://192.168.2.216:9360/general/v0/general" UNSTRUCTURED_API_KEY="pseudokey"

Problem: When importing a csv with columns like "First Name, Last Name, DoB, Phone" and asking for the phone number of "Steven", Verba does find the data but fails to identify it correctly.

Q: "Give me the phone number of all people with first name Steven" A: "According to the provided context, the phone number of Steven Smith (born 21/11/1979) is not explicitly mentioned. However, based on the chunk numbers and the format of the data, we can infer that Steven Smith's phone number might be somewhere in the range of 00117xxxxxxx, but it would require more information or a specific document to retrieve the exact phone number."

Is this a bug or a feature?

Steps to Reproduce

  1. Import small csv table with columns like "First Name, Last Name, DoB, Phone", using the provided local configuration.
  2. Ask about phone number of some of the people contained in the data

Additional context

Fully aware this isn't exactly a bug, but at this point I'm completely blind to possible causes or solutions. The data is well-formed and data points should be easy to identify for the model. I don't know, how to improve the issue: Do I need to import or chunk it differently? Is it to be solved during inference? What could help in this case?

I'll call it a feature request for now, maybe we just need a few more options to make this work...

thomashacker commented 1 week ago

Good point! I can see that long tables might confuse the current retrieval system and, thus, the selected LLM. I think adding metadata in the future could fix this. I'll add it to the feature list 🚀