weaviate / Verba

Retrieval Augmented Generation (RAG) chatbot powered by Weaviate
BSD 3-Clause "New" or "Revised" License
6.05k stars 642 forks source link

Mass Import Files? #239

Open Raichuu41 opened 2 months ago

Raichuu41 commented 2 months ago

Description

Question/Discussion: What is the best way to mass-import many files? I need to import about 200.000 text files. Currently my only working solution would be to upload all the files in batches of size 500 into github folders. And then import these folders via the GitHub reader one by one manually, whenever the current import is completed. Is there an easier way to do this, possibly directly by sending the file bytes via an API endpoint?

Is this a bug or a feature?

Steps to Reproduce

[see above]

Additional context

[None]

thomashacker commented 2 months ago

Good point. There is currently no feature for mass importing files, but we'll add it to the feature list.

Raichuu41 commented 1 month ago

Good point. There is currently no feature for mass importing files, but we'll add it to the feature list.

@thomashacker Would it be possible to support you in the implementation of mass importing files? I really need the functionality and have the skills to do it. If possible, we could have a short online meeting of 30-60mins to introduce or explain the current importing mechanism and how it can be extended?

thomashacker commented 1 month ago

We're currently implementing mass importing files in the upcoming v2 version, which should be released in a couple of weeks. If you need the functionality now, you can add it yourself, the source code of the frontend and backend are all available here 😄

thomashacker commented 2 weeks ago

Implemented the mass import functionality in the newest release

Raichuu41 commented 1 week ago

Implemented the mass import functionality in the newest release

Where is this implemented? I see no documentation for it. I found one backend endpoint @app.websocket("/ws/import_files"), is it this one? And if so, how can I make it work? I tried to understand the code more but it keeps failing. I send valid data to this endpoint that it passes the validation for DataBatchPayload but it fails in add_batch() when calling self.check_batch() to generate the fileConfig. It doesn't pass the validation of being a FileConfig. Following through the code, it is only the value of the chunks field as shown in the code (goldenverba/server/helpers.py):

chunks = self.batches[fileID]["chunks"]
data = "".join([chunks[chunk] for chunk in chunks])

So I assume the chunk value for DataBatchPayload needs to be a FileConfig? If so, why is it defined only as string and not a FileConfig? Some documentation would be nice. Maybe this isn't even the intended functionality. Generally it is good practice to mention the fixed issue in the commit of where it is being resolved. I am also confused why there is no more technical documentation in the repository? The hyperlink still exists in the README but it points to nothing and the technical markdown file has been deleted with no replacement.

thomashacker commented 1 week ago

Good point! We added mass importing file functionality via the frontend, the FastAPI endpoints are currently only optimized to communicate with the frontend. Can you share with me more information on what functionality you need? We're working on a user API to make is easier to use programmatically in the future.

And I agree, we're currently reworking the technical documentation, will be re-added soon 🚀