zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks
https://docs.privategpt.dev
Apache License 2.0
53.03k stars 7.12k forks source link

Update ingest.py to seamlessly import windows ascii (ISO-8859-1) text files or unix unicode files irrespective of whether thy originated in *nix or windoze #561

Closed tkiker closed 5 months ago

tkiker commented 1 year ago

Over the course of my 25 year IT career I've had to vacillate by necessity between *nix and windows. Consequently as I'm trying to ingest all my data locally I'm getting this error below. I can get around it by running my windows text files through this command available in macOS called 'icons'. It automagically strips out the rogue windows EOL character and then the file ingests just fine and I can successfully run queries against it in 'privateGPT'.

bryonykiker@Bryonys-MacBook-Air privateGPT-main % rm -rf db bryonykiker@Bryonys-MacBook-Air privateGPT-main % python3 ingest.py
Creating new vectorstore Loading documents from source_documents Loading new documents: 55%|██████████▉ | 12/22 [00:01<00:01, 6.67it/s] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/document_loaders/text.py", line 41, in load text = f.read() ^^^^^^^^ File "", line 322, in decode UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 22: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/Users/bryonykiker/Downloads/privateGPT-main/ingest.py", line 89, in load_single_document return loader.load()[0] ^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/document_loaders/text.py", line 54, in load raise RuntimeError(f"Error loading {self.file_path}") from e RuntimeError: Error loading source_documents/geneology.txt """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/bryonykiker/Downloads/privateGPT-main/ingest.py", line 167, in main() File "/Users/bryonykiker/Downloads/privateGPT-main/ingest.py", line 157, in main texts = process_documents() ^^^^^^^^^^^^^^^^^^^ File "/Users/bryonykiker/Downloads/privateGPT-main/ingest.py", line 119, in process_documents documents = load_documents(source_directory, ignored_files) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/bryonykiker/Downloads/privateGPT-main/ingest.py", line 108, in load_documents for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)): File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 873, in next raise value RuntimeError: Error loading source_documents/geneology.txt bryonykiker@Bryonys-MacBook-Air privateGPT-main %

Describe the solution you'd like Update ingest.py to seamlessly import windows ascii (ISO-8859-1) text files or unix unicode files irrespective of whether the file originated in *unix or windoze.

Describe alternatives you've considered Currently I just manually run perl scripts to update the filenames with 'iconv' but that is irritating and other folks won't want to do that. THIS tool will be more successful if the ingest.py script can flexibly handle more data.

tkiker commented 1 year ago

Also regarding the "how" of how the problem get's solved I wanted to point out that I really am not educated enough on any of the tools like python or c++ et all to say "how" to fix the problem.

Regarding the "what", however, I am exceptionally opinionated. 1) One of the reasons I've parked on this particular local LLM is that it's design imports/ingests discrete data that I and only I can provide and I and only I can access after the ingest. 2) logs the success and failure BEFORE I ever try to run a interpretive CLI with the wifi turned off.