zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks
https://privategpt.dev
Apache License 2.0
53.74k stars 7.22k forks source link

Error running python ingest.py #787

Closed kraj1001 closed 7 months ago

kraj1001 commented 1 year ago

Note: if you'd like to ask a question or open a discussion, head over to the Discussions section and post it there.

Describe the bug and how to reproduce it A clear and concise description of what the bug is and the steps to reproduce the behavior.

Expected behavior A clear and concise description of what you expected to happen.

Environment (please complete the following information):

Additional context Add any other context about the problem here.

I get the below error when I run the python ingest.py

Traceback (most recent call last): File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\pool.py", line 125, in worker result = (True, func(*args, kwds)) ^^^^^^^^^^^^^^^^^^^ File "C:\privateGPT-main\ingest.py", line 89, in load_single_document return loader.load() ^^^^^^^^^^^^^ File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\document_loaders\unstructured.py", line 71, in load elements = self._get_elements() ^^^^^^^^^^^^^^^^^^^^ File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\document_loaders\word_document.py", line 100, in _get_elements from unstructured.partition.docx import partition_docx File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\unstructured\partition\docx.py", line 25, in from unstructured.partition.text_type import ( File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\unstructured\partition\text_type.py", line 21, in from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\unstructured\nlp\tokenize.py", line 32, in _download_nltk_package_if_not_present(package_name, package_category) File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\unstructured\nlp\tokenize.py", line 21, in _download_nltk_package_if_not_present nltk.find(f"{package_category}/{package_name}") File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\nltk\data.py", line 555, in find return find(modified_name, paths) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\nltk\data.py", line 542, in find return ZipFilePathPointer(p, zipentry) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\nltk\compat.py", line 41, in _decorator return init_func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\nltk\data.py", line 394, in init zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\nltk\compat.py", line 41, in _decorator return init_func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\nltk\data.py", line 935, in init zipfile.ZipFile.init(self, filename) File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\zipfile.py", line 1301, in init self._RealGetContents() File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\zipfile.py", line 1368, in _RealGetContents raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\privateGPT-main\ingest.py", line 166, in main() File "C:\privateGPT-main\ingest.py", line 156, in main texts = process_documents() ^^^^^^^^^^^^^^^^^^^ File "C:\privateGPT-main\ingest.py", line 118, in process_documents documents = load_documents(source_directory, ignored_files) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\privateGPT-main\ingest.py", line 107, in load_documents for i, docs in enumerate(pool.imap_unordered(load_single_document, filtered_files)): File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\pool.py", line 873, in next raise value zipfile.BadZipFile: File is not a zip file

aboutte commented 1 year ago

I just started getting this same error message. Worked for me a few days ago. I am on commit b1057af [Iván Martínez] 2023-06-16 19:29:18 +0200

aboutte commented 1 year ago

I was able to manually download nltk to work around the issue. The steps I took were:

  1. Delete the existing ntlk directory (not sure if this is required, on a Mac mine was located at ~/nltk_data
  2. run python from the terminal
  3. run import nltk
  4. run nltk.download()
  5. A window opens and I opted to download "all" because I do not know what is actually required by this project
kraj1001 commented 1 year ago

Thanks Andy, That did work.

fdoumet commented 1 year ago

Looks like you only have to download with nltk.download("averaged_perceptron_tagger") according to #598

uright commented 1 year ago

I was able to manually download nltk to work around the issue. The steps I took were:

  1. Delete the existing ntlk directory (not sure if this is required, on a Mac mine was located at ~/nltk_data
  2. run python from the terminal
  3. run import nltk
  4. run nltk.download()
  5. A window opens and I opted to download "all" because I do not know what is actually required by this project

This works! Thank you @aboutte !