zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks
https://privategpt.dev
Apache License 2.0
53.74k stars 7.22k forks source link

ingest.py on eml throws zipfile.BadZipFile: File is not a zip file #345

Closed slavag closed 7 months ago

slavag commented 1 year ago

Note: if you'd like to ask a question or open a discussion, head over to the Discussions section and post it there.

Describe the bug and how to reproduce it ingest.py on source_documents folder with many with eml files throws zipfile.BadZipFile: File is not a zip file

Expected behavior Expecting that eml files will be loaded

Environment (please complete the following information):

Additional context Loading new documents: 0%| | 0/75093 [00:08<?, ?it/s] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 49, in load doc = UnstructuredEmailLoader.load(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 70, in load elements = self._get_elements() ^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/email.py", line 22, in _get_elements from unstructured.partition.email import partition_email File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/email.py", line 41, in from unstructured.partition.html import partition_html File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/html.py", line 6, in from unstructured.documents.html import HTMLDocument File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 25, in from unstructured.partition.text_type import ( File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 21, in from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 32, in _download_nltk_package_if_not_present(package_name, package_category) File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 21, in _download_nltk_package_if_not_present nltk.find(f"{package_category}/{package_name}") File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 555, in find return find(modified_name, paths) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 542, in find return ZipFilePathPointer(p, zipentry) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/compat.py", line 41, in _decorator return init_func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 394, in init zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/compat.py", line 41, in _decorator return init_func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 935, in init zipfile.ZipFile.init(self, filename) File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/zipfile.py", line 1301, in init self._RealGetContents() File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/zipfile.py", line 1368, in _RealGetContents raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 89, in load_single_document return loader.load()[0] ^^^^^^^^^^^^^ File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 59, in load raise type(e)(f"{self.file_path}: {e}") from e zipfile.BadZipFile: source_documents/2013-01-03 095102 dea8d7fd13.eml: File is not a zip file """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 167, in main() File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 157, in main texts = process_documents() ^^^^^^^^^^^^^^^^^^^ File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 119, in process_documents documents = load_documents(source_directory, ignored_files) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/slava/Documents/Development/private/AI/privateGPT/ingest.py", line 108, in load_documents for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)): File "/Users/slava/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 873, in next raise value zipfile.BadZipFile: source_documents/2013-01-03 095102 dea8d7fd13.eml: File is not a zip file

AntouanK commented 1 year ago

Same here. I only have txt, html and pdf files. And I get that zip error.

ElementalWarrior commented 1 year ago

Its happening at import time. This is the cause: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/nlp/tokenize.py#LL31C39-L31C52 When I try to unzip this myself it fails. I found in the nltk_data repo you can do

>>> import nltk
nltk.download()

Which led me to find https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml. I downloaded averaged_perception_tagger. unzipped that to ~/nltk_data/taggers/ and extracted. But then I get new errors.

Traceback (most recent call last):
  File "/home/james/projects/privateGPT/ingest.py", line 50, in load
    doc = UnstructuredEmailLoader.load(self)
  File "/home/james/.local/lib/python3.10/site-packages/langchain/document_loaders/unstructured.py", line 70, in load
    elements = self._get_elements()
  File "/home/james/.local/lib/python3.10/site-packages/langchain/document_loaders/email.py", line 24, in _get_elements
    return partition_email(filename=self.file_path, **self.unstructured_kwargs)
  File "/home/james/.local/lib/python3.10/site-packages/unstructured/partition/email.py", line 265, in partition_email
    element.apply(_replace_mime_encodings)
  File "/home/james/.local/lib/python3.10/site-packages/unstructured/documents/elements.py", line 154, in apply
    cleaned_text = cleaner(cleaned_text)
  File "/home/james/.local/lib/python3.10/site-packages/unstructured/cleaners/core.py", line 197, in replace_mime_encodings
    return quopri.decodestring(text.encode()).decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 2195: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/james/projects/privateGPT/ingest.py", line 90, in load_single_document
    return loader.load()[0]
  File "/home/james/projects/privateGPT/ingest.py", line 60, in load
    raise type(e)(f"{self.file_path}: {e}") from e
TypeError: function takes exactly 5 arguments (1 given)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/james/projects/privateGPT/ingest.py", line 168, in <module>
    main()
  File "/home/james/projects/privateGPT/ingest.py", line 158, in main
    texts = process_documents()
  File "/home/james/projects/privateGPT/ingest.py", line 120, in process_documents
    documents = load_documents(source_directory, ignored_files)
  File "/home/james/projects/privateGPT/ingest.py", line 109, in load_documents
    for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
TypeError: function takes exactly 5 arguments (1 given)
slavag commented 1 year ago

Well, Thanks, it helped to understand that I need to remove the nltk_data folder and it's solved issue with badzip, but now I have exactly same issue with TypeError: function takes exactly 5 arguments (1 given)

kulnor commented 1 year ago

Just to report I'm seeing the same issue. Thanks for looking into this.

ericflecher commented 1 year ago

Getting the same issue with any file that is not a .pdf

BacKinnn commented 1 year ago

I linked my folder to source_documents on my Linux machine and got the same issue as well

conradolandia commented 1 year ago

Having the same issue. I downloaded nltk_data again and now I have this error:

Creating new vectorstore
Loading documents from source_documents
Loading new documents:   0%|                            | 0/603 [00:00<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/andi/miniconda3/envs/roger-bacon/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 89, in load_single_document
    return loader.load()[0]
           ^^^^^^^^^^^^^
  File "/home/andi/miniconda3/envs/roger-bacon/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 71, in load
    elements = self._get_elements()
               ^^^^^^^^^^^^^^^^^^^^
  File "/home/andi/miniconda3/envs/roger-bacon/lib/python3.11/site-packages/langchain/document_loaders/epub.py", line 22, in _get_elements
    return partition_epub(filename=self.file_path, **self.unstructured_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andi/.local/lib/python3.11/site-packages/unstructured/partition/epub.py", line 24, in partition_epub
    return convert_and_partition_html(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andi/.local/lib/python3.11/site-packages/unstructured/partition/html.py", line 124, in convert_and_partition_html
    html_text = convert_file_to_html_text(source_format=source_format, filename=filename, file=file)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andi/.local/lib/python3.11/site-packages/unstructured/file_utils/file_conversion.py", line 44, in convert_file_to_html_text
    html_text = convert_file_to_text(
                ^^^^^^^^^^^^^^^^^^^^^
  File "/home/andi/.local/lib/python3.11/site-packages/unstructured/file_utils/file_conversion.py", line 12, in convert_file_to_text
    text = pypandoc.convert_file(filename, target_format, format=source_format)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andi/.local/lib/python3.11/site-packages/pypandoc/__init__.py", line 164, in convert_file
    format = _identify_format_from_path(discovered_source_files[0], format)
                                        ~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 167, in <module>
    main()
  File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 157, in main
    texts = process_documents()
            ^^^^^^^^^^^^^^^^^^^
  File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 119, in process_documents
    documents = load_documents(source_directory, ignored_files)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/run/media/andi/EXTERNAL/privateGPT/ingest.py", line 108, in load_documents
    for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
  File "/home/andi/miniconda3/envs/roger-bacon/lib/python3.11/multiprocessing/pool.py", line 873, in next
    raise value
IndexError: list index out of range 
kulnor commented 1 year ago

As a workaround for my flavour of this issue, I used the interactive NLTK installer to install punkt on my machine. See instructions at https://www.nltk.org/data.html

In a nutshell run Python in interactive mode and call:

import nltk nltk.download()

This opens the install dialog. Go to Models tab and install punkt.

Hope this helps and works for others....

adi commented 1 year ago

This is thrown for ~/nltk_data/taggers/averaged_perceptron_tagger.zip which is really a bad zip. Deleting ~/nltk_data and restarting ingesting downloaded a correct version of this file and now ingestion works for me.

BacKinnn commented 1 year ago

As a workaround for my flavour of this issue, I used the interactive NLTK installer to install punkt on my machine. See instructions at https://www.nltk.org/data.html

In a nutshell run Python in interactive mode and call:

import nltk nltk.download()

This opens the install dialog. Go to Models tab and install punkt.

Hope this helps and works for others....

Thank you, it worked. Although I don't need to install punkt, I need to update a package/ model(?) (I don't remember what it called), and the command run.

Currently ingesting my data and stuck in this line "Using embedded DuckDB with persistence: data will be stored in: db" (but my CPU is still running so I guess it's fine?", will update the result if everything go smoothly.

conradolandia commented 1 year ago

solved it with the suggestions offered here. Also found out that if you try and ingest too many documents at once, it chokes on it. Feeding around 20-30 documents works fine. Around 50, works fine but gets very slow. Also, some PDFs may give fail to be ingested.

tfyt2023 commented 1 year ago

Can someone please translate this to idiot-speak for me?

As a workaround for my flavour of this issue, I used the interactive NLTK installer to install punkt on my machine. See instructions at https://www.nltk.org/data.html

In a nutshell run Python in interactive mode and call:

import nltk nltk.download()

This opens the install dialog. Go to Models tab and install punkt.

Hope this helps and works for others....

ThomasFeher commented 1 year ago

This is thrown for ~/nltk_data/taggers/averaged_perceptron_tagger.zip which is really a bad zip. Deleting ~/nltk_data and restarting ingesting downloaded a correct version of this file and now ingestion works for me.

That worked for me. Thank you @adi!

cutd commented 1 year ago

First, Delete ~/nltk_data/taggers/averaged_perceptron_tagger.zip; Second,

import nltk
nltk.download()

chose averaged_perceptron_tagger to download

ElementalWarrior commented 1 year ago

I am able to get mine to run, but trying to process emails, there are tonnes of issues with unicode, date parsing, etc for emails exported from thunderbird.

DaniruKun commented 1 year ago

If like me you have a broken TK install, you can also force the NLTK download from the CLI:

python -m nltk.downloader all
slavag commented 1 year ago

@ElementalWarrior same here, unicode issues, exported from Gmail, and eventually process fails. Opened another bug, but there's no any answer there : https://github.com/imartinez/privateGPT/issues/378 and also opened a ticket in unstrctured, but also there no one answers : https://github.com/Unstructured-IO/unstructured/issues/635

DukeOfEtiquette commented 1 year ago

If like me you have a broken TK install, you can also force the NLTK download from the CLI:

python -m nltk.downloader all

This fixed it for me, thanks!

Jolg42 commented 1 year ago

Same issue on my M1 laptop, this did it for me.

python3
import nltk
nltk.download()

# Select Download menu
d
# Enter identifier
averaged_perceptron_tagger

# Select Download menu
d
# Enter identifier
punkt

I guess the download all approach is easier and works too but is unnecessary.

Friedrich-hue commented 1 year ago

How did you solve it, can you directly package it into a python script to run?

williamblair333 commented 1 year ago

First, Delete ~/nltk_data/taggers/averaged_perceptron_tagger.zip; Second,

import nltk
nltk.download()

chose averaged_perceptron_tagger to download

This solved my error on chatdocs as well. Thanks!!

mweth commented 1 year ago

type 'python' to get to the interactive screen to run import nltk nltk.download()