Cannot ingest markdown files

vicdotdevelop commented 1 year ago

Describe the bug and how to reproduce it I was trying to ingest markdown files from one of my documentations. Also, I have tried different markdown files and they all end with the same error. I am using the latest commit from the main branch.

This is how my .env looks like:

PERSIST_DIRECTORY=/Users/victor/pyprojects/privateGPT/db
EMBEDDINGS_MODEL_NAME=all-mpnet-base-v2
MODEL_TYPE=GPT4All
MODEL_PATH=/Users/victor/local_llms/ggml-gpt4all-j-v1.3-groovy.bin
MODEL_N_CTX=1000

Expected behavior

Ingest runs through without issues.

Environment (please complete the following information):

OS / hardware: MacOSX 13.4 (Intel i9)
Python version 3.11.3

Additional context Creating new vectorstore Loading documents from source_documents Loading new documents: 0%| | 0/1 [00:05<?, ?it/s] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, kwds)) ^^^^^^^^^^^^^^^^^^^ File "/Users/victor/pyprojects/privateGPT/ingest.py", line 89, in load_single_document return loader.load()[0] ^^^^^^^^^^^^^ File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 70, in load elements = self._get_elements() ^^^^^^^^^^^^^^^^^^^^ File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/markdown.py", line 12, in _get_elements from unstructured.partition.md import partition_md File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/md.py", line 9, in from unstructured.partition.html import partition_html File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/html.py", line 6, in from unstructured.documents.html import HTMLDocument File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 25, in from unstructured.partition.text_type import ( File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 21, in from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 32, in _download_nltk_package_if_not_present(package_name, package_category) File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 21, in _download_nltk_package_if_not_present nltk.find(f"{package_category}/{package_name}") File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 555, in find return find(modified_name, paths) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 542, in find return ZipFilePathPointer(p, zipentry) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/compat.py", line 41, in _decorator return init_func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 394, in init zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/compat.py", line 41, in _decorator return init_func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 935, in init zipfile.ZipFile.init(self, filename) File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/zipfile.py", line 1301, in init self._RealGetContents() File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/zipfile.py", line 1368, in _RealGetContents raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/victor/pyprojects/privateGPT/ingest.py", line 167, in main() File "/Users/victor/pyprojects/privateGPT/ingest.py", line 157, in main texts = process_documents() ^^^^^^^^^^^^^^^^^^^ File "/Users/victor/pyprojects/privateGPT/ingest.py", line 119, in process_documents documents = load_documents(source_directory, ignored_files) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/victor/pyprojects/privateGPT/ingest.py", line 108, in load_documents for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)): File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 873, in next raise value zipfile.BadZipFile: File is not a zip file

zswanson commented 1 year ago

Same issue on Mac M1, also with python 3.11.3. Renamed every .md file to .txt and that works.

bradlindblad commented 1 year ago

Same issue linux mint 21.1, python 3.10.

abhishekbhakat commented 1 year ago

I think this error exists with UnstructuredHTMLLoader and UnstructuredMarkdownLoader both.

abhishekbhakat commented 1 year ago

https://github.com/hwchase17/langchain/issues/5264

abhishekbhakat commented 1 year ago

I created a markdownloader copying TextLoader. It uses marko for converting into html and then BeautifulSoup to extract text. Seems to be working for me.

https://github.com/abhishekbhakat/privateGPT/tree/main

nleroy917 commented 1 year ago

Same issue on Mac M1, also with python 3.11.3. Renamed every .md file to .txt and that works.

You can also just replace: ".md": (UnstructuredMarkdownLoader, {}), with ".md": (TextLoader, {}), inside ingest.py which is effectively the same thing and you aren't renaming your files.

That worked for me

joshrouwhorst commented 1 year ago

I'm actually getting a different error, but renaming to .txt seems to fix this as well. I copied the current README.md for the project into the source_documents folder to test this. I just cloned the project this morning and I am running python 3.11.3 with an M1 Mac.

Here is the error I'm getting:

Creating new vectorstore
Loading documents from source_documents
Loading new documents:  67%|██████████████▋       | 2/3 [00:02<00:01,  1.40s/it]
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 89, in load_single_document
    return loader.load()[0]
           ^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 71, in load
    elements = self._get_elements()
               ^^^^^^^^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/markdown.py", line 25, in _get_elements
    return partition_md(filename=self.file_path, **self.unstructured_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/md.py", line 52, in partition_md
    return partition_html(
           ^^^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/html.py", line 91, in partition_html
    layout_elements = document_to_element_list(document, include_page_breaks=include_page_breaks)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/common.py", line 73, in document_to_element_list
    num_pages = len(document.pages)
                    ^^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/xml.py", line 52, in pages
    self._pages = self._read()
                  ^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 116, in _read
    element = _parse_tag(tag_elem)
              ^^^^^^^^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 222, in _parse_tag
    return _text_to_element(text, tag_elem.tag, ancestortags)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 237, in _text_to_element
    elif is_narrative_tag(text, tag):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 265, in is_narrative_tag
    return tag not in HEADING_TAGS and is_possible_narrative_text(text)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 86, in is_possible_narrative_text
    if (sentence_count(text, 3) < 2) and (not contains_verb(text)) and language == "en":
                                              ^^^^^^^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 189, in contains_verb
    pos_tags = pos_tag(text)
               ^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 57, in pos_tag
    parts_of_speech.extend(_pos_tag(tokens))
                           ^^^^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/tag/__init__.py", line 165, in pos_tag
    tagger = _get_tagger(lang)
             ^^^^^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/tag/__init__.py", line 107, in _get_tagger
    tagger = PerceptronTagger()
             ^^^^^^^^^^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/tag/perceptron.py", line 169, in __init__
    self.load(AP_MODEL_LOC)
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/tag/perceptron.py", line 252, in load
    self.model.weights, self.tagdict, self.classes = load(loc)
                                                     ^^^^^^^^^
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 755, in load
    resource_val = pickle.load(opened_resource)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: pickle data was truncated
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 167, in <module>
    main()
  File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 157, in main
    texts = process_documents()
            ^^^^^^^^^^^^^^^^^^^
  File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 119, in process_documents
    documents = load_documents(source_directory, ignored_files)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 108, in load_documents
    for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
  File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 873, in next
    raise value
_pickle.UnpicklingError: pickle data was truncated

andrewchch commented 1 year ago

I'm getting the same error as @joshrouwhorst when I try to ingest .html files. A single PDF ingests fine but adding this additional files doesn't. Renaming the file to .txt also fixes the problem.

imonroe commented 1 year ago

I'm getting a similar result with EPUB files.

StCross commented 1 year ago

Hi @joshrouwhorst and @andrewchch , I have found a stackoverflow issue and it solves your problem: https://stackoverflow.com/questions/56049033/what-can-be-the-reasons-of-having-an-unpicklingerror-while-running-pos-tag-fro just

import nltk
nltk.download('averaged_perceptron_tagger')

in the terminal and it works well for me in ubuntu 20.04 when ingesting MS word. not sure if it also works for .zip file since my data is in .docx format. you can try it @vicdotdevelop

ozanweb commented 12 months ago

Cannot upload epub files to work with.

Error Please install extra dependencies that are required for the EpubReader: pip install EbookLib html2text

Even if I install the required dependencies mention above, still getting the same error in every attempt.

zylon-ai / private-gpt

Cannot ingest markdown files #358