Closed vicdotdevelop closed 9 months ago
Same issue on Mac M1, also with python 3.11.3. Renamed every .md file to .txt and that works.
Same issue linux mint 21.1, python 3.10.
I think this error exists with UnstructuredHTMLLoader
and UnstructuredMarkdownLoader
both.
I created a markdownloader copying TextLoader. It uses marko for converting into html and then BeautifulSoup to extract text. Seems to be working for me.
Same issue on Mac M1, also with python 3.11.3. Renamed every .md file to .txt and that works.
You can also just replace: ".md": (UnstructuredMarkdownLoader, {}),
with ".md": (TextLoader, {}),
inside ingest.py
which is effectively the same thing and you aren't renaming your files.
That worked for me
I'm actually getting a different error, but renaming to .txt seems to fix this as well. I copied the current README.md for the project into the source_documents folder to test this. I just cloned the project this morning and I am running python 3.11.3 with an M1 Mac.
Here is the error I'm getting:
Creating new vectorstore
Loading documents from source_documents
Loading new documents: 67%|██████████████▋ | 2/3 [00:02<00:01, 1.40s/it]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 89, in load_single_document
return loader.load()[0]
^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 71, in load
elements = self._get_elements()
^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/markdown.py", line 25, in _get_elements
return partition_md(filename=self.file_path, **self.unstructured_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/md.py", line 52, in partition_md
return partition_html(
^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/html.py", line 91, in partition_html
layout_elements = document_to_element_list(document, include_page_breaks=include_page_breaks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/common.py", line 73, in document_to_element_list
num_pages = len(document.pages)
^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/xml.py", line 52, in pages
self._pages = self._read()
^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 116, in _read
element = _parse_tag(tag_elem)
^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 222, in _parse_tag
return _text_to_element(text, tag_elem.tag, ancestortags)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 237, in _text_to_element
elif is_narrative_tag(text, tag):
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 265, in is_narrative_tag
return tag not in HEADING_TAGS and is_possible_narrative_text(text)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 86, in is_possible_narrative_text
if (sentence_count(text, 3) < 2) and (not contains_verb(text)) and language == "en":
^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 189, in contains_verb
pos_tags = pos_tag(text)
^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 57, in pos_tag
parts_of_speech.extend(_pos_tag(tokens))
^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/tag/__init__.py", line 165, in pos_tag
tagger = _get_tagger(lang)
^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/tag/__init__.py", line 107, in _get_tagger
tagger = PerceptronTagger()
^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/tag/perceptron.py", line 169, in __init__
self.load(AP_MODEL_LOC)
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/tag/perceptron.py", line 252, in load
self.model.weights, self.tagdict, self.classes = load(loc)
^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 755, in load
resource_val = pickle.load(opened_resource)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: pickle data was truncated
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 167, in <module>
main()
File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 157, in main
texts = process_documents()
^^^^^^^^^^^^^^^^^^^
File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 119, in process_documents
documents = load_documents(source_directory, ignored_files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 108, in load_documents
for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 873, in next
raise value
_pickle.UnpicklingError: pickle data was truncated
I'm getting the same error as @joshrouwhorst when I try to ingest .html files. A single PDF ingests fine but adding this additional files doesn't. Renaming the file to .txt also fixes the problem.
I'm getting a similar result with EPUB files.
Hi @joshrouwhorst and @andrewchch , I have found a stackoverflow issue and it solves your problem: https://stackoverflow.com/questions/56049033/what-can-be-the-reasons-of-having-an-unpicklingerror-while-running-pos-tag-fro just
import nltk
nltk.download('averaged_perceptron_tagger')
in the terminal and it works well for me in ubuntu 20.04 when ingesting MS word. not sure if it also works for .zip file since my data is in .docx format. you can try it @vicdotdevelop
Cannot upload epub files to work with.
Error Please install extra dependencies that are required for the EpubReader:
pip install EbookLib html2text
Even if I install the required dependencies mention above, still getting the same error in every attempt.
Describe the bug and how to reproduce it I was trying to ingest markdown files from one of my documentations. Also, I have tried different markdown files and they all end with the same error. I am using the latest commit from the main branch.
This is how my .env looks like:
Expected behavior
Environment (please complete the following information):
Additional context Creating new vectorstore Loading documents from source_documents Loading new documents: 0%| | 0/1 [00:05<?, ?it/s] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, kwds)) ^^^^^^^^^^^^^^^^^^^ File "/Users/victor/pyprojects/privateGPT/ingest.py", line 89, in load_single_document return loader.load()[0] ^^^^^^^^^^^^^ File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 70, in load elements = self._get_elements() ^^^^^^^^^^^^^^^^^^^^ File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/markdown.py", line 12, in _get_elements from unstructured.partition.md import partition_md File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/md.py", line 9, in
from unstructured.partition.html import partition_html
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/html.py", line 6, in
from unstructured.documents.html import HTMLDocument
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 25, in
from unstructured.partition.text_type import (
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 21, in
from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 32, in
_download_nltk_package_if_not_present(package_name, package_category)
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 21, in _download_nltk_package_if_not_present
nltk.find(f"{package_category}/{package_name}")
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 555, in find
return find(modified_name, paths)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 542, in find
return ZipFilePathPointer(p, zipentry)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/compat.py", line 41, in _decorator
return init_func(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 394, in init
zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/compat.py", line 41, in _decorator
return init_func(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 935, in init
zipfile.ZipFile.init(self, filename)
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/zipfile.py", line 1301, in init
self._RealGetContents()
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/zipfile.py", line 1368, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/Users/victor/pyprojects/privateGPT/ingest.py", line 167, in
main()
File "/Users/victor/pyprojects/privateGPT/ingest.py", line 157, in main
texts = process_documents()
^^^^^^^^^^^^^^^^^^^
File "/Users/victor/pyprojects/privateGPT/ingest.py", line 119, in process_documents
documents = load_documents(source_directory, ignored_files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/victor/pyprojects/privateGPT/ingest.py", line 108, in load_documents
for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 873, in next
raise value
zipfile.BadZipFile: File is not a zip file