Closed kraj1001 closed 7 months ago
I just started getting this same error message. Worked for me a few days ago. I am on commit b1057af [Iván Martínez] 2023-06-16 19:29:18 +0200
I was able to manually download nltk to work around the issue. The steps I took were:
~/nltk_data
python
from the terminalimport nltk
nltk.download()
Thanks Andy, That did work.
Looks like you only have to download with nltk.download("averaged_perceptron_tagger")
according to #598
I was able to manually download nltk to work around the issue. The steps I took were:
- Delete the existing ntlk directory (not sure if this is required, on a Mac mine was located at
~/nltk_data
- run
python
from the terminal- run
import nltk
- run
nltk.download()
- A window opens and I opted to download "all" because I do not know what is actually required by this project
This works! Thank you @aboutte !
Note: if you'd like to ask a question or open a discussion, head over to the Discussions section and post it there.
Describe the bug and how to reproduce it A clear and concise description of what the bug is and the steps to reproduce the behavior.
Expected behavior A clear and concise description of what you expected to happen.
Environment (please complete the following information):
Additional context Add any other context about the problem here.
I get the below error when I run the python ingest.py
Traceback (most recent call last): File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\pool.py", line 125, in worker result = (True, func(*args, kwds)) ^^^^^^^^^^^^^^^^^^^ File "C:\privateGPT-main\ingest.py", line 89, in load_single_document return loader.load() ^^^^^^^^^^^^^ File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\document_loaders\unstructured.py", line 71, in load elements = self._get_elements() ^^^^^^^^^^^^^^^^^^^^ File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\document_loaders\word_document.py", line 100, in _get_elements from unstructured.partition.docx import partition_docx File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\unstructured\partition\docx.py", line 25, in
from unstructured.partition.text_type import (
File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\unstructured\partition\text_type.py", line 21, in
from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize
File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\unstructured\nlp\tokenize.py", line 32, in
_download_nltk_package_if_not_present(package_name, package_category)
File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\unstructured\nlp\tokenize.py", line 21, in _download_nltk_package_if_not_present
nltk.find(f"{package_category}/{package_name}")
File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\nltk\data.py", line 555, in find
return find(modified_name, paths)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\nltk\data.py", line 542, in find
return ZipFilePathPointer(p, zipentry)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\nltk\compat.py", line 41, in _decorator
return init_func(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\nltk\data.py", line 394, in init
zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\nltk\compat.py", line 41, in _decorator
return init_func(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\site-packages\nltk\data.py", line 935, in init
zipfile.ZipFile.init(self, filename)
File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\zipfile.py", line 1301, in init
self._RealGetContents()
File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\zipfile.py", line 1368, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "C:\privateGPT-main\ingest.py", line 166, in
main()
File "C:\privateGPT-main\ingest.py", line 156, in main
texts = process_documents()
^^^^^^^^^^^^^^^^^^^
File "C:\privateGPT-main\ingest.py", line 118, in process_documents
documents = load_documents(source_directory, ignored_files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\privateGPT-main\ingest.py", line 107, in load_documents
for i, docs in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
File "C:\Users\rkambhat\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\pool.py", line 873, in next
raise value
zipfile.BadZipFile: File is not a zip file