zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks
https://privategpt.dev
Apache License 2.0
53.74k stars 7.22k forks source link

python ingest.py error:install pypandoc wheels with included pandoc. #353

Closed hktalent closed 1 year ago

hktalent commented 1 year ago

Macos 13.4/ intel i7

python -V Python 3.10.11

$ pip list

Package                 Version
----------------------- -----------
aiohttp                 3.8.4
aiosignal               1.3.1
anyio                   3.6.2
argilla                 1.7.0
async-timeout           4.0.2
attrs                   23.1.0
backoff                 2.2.1
beautifulsoup4          4.12.2
certifi                 2023.5.7
cffi                    1.15.1
chardet                 5.1.0
charset-normalizer      3.1.0
chromadb                0.3.22
click                   8.1.3
clickhouse-connect      0.5.24
colorclass              2.2.2
commonmark              0.9.1
compressed-rtf          1.0.6
cryptography            40.0.2
dataclasses-json        0.5.7
Deprecated              1.2.13
duckdb                  0.7.1
easygui                 0.98.3
ebcdic                  1.1.1
et-xmlfile              1.1.0
extract-msg             0.41.1
fastapi                 0.95.1
filelock                3.12.0
frozenlist              1.3.3
fsspec                  2023.5.0
greenlet                2.0.2
h11                     0.14.0
hnswlib                 0.7.0
httpcore                0.16.3
httptools               0.5.0
httpx                   0.23.3
huggingface-hub         0.14.1
idna                    3.4
IMAPClient              2.3.1
Jinja2                  3.1.2
joblib                  1.2.0
langchain               0.0.166
lark-parser             0.12.0
llama-cpp-python        0.1.48
lxml                    4.9.2
lz4                     4.3.2
Markdown                3.4.3
MarkupSafe              2.1.2
marshmallow             3.19.0
marshmallow-enum        1.5.1
monotonic               1.6
mpmath                  1.3.0
msg-parser              1.2.0
msoffcrypto-tool        5.0.1
multidict               6.0.4
mypy-extensions         1.0.0
networkx                3.1
nltk                    3.8.1
numexpr                 2.8.4
numpy                   1.23.5
olefile                 0.46
oletools                0.60.1
openapi-schema-pydantic 1.2.4
openpyxl                3.1.2
packaging               23.1
pandas                  1.5.3
pandoc                  2.3
pcodedmp                1.2.6
pdfminer.six            20221105
Pillow                  9.5.0
pip                     23.1.2
plumbum                 1.8.1
ply                     3.11
posthog                 3.0.1
pycparser               2.21
pydantic                1.10.7
Pygments                2.15.1
pygpt4all               1.1.0
pygptj                  2.0.3
pyllamacpp              2.1.3
pypandoc                1.11
pyparsing               2.4.7
python-dateutil         2.8.2
python-docx             0.8.11
python-dotenv           1.0.0
python-magic            0.4.27
python-pptx             0.6.21
pytz                    2023.3
pytz-deprecation-shim   0.1.0.post0
PyYAML                  6.0
red-black-tree-mod      1.20
regex                   2023.5.5
requests                2.30.0
rfc3986                 1.5.0
rich                    13.0.1
RTFDE                   0.0.2
scikit-learn            1.2.2
scipy                   1.10.1
sentence-transformers   2.2.2
sentencepiece           0.1.99
setuptools              67.7.2
six                     1.16.0
sniffio                 1.3.0
soupsieve               2.4.1
SQLAlchemy              2.0.13
starlette               0.26.1
sympy                   1.12
tabulate                0.9.0
tenacity                8.2.2
threadpoolctl           3.1.0
tokenizers              0.13.3
torch                   2.0.1
torchvision             0.15.2
tqdm                    4.65.0
transformers            4.29.1
typer                   0.9.0
typing_extensions       4.5.0
typing-inspect          0.8.0
tzdata                  2023.3
tzlocal                 4.2
unstructured            0.6.5
urllib3                 2.0.2
uvicorn                 0.22.0
uvloop                  0.17.0
watchfiles              0.19.0
websockets              11.0.3
wheel                   0.40.0
wrapt                   1.14.1
XlsxWriter              3.1.0
yarl                    1.9.2
zstandard               0.21.0

$ python ingest.py

Creating new vectorstore
Loading documents from source_documents
Loading new documents:   2%|▎                   | 2/131 [00:03<03:12,  1.50s/it][nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/51pwn/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
Loading new documents:   2%|▎                   | 2/131 [00:04<05:13,  2.43s/it]
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 89, in load_single_document
    return loader.load()[0]
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/unstructured.py", line 70, in load
    elements = self._get_elements()
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/epub.py", line 22, in _get_elements
    return partition_epub(filename=self.file_path, **self.unstructured_kwargs)
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/partition/epub.py", line 24, in partition_epub
    return convert_and_partition_html(
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/partition/html.py", line 124, in convert_and_partition_html
    html_text = convert_file_to_html_text(source_format=source_format, filename=filename, file=file)
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/file_utils/file_conversion.py", line 44, in convert_file_to_html_text
    html_text = convert_file_to_text(
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/file_utils/file_conversion.py", line 12, in convert_file_to_text
    text = pypandoc.convert_file(filename, target_format, format=source_format)
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pypandoc/__init__.py", line 168, in convert_file
    return _convert_input(discovered_source_files, format, 'path', to, extra_args=extra_args,
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pypandoc/__init__.py", line 324, in _convert_input
    _ensure_pandoc_path()
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pypandoc/__init__.py", line 750, in _ensure_pandoc_path
    raise OSError("No pandoc was found: either install pandoc and add it\n"
OSError: No pandoc was found: either install pandoc and add it
to your PATH or or call pypandoc.download_pandoc(...) or
install pypandoc wheels with included pandoc.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 167, in <module>
    main()
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 157, in main
    texts = process_documents()
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 119, in process_documents
    documents = load_documents(source_directory, ignored_files)
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 108, in load_documents
    for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
OSError: No pandoc was found: either install pandoc and add it
to your PATH or or call pypandoc.download_pandoc(...) or
install pypandoc wheels with included pandoc.
hktalent commented 1 year ago
pip install pandoc
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: pandoc in /Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages (2.3)
Requirement already satisfied: plumbum in /Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages (from pandoc) (1.8.1)
Requirement already satisfied: ply in /Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages (from pandoc) (3.11)
(privateGPT) 51pwn@123-2 privateGPT $ pip install pypandoc
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: pypandoc in /Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages (1.11)
yousifalyousifi commented 1 year ago

I ran into the same issue. You have to install pandoc, and add it to your PATH.

I'm on Win10. I ran the following py script. It will download and run the pandoc installer. Then add "C:\Users\Username\AppData\Local\Pandoc" to your PATH. That's where mine got installed. Yours might be different.

from pypandoc.pandoc_download import download_pandoc
# see the documentation how to customize the installation path
# but be aware that you then need to include it in the `PATH`
download_pandoc()
hktalent commented 1 year ago

@yousifalyousifi thanks

hktalent commented 1 year ago
$ python ingest.py
Creating new vectorstore
Loading documents from source_documents
Loading new documents:   3%|▌                   | 4/131 [00:06<02:43,  1.29s/it][nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/51pwn/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/51pwn/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/51pwn/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
Loading new documents:   3%|▌                   | 4/131 [00:09<04:57,  2.34s/it]
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 89, in load_single_document
    return loader.load()[0]
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/unstructured.py", line 70, in load
    elements = self._get_elements()
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/langchain/document_loaders/epub.py", line 22, in _get_elements
    return partition_epub(filename=self.file_path, **self.unstructured_kwargs)
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/partition/epub.py", line 24, in partition_epub
    return convert_and_partition_html(
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/partition/html.py", line 124, in convert_and_partition_html
    html_text = convert_file_to_html_text(source_format=source_format, filename=filename, file=file)
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/file_utils/file_conversion.py", line 44, in convert_file_to_html_text
    html_text = convert_file_to_text(
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/unstructured/file_utils/file_conversion.py", line 12, in convert_file_to_text
    text = pypandoc.convert_file(filename, target_format, format=source_format)
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/site-packages/pypandoc/__init__.py", line 164, in convert_file
    format = _identify_format_from_path(discovered_source_files[0], format)
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 167, in <module>
    main()
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 157, in main
    texts = process_documents()
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 119, in process_documents
    documents = load_documents(source_directory, ignored_files)
  File "/Users/51pwn/MyWork/privateGPT/ingest.py", line 108, in load_documents
    for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
  File "/Users/51pwn/anaconda3/envs/privateGPT/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
IndexError: list index out of range
hktalent commented 1 year ago
image
hktalent commented 1 year ago

Can someone help me?

I started it in tmux

python privateGPT.py

Then the control character appears, I don't know how to close it

But the interaction inside Python when I start it alone is normal, indicating that the environment should be fine

image
hktalent commented 1 year ago

Doesn't it seem to support Chinese?

image

And then he found out that his answer was very confusing, it was just the best in the trash

Is there a solution?

image
Lestibournes commented 1 year ago

I got the same exception on Pop!_OS 22.04 (which is based on Ubuntu, which is based on Debian, so this should work on all systems based on those). I solved it by installing pandoc on the system: sudo apt install pandoc

yousifalyousifi commented 1 year ago

Doesn't it seem to support Chinese? And then he found out that his answer was very confusing, it was just the best in the trash

Is there a solution?

This is not a pandoc issue anymore. I would advise closing this issue and opening a new one.

dinhhuydh commented 1 year ago

I've installed yesterday and still get the error. My computer: Mac Pro M2 Python 3.10.8

OSError: No pandoc was found: either install pandoc and add it
to your PATH or or call pypandoc.download_pandoc(...) or
install pypandoc wheels with included pandoc.

I'm already install pandoc or pypandoc. It's work when running with data test but for an epub file in the source_document.

dinhhuydh commented 1 year ago

I've found solution. I need to install pandoc with brew first brew install pandoc More details: https://pandoc.org/installing.html

windmaple commented 1 year ago

I got the same exception on Pop!_OS 22.04 (which is based on Ubuntu, which is based on Debian, so this should work on all systems based on those). I solved it by installing pandoc on the system: sudo apt install pandoc

This is the right approach on linux. 'pip install pandoc' isn't sufficient.

ManuLinares commented 1 year ago

How is this issue closed if it still an issue on linux?

Lestibournes commented 1 year ago

How is this issue closed if it still an issue on linux?

Perhaps because the errors are caused by pandoc not being installed on the system, which makes it a user error, rather than a privateGPT error.

ManuLinares commented 1 year ago

Oops my bad, I borked the venv. Sorry.

The installation instructions should add:

python3 -m venv ./venv
source ./venv/bin/activate

before pip install -r requirements.txt. This would save a lot of hassle.

hktalent commented 1 year ago

I can run normally on Macos Inteli7, and this is my operation to share with everyone

However, I found that using it to try making AI search engines still falls far short of expectations

conda remove --name privateGPT --all -y
conda create -n privateGPT -y python=3.10
conda activate privateGPT
conda init zsh
export PATH="$HOME/anaconda3/envs/privateGPT/bin:$PATH"
which pip python
python -V
cat requirements.txt|xargs -I % pip install "%" -i https://mirror.baidu.com/pypi/simple
ARCHFLAGS="-arch x86_64"  
pip install langchain llama-cpp-python chromadb unstructured  -i https://mirror.baidu.com/pypi/simple
conda install -c conda-forge pypandoc
brew install pandoc

stackcverflow commented 1 year ago

Problem solved with : pip install pypandoc-binary

(no need to worry about the PATH with this command)

baisechundu commented 5 months ago

pip install pypandoc-binary

Great! It works for me! Thanks

Neyromancer commented 4 months ago

pip install pypandoc-binary

did not work for me. Ran into this problem on Debian based docker container