Ingest pdfs with PyMuPDF instead of pypdf

zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks

https://privategpt.dev

Apache License 2.0

53.78k stars 7.22k forks source link

Ingest pdfs with PyMuPDF instead of pypdf #1114

Closed namp closed 11 months ago

namp commented 11 months ago

I'm getting a lot of errors when parsing pdf files with the new version of privateGPT, for example:

pypdf/_cmap.py", line 369, in parse_bfrange ] = unhexlify(fmt2 % c).decode("utf-16-be", "surrogatepass") ^^^^^^^^^^^^^^^^^^^ binascii.error: odd-length string

The primordial version utilized pyMUpdf which parsed my pdf files without issues.

Is there any way to set PyMuPDFLoader as default loader for ingesting pdf files?

Thanks

namp commented 11 months ago

I (kind of) answered my own question with a quick and dirty code injection in ingest_service.py

If anyone's interesting feel free to contact me.

Thanks

jmellano commented 8 months ago

Hello

i could be interessted ! I would like to change PDF conversion from pypdf to something like ChatDOC PDF Parser. It would enable me to have more context into unstructured data like PDF

xXPhenomXx commented 8 months ago

@namp Would love any insight on how you were able to switch the PDF loader. Also encountering the same issue with bulk PDF imports

Thanks,