Scanned PDFs are not loaded with no error

zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks

Apache License 2.0

53.57k stars 7.2k forks source link

I noticed scanned PDFs are not imported when loaded with the SDK or the GUI. To cope with that, someone implemented an OCR layer (#1610). You can simulate this behavior with any scanned PDF, such as https://jeroen.github.io/images/ocrscan.pdf.

What is more relevant is that no error is thrown, making the usage of PrivateGPT in non-UI mode problematic. I'm planning to implement an OCR layer to cope with that. Is anyone aware that such a feature is already implemented in the current stack, or do I need to implement extra dependencies such as https://github.com/tesseract-ocr/tesseract?

zylon-ai / private-gpt

Scanned PDFs are not loaded with no error #1960