zylon-ai / private-gpt

Interact with your documents using the power of GPT, 100% privately, no data leaks
https://privategpt.dev
Apache License 2.0
53.57k stars 7.2k forks source link

Scanned PDFs are not loaded with no error #1960

Open mrepetto-certx opened 3 months ago

mrepetto-certx commented 3 months ago

I noticed scanned PDFs are not imported when loaded with the SDK or the GUI. To cope with that, someone implemented an OCR layer (#1610). You can simulate this behavior with any scanned PDF, such as https://jeroen.github.io/images/ocrscan.pdf.

What is more relevant is that no error is thrown, making the usage of PrivateGPT in non-UI mode problematic. I'm planning to implement an OCR layer to cope with that. Is anyone aware that such a feature is already implemented in the current stack, or do I need to implement extra dependencies such as https://github.com/tesseract-ocr/tesseract?

jaluma commented 2 months ago

We are discussing how to implement OCR in PrivateGPT. Including OCR inside, it will increase much package size (many dependencies, etc...). Anyway, feel free to open a new PR how you'd implement this feature :)