webwhiz-ai / webwhiz

WebWhiz allows you to create an AI chatbot that knows everything about your product and can instantly respond to your customer's queries.
https://www.webwhiz.ai/
GNU Affero General Public License v3.0
882 stars 148 forks source link

Text inside PDF fails to be crawled #204

Open MohamedAmineDHIAB opened 1 month ago

MohamedAmineDHIAB commented 1 month ago

Dear WebWhiz Team,

When trying to create a chatbot and upload the following PDF either I get a 500 error code

image

image

If I add other data files, the chatbot gets created but it fails to answer my questions regarding the earlier mentioned PDF, saying:

I don't know the answer to that

One of the issues might be that the Data Crawler does not support OCR, and only retrieves text from PDF files that already contain embedded Texts within them. However, for PDF files that look like they contain Textual Data from a first glance, however they do not contain any embedded Text, the Crawler fails to get the data resulting in such issues.

I hope this can be helpful for debugging this issue.

I also saw that a similar issue has been reported here: #107