Open Dipankar1997161 opened 1 year ago
Hello @Dipankar1997161,
Firstly, I apologize for the late reply. Thank you for your kind words and interest in my project.
To address your question: The reason I recommend converting .pdf
files to .txt
format before passing them into the model is primarily due to the complex formatting commonly found in scientific papers. These papers often have a 2-multistaged format (columns, figures, tables, etc.), which can make OCR (Optical Character Recognition) operations quite challenging.
When using OCR tools like PyPDF
or similar, the extracted text can sometimes be misaligned or include unexpected characters. This can lead to undesired results when performing the forward operations of the model.
By manually converting the PDF to a .txt file, you have more control over the text format, and you can ensure that only the relevant text is inputted into the model. This is why I suggest taking the extra step to convert the file before proceeding.
I hope this clarifies your question. If you have any more questions or need further clarification, feel free to ask.
Best regards, @wjgoarxiv
Thanks for the work, Just wondering, is there a way to extract the text from pdf and then perform the forward operations of this model or do we need to convert it into .txt first before passing in.
I saw you recommended to convert it to .txt first, is there a reason behind it? update me kindly
thank you @wjgoarxiv