General question on the working

Hello @Dipankar1997161,

Firstly, I apologize for the late reply. Thank you for your kind words and interest in my project.

To address your question: The reason I recommend converting .pdf files to .txt format before passing them into the model is primarily due to the complex formatting commonly found in scientific papers. These papers often have a 2-multistaged format (columns, figures, tables, etc.), which can make OCR (Optical Character Recognition) operations quite challenging.

When using OCR tools like PyPDF or similar, the extracted text can sometimes be misaligned or include unexpected characters. This can lead to undesired results when performing the forward operations of the model.

By manually converting the PDF to a .txt file, you have more control over the text format, and you can ensure that only the relevant text is inputted into the model. This is why I suggest taking the extra step to convert the file before proceeding.

I hope this clarifies your question. If you have any more questions or need further clarification, feel free to ask.

Best regards, @wjgoarxiv

wjgoarxiv / PaperSumGPT

General question on the working #2