wjgoarxiv / PaperSumGPT

::A tool to abbreviate scientific paper contents using ChatGPT::
MIT License
11 stars 2 forks source link

General question on the working #2

Open Dipankar1997161 opened 1 year ago

Dipankar1997161 commented 1 year ago

Thanks for the work, Just wondering, is there a way to extract the text from pdf and then perform the forward operations of this model or do we need to convert it into .txt first before passing in.

I saw you recommended to convert it to .txt first, is there a reason behind it? update me kindly

thank you @wjgoarxiv

wjgoarxiv commented 1 year ago

Hello @Dipankar1997161,

Firstly, I apologize for the late reply. Thank you for your kind words and interest in my project.

To address your question: The reason I recommend converting .pdf files to .txt format before passing them into the model is primarily due to the complex formatting commonly found in scientific papers. These papers often have a 2-multistaged format (columns, figures, tables, etc.), which can make OCR (Optical Character Recognition) operations quite challenging.

When using OCR tools like PyPDF or similar, the extracted text can sometimes be misaligned or include unexpected characters. This can lead to undesired results when performing the forward operations of the model.

By manually converting the PDF to a .txt file, you have more control over the text format, and you can ensure that only the relevant text is inputted into the model. This is why I suggest taking the extra step to convert the file before proceeding.

I hope this clarifies your question. If you have any more questions or need further clarification, feel free to ask.

Best regards, @wjgoarxiv