Getting unreadable data (UTF-8 squares 50% of the time)

ScottOster commented 3 years ago

Hi,

I am trying to use PDF reader to extract text from PDFs and then of course perform some operations on it.

The problem is that my application is designed to work for anyone, and so far roughly 50% of the sample PDFs do not return any data at all just squares.

My question is: Is this expected? Is there a fundamental reason why it is not possible to extract data from majority of PDFs?

All samples used have been "openable" with adobe and generated with common print to PDF etc.

Thanks in advance for any feedback. I don't mind contributing for the time :-)

yob commented 3 years ago

In my experience pdf-reader does a reasonable (but not perfect) text extraction from the majority of PDFs, but it does depend on the source files.

For the 50% where it doesn't work, are you able to copy paste the text from another PDF tool (acrobat,evince, preview, firefox, etc) into notepad? If you can I'd consider the pdf-reader behaviour a bug, but if you can't then maybe it's an issue with the source PDFs.

As for what the bug is... I think I'd need to see a sample file. Are any of the files online and public?

ScottOster commented 3 years ago

Hi James,

Thank you so much for the prompt response.

I have just sampled 3 off the files that are returning squares, and all three copied and pasted from Adobe, Microsoft and evince readers !!

I have put this question to the developers. The code is not public but i'd be more than happy to share it with you privately.

Thanks again

On Sun, Jan 31, 2021 at 11:41 AM James Healy notifications@github.com wrote:

In my experience pdf-reader does a reasonable (but not perfect) text extraction from the majority of PDFs, but it does depend on the source files.

For the 50% where it doesn't work, are you able to copy paste the text from another PDF tool (acrobat,evince, preview, firefox, etc) into notepad? If you can I'd consider the pdf-reader behaviour a bug, but if you can't then maybe it's an issue with the source PDFs.

As for what the bug is... I think I'd need to see a sample file. Are any of the files online and public?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yob/pdf-reader/issues/345#issuecomment-770368869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR6UFDZLIADRZLVVLNXNJSDS4U6WRANCNFSM4W3E6XPA .

-- Scott Oster Artificial Ingenious LTD

T: 07794708828

ScottOster commented 3 years ago

659900.pdf 23781.pdf 4500067854.pdf

Here are some of the sample files used , the ultimate goal being to extract the delivery date.

any insight greatly appreciated

yob commented 3 years ago

Hi @scottybigo.

I downloaded all three files and tested text extraction with pdf-reader like this:

$ ruby -Ilib bin/pdf_text ~/downloads/4500067854.pdf
$ ruby -Ilib bin/pdf_text ~/downloads/23781.pdf
$ ruby -Ilib bin/pdf_text ~/downloads/659900.pdf

In all three cases text was printed to my terminal, so I don't think there's a fundamental incompatibility between these particular files and pdf-reader.

pdf_text looks something like this:

require 'pdf/reader'

pdf = PDF::Reader.new("file.pdf")
pdf.pages.each do |page|
  puts page.text
end

Does your code look similar? Code you post a reproduction script that results in little squares?

ScottOster commented 3 years ago

Thanks again James , will have a look into it .

Much appreciated.

yob / pdf-reader

Getting unreadable data (UTF-8 squares 50% of the time) #345