yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.82k stars 271 forks source link

Page.text fails when font size changes on a single line #371

Open coezbek opened 3 years ago

coezbek commented 3 years ago

When reading text from a document that uses different font sizes on the same line of text, I have seen that fail both as extra spaces and overridden characters. I am wondering is this something that pdf-reader is intended to do accurately?

Example file: "hello_world_caps.pdf"

hello_world_caps.pdf

Example spec (fails):


 describe "#text" do
    ...

    it "can deal with different height characters on the same line" do
      @browser = PDF::Reader.new(pdf_spec_file("hello_world_caps"))
      @page    = @browser.page(1)

      expect(@page.text).to eql("HELLO WORLD") # Returns "HELLWORLD"
    end

  end
yob commented 3 years ago

Thanks for a great sample file that demonstrates the issue.

I am wondering is this something that pdf-reader is intended to do accurately?

I would classify it as a known issue that I'd like to handle better than we currently do. Probably the algorithm in PageLayout needs a significant overhaul, which is a bummer.