yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.82k stars 271 forks source link

Add PDF::Reader::Page#runs method, to extract text from a page with positioning info #411

Closed yob closed 2 years ago

yob commented 2 years ago

We've had PDF::Reader::Page#text for years. It returns a plain text representation of the page, with the text layed out and positioned. It works reasonably well, but the positioning algorithim has bugs and we get occasional reports where it hasn't worked too well.

Page#runs is an escape hatch. User can request the character info for the page as an Array of PDF::Reader::TextRun objects. Each TextRun has one or more characters, the x,y co-ordinates of the origin (bottom left corner), the width, and the font size. Users can use this data in any way they need, possibly including building a custom layout algorithm.

Use it like this:

PDF::Reader.open("somefile.pdf") do |pdf|
  reader.page(1).runs.each do |run|
    puts "#{run.text} (#{run.x},#{run.y}) width: #{run.width} font_size: #{run.font_size}"
  end
end

The runs method supports some options:

For example, here's extracting the runs for page 1 without merging:

PDF::Reader.open("somefile.pdf") do |pdf|
  reader.page(1).runs(merge: false).each do |run|
    puts run.inspect
  end
end

A side effect of this change is that Page#text also accepts the above options. This might be useful for folks who want to extract text from a part of the page without implementing their own algorithm. That would look something like this:

PDF::Reader.open("somefile.pdf") do |pdf|
  puts reader.page(1).text(rect: PDF::Reader::Rectangle.new(0, 0, 100, 100))
end