We've had PDF::Reader::Page#text for years. It returns a plain text representation of the page, with the text layed out and positioned. It works reasonably well, but the positioning algorithim has bugs and we get occasional reports where it hasn't worked too well.
Page#runs is an escape hatch. User can request the character info for the page as an Array of PDF::Reader::TextRun objects. Each TextRun has one or more characters, the x,y co-ordinates of the origin (bottom left corner), the width, and the font size. Users can use this data in any way they need, possibly including building a custom layout algorithm.
Use it like this:
PDF::Reader.open("somefile.pdf") do |pdf|
reader.page(1).runs.each do |run|
puts "#{run.text} (#{run.x},#{run.y}) width: #{run.width} font_size: #{run.font_size}"
end
end
The runs method supports some options:
rect - A PDF::Reader::Rectangle - filter out any characters that aren't inside this rectangle. Defaults to the page MediaBox
skip_zero_width - skip text that renders with a width of 0. Defaults to true
skip_overlapping - skip duplicate text that render on top of the same text. Defaults to true
merge - merge characters that render in sequence. Defaults to true. If this is false, the result will be an Array of single characer TextRun's.
For example, here's extracting the runs for page 1 without merging:
PDF::Reader.open("somefile.pdf") do |pdf|
reader.page(1).runs(merge: false).each do |run|
puts run.inspect
end
end
A side effect of this change is that Page#text also accepts the above options. This might be useful for folks who want to extract text from a part of the page without implementing their own algorithm. That would look something like this:
PDF::Reader.open("somefile.pdf") do |pdf|
puts reader.page(1).text(rect: PDF::Reader::Rectangle.new(0, 0, 100, 100))
end
We've had PDF::Reader::Page#text for years. It returns a plain text representation of the page, with the text layed out and positioned. It works reasonably well, but the positioning algorithim has bugs and we get occasional reports where it hasn't worked too well.
Page#runs is an escape hatch. User can request the character info for the page as an Array of PDF::Reader::TextRun objects. Each TextRun has one or more characters, the x,y co-ordinates of the origin (bottom left corner), the width, and the font size. Users can use this data in any way they need, possibly including building a custom layout algorithm.
Use it like this:
The runs method supports some options:
For example, here's extracting the runs for page 1 without merging:
A side effect of this change is that Page#text also accepts the above options. This might be useful for folks who want to extract text from a part of the page without implementing their own algorithm. That would look something like this: