yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.79k stars 269 forks source link

Does pdf-reader manage tagged PDF ? #322

Open Noctambul opened 4 years ago

Noctambul commented 4 years ago

Hi,

I'm working with some tagged PDF and I must extract array from them. This arrays are tagged and I think it's the only way to parse them properly. I mean the rows have different cell size and the arrays could be on different pages.

So I'm wondering if this PDF-Reader API is able to manage this tagged PDF ?

Thank you for your attention.

MonsieurDart commented 4 years ago

Yes, and this could help for accessible PDFs.

yob commented 4 years ago

I believe pdf-reader will provide access to the tagged data, but it's pretty low level. For example, the high-ish level Page#text method ignore tags, but the low-level Page#walk_contents method should generate callbacks for tags.

Unfortunately I haven't worked with tagged PDFs myself, so I'm not super familiar with how to extract the data.

Noctambul commented 4 years ago

Thank you for your answer and for the details. We will explore your suggestion with attention :) .