yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.82k stars 271 forks source link

Can I extract data/text specifying coordinates? w, h, x, y? #327

Closed djalmaaraujo closed 4 years ago

djalmaaraujo commented 4 years ago

Hello @yob,

Imagine that I want to extract texts from a page passing the "areas". Width Height X, Y Is there a method, easy solution for this?

Thanks

yob commented 4 years ago

pdf-reader doesn't have support for this out of the box, although I can see it'd be a useful feature.

The quickest way to get this working for now is something based on the JSON extraction we discussed last year: https://gist.github.com/yob/d9e28e39943aec251cb570bf2879bda4. In theory, JsonTextReceiver could accept some co-ordinates for a box and throw away any characters outside that box.

djalmaaraujo commented 4 years ago

@yob Thanks, I forgot about this, thanks for reminding me. :)