yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

crop text in 'Tj' PagesStrategy::OPERATORS #398

Open msk-yv opened 2 years ago

msk-yv commented 2 years ago

What I see in pdf image Text what I see when call page.text

image

However, in page.raw_content I can see all date text image

Can I be sure it just date format croping? Or it some system problem and when in that place would '22.12.2019' I`ll get '22.12.20' instead '22.12.19' ?

yob commented 2 years ago

This is likely to be the fault of the primitive algorithm in PageLayout. I'd love to find time to improve it!

The algorithm sometimes results in characters that will overlap, in which case some characters will be left out.