yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

Use median instead of mean when determining page layout size #355

Closed yob closed 3 years ago

yob commented 3 years ago

When extracting text from a page we're taking all the characters from that page (which can be a variety of sizes) and forcing them into a plain text string of a single size.

We have to determine the number of rows and columns in the plain text output, and we need enough of both to fit all the text.

To determine the number of columns I've been using the mean/average of the character widths on the page. That mostly works OK, but on pages with a handful of large characters (like headings) the mean/average can skew high. By using median, we can be confident about 50% of the text runs have characters at least as wide as the number we use to calculate columns in the output.

This is still a super naive algorithm. It's good enough for the basic approach though, and anyone with more advanced needs will still need to build their own layout algorithm.

In the long term, it'd be nice to change this algorithm to avoid silently over-writing characters.

Partial fix for #354