When extracting text from a page we're taking all the characters from that page (which can be a variety of sizes) and forcing them into a plain text string of a single size.
We have to determine the number of rows and columns in the plain text output, and we need enough of both to fit all the text.
To determine the number of columns I've been using the mean/average of the character widths on the page. That mostly works OK, but on pages with a handful of large characters (like headings) the mean/average can skew high. By using median, we can be confident about 50% of the text runs have characters at least as wide as the number we use to calculate columns in the output.
This is still a super naive algorithm. It's good enough for the basic approach though, and anyone with more advanced needs will still need to build their own layout algorithm.
In the long term, it'd be nice to change this algorithm to avoid silently over-writing characters.
When extracting text from a page we're taking all the characters from that page (which can be a variety of sizes) and forcing them into a plain text string of a single size.
We have to determine the number of rows and columns in the plain text output, and we need enough of both to fit all the text.
To determine the number of columns I've been using the mean/average of the character widths on the page. That mostly works OK, but on pages with a handful of large characters (like headings) the mean/average can skew high. By using median, we can be confident about 50% of the text runs have characters at least as wide as the number we use to calculate columns in the output.
This is still a super naive algorithm. It's good enough for the basic approach though, and anyone with more advanced needs will still need to build their own layout algorithm.
In the long term, it'd be nice to change this algorithm to avoid silently over-writing characters.
Partial fix for #354