wp-cli / php-cli-tools

A collection of tools to help with PHP command line utilities
MIT License
672 stars 118 forks source link

Nepali character width calculated incorrectly #103

Closed danielbachhuber closed 7 years ago

danielbachhuber commented 7 years ago

term

From https://github.com/wp-cli/wp-cli/issues/3038#issuecomment-230158804

johnbillion commented 7 years ago

Yay! Character encoding!

This is due to combining marks in the string. mb_strlen() counts characters, but when a combining mark is printed it gets combined with another character to form a grapheme with the width of one character. That's what we actually want to count in a situation like this where the number of printed characters is important.

mb_strlen() will be short by one character for every combining mark present in the string.

You can count the number of graphemes in a string with grapheme_strlen() but this requires the intl extension, and I've no idea how widespread that is. Stack Overflow tells me that preg_match_all( '/\X/u', $str) is an alternative.

The character counting issue in php-cli-tools is in \cli\safe_strlen().

johnbillion commented 7 years ago

This might be a solution, but it's untested. I'll take a proper look later:

function safe_strlen( $str ) {
    return preg_match_all( '/\X/u', $str );
}
danielbachhuber commented 7 years ago

Yay! Character encoding!

I think I detect sarcasm here but I'm not quite sure... ;)

This might be a solution, but it's untested. I'll take a proper look later

Sounds good, thanks!

johnbillion commented 7 years ago

Wow. Turns out that the font makes a difference to how some combined characters appear. Here's the same output in two different fonts:

SF Mono:

screenshot 2017-02-17 17 20 54

Menlo:

screenshot 2017-02-17 17 21 02
danielbachhuber commented 7 years ago

Turns out that the font makes a difference to how some combined characters appear.

:(

gitlost commented 7 years ago

I don't think there's much one can do about fonts not displaying stuff correctly but I got the original Nepali example working (on Ubuntu at least) by using the suggested grapheme_strlen() (with preg_match_all( '/\X/u' ) backup) in a new function strwidth(), to be called by safe_str_pad(), with adjustments for East Asian Width.

screenshot from 2017-07-22 23-43-54

PR to follow.

Edit: just noticed the padding for post_title is off so pushed a fix for that.

schlessera commented 7 years ago

Resolved (as good as possible) through #107 .