whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
7.88k stars 2.58k forks source link

`input[maxlength]` breaks a grapheme cluster down into pieces #7861

Open saschanaz opened 2 years ago

saschanaz commented 2 years ago
<input maxlength="7">

Copy-pasting 🏳️‍⚧ into that input twice results in 🏳️‍⚧🏳 in both Firefox and Chrome, which is fairly unexpected for users and websites can't really control this. (https://github.com/mastodon/mastodon/issues/18038)

Can we specify that the cluster should not be broken down and instead be prevented altogether, so that the result can be 🏳️‍⚧ instead of 🏳️‍⚧🏳? (One counterpoint would be that this can't be consistently done for all browsers as the number of the clusters changes every time a Unicode update happen, but I'm not sure how it can cause actual interoperability issue in this case.)

(#1467 shows maxlength is complicated as WebKit counts the emoji as a single character, but that's about the counting of the characters.)

domenic commented 2 years ago

This kind of falls at the boundary between HTML and UI Events, and UI Events is unfortunately not that maintained... Maybe the best we can do is throw in something into HTML.

Is this kind of limitation something browsers are implemented in implementing?

saschanaz commented 2 years ago

I can take a look at the implementation some day if @annevk is okay with that.

annevk commented 2 years ago

Yeah, seems like a reasonable thing to fix.

I suppose an argument could be made that how maxlength is enforced for user input is a UI decision and browsers should be allowed to decide how to trim excessive inputs. (Websites could grab the paste event and set value directly if they don't like that.)

aphillips commented 2 years ago

I tend to agree with @annevk that this could be browser specific. It might also be useful for the browser to indicate to the user visually that truncation has occurred. For example,some emoji sequences can be really long. This family emoji (👨🏻‍👩🏼‍👧🏾‍👧🏿) has 12 code points (don't forget skin tone selectors). With maxlength=7 and grapheme truncation, the input or paste would just appear to fail.

I think that specifying maxlength in terms of graphemes rather than code points would probably be best for end-users (that is, I think WebKit in #1467 is more user-friendly). However, this is probably not consistent with the expectations of page authors (if I said maxlength=7 I don't expect to get 7 x 12 code point family emojis = 84 code points as input)

Note that while the example features emoji, this also affects languages that use combining marks to form e.g. syllables. For example <input maxlength=5> and यूनिकोड results in यूनिक (the last conjunct should be को). While grapheme clusters in language don't tend to reach the absurd lengths that emoji sequences do, they still can be reasonably long (3-4 code points and rarely more) and result in damaged meaning if truncated in the middle. Definition of grapheme clusters in Unicode is imperfect (leading to various permutations, such as "extended" grapheme clusters and on-going work to fully describe cluster boundaries). @r12a can provide more detail.

domenic commented 2 years ago

Thanks @aphillips for the great reminder about how complicated the world of text is :).

For the HTML Standard, I guess the question is whether we say anything, and if so, how. I was thinking of expanding the existing text:

User agents may prevent the user from causing the element's API value to be set to a value whose length is greater than the element's maximum allowed value length.

by adding a paragraph or sentence like:

If user agents implement such a restriction, they should take special care in cases where multiple code units are entered at once, such as via pasting or using an input method editor. For example, if pasting यूनिकोड into an <input maxlength=7> field truncates the value to the first 7 code units, the result is यूनिक, but a more semantically correct (????) truncation would be यूनिको. Similarly, [... give examples of emoji situations ...]. The best user interface for such situations is not clear, so user agents might want to experiment and report back to the spec on what they find meets users' expectations.