whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.01k stars 2.62k forks source link

Add hyphenation-hint as HTML-element #6326

Open OskarGrosser opened 3 years ago

OskarGrosser commented 3 years ago

Currently there is ­ to hint to a possible line-break opportunity. But as ­ is its own character, copying the underlying text will copy that character as well.

When ­ is inside an element, it will not display a hyphen, even if it is at a line-break.
"Inside an element" includes:

This rules out the possibility of styling it to behave only as a line-break opportunity, without copying it.

Changing the behavior of ­ is out of question, as it is already used for long and for specific purposes. Also, its definition is debated, but HTML and Unicode seem to have settled on one. Hence there is still a need for hinting to line-break opportunities without the hint actually encoding a character. Without the ability of styling ­ to mimic said behavior, another solution is required.

There exists a similar case, where U+200B ZERO-WIDTH SPACE can be represented by <wbr> or &#x200B;, with the first being unselectable, and the latter as the character itself.

I suggest - as in the zero-width space case - to have a complementary HTML-element for &shy;. It should:

Basically be &shy; but unselectable, as in not part of the rendered text, like <wbr>.

r12a commented 3 years ago

I'm still working through this in my mind, but i'm beginning to think that it would be better to use the exisiting wbr markup, rather than create more, but make it stylable using CSS. (So this would become a CSS issue, rather than an HTML one.)

wbr is not specified in great detail, and it only appears to be associated with ZWSP because it currently doesn't produce a hyphen when the line breaks. I haven't discovered anything yet that prevents its use within words, as well as between them. And it's sole function appears to be to indicate a line break opportunity.

One problem with using &shy; is that it's limited in use, unless the browser adds some smarts. It doesn't currently cope well with the following use cases:

  1. In some cases the spelling of a word needs to be changed around hyphenation, for example in Dutch cafeetje → café-tje and skiërs → ski-ers, and in Hungarian Összeg → Ösz-szeg.
  2. The symbol used to indicate that a word was broken at the end of a line is not always one that looks like a hyphen. Cree uses ᐀ [U+1400 CANADIAN SYLLABICS HYPHEN], Armenian uses ֊ [U+058A ARMENIAN HYPHEN], Balinese uses ᭠ [U+1B60 BALINESE PAMENENG], etc.
  3. The location of the mark is not always at the end of the line – some languages put it at the start of the following line.

My expectation is that when a browser gets around to providing support for hyphenation for a given language, and when the content author sets hyphens:auto in CSS, the browser would automatically apply rules for hyphenation related to the language of the text, such that all of the above variations are addressed.

It seems to me, then, that the browser could do the same when a wbr tag appears inside a word. The key starting point, in all cases, is to know where the break point opportunity lies (which is what the wbr tag does), and what the language is.

So as not to break legacy content, it would probably be necessary to continue to produce no hyphenation behaviour by default for wbr, other than a line-break. However, CSS could be used to activate the browser smarts so that the full hyphenation behaviour is produced automatically by the browser.

We could also go further and give authors CSS properties that would allow them to style the result of using wbr, at least to some extent, in the absence of browser smarts. For example, we could allow authors to indicate what character should be used for the mark (or that no mark should be used) by styling the wbr tag. This may help where hyphenation is not yet implemented for a browser+OS+language combination. For example, it would allow someone authoring Plains Cree to specify that the mark to use is ᐀ [U+1400 CANADIAN SYLLABICS HYPHEN], and thereby allow some degree of manual hyphenation to occur for Cree, well before the browser gets around to implementing the necessary dictionaries or rules required by hyphens:auto.

It would also allow authors to do the equivalent of &shy; for a language like Telugu, which browsers don't currently hyphenate, but which has complicated morphology and long words, and needs hyphenation (using the typical '-'). In this case, the advantage of using wbr instead of &shy; being that, as you originally wanted, the break points wouldn't be copied with the text.

r12a commented 3 years ago

cc @fantasai @frivoal

frivoal commented 3 years ago

I don't think <wbr> could serve that purpose as is without breaking the purpose it currently serves, as there's no way to tell the difference between <wbr> in the middle of a word and <wbr> separating two words. However, I suspect we could make it gain this new ability via an attribute.

If the attribute is absent, <wbr> behaves like a zero width space, as it does today. If the attribute is present, <wbr> would behave as a soft hyphen, and the value of the attribute would let you know what character to inject when line breaking, so that do the right thing in languages that the browser doesn't know how to handle. Let's call the attribute hyphen:

r12a commented 3 weeks ago

I wouldn't want to have to add <wbr hyphen="᭠"> to every word that needs a soft hyphen. Apart from the bulk, which would affect readability of the source code, if you wanted to use a different hyphen character you'd have to edit the HTML source for all the documents where this was used. Specifying the expected appearance using a line of CSS would provide much more flexibility, eg. for translations, where a different hyphen would be needed, and the appropriate change can be effected by altering a single line of CSS code.