whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.15k stars 2.67k forks source link

Reassess the fallback encoding for Greek #4558

Open hsivonen opened 5 years ago

hsivonen commented 5 years ago

In the table under step 8 of https://html.spec.whatwg.org/#determining-the-character-encoding , the fallback encoding suggested for Greek is ISO-8859-7 which comes from early Mozilla localizations and was adopted by Safari and Chrome.

The notable difference between ISO-8859-7 and windows-1253, the old IE fallback, is the byte used for Ά. (Lore has it that windows-1253 reassigned Ά in order to keep the pilcrow sign allocated to the same byte in windows-1253 as in windows-1252 so that Word didn't need to change its hard-coding of the pilcrow sign.)

These days, when presented with legacy-encoded unlabeled Greek text, Chromium (including Chrome and the new Edge) uses content-based guessing that appears to guess windows-1253 if the text looks Greek enough without refining the guess further between ISO-8859-7 and windows-1253. This means that Chromium has effectively changed its alignment from Firefox and Safari to IE. (Unlabeled ISO-8859-7-encoded test case that Chromium decodes as windows-1253.)

It seems appropriate to find out what the new Chromium behavior is based on, e.g. if Google knows from Web crawling that unlabeled Greek legacy content is more often windows-1253 than ISO-8859-7. If windows-1253 is more common than ISO-8859-7 for unlabeled content the spec should change its advice.

hsivonen commented 5 years ago

@jungshik Is the Chromium detector's bias towards windows-1253 based on evidence from Web crawls?

hsivonen commented 5 years ago

(FWIW, when the autodetector in IE11 is enabled, it, too, uses windows-1253 for unlabeled ISO-8859-7 content. Both detectors do support the ISO 8859 series in the Central European and Cyrillic cases, so neither Chromium nor IE is restricted to Windows encodings in general.)

littledan commented 5 years ago

cc @frankyftang

hsivonen commented 5 years ago

cc @JinsukKim

hsivonen commented 5 years ago

It seems that there are people complaining about the Google detector's results on ISO-8859-family input.