Open hsivonen opened 5 years ago
@jungshik Is the Chromium detector's bias towards windows-1253 based on evidence from Web crawls?
(FWIW, when the autodetector in IE11 is enabled, it, too, uses windows-1253 for unlabeled ISO-8859-7 content. Both detectors do support the ISO 8859 series in the Central European and Cyrillic cases, so neither Chromium nor IE is restricted to Windows encodings in general.)
cc @frankyftang
cc @JinsukKim
It seems that there are people complaining about the Google detector's results on ISO-8859-family input.
In the table under step 8 of https://html.spec.whatwg.org/#determining-the-character-encoding , the fallback encoding suggested for Greek is ISO-8859-7 which comes from early Mozilla localizations and was adopted by Safari and Chrome.
The notable difference between ISO-8859-7 and windows-1253, the old IE fallback, is the byte used for Ά. (Lore has it that windows-1253 reassigned Ά in order to keep the pilcrow sign allocated to the same byte in windows-1253 as in windows-1252 so that Word didn't need to change its hard-coding of the pilcrow sign.)
These days, when presented with legacy-encoded unlabeled Greek text, Chromium (including Chrome and the new Edge) uses content-based guessing that appears to guess windows-1253 if the text looks Greek enough without refining the guess further between ISO-8859-7 and windows-1253. This means that Chromium has effectively changed its alignment from Firefox and Safari to IE. (Unlabeled ISO-8859-7-encoded test case that Chromium decodes as windows-1253.)
It seems appropriate to find out what the new Chromium behavior is based on, e.g. if Google knows from Web crawling that unlabeled Greek legacy content is more often windows-1253 than ISO-8859-7. If windows-1253 is more common than ISO-8859-7 for unlabeled content the spec should change its advice.