wixtoolset / issues

WiX Toolset Issues Tracker
http://wixtoolset.org/
129 stars 36 forks source link

{iso639-1}[-{iso15954}][-iso31661 alpha2] for Locale Identification #7896

Open hollowaykeanho opened 9 months ago

hollowaykeanho commented 9 months ago

Feature requests

If this issue is a feature request:

When packaging using Wix4 with UI supports, it is found that the internationalization feature is using its own set of language identification format shown in https://wixtoolset.org/docs/tools/wixext/wixui/#localization. Although not impeding innovation, it's better to deploy the common {iso639-1}[-{iso15954}][-iso31661 alpha2] format where it is used in a lot of industries especially web and network applications.

This has 2 benefits:

  1. Don't have to guess or refer to another table just to identify a language + location.
  2. Don't have to maintain another ID table set (which is a huge effort when covers all the known probable combinations).

The ID format is:

{iso639-1}[-{iso15954}][-iso31661]
{language_code}[-language_variant][-country_code]

NOTE:
1. All must be in lowercase.
2. Only dash (-) is used for separation.

LEGEND:
1. { … } - Compulsory Element
2. [ … ] - Optional Element

Examples:

1 en – International English. 2 en-us – English (United States). 3 fr – International French. 4 fr-ca – Canadian French. 5 zh-hans – International Simplified Chinese. 6 zh-hant – International Traditional Chinese. 7 zh-hans-tw – Taiwanese Simplified Chinese. 8 zh-hans-hk – Hong Kong SAR Simplified Chinese. 9 zh-hans-cn – Mainland China Simplified Chinese. 10 zh-hant-tw – Taiwanese Traditional Chinese. 11 zh-hant-hk – Hong Kong SAR Traditional Chinese. 12 zh-hant-cn – Mainland China Traditional Chinese. 13 zh-hans-us – United States Simplified Chinese. 14 zh-hant-us – United States Tranditional Chinese.

References

1 https://developers.google.com/search/docs/specialty/international/localized-versions 2 https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes 3 https://unicode.org/iso15924/iso15924-codes.html 4 https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2

Only the ID values at Wix's wxs layer where we developer deploy Wix.

It is understandable that the backend MSI is dictated by Microsoft so underlying low-level layers should not be affected (help us helps you).

barnson commented 8 months ago

The localizations that ship with WiX use the then-current ids as used by Windows and .NET. Other ids could be used, as long as backward compatibility is maintained for culture names and resulting code pages.

BMurri commented 8 months ago

IIRC, .net accepts (for decades) the proposed culture format proposed in this issue, so one could close this issue with no changes and call it completed, but adding a link in the documentation to this page might be helpful: https://learn.microsoft.com/dotnet/api/system.globalization.cultureinfo?view=net-8.0#culture-names-and-identifiers

hollowaykeanho commented 8 months ago

Thanks @BMurri. Quoting:

A culture name that includes a script uses the pattern languagecode2-scripttag-country/regioncode2.

I will give in a try since I managed to bring MSI CI up already before closing the case (notable en & zh-hans in lowercases). However, we cannot avoid LCID translation table, am I right?

BMurri commented 8 months ago

Dotnet's parsing of culture names is always case-insensitive, but if you require lowercase everything case might not round-trip. Good thing Windows filesystems are (usually) case-insensitive!

Dotnet's CultureInfo has an LCID property, so LCID is covered.

Note that for bundles, none of what follows applies. Bundles use Unicode internally. What follows applies to Windows Installer files (no matter how created).

The hard one is the "code-page"(also this) that's required in every Windows Installer file type (likeMSI). There are two places where codepages are used: the database table fields and the Summary Information, and each has its own restrictions.

The codepage(s) selected cannot be "16-" (1200,1201) or "32-" (12000,12001) -bit Unicodes (but can be DBCSs), and must be supported on the Windows system that you (or your customers) deploy it on (the support historically has varied by version, edition, and optional add-ons). You cannot use any character that isn't in the codepage selected (you will get an error while building).

For the database portion of the output file, "UTF-8" (65001) can be used (and in fact must be used for many of the cultures added to Windows after Windows 7) but doing so when another codepage is historically associated with the related culture has often resulted in weird font issues (Windows Installer still uses ~25 year-old APIs that predate most Unicode support). IIRC the docs say Unicode isn't supported.

For the SummaryInformation record (which, among other things, produces all of the textual "File Properties" information except for the filename) no version of Unicode works (when I discovered that about 15 years ago, it would build but either wouldn't display anything or would display gibberish).

In Windows Installer you can use a value of 0 for codepage, but it means 7-bit ASCII. When I managed projects that included Unicode-only locales, I mandated that the Summary Information was always in English for all packages, to deal with that difference in codepage support.

Given the general lack of non-security changes in Windows Installer in the past couple of decades I don't think the behaviors I've described above have changed.

hollowaykeanho commented 8 months ago

Just came back with test results. Nope. It fails badly. These are the results and artifacts used when complying to using .Net strict codes:


Control sample

  1. Build log from CI (specifically MSI package) - https://github.com/wixtoolset/issues/files/13763003/msi_automataci-msi_any-any.log
  2. Actual Test Run - https://github.com/corygalyna/AutomataCI/actions/runs/7314959938
  3. Generated WXS by CI automation
    1. en-US - https://github.com/wixtoolset/issues/files/13763007/automataci_1.7.0_en-US_windows-amd64.txt
    2. zh-CN - https://github.com/wixtoolset/issues/files/13763009/automataci_1.7.0_zh-CN_windows-amd64.txt

Build status: success



Actual Experiment

  1. Build log from CI (specifically MSI package) - https://github.com/wixtoolset/issues/files/13763018/msi_automataci-msi_any-any.log
  2. Actual Test Run - https://github.com/corygalyna/AutomataCI/actions/runs/7317290928/job/19932655672
  3. Generated WXS by CI automation
    1. en-us - https://github.com/wixtoolset/issues/files/13763021/automataci_1.7.0_en-us_windows-amd64.txt
    2. zh-hans - https://github.com/wixtoolset/issues/files/13763024/automataci_1.7.0_zh-hans_windows-amd64.txt

Build status: failed


What we can learn from this run is that:

  1. it can be lowercase; AND
  2. Have to use the specific language code (e.g. en-us with country code instead of en). This is the reason why zh-CN passed and zh-hans failed where CN means China but Mandarin speakers can be from anywhere not just China alone (e.g. Taiwan TW, Singapore SG, or mine, Malaysia MY).

I'm aware of the requirement for maintaining the backward compatibility at the compiled MSI layer which is why I only file it on wix layer alone. In order to support i18n right now, a special dedicated table must be created (see: https://github.com/ChewKeanHo/AutomataCI/blob/experimental/src/.ci/_package-msi_windows-any.ps1) just to interact properly (notice I have to maintain $__i18n, $__var_LANGUAGE_ID, and $_wxs for 1 language). This is only seen for MSI packager. Other types of packaging systems do not have this problem.

Language shouldn't be geo-fenced without apparent trade/culture reason (usually political embargo or specific country target market). Here's a case study, 'stone' in International Mandarin (zh-hans) is known as 石头 or 石子 where all CN, TW, SG, and MY recognize them. However, in MY, certain region actually accepts "Ba-tu" as well due to local cultural fusion which makes folks from CN, TW, and some SG scratches their head.

I don't think we should mark this as done.

Side-note for everyone:

christmas-postcard

BMurri commented 7 months ago

The actual error messages are no longer available (⦗ ERROR ⦘ package failed - ... is useless, I don't even have the command-line passed to the wix tool anywhere). There's no difference between the two zh WXS files (apart from component guids changing from build-to-build, which I'm assuming is deliberate).

I'm curious about permutations like zh-hans-cn (which should defiantly work), en, zh-hant-cn, zh-hans-tw, etc.

As far as I can tell, the LCIDs need to end up being one of the "hex without prefix" names in the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\Locale (or HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\Locale\Alternate Sorts) registry key. None of those values are neutral languages. So, the issue isn't necessarily WiX or .Net, it's likely Windows Installer. However, there should be something we could do here to make sensible "defaults" when neutral languages are requested that don't break the system (Windows itself by default uses some locales as fallbacks for others, even when they are not neutral locales).

robmen commented 7 months ago

I think the comments are going into unnecessary technical detail.

Many .wxl files are available in WiX. For example, you can see the WiX UI .wxl files in this folder: https://github.com/wixtoolset/wix/tree/develop/src/ext/UI/wixlib

Those files have culture identifiers and codepages in them. I understand some of the .wxl files are not using the most modern culture identifiers. That's okay. We should keep the existing files with their current names for backward compatibility reasons. (If there are translation mistakes in those files, we would appreciate fixes by people who speak those languages).

Therefore, the request is to add more .wxl files with "more modern" culture identifiers and (presumably) more translations. That would be great, except that @barnson and I are only proficient in en-US so others must step up.

BMurri commented 7 months ago

I just did an experiment of my own. Taking a WXS with no reference to the UI extension and no Package/@Language attribute, passing a -culture of en to the wix tool, I looked at the summary information and the ProductLanguage property and both were "9".

I ran the MSI and there were no problems or other weirdness. I added a reference to the UI extension and added a UI set. Rebuilt. Got errors about not having several localization variable values. I downloaded the wxl file for English from https://github.com/wixtoolset/wix/blob/develop/src/ext/UI/wixlib/WixUI_en-us.wxl, changed the WixLocalization/@Culture to "en", removed the ExtensionDefaultCulture attribute (since the file is no longer in an extension), and rebuilt.

I looked at the summary information and the ProductLanguage property and both were "9". I ran the MSI and no problems (everything showed up as expected and looked great).

So, Windows Installer has no problems with neutral languages. With some testing, we could confirm that if we use neutral locales in the localization files in the extensions while using "specific locales" everywhere else everything will just continue working as it does today.

@robmen I'm interested in doing that experiment and writing a WIP as well as a PR including tests to verify backward compatibility. I don't care if it ends up in v5 or v6, personally, but having done quite a bit of globalization/localization work over the years, and being near-native fluent in 2 languages (with varying degrees of exposure to several others) I know that I can get this working.

@hollowaykeanho The -culture argument is only needed when the build has multiple cultures within the scope of the build files presented to filter them down to just one. You won't need to pass it with the way you are integrating WiX into your CI system.

robmen commented 7 months ago

@BMurri I'm not sure what you're thinking so a feature request seems like a good place to start.

Unless it's a really small change it won't make v5 but there is always v6.

BMurri commented 7 months ago

@robmen This issue already is a feature request. Would you entertain a PR (in wixtoolset/web, where the other WIPs are) comprising a WIP for this feature request? I'd love to engage both you and @hollowaykeanho in refining that WIP before I go too far down this particular rabbit hole.

hollowaykeanho commented 7 months ago

Hi all, sorry for the late reply, am engaging a customer's project =.=".

The -culture argument is only needed when the build has multiple cultures within the scope of the build files presented to filter them down to just one. You won't need to pass it with the way you are integrating WiX into your CI system.

Thanks for the input. I'll run a test and revert back to you asap. so the difference would be removing the -culture argument.

Side-note: the current implementation is definitely working and if your finding works, I will make it an enhancement feature on my side (don't wish to block the nearest release).

the request is to add more .wxl files with "more modern" culture identifiers and (presumably) more translations.

The request is more about using the fragment of the language+country codes correctly, not "adding feature". At the moment, country codes are being abused even for Global English (en) assumed to be in United States - English (en-US), which is entirely different from United Kingdom's English (en-UK) or worst, Singlish (en-SG).

None of those values are neutral languages. So, the issue isn't necessarily WiX or .Net, it's likely Windows Installer.

Yeah. Definitely MSFT Windows' problem. I'm aware Microsoft was the only one spearheaded i18n feature back in 1990s and to keep things backward compatibles, I'm likely going to think an interpreter table would makes sense:

  1. WiX customers layer feeds existing and proper language codes.
  2. WiX packager use the table to determine the closest Windows Installer LCID and fonts.
  3. Generated output.
BMurri commented 2 months ago

I want to fix this issue

robmen commented 2 months ago

@BMurri this issue still needs a WIP. I don't know what the proposed fix is. Open a new issue to create a WIP and feel free to reference this issue as background information.

hollowaykeanho commented 2 months ago

hi @BMurri , sorry for the hold up... I'm currently working on refactoring the pipeline on my side.

hollowaykeanho commented 2 months ago

Hi @BMurri, I was building this library and auditing its datasets for the last 2 months to make sure I'm not straying too far from fantasy. The algorithms are unit-tested. The caveat is that it's currently in TypeScript and is yet to translate into C/C++ and other languages.

hestiaLOCALE.zip

I like to point out that the following do support my case:

  1. in data_language.ts, you can find the data structure of language (iso639-1) can scan for its associated optional script ([-{iso15954}]). Some good examples are: en (english), zh (chinese), ko (korean), ar (arabic).
  2. The country is independent of the previous 2 but its official language is backward traceable.
  3. The general "Parse" and "To_String" functions are inside Vanilla.ts.

The next step is integrating Microsoft's LCID into it for backward compatibility. As of how, still looking into it.


For the CI part, I'm currently trying to simplify the job recipes because the implementations for UNIX and Windows are way too different.

hollowaykeanho commented 2 months ago

Hi @BMurri, I refactored my CI implementations and investigated this issue on my side.

The -culture argument is only needed when the build has multiple cultures within the scope of the build files presented to filter them down to just one. You won't need to pass it with the way you are integrating WiX into your CI system.

So far, no luck without using -culture argument even when the Language property is present (Ref: https://stackoverflow.com/questions/75879120/package-languages-in-wix-4 and https://wixtoolset.org/docs/tools/wixext/wixui/#translated-strings).

Is this a bug?


Case I: without using -culture argument but use en (LCID: 9)

Result: FAILED - have to fill in all manually


Case II: without using -culture argument but trick the compiler to use en-us (LCID: 1033) for en (LCID: 9)

Result: FAILED - have to fill in all manually


Case III - using -culture argument and en (LCID: 9)

Result: SUCCESS


hollowaykeanho commented 2 months ago

Regarding this issue, I gone through Microsoft LCID identifiers and they DID support {iso639-1}[-{iso15954}] identifications (under section 2.2 "Language ID (2-bytes)" table from https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/70feba9f-294e-491e-b6eb-56532684c37f).

In fact, the current list in https://wixtoolset.org/docs/tools/wixext/wixui/#translated-strings are supposedly:

  1. ar-SA (LCID: 0x0401) -> ar (LCID: 0x0001)
  2. bg-BG (LCID: 0x0402) -> bg (LCID: 0x0002)
  3. ca-ES (LCID: 0x0803) -> ca (LCID: 0x0003)
  4. cs-CZ (LCID: 0x0405) -> cs (LCID: 0x0005)
  5. da-DK (LCID: 0x0406) -> da (LCID: 0x0006)
  6. de-DE (LCID: 0x0407) -> de (LCID: 0x0007)
  7. el-GR (LCID: 0x0408) -> el (LCID: 0x0008)
  8. en-US (LCID: 0x0409) -> en (LCID: 0x0009)
  9. es-ES (LCID: 0x040A) -> es (LCID: 0x000A)
  10. et-EE (LCID: 0x0425) -> et (LCID: 0x0025)
  11. fi-FI (LCID: 0x040B) -> fi (LCID: 0x000B)
  12. fr-FR (LCID: 0x040C) -> fr (LCID: 0x000C)
  13. he-IL (LCID: 0x040D) -> he (LCID: 0x000D)
  14. hi-IN (LCID: 0x0439) -> hi (LCID: 0x0039)
  15. hr-HR (LCID: 0x041A) -> hr (LCID: 0x001A)
  16. hu-HU (LCID: 0x040E) -> hu (LCID: 0x000E)
  17. it-IT (LCID: 0x0410) -> it (LCID: 0x0010)
  18. ja-JP (LCID: 0x0411) -> ja (LCID: 0x0011)
  19. kk-KZ (LCID: 0x043F) -> kk (LCID: 0x003F)
  20. ko-KR (LCID: 0x0412) -> ko (LCID: 0x0012)
  21. lt-LT (LCID: 0x0427) -> lt (LCID: 0x0027)
  22. lv-LV (LCID: 0x0426) -> lv (LCID: 0x0026)
  23. nb-NO (LCID: 0x0414) -> nb (LCID: 0x7C14)
  24. nl-NL (LCID: 0x0413) -> nl (LCID: 0x0013)
  25. pl-PL (LCID: 0x0415) -> pl (LCID: 0x0015)
  26. pt-BR (LCID: 0x0416), pt-PT (LCID: 0x0816) -> pt (LCID: 0x0016)
  27. ro-RO (LCID: 0x0418) -> ro (LCID: 0x0018)
  28. ru-RU (LCID: 0x0419) -> ru (LCID: 0x0019)
  29. sk-SK (LCID: 0x041B) -> sk (LCID: 0x001B)
  30. sl-SI (LCID: 0x0424) -> sl (LCID: 0x0024)
  31. sq-AL (LCID: 0x041C) -> sq (LCID: 0x001C)
  32. sr-Latn-RS (LCID: 0x081A) -> sr (LCID: 0x7C1A)
  33. sv-SE (LCID: 0x041D) -> sv (LCID: 0x001D)
  34. th-TH (LCID: 0x041E) -> th (LCID: 0x001E)
  35. tr-TR (LCID: 0x041F) -> tr (LCID: 0x001F)
  36. uk-UA (LCID: 0x042) -> uk (LCID: 0x0022)
  37. zh-CN (LCID: 0x0804), zh-SG (LCID: 0x1004) -> zh-Hans (LCID: 0x0004), zh (LCID: 0x7804)
  38. zh-HK (LCID: 0x0C04), zh-TW (LCID: 0x0404), zh-MO (LCID 0x1404) -> zh-Hant (LCID: 0x7C04)

For backward compatibility purposes, I propose to group each line as 1 language pack since MSFT LCID does not micro-manage the scripts (e.g jp-Jpan, jp-Hira Hiragana, & jp-Kana Katakana). Then, in the future, you can orientate your language translations towards the base languages symbol (e.g. jp).

Note that for zh, I can only confirm that zh-CN, zh-SG are indeed using zh-Hans script. I assume Hong Kong, Taiwan, and Macao are still using zh-Hant script. From communities wise, younger generations tend to skew towards zh-hans while elder generations are towards zh-hant (even in China and Singapore). Politicians (e.g. Taiwan vs China) can be complicated. That's why zh-Hant and zh-Hans are way better than using the location versions and the generic version (zh).


I'm curious about permutations like zh-hans-cn (which should defiantly work), en, zh-hant-cn, zh-hans-tw, etc.

Quite big since the relationship between language and country is mutually exclusive.

Maths wise, it's combination, and each language has associated scripts, if we insist on assigning LCID for every languages in its scripted form, for each countries, would be something like:

T = (S(a)C1 . S(b)C1 . S(c)C1 . ... )C249 

a, b, c ... --> total scripts of each language
T: total cases
C: combination operator
249 registered countries so far

So by estimate, scoping down to only official languages (Set 1) and countries without script:

T = 249C183 = 1.961551522E+61

T: total cases
C: combination operator
249 registered countries
183 languages (Set 1)
BMurri commented 2 months ago

@hollowaykeanho thank you, that is quite helpful.

I'm going to build out some scenarios in the context of the current codebase of the WiX toolset and after combining those investigations I will proceed with a WIP for this issue, as requested by @robmen.

Expect a WIP by the end of the northern hemisphere summer.