wooorm / franc

Natural language detection
https://wooorm.com/franc/
MIT License
4.12k stars 173 forks source link

Add support for BCP 47 and output IANA language subtags #30

Closed rhythmus closed 8 years ago

rhythmus commented 8 years ago

By default, Franc returns ISO-639-3 three-letter language tags, as listed in the Supported Languages table.

We would like Franc to alternatively support outputting IANA language subtags as an option, in compliance with the W3C recommendation for specifying the value of the lang attribute in HTML (and the xml:lang attribute in XML) documents.

(Two- and three-letter) IANA language codes are used as the primary language subtags in the language tag syntax as defined by the IETF’s BCP 47, which may be further specified by adding subtags for “extended language”, script, region, dialect variants, etc. (RFC 5646 describes the syntax in full). The addition of such more fine-grained secondary qualifiers are, I guess, out of Franc’s scope, but it would be very helpful nevertheless when Franc would be able to at least return the IANA primary language tags, which suffice, if used stand-alone, to be still in compliance with the spec.

On the Web — as the IETF and W3C agree — IANA language subtags and BCP 47 seem to be the de facto industry standard (at least more so than ISO 639-3). Moreover, the naming convention for TeX hyphenation pattern files (such as used by i.a. OpenOffice) use ISO-8859-2 codes, which overlap better with IANA language subtags, too.

If Franc would output IANA language subtags, then the return values could be used as-is, and without any further post-processing or re-mapping, in, for example CSS rules, specifying hyphenation:

@media print {
  :lang(nl) { hyphenate-patterns: url(hyphenation/hyph-nl.pat); }
}

@wooorm :

  1. What is the rationale for Franc to default on ISO-639-3 (only)? Is it a “better” standard, and, if so, why?
  2. If you would agree it would be a good idea for Franc to support BCP 47 and outputting IANA language subtags as an available option, then how would you prefer it to be implemented and accept a PR? (We’d happily contribute.) Would it suffice to add and map them in data/support.json?
rhythmus commented 8 years ago

To further clarify: the rationale for not supporting the output of ISO 639-1 codes (as discussed in issue #10 ) does not apply to this request to return IANA primary language subtag codes, since they (certainly if further specified using BCP 47’s syntax of additional qualifiers) do have the largest coverage of the world’s languages, AFAIK, and hence, of what Franc is capable of recognizing.

wooorm commented 8 years ago

Thanks for the detailed question, great to see!

Franc uses ISO 639-3 because that specification can represent every language used in Franc. Yes, BCP 47 can do that too, through the primary language subtag. The reason BCP 47 uses both 2-character codes and 3-character codes, is because it uses ISO 639-1, ISO 639-2, ISO 639-3, and ISO 639-5 (which ever comes first), and they register those in the IANA registry.

Now, 639-5 is dead, and to my knowledge all 639-1 and 639-2T codes are also in 639-3, so there won’t be more (possibly supported) codes if Franc would switch to BCP 47 language subtags.

The reason to use 639-3 is because it’s a single list of codes, each of three characters, large enough to contain all languages used in franc and small enough to include nothing else. BCP 47 on the other hand, is huge. If you pull down the IANA registry, that’s a lot more data than needed: because multiple specs are involved.

What you’re planning to use Franc for? I have quite some knowledge on BCP 47 and ISO 639. Maybe I can help in some other way.

P.S. thanks for offering to PR!

rhythmus commented 8 years ago

Thanks for your swift response and clarification!

I can see why, from a development/design perspective, ISO 639-3 (which has both complete and concise coverage, with uniform tags) is to be preferred over BCP 47 / IANA (which have more-than-necessary coverage, with tags of unpredictable length and form).

From a practical viewpoint, however, it is still desirable to have Franc (optionally) return IANA primary language subtags, while the W3C recommends those as the preferred value for the lang and :lang() attributes in html, xml and css. Without such output option available, we are required to post-process Franc’s output and hook into some ad-hoc mapping of the ISO 630-3 tags returned by Franc, to W3C-compliant tags.

This is not to say that Franc should pull down the entire IANA registry, let alone reckon with BCP 47’s complex syntax. If I’m not mistaken, it would suffice to “just” add (and maintain…) a simple mapping of the ISO 639-3 codes with their corresponding IANA language tags for the 75 ≤ n ≤ 335 languages that Franc supports? (After all, and as far as we are concerned, both sets are just strings.)

We are developing a typesetting service (Textus) which converts Markdown files into (html5 compliant) responsive webpages and (ISO 19005 compliant) PDF documents. We would like to use Franc to do automatic language detection, after which proper hyphenation can be applied. (BTW, we’re big fans of your remark.js too!)

wooorm commented 8 years ago

OK, thanks!

First off, I do understand why you want BCP-47 tags. That’s a good use case. And, I agree that the solution would be pretty light, as it would not need the complete IANA registry.

But, I do think the solution would be better placed in another module, instead of in the core of Franc. E.g., the following (not yet working) code:

var franc = require('franc');
var toBCP47 = require('iso-639-3-to-bcp-47');

var lang = toBCP47(franc('An English language document with words.'));
console.log(lang);

Yields:

'en'

Would that work?

rhythmus commented 8 years ago

That would do great for our use case[^†], thanks!

How would you plan to implement (m.m. like to see implemented) such an iso-639-3-to-bcp-47 mapper?

I’d be happy to do some grunt work to make this happen. Just let me know how you’d like me to be of assistance!

[^†]: FYI: Pandoc too (and LaTeX) default to BCP 47 instead of ISO 639 3. An iso-639-3-to-bcp-47 mapper for usage with Franc, would be a great feature addition for a lot of workflows, I take it.

wooorm commented 8 years ago

Sorry for the late response.

I’d say to either create a module specifically for franc (all theoretically possible codes are in the wooorm/trigrams module). Mapping those to a IANA entries (where possible, and needed). Then write a function which looks it’s input up in that map, and there’s an entry, returns that, and otherwise returns the input.

An alternative would be to do this for all 639-3 codes, mapping them to ISO 639-1. More useful for non-franc users, but maybe the IANA registry has different values.

Great that you’re willing to investigate. Thanks!

wooorm commented 8 years ago

@rhythmus Ping!

wooorm commented 8 years ago

@rhythmus I’m closing this due to no response. Let me know if I can help you further or if I should re-open this!

timdiggins commented 6 years ago

@wooorm is it there no programmatic converter between iso-639-3 and bcp 47? I presume most of the work of this PR would be actually creating this (separate) converter?

(and a pleasant converter might be able I think to attempt conversion to iso-639-1, which might ease some people's confusion). -- EDIT: could easily do this with existing library like https://github.com/adlawson/nodejs-langs

As a side note, I feel like mentioning the iso-639-3 code format in the README (maybe with a link here?) would be helpful (I wasn't sure whether it was iso-639-3 or iso-639-2 and had to work it out) -- have drafted a PR, feel free to ditch that and word it yourself.

PS thanks for this great library (and CLI is particularly handy!)

wooorm commented 6 years ago

is it there no programmatic converter between iso-639-3 and bcp 47? I presume most of the work of this PR would be actually creating this (separate) converter?

Semantically, ISO 639-3 is a valid BCP 47: just not the suggested shortest canonical version. But yes, a you’re right on the work going into that!

EDIT: could easily do this with existing library like https://github.com/adlawson/nodejs-langs

Yes, or with my own https://github.com/wooorm/iso-639-3. I’d suggest a new project though, iso-639-3-to-1 or so?

As a side note, I feel like mentioning the iso-639-3 code format in the README (maybe with a link here?) would be helpful (I wasn't sure whether it was iso-639-3 or iso-639-2 and had to work it out) -- have drafted a PR, feel free to ditch that and word it yourself.

Yes, I’d like that! Awesome! 👍

PS thanks for this great library (and CLI is particularly handy!)

Thank you :)

davidar commented 6 years ago

Yes, or with my own https://github.com/wooorm/iso-639-3

For reference:

const iso639 = require('iso-639-3')
const shortLang = {}
for (const {iso6391, iso6393} of iso639) shortLang[iso6393] = iso6391

let lang = franc(md)
if (shortLang[lang]) lang = shortLang[lang]
amitbend commented 6 years ago

I created a new slim package to convert between iso-639-3 to iso-639-1, For languages without iso-639-1 that have a "macro language". PRs are welcome! iso-639-3-to-1 package