xigt / lgid

language identification of linguistic examples
MIT License
1 stars 0 forks source link

Languages with ??? language codes #2

Closed elirnm closed 7 years ago

elirnm commented 7 years ago

The following languages currently have ??? for their language code and should probably get a code of their own:

Most of these are pretty obvious and we just need to come up with a code for them. Ethnologue lists 50+ Zapotecs though, so I don't know if Colonial Valley is a different name for one of those or if it's its own thing.

rgeorgi commented 7 years ago

I'll just note that this isn't exactly as cut and dry as it might seem — while Lardil is lbz, for instance, Old Lardil is evidently at least partially distinct from Lardil and thus whether to use lbz for both is unclear.

I propose that we use the most closely related ISO code we can, and append -??? to keep these special case languages distinct, if need be (while still being able to collapse them into a similar language if needed).

goodmami commented 7 years ago

Also, ??? should not be used for a language code, while it could be used for an unknown language name. The code for an unknown language is und (undetermined), while for a known language without a code it's mis. See: https://en.wikipedia.org/wiki/ISO_639-3#Special_codes

elirnm commented 7 years ago

So should I use mis for those (I think most of them at least don't have ISO codes), or do we want to distinguish these languages even though ISO doesn't?

goodmami commented 7 years ago

That might be the most correct.

If we were using BCP-47 tags instead of ISO-639-3 codes, we could include that information. E.g. zap-colonial-valley or something. But if you can determine that there's an ISO-639-3 code that's pretty close to the language in question, we can probably use that.

elirnm commented 7 years ago

Ok, I plan to use mis for Taimyr Pidgin Russian.

For Colonial Valley Zapotec, I'll either use mis or zap (the code for the Zapotec Macrolanguage), depending on your preference.

The rest are all historical forms of languages with codes. I'll use mis for those unless you think it'd be better to just use the code for the modern language. It seems like mis probably makes more sense there.

Old Lardil might be a special situation, however, since the Wikipedia article on Lardil indicates that the "Old" variety didn't die out until 2007, so there might actually be articles that use "Old Lardil" to refer to just basic Lardil (lbz).

Let me know what you prefer for Zapotec, Lardil, and the other Old/Classic languages, and then I'll update the language table.

goodmami commented 7 years ago

I think your conclusions are good. Thanks for doing the research.

zap for Colonial Valley Zapotec is probably ok. I saw there was a subcategory of "Valley Zapotecs", so if there's a more refined macrolanguage that would probably be appropriate. Otherwise let's go with zap.

Lardil is indeed special, as you say. Wikipedia makes it sound like the language had nearly died and then was revitalized, with the pre-near-death variant now being called "Old Lardil" and the new variant (or perhaps both in general) being just "Lardil". I'd say we use the code lbz for Old Lardil (and "regular" Lardil as well).