xxyzz / mediawiki_langcodes

Convert MediaWiki language names and language codes.
GNU General Public License v3.0
2 stars 0 forks source link

[el] Some gaps in the data in the Greek code_to_names #22

Closed kristian-clausal closed 4 days ago

kristian-clausal commented 6 days ago

This is some of my own debug print output:

Wikiarticle: WARNING: ... lang_code ... output name ... language heading in Greek wiktionary
άντε: WARNING: Language code 'tsd' Greek name 'tsakoniu' does not matchoriginal string 'Τσακωνικά (tsd)' at ['άντε']

So `code_to_name("tsd", "el") returns "tsakoniu".

I've left out anything with any kind of Greek language output, just the gaps.

Other occurrences, mostly less well-known languages that obviously just have gaps in the data:

'aaa':  '戈圖奧語' -> 'Γλώσσα Ghotuo (aaa)' at 
'alq':  'အယ်လ်ဂါမ်ခှန်' -> 'Αλγκονκίν (alq)' at 
'arz':  'مصرى' -> 'Αιγυπτιακά αραβικά (arz)' at 
'avk':  'Kotava' -> 'Κοτάβα (avk)' at 
'bar':  'Boarisch' -> 'Βαυαρικά (bar)' at 
'bcl':  'Bikol Central' -> 'Φιλιππινέζικα της κεντρικής Μπικόλ (bcl)' at 
'cpg':  '卡帕多細亞希臘語' -> 'Καππαδοκικά (cpg)' at   <<< This one's weird.
'diq':  'Zazaki' -> 'Ζαζάκι (diq)' at 
'ext':  'estremeñu' -> 'Εξτρεμαδουρικά (ext)' at 
'frp':  'arpetan' -> 'Γαλλοπροβηγκιανά (frp)' at 
'gal':  'Galoli' -> 'Γκαλό (gal)' at 
'gan':  '贛語' -> 'Κινεζικά γκαν (gan)' at 
'gmy':  'ဂရိမာဲသဳနဳယာန်' -> 'Μυκηναϊκή διάλεκτος (gmy)' at 
'hak':  '客家語/Hak-kâ-ngî' -> 'Χάκα (hak)' at 
'hbo':  'ဟဳဘရဝ် သၠပတ်' -> 'Αρχαία εβραϊκά (hbo)' at 
'hif':  'Fiji Hindi' -> 'Φίτζι χίντι (hif)' at 
'jam':  'Patois' -> 'Τζαμαϊκανά κρεολικά (jam)' at 
'lij':  'ligure' -> 'Λιγουριανά (lij)' at  <<< Some are lowercase
'liv':  'Līvõ kēļ' -> 'Λιβονικά (liv)' at 
'lld':  'Ladin' -> 'Λαδινικά (lld)' at 
'lmo':  'Lombard' -> 'Λομβαρδικά (lmo)' at 
'mlq':  '西曼丁哥語' -> 'Δυτική μαλίνκε (mlq)' at 
'mo':  'молдовеняскэ' -> 'Μολδαβικά (mo)' at 
'nah':  'Nāhuatl' -> 'Νάουατλ (nah)' at 
'nci':  '古典納瓦特爾語' -> 'Κλασικά νάουατλ (nci)' at 
'nhn':  '中納瓦特爾語' -> 'Κεντρικά νάουατλ (nhn)' at 
'nov':  'Novial' -> 'Νόβιαλ (nov)' at 
'nrm':  'Nouormand' -> 'Νορμανδικά (nrm)' at 
'otk':  'တူရကဳတြေံ' -> 'Παλαιά τουρκικά (otk)' at 
'pdc':  'Deitsch' -> 'Γερμανικά της Πενσυλβανίας (pdc)' at 
'pfl':  'Pälzisch' -> 'Γερμανικά του Παλατινάτου (pfl)' at 
'pms':  'Piemontèis' -> 'Πιεμοντέζικα (pms)' at 
'pnb':  'پنجابی' -> 'Δυτική παντζάμπι (pnb)' at 
'rue':  'русиньскый' -> 'Ρουθηνικά (rue)' at 
'stq':  'Seeltersk' -> 'Ανατολικά φριζικά (stq)' at 
'tsd':  'tsakoniu' -> 'Τσακωνικά (tsd)' at 
'vec':  'vèneto' -> 'Βενετικά (vec)' at 
'vep':  'vepsän kel’' -> 'Βεψικά (vep)' at 
'vls':  'West-Vlams' -> 'Φλαμανδικά (vls)' at 
'xcl':  '古典亞美尼亞語' -> 'Παλαιά αρμενικά (xcl)' at 
'xld':  '呂底亞語' -> 'Λυδικά (xld)' at 
'xno':  'Anglo-Norman' -> 'Αγγλονορμανδικά (xno)' at 
'yua':  'yukatansk maya' -> 'Μάγια του Γιουκατάν (yua)' <<< This one's also interesting because "yukatansk maya", "Yucatan Maya" is obviously from a Scandinavian language
'zea':  'Zeêuws' -> 'Ζηλανδικά (zea)' at 

Just putting these here because of the output names being in sometimes weird languages, like the random Chinese and "yukatansk maya" in Swedish.

xxyzz commented 4 days ago

More Greek language data are added in v0.2.11 release, but if you already have both language name and code then you don't need this library...

kristian-clausal commented 4 days ago

I was using the code -> Greek name as a way to check if the heading is correct, if the code and name match.

xxyzz commented 4 days ago

Seems unnecessary? In el edition, language code is the language template name or in template arg, and language name can be obtained from expanded template.

I think this issue could be considered resolved? Most languages in el edition should be added.

kristian-clausal commented 4 days ago

Yeah, it turned out to be unnecessary at the time.

kristian-clausal commented 4 days ago

No wait, I forgot why I posted this in the first place: It's weird that Yucatan Mayan gets the Swedish name "yukatansk maya", it's weird that other languages get Chinese names. This could be a little more consistent, the algorithm for picking what language to return seems weird.

xxyzz commented 4 days ago

code_to_name could return language name in other language if it can't find the name in requested language. I think it is implemented in this way to mimic some Mediawiki Lua APIs. Usually it's not a problem in extractor code, especially if data extracted from an edition are added.

kristian-clausal commented 4 days ago

If it's emulating some precedent, I guess that's good.