xinjli / allosaurus

Allosaurus is a pretrained universal phone recognizer for more than 2000 languages
GNU General Public License v3.0
550 stars 86 forks source link

Incomplete phone inventory for iso gup #1

Open complinger opened 4 years ago

complinger commented 4 years ago

Description:

The phone inventory for Kunwinjku (iso gup) is incomplete. The output of python -m allosaurus.list_phone --lang gup is:

['a', 'e', 'i', 'j', 'l', 'm', 'n', 'o', 'r', 'u', 'w', 'ŋ', 'ɭ', 'ɳ', 'ɻ', 'ʔ']

However, Phoible lists the complete inventory as:

allophone description_name
m m Gunwinggu (PH 883)
i ɪ i Gunwinggu (PH 883)
j j Gunwinggu (PH 883)
u ʊ u Gunwinggu (PH 883)
a ʌ ai au a Gunwinggu (PH 883)
w w Gunwinggu (PH 883)
n n Gunwinggu (PH 883)
l l Gunwinggu (PH 883)
b p pʰ b Gunwinggu (PH 883)
ŋ ŋ Gunwinggu (PH 883)
e ɛ æ e Gunwinggu (PH 883)
o ɔ ɒ o Gunwinggu (PH 883)
ɡ k kʰ ɡ Gunwinggu (PH 883)
r r Gunwinggu (PH 883)
ɲ ɲ Gunwinggu (PH 883)
ʔ ʔ Gunwinggu (PH 883)
d̪ t̪ t̪ʰ d̪ Gunwinggu (PH 883)
ɳ ɳ Gunwinggu (PH 883)
ɭ ɭ Gunwinggu (PH 883)
ɻ ɻ Gunwinggu (PH 883)
ɖ ɖ Gunwinggu (PH 883)
ɽ ɽ Gunwinggu (PH 883)
ʎ ʎ Gunwinggu (PH 883)
dʲ tʲ tʲʰ dʲ Gunwinggu (PH 883)

https://phoible.org/inventories/view/883

Expected behavior

I would expect the allosaurus model inventory for iso gup to be:

['a', 'e', 'i', 'j', 'l', 'm', 'n', 'o', 'r', 'u', 'w', 'ŋ', 'ɭ', 'ɳ', 'ɻ', 'ʔ', 'ɪ', 'ʊ', 'ʌ', 'ai', 'au',  'b', 'p', 'pʰ', 'ɛ','æ', 'ɔ', 'ɒ', 'ɡ', 'k', 'kʰ', 'ɲ', 'd̪', 't̪', 't̪ʰ', 'ɖ', 'ɽ', 'ʎ', 'dʲ', 'tʲ', 'tʲʰ']
xinjli commented 4 years ago

Hi

Thanks for sending an issue with very clear descriptions! I will take a look at this very soon.

Thanks! Xinjian

xinjli commented 4 years ago

Hi, sorry for the late reply.

The main cause of the issue here is because we built the PHOIBLE inventory by using the Segment columns rather than the allophone column as the allophone column is empty for lots of languages.

I think using the allophone column (when nonempty) should be the expected behavior as you suggested. In the next pretrained model update, I will fix the inventory to solve this issue.

Thanks!

BrendanJohnson commented 3 years ago

I noticed some discrepancy between the phone inventory and the allophones listed in phoible, for example for Cantonese (yue):

Output of allosaurus.list_phone: a a̞ e f h i j k kʰ kʷ kʷʰ l l̥ l̪ l̪̥ m m̩ n n̪ o p pʰ r s sʰ t tʰ t̠ t̪ t̪ʰ u w y æ ŋ ŋ̩ œ œ̞ ɐ ɔ ɛ ɪ ɪ̞ ɵ ʃ ʃʰ ʊ ʊ̟ β

Output of Phoible: m i k j u a p w n t l s ŋ h f ɛ ɔ ts kʰ pʰ ɪ ʊ tʰ kʷ tsʰ y ai œ au ɐ kʷʰ ui ei ɵ iu ou ɔi ɐi ɐu ɛu ɵy

It seems like the two character phones (i.e: "ts","ui", "ei", "iu") are missing from Allosaurus. Is this an intentional design decision, or a problem with the way the inventory lists were built? (the Allosaurus phone inventory for Mandarin cmn also lacks 2-character phones)