xigt / lgid

language identification of linguistic examples
MIT License
1 stars 0 forks source link

Crubadan.csv isn't complete #15

Closed elirnm closed 7 years ago

elirnm commented 7 years ago

The Crubadan.csv file ends at bcp-47 code wrs (sorted alphabetically), but Crubadan provides data for (and we have in the language table) a number of languages with codes past that. The csv file has 2,000 language entries in it, but the downloads table on the website lists 2,124 entries.

goodmami commented 7 years ago

It's possible the Crubadan data has been updated since the CSV file was created. I copied the CSV file from the previous language identifier. I'm not sure how the CSV file was originally obtained or created, but if I download the WritingSystems.csv file from http://crubadan.org/writingsystems, I only see ~2000 rows. I do note that some rows have a "child_ws" field filled in, so for example abt has abt-x-maprik abt-x-wosera as the child_ws value, so perhaps that's where the other ~124 entries come from?

elirnm commented 7 years ago

The download information button there says the CSV download has maximum 2000 rows, so it probably just cuts off at that point.

The entries that don't appear in the CSV are any that have a bcp-47 code which comes alphabetically after wherever the CSV cuts off, which in the case of the one in res/ is wrs. The missing entries include things like zho for Chinese and xh for Xhosa, so it's not just variants or obscure or not-fully-recognized items.

goodmami commented 7 years ago

The download information button there says the CSV download has maximum 2000 rows, so it probably just cuts off at that point.

Oh, good catch. I hope there's a solution besides manually entering the remaining items.

elirnm commented 7 years ago

I was able to get the missing entries by sorting the table in reverse order and then downloading the csv again. I'll incorporate the two csvs and then commit the combined one.