Trying to add Ukrainian and failing miserably

dgisser commented 5 months ago

Thanks for creating this project! I'm trying to add Ukrainian, here's what I got so far:

.env file

DEBUG_WORD=критика
MAX_MEMORY_MB=16384000
DICT_NAME=test

added {"iso": "uk", "language": "Ukrainian", "flag": "🇺🇦"}, to languages.json

ran ./auto.sh Ukrainian English

This creates 2 zips, which if I put into Yomitan, suck. If you go to a random Ukrainian wiki page, very few of the words highlight, including words that are for sure in kaikki like критика.

We are skipping a ton of term tags, e.g.

{
  "alt-of": 439,
  "alternative": 282,
  "morpheme": 219,
  "broadly": 115,
  "collective": 115,
  "by extension": 114,
  "predicative": 110,
  "third-person": 105,
  "with-genitive": 94,
  "in-plural": 88,
  "no-comparative": 84,
  "with-dative": 79,
  "letter": 72,
  "third person only": 71,
  "with-instrumental": 69,
  "it is": 55,
  "noun-from-verb": 37,
  "plural-normally": 36,
  "uppercase": 33,
  "lowercase": 33,
  "Western Ukraine": 33,
  "proscribed": 30,
  "genitive": 26,

etc. as well as skipped parts of speech

   "name": 2290,
  "adv": 854,
  "num": 129,
  "intj": 125,
  "prep": 105,

so maybe this is part of the problem. Look forward to any advice on how to resolve!

StefanVukovic99 commented 5 months ago

You did everything right (except setting max memory to 16000 GB :sweat_smile:). Words are likely not getting matched because wiktionary has diacritics on the headwords, and they aren't getting handled: We'll need to add a case to the normalizeOrthography function (like #67).

As for the skipped term tags/parts of speech, that's normal. The parts of speech don't matter unless/until there are deinflection rules written for that language. Adding tags to a tag_bank_term controls whether they will remain in parentheses or be moved to a yomitan tag: Here, anatomy gets recognized and parsed out, the rest are left as-is. I'm not too happy with how the tags look in yomitan, may have been better to leave them all in parentheses. There are also some tags that are invisible on wiktionary, but kaikki deduces them somehow, these won't be shown in the yomitan dict unless they are add to a tag_bank_term.

P.S. I remember reading this issue of yours back when the official policy in the yomitan readme was 'no other languages'. I might not have even tried to merge my fork with yomitan and do all this if it wasn't for that hint that there would be support for it, so thanks :pray:

dgisser commented 5 months ago

Thanks!! Just copying the Russian normalizeOrthography rule greatly improves the performance. Let me know if you would like me to submit a PR with these very minor changes. Also I'm amazed that you remember that issue in Korean no less! I'm so happy that Korean is available in Yomitan and it is so powerful; much better than any other chrome extension out there!

StefanVukovic99 commented 5 months ago

Feel free to PR, then Ukrainian dicts will be included automatically from the next release!

Also check out the language docs to properly add Ukrainian to Yomitan. Texts with no diacritics or full diacritics should work with these dicts, but you'll want to add the same diacritics processing to yomitan (like https://github.com/themoeway/yomitan/pull/1057) so texts with partial diacritics and other dicts will work.

dgisser commented 5 months ago

Yeah, normally I would be really into doing something like that but I'm just doing this for a friend who is learning Ukrainian. I don't have any knowledge of Ukrainian (the most I can do is read the Russian alphabet and read a few basic words) so just getting a dictionary set up is sufficient for my needs.

yomidevs / kaikki-to-yomitan

Trying to add Ukrainian and failing miserably #73