Generating dictionary wordforms/unmunch

exander77 commented 2 years ago

The original Hunspell had two important utilities:

# print all forms for all words whose roots are given in `roots.dic`
# and make use of affix rules defined in `affixes.aff`:
unmunch   roots.dic affixes.aff
# print the forms of ONE given word (a single root with no affix rule)
# which are allowed by the reference dictionary defined by the pair of
# `roots.dic` and `affixes.aff`:
wordforms affixes.aff roots.dic word

How to achieve this in spylls.hunspell? I use Hunspell to generate Scrabble dictionaries, and I am looking into replacing it with spylls.hunspell.

zverok commented 2 years ago

There is an examples/unmunch.py The comments there explain its limitations. I haven't had resource to work on something more robust, unfortunately :shrug:

exander77 commented 2 years ago

Works superbly compared to running wordforms over all roots (took me three days), unmuch is not supported for a while. This took like 30 seconds. But I have some differences. I am missing 2851 words and I have 319493 new words.

exander77 commented 2 years ago

Running Czech hunspell: http://www.translatoblog.cz/wp-content/uploads/2021/03/hunspell_cs.zip

$python3 examples/unmunch.py -d cs_CZ -w seniorní
Unmunching only words with stem seniorní

Unmunching Word(seniorní /Y,y)

seniorní
seniorních
seniorního
seniorním
seniorníma
seniorními
seniornímu
seniorněji
seniornější
seniornějších
seniornějšího
seniornějším
seniornějšíma
seniornějšími
seniornějšímu

$wordforms cs_CZ.aff cs_CZ.dic seniorní
seniorněji
seniornějším
seniorních
seniorních
seniorníma
seniornějšímu
seniorního
seniornějšími
seniorní
seniorním
seniorním
seniornějšíma
seniornějších
seniornějšího
seniornímu
seniorní
seniorních
seniorními
seniornější
seniorním
nejseniorněji
nejseniornějším
nejseniornějšímu
nejseniornějšími
nejseniornějšíma
nejseniornějších
nejseniornějšího
nejseniornější

exander77 commented 2 years ago

All words missed by Spylls: cs_CZ.txt.missing.txt

This most likely means that they will not be assumed as correct during spellchecking.

exander77 commented 2 years ago

The new words created by spylls seems to be deficiency in original wordforms.

exander77 commented 2 years ago

Basically I see missing words of two kinds. The ones with prefix nej (basically same as suffix est in english: rychlejší => nejrychlejší, fast => fastest). But a lot of words with nej are present, so some combination of properties? The other ones are some foreign surnames forms.

Spylls:

Žukrowského
Žukrowském
Žukrowskému
Žukrowski
Žukrowskiová
Žukrowskiové
Žukrowskiovou
Žukrowskiových
Žukrowskiovým
Žukrowskiovými
Žukrowský
Žukrowským

Hunspell:

Žukrowského
Žukrowském
Žukrowskému
Žukrowski
Žukrowskiho
Žukrowskim
Žukrowskimu
Žukrowskiová
Žukrowskiové
Žukrowskiovou
Žukrowskiových
Žukrowskiovým
Žukrowskiovými
Žukrowský
Žukrowským

The surnames are maybe correct with Spylls, but wrong in Hunspell? But the nej prefix is definitely some bug in Spylls.

exander77 commented 2 years ago

Running:

import sys
from spylls.hunspell import Dictionary

dictionary = Dictionary.from_files('cs_CZ')

print(dictionary.lookup(sys.argv[1]))
for suggestion in dictionary.suggest(sys.argv[1]):
    print(suggestion)

Produces:

$python3 suggest.py nejseniornější
True
nejseniornější
nejseniornějším
nejseniorštější
nejsenilnější
neseniorštější
nejinferiornější

So this looks more like an unmnuch bug and not a general Spylls bug.

exander77 commented 2 years ago

$python3 unmunch.py -d cs_CZ -w rychlejší
nejrychleji
nejrychlejší
nejrychlejších
nejrychlejšího
nejrychlejším
nejrychlejšíma
nejrychlejšími
nejrychlejšímu
rychleji
rychlejší
rychlejších
rychlejšího
rychlejším
rychlejšíma
rychlejšími
rychlejšímu

vs

$python3 unmunch.py -d cs_CZ -w seniorní
seniorní
seniorních
seniorního
seniorním
seniorníma
seniorními
seniornímu
seniorněji
seniornější
seniornějších
seniornějšího
seniornějším
seniornějšíma
seniornějšími
seniornějšímu

Seniorní is missing nej variants compared to rychlý.

exander77 commented 2 years ago

Found obviously missing code: https://github.com/zverok/spylls/pull/23

Suffix crossproduct is not analysed for prefixes. Btw, maybe secondary suffix crossproduct needs to be analysed as well?

exander77 commented 2 years ago

Btw, I am not even sure if the code in unmuch is right approach, sound't it be recursive check?

After each prefix or suffix is added, check if new prefixes or suffixes cannot be added on top of that?

I can image word where when you add prefix prefword, then you can now add suffix prefwordsuf even though wordsuf would not be valid. And then you can add another prefix pref2prefwordsuf even though pref2prefword would not be valid? And so on?

zverok commented 2 years ago

@exander77 unmunch.py is a quick hack I did while discussing a similar question in #10, I don't consider it feature-complete (that's why it is examples/, just shows the direction in which one should go to use spylls to produce word list).

ATM I, unfortunately, don't have much resource to discuss/debug it (I am in Kharkiv, Ukraine, splitting my days between volunteering, my dayjob, and doomscrolling).

I'm thankful for your PR and I'll merge it if it works for you :) In case you are willing to work on improving unmunch.py, it might make sense to "promote" it to a real feature with code in spylls/hunspell/, script in bin/ and maybe some tests to make sure it works (and improve it when it doesn't), WDYT?

exander77 commented 2 years ago

With that PR unmuch pretty much works for cs_CZ and I support turning it into the real feature and offer my help with improving it. It definitely works better than Hunspell's wordforms and unmunch even now. Putting it into bin and adding tests etc. sounds reasonable.

I don't want to get political on Github, but I am sending my:

Putin of Russia, go fuck yourself!

from the Czech Republic. We had Soviet occupation here in 1968... I hope Czech Republic and whole European Union and NATO by extension will send enough support including weapons, so Ukraine can put Russia in its place. I think the hearts and minds of most Czech people are with Ukraine.

zverok / spylls

Generating dictionary wordforms/unmunch #22