Closed fabiospampinato closed 3 years ago
Hey!
Thanks for your kind words about my work. The question of linearizing the dictionary (Hunspell and its predecessors call this "unmunching" for some reason) is a quirky one.
Basically, if we ignore word compounding, it is more or less trivial: look at word flags, see which suffixes and prefixes it might have, produce all valid combinations". Here is proof-of-concept Spylls-based script for doing so: https://gist.github.com/zverok/c574b7a9c42cc17bdc2aa396e3edd21a I haven't tested it extensively, but it seems to work and produce valid results, so you can try to use it for your idea!
The problem is, this script can be made universal to handle any possible dictionary because of word compounding. In languages with word compounding, the "potentially possible" words list is incredibly huge. (The topic is covered in this article).
For example, in nb_NO
(one of two variants of Norwegian) dictionary, there are 11k words with COMPOUNDFLAG
. This means that "theoretically possible" is any combination of those 11k words (not only every pair but also three, four, ... in a row). The dictionary contains enough information to tell "whether this is the valid compound word", but not enough to produce all really existing/used compounds (see example in the article I've linked: “røykfritt” (“smoke-free” = “non-smoking”) and “frittrøyk” are both "valid" for Hunspell, but the second one is actually not an existing word... But I assume some poet might use it, and Norvegian readers might understand what they mean in the context.)
That's why "unmunching" the list of words for languages with compounding is impossible—so, to have just a list of "all actually existing Norwegian words" Hunspell's dictionary is useless.
Another curious example: English dictionary have "compounding rules" used to check numerals, like "11th is valid, and 10000011th is valid, but 10000011nd is not". Attempt to produce "all possible" combinations from this rule will obviously lead to an infinite loop :)
As far as I understand, that's the biggest obstacle for "full list"-based spellcheckers: AFAIK, LanguageTool grammar checking project uses morfologik spellchecker which encodes word lists with FSA very efficiently but falls back to Hunspell for languages like German or Norwegian due to compounding.
Here is proof-of-concept Spylls-based script for doing so: gist.github.com/zverok/c574b7a9c42cc17bdc2aa396e3edd21a
Awesome! Thanks a lot.
In languages with word compounding, the "potentially possible" words list is incredibly huge.
English dictionary have "compounding rules" used to check numerals, like "11th is valid, and 10000011th is valid, but 10000011nd is not". Attempt to produce "all possible" combinations from this rule will obviously lead to an infinite loop :)
That's indeed a big problem 🤔 Thanks for pointing that out.
Potentially that could be solved by taking a hybrid approach, using a FSA for regular words and using some dynamic logic for accounting for compounding, however with an hybrid approach now I can't just ignore Hunspell's logic entirely, I can't just throw away all the flags after parsing, and accounting for compounding will most probably be tough at best.
I'll have to think more about this.
Potentially that could be solved by taking a hybrid approach
Yes, that's what comes to mind:
affix_forms
it should just look in the unmunched word list and deduce whether compound is good)In general, it all looks bearable, yet not a small amount of work. If you'll take this road and will need more info on how this mad mechanism works, feel free to ping me (here or by email), I spent a lot of intimate moments with it and will be glad if my acquired knowledge, if incomplete, would be of some use.
In general, it all looks bearable, yet not a small amount of work.
Yeah, another problem is that I don't really have a clue about how to make a serialized DA-FSA that can be worked with directly without requiring it to be deserialized first, there's some somewhat approachable material on ~serializing a Trie, but the compression ratios would be much better with a DA-FSA, and for that I've only been able to find some papers that talk about it and some probably buggy implementations nobody seems to use.
If you'll take this road and will need more info on how this mad mechanism works, feel free to ping me (here or by email), I spent a lot of intimate moments with it and will be glad if my acquired knowledge, if incomplete, would be of some use.
Thank you, the knowledge you've distilled in what you posted online has already been very useful to me. I'll let you know if I'll actually end up working on that, for now I'm thinking it would probably be unwise for me to dedicate that much time to it. I think the next step for me should be to check what the performance characteristics of a WASM port of Hunspell are, I probably overlooked a few things the first time I looked into that, if memory usage with that isn't insane like it is with nspell
I'll probably just use that for now, but even in that scenario I'll still very much be interested in making something better for my use case.
I'm potentially open to look into alternative approaches, but I care about memory usage and startup time, and for that working with a succinct DA-FSA seems optimal really (just load a compressed string into memory and that's it).
I don't really have a clue about how to make a serialized DA-FSA that can be worked with directly without requiring it to be deserialized first
You might be interested in morfologik, BTW. They solve exactly this problem. I actually did a port of the algorithm to Ruby (you might notice repeating motive in my activities :joy:); even the naive implementation in a "known to be slow" language is quite effective. But that reimplementation was "blind" (I just ported Java to Ruby, cleaned up, and debugged it to work exactly like Java), so I don't have a deep understanding, but I assume on the top level it layouted like this:
1. Assume we have words to encode: "aardvark", "abacus", "abbot", "abbreviation", ...
2. In FSA, it would be coded (logical) this way:
a -> a -> rdvark
b -> a -> cus
b -> o -> t
r -> eviation
3. Phisically, it would be layouted like this:
<start point>
<address to "A" as a first letter>
<address to "B" as a first letter>
....
A-as-a-first-letter point
<address to "A" as a second letter>
<address to "B" as a second letter>
....
A-as-a-second-letter point
<address to "A" as a third letter...>
So, when we want to check "aardvark", we do this:
Something like that
I'll look into morfologik, thanks for the pointer.
The way you explained it seems fairly approachable, perhaps I should try implementing something like that.
However there might be a couple of problematic details with that:
The best documentation I've found so far about this is this lecture, but it stops at encoding Tries not full-blown DA-FSAs.
that's at minimum 20 bits of information just for each of those pointers.
Well I guess each pointer rather than encoding the distance from the start could be encoding the instance from itself, saving a good amount of memory, but as in theory one doesn't need to encode this information at all perhaps it'd be better just to try to implement the more optimal data structure.
You should consider I retold "how it works" from the memory of a loose reimplementation I did 2 years ago :) But I know that morfologik's encoding is quite efficient:
Example: for Polish, the 3.5 million word list becomes less than 1MB file in the FSA format.
Interesting, perhaps then it would be pretty good for my use case 🤔 I definitely have to look into that.
Apparently the current size of that Polish dictionary is closer to 3MB, but still pretty good, maybe they just added more words to it https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-polish/src/main/resources/morfologik/stemming/polish/polish.dict
I've started dedicating some time to this. Some random thoughts:
Traceback (most recent call last):
File "src/unmunch.py", line 32, in <module>
from spylls.hunspell.dictionary import Dictionary
File "/usr/local/Cellar/pypy3/7.3.3/libexec/site-packages/spylls/hunspell/__init__.py", line 1, in <module>
from .dictionary import Dictionary
File "/usr/local/Cellar/pypy3/7.3.3/libexec/site-packages/spylls/hunspell/dictionary.py", line 11, in <module>
from spylls.hunspell.algo import lookup, suggest
File "/usr/local/Cellar/pypy3/7.3.3/libexec/site-packages/spylls/hunspell/algo/lookup.py", line 37, in <module>
import spylls.hunspell.algo.permutations as pmt
AttributeError: module 'spylls' has no attribute 'hunspell'
vi-VN
dictionary seems utterly useless, the whole thing weighs like 40kb and just over 6000 words get extracted from it, there's no way that's complete by any means, removing it from my app would probably be better than using this dictionary just because I'd imagine the false negatives would be through the roof.rw-RW
dictionary looks really weird, its aff file weighs ~38MB and its dic file weighs just half a kilobyte, weird, I don't know what's the deal with that.fr-FR
dictionary, but I think I was getting millions of words extracted from it by nspell
, so maybe a lot of them are encoded via compound rules, or nspell
was just doing something wrong, I'll have to check how many words hunspell's own unmunch command can extract.hu-HU
dictionary (ironically), takes forever to process, I let it run for over half an hour and it still wasn't done, it just kept consuming more and more memory, there might be an endless loop in there somewhere. I'll have to check what the size of the stdout buffer was, but I'm guess it was well over 5GB, which would be unworkable for my approach anyway.es-DO
dictionary leads to the following error:Traceback (most recent call last):
File "/Users/fabio/Desktop/notable-spellchecker/src/unmunch.py", line 34, in <module>
dictionary = Dictionary.from_files(sys.argv[1])
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/dictionary.py", line 141, in from_files
aff, context = readers.read_aff(FileReader(path + '.aff'))
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 105, in read_aff
dir_value = read_directive(source, line, context=context)
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 161, in read_directive
value = read_value(source, name, *arguments, context=context)
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 251, in read_value
for line in _read_array(int(count))
hy-AM
dictionary leads to the following error:Traceback (most recent call last):
File "/Users/fabio/Desktop/notable-spellchecker/src/unmunch.py", line 34, in <module>
dictionary = Dictionary.from_files(sys.argv[1])
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/dictionary.py", line 141, in from_files
aff, context = readers.read_aff(FileReader(path + '.aff'))
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 105, in read_aff
dir_value = read_directive(source, line, context=context)
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 161, in read_directive
value = read_value(source, name, *arguments, context=context)
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 251, in read_value
for line in _read_array(int(count))
ValueError: invalid literal for int() with base 10: 'ցվեցին/CH'
cs-CZ
dictionary leads to the following error:Traceback (most recent call last):
File "/Users/fabio/Desktop/notable-spellchecker/src/unmunch.py", line 34, in <module>
dictionary = Dictionary.from_files(sys.argv[1])
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/dictionary.py", line 141, in from_files
aff, context = readers.read_aff(FileReader(path + '.aff'))
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 105, in read_aff
dir_value = read_directive(source, line, context=context)
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 161, in read_directive
value = read_value(source, name, *arguments, context=context)
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 249, in read_value
return [
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 250, in <listcomp>
make_affix(directive, flag, crossproduct, *line, context=context)
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 294, in make_affix
return kind_class(
File "<string>", line 9, in __init__
File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/data/aff.py", line 266, in __post_init__
self.cond_regexp = re.compile(self.condition.replace('-', '\\-') + '$')
File "/usr/local/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/re.py", line 252, in compile
return _compile(pattern, flags)
File "/usr/local/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/local/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/usr/local/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/local/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "/usr/local/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/sre_parse.py", line 549, in _parse
raise source.error("unterminated character set",
re.error: unterminated character set at position 36
eu-ES
dictionary also takes forever to process.he-IL
takes forever too.ko-KR
takes forever too.Short summary:
es-DO
, hy-AM
, cs-CZ
, hopefully those can all be fixed in Spylls as they seem errors related to encoding.eu-ES
, he-IL
, hu-HU
, ko-KR
, rw-RW
, I don't know if they just produce gigabytes-level words lists or if there's an infinite loop somewhere, I killed the process after a while. Whatever the culprit is it'd be good if Spylls could somehow fix it or ignore the flags that cause the combinatorics to explode. I should try running hunspell
's own unmunch
over them too.Overall the unmunch script seems to have worked pretty well, I'd say the next steps are:
Here's the result of running Hunspell's unmunch over the problematic dictionaries:
es-DO
: it made an 11MB words list.hy-AM
: it gave me a segmentation fault.cs-CZ
: it made a 45MB words list.eu-ES
: it gave me a segmentation fault.he-IL
: it made a huge 1.3GB words list.hu-HU
: after a looong time it I had to stop it because I was running out of disk space, in the meantime it wrote me a gigantic 22GB words list to disk, I guess that's one of the languages that makes heavy use of compound rules.ko-KR
: it gave me a gazillion errors and a segmentation fault. rw-RW
: after a while it made a 2MB words list, weird.Better than nothing I guess, it'd be great if I could regenerate these words lists with Spylls though, disabling compound rules entirely as I'll handle them specially.
@fabiospampinato Interesting investigation! I'll definitely dive into fixing/investigating all problems somewhere upcoming week (can't promise I'll investigate the PyPy one, but broken dictionaries shouldn't be). Just to be in sync: where did you take the dictionaries from? (So I'll be testing on the exact same ones)
can't promise I'll investigate the PyPy one, but broken dictionaries shouldn't be
No worries, that doesn't actually matter to me, I'd be happy to leave this running overnight or something, I care much more about it working well.
Just to be in sync: where did you take the dictionaries from? (So I'll be testing on the exact same ones)
I'm using the dictionaries listed here: https://github.com/wooorm/dictionaries
Update on the progress with the succinct trie:
Basically I'm kind of ecstatic, this data structure is amazing.
I'm just hoping the situation regarding compound words has a reasonably simple solution, because potentially until I get that working I can't ship this.
A little analysis of alphabets as extracted from words lists ordered by frequencies:
bg-BG
аиеотнрвялспзкдмъугбщхчшцжйфюьКСМБДПТВГАРНЛХИЧЕЗЦОФШЯЙЖУЮЩЪ
br-FR
eanoi'rthdluzmsgcpjkfvùbñMDPTNwyCXIêVLüGKB’âFZASEWRHJOYéqïUë-áôûxíóåà QèÅÂãç10
ca-ES
es-alinortum'cdgphvbfqxzjyDàïAEL·íSéòèçIOHóüUkúCMBPÀTGRVwFNÈÒÍÉKJāXWÚZáQöY1ūšīñžōÓşčäǧ935µâłḤøΩ42ṭṣŏýḥČåēëı7680śãîŠęôǦḫŭķ³²êŽÅÁ+ţŌńÖÇûǎŚŁĀìřù₂ăṬḍźżĒØÑÆṇňėćḗǒőṛ
ca-ES
es-alinortum'cdgphvbfqxzjyDàïAEL·íSòéèçIOHóüUkúCMBPÀTGRVwFNÈÒÍÉKJāXWÚZáQöY1ūšīñžōÓşčäǧ3µâ9ł5ḤøΩ4ṭ2ṣŏýḥČåēëıśã76îŠęôǦḫ80ŭķ³²êŽÅÁ+ţŌńÖÇûǎŚŁĀìřù₂ăṬḍźżĒØÑÆṇňėćḗǒőṛ
cs-CZ
noeavtimlrsíupkájcdhzěšyřýbčéžfgůňťKxúBHPSMVóďŠLDRJTCFAZNGOČwEŽIUWŘüöqÚĎäYXÁÍľQĽÓ;
da-DK
esrntdliaogkmuvbpføyhæjSNåc'M’VØ-zAwBéHKTL.xERGFDCJPIOqUWZüYÅö"XQ óèáÆäúłßíðëęśđńÖ3μ6Áܵï2ωψχφυτσςρποξνλκιθηζεδγβαΩΨΧΦΥΤΣΡΠΟΞΝΜΛΚΙΘΗΖΕΔΓΒΑ85ÞôñêçÓàãšîć,1ʻžŽÒÍýòÿőğșż/90
de-AT
enrstialhugc-dmobkfpzwväüISöABKßyEPVGMTFDxRHWjLZNUqCOJÜQÄéÖ. YXñêâ92180:63à_,>@<¶Ã)(7
de-CH
enrstialhugc-dmobkfpzwväüISöABKßyEPVGMTFDxRHWjLZNUqCOJÜQÄéÖ. YXñêâ92180:63à_,>@<¶Ã)(7
de-DE
enrstialhugc-dmobkfpzwväüISöABKßyEPVGMTFDxRHWjLZNUqCOJÜQÄéÖ. YXñêâ92180:63à_,>@<¶Ã)(7
el-GR
αοετνισρμκυπλςηόγωδίέθάύφχζώήβξψϊΑΚΜΣΠϋΒΤΛΐΓΕΝΔΦΙΧΡΟΘΆΖΗΰΈΊΌΥΞΨΩΉΎ
en-AU
seiarntolcd'ugpmhbyfkvwxzMSCqAjBPTLHDGREKNFWJOIVUYZQX3219876540
en-CA
seiarntolcd'ugpmhbyfkvwzxMSCAqjBPTLHDGRENKFWJOIVUYZQX3219876540
en-GB
seiarntolcd'ugpmhbyfkvwxzMSCqAjBPTLHDGRENKFWJOIVUYZQX3219876540
en-PH
aersoinctldmubpágízfvhéjqóñxyCúPTAMSüBGHLkVRNIOJEDYFQUZwXKWÁÑ-ÚÉÓ.Í/ _
en-US
seiarntolcd'ugpmhbyfkvwzxMSCAqjBPTLHDGREKNFWJOIVUYZQX3219876540
en-ZA
esianrtolcdumgp'hbyfvkw-zxSMqjBCATRNLGKDWHPEOFVIJUéZYQ.èXêñâç!üïôóöáàä31û4ë5Åîã862
es-AR
aersionctldmubpágízfvhéjqóñxyúükCAMSwBGPLTEDIRNJVHFOUYKQZWÁ./ÓÑÍX
es-BO
aersonictldmupbágfzívhjéqóñxyúükMCSwABGIEPTNDVRHLJFUYOKÁZWQ/.ÓÍX
es-CL
aersonicltdmupbágfzívhjéqóñxyúükCMPwASLBTGINEHDRVFJQUOKYÑÁZW/.ÓÍX
es-CO
aersonictldmubpágífzvhjéqóñxyúCPMüASTGBkRVLNEDJIHFOUYwQZKÁWÚÓÑÍÉ/X.
es-CR
aersonictldmupbágzfívhjéqóñxyúükwMSACGBEINDPJLRFVHTUOKÁYZW/.ÓÍX
es-DO
aesornlitdcmáupbgéífvzhjqóxñyúS|/ükwMACGBEINDPJLFV RHTUOKYÁZW.X ÓÍ
es-EC
aersonictldmupbágfzívhjéqóñxyúükwMSACGBEINDPJLRFVHTUOKÁYZW/.ÓÍX
es-ES
aersoinctdlmubpágífzvhéjqóñxyúükAMCSPBGwVRLTEINDJFHOUÁZKYWÉ/.ÚÓÍX
es-GT
aersonictldmupbágzfívhjéqóñxyúükwMSACGBEINDPJLRFVHTUOKÁYZW/.ÓÍX
es-HN
aersonictldmupbágzfívhjéqóñxyúükwMSACGBEINDPJLRFVHTUOKÁYZW/.ÓÍX
es-MX
aersonictldmubpágífzvhjéqóñxyúTCüAkMSPHNIBJEGLZODYwRVXFUQKÁW/.ÓÑÍ
es-NI
aersonictldmupbágfzívhjéqóñxyúükwMSACGBEINDPJLRFVHTUOKÁYZW/.ÓÍX
es-PA
aersonictldmupbágfzívhjéqóñxyúükwMSACGBEINDPJLRFVHTUOKÁYZW/.ÓÍX
es-PE
aersonictldmupbágífzvhjéqóñxyúCükPAMSHTLBIVRwOJYNQGEUDFKZ-ÁW/.ÓÑÍX _
es-PR
aersonictldmupbágfzívhjéqóñxyúükwMSACGBEINDPJLRFVHTUOKÁYZW/.ÓÍX
es-PY
aersonictldmupbágfzívhjéqóñxyúükwMSACGBEINDPJLRFVHTUOKÁYZW/.ÓÍX
es-SV
aersonictldmupbágzfívhjéqóñxyúükwMSACGBEINDPJLRFVHTUOKÁYZW/.ÓÍX
es-US
aersionctldmubpágízfvhéjqóñxyúükwMSACGBEINDPJLRFVHTUOKÁYZW/.ÓÍX
es-UY
aersonictldmupbágfzívhjéqóñxyúükwMSACGBEINDPJLRFVHTUOKÁYZW/.ÓÍX
es-VE
aersonictldmubpágífzvhjéqóñxyúükMCASGwBPTEILRDNVFJUHOZKYÁW/Q.ÓÍX
et-EE
iatesuklnmgrvodphjõäbüöfðþz-yAxCcBLID.MEKRGTSFHUOPNVÜWXw
fa-IR
اینرتهمدوشسبلکگزفخپجعقآحچصغطذضژثظئٔءأُؤًِةَّ
fo-FO
anrisultkgmeðvdobfpjøóyháíúæýSHEKTMBAGJVNF-RcDÁOLwz.PIxÍCUÚØÓYéWÆ/Qå+ü761qZ#83
fr-FR
esiarntolucémdp'gbhfvqz-yâLDQxèCAOjSHk₂MBNPFI₃TîçRG₄EêwïVKû₁JWô₆₅ÉœUYZ₇₈₀à₉XëöüæᵉóŒá231íµÎ4ÀùΩℓΩˢʳÆ976ɴꜱć0ÈÂᵈñ85äÅωψχφυτσρποξνμλκιθηζεδγβαΔܳ²ãńśÿᴀᴍᴏšÔÊōˡåúᵛᵐᵍ_ⁿ
fy-NL
esntriaklodjfpgbmhuwcyûAOUzúvâôê-'.SBëHGïWNFMTKJRELDPIäöYxéqCèVü341QX6520åàZ8
ga-IE
ahinrtceoldsímgbfáó'uéúp-CASBMIGvFTPDEyOLRHNUkÉwzVKÚXÁWJxÍÓjZQqYüöèñćëâêãùäìěòîçğà
gd-GB
ahin-esrcdgltbomufà'pòìùèéóACSMBIEGFTDPLáONRUykvÈÉÀHÌwzÒxÙWVKqJjY.XZ‑Qñíōüā1öäëÁūÓ,úłđ⁊âôçŌćĐÅē=ïęøãīıễîŷʻğŭŏăşåþ
gl-ES
elsoanrcmávidthubpéígfzxóñqúïükAySCMBP.RTLjGDEHKNJFIVw-OWZYUXаиQ'норöōелвèëсšткοςçãÁдäАαÉмłàιйū/ρčνÖôчяěıλâ1уêðşηøВžτгСśМσКő2ь3κεŽğРДυδПфř0ФИпāЛГòбΑίŞхάåÓæÆНЕμܵŠŌİБýćīέзшγΩ4πΆÍªăТΝόÇ+ω9βÅЯОÞňîΔφìť5ΣΠώĐ Čõ8ἈіήĽØ´ďùþżюΕύÚęÑ6ա́ΜθЮЗыёΛΚΓχńû7Θξē:ХъΡΙΒđė@Οϊßəţ|アորբэцΧΦΤζǺºů’ǧ伯ラのეկЭЦУЙΖŻÐÂÀ²ʿ][ạșąǎ™亞亚ドーハブἌიაՎնմԱШЖџжῑΈŚŁÌijịảț–ĉ&이아友千道罕拉东台中ッキインルベムナ海しἸთქნਨਮիգտեՀհЧєІΥȘŘŐĀÙÕÒÎűʻ⁺ệ¡ģọẩṭ₂ǫũề),ǽ$찬해지함라브市走網叉夜犬島路淡毅正クジ社会式株政李之励方狼場戦内寺利実族氏尼安川野吉门厦隠神尋と北子未來當当九丹视电央国ザオバカシウュ・へユチニァヴスャエヌワトィデぎふΩ€‰”“ῊἩϐỈᒃᓗᑖᖅᑭᕿვლდღობრუზมัดาอ්මදආਘੰਿਸਹੋमहार्बअսՊՅյւձռփўїқљЁϋῖȣᾱῐἀὕἰΉƏŬźĪĢĒứộỗẹĆÊÈȳ°¥£′ʾṢṣḤ
ḇửỳÝ‘…ļ_ɘṛ!ờưņḥķĭǔᵥữŏœểŕ%`he-IL
וכשי"מלתבהנרםפקןאחסטדעגצךז'ףץ
hr-HR
aionerjmtsuvklpdgzčbšchćžfđASB-KMPTLHJEDVGNIRFCZOŠČUwŽyĐxéëXYĆQW:
is-IS
anrisultgemkfðödvjbhóápoæíyúþýéxHSABEGFVÁKMJÞR-DLNTÍIPc.ÓÚOUÖCWqzYwÆÝZ
it-IT
leiatn'ocrsmugvdAIpbzfqhDEQCOBSNULMTVòàéìHxkyPwRGFù èjZ/0KW2JY,.1":-)(3;><654Xçôâ7
ka-GE
აიესნბრმოლდვთუგცკტშხზქფპწყღძჩჯჭჰჟ- ietosanTEOIrRNAShl,dcf#bHuLpgwFCmDMWGyPUY/B.v):("VKjx_kX40
lb-LU
enrtsialohcugmpdkfébzwäëvSKBPGAFMyDHREVTxWLIjONCZüqUJËèÄöQ-ÉêûôîçYÖXâÜñ'ïàí&/320Šó!51
lt-LT
iseanuotmjrbkpdlvėčšgyžąįųūzęfchPKSGBVMADLRJTŠNEŽUZIČOFCHŪĄYĖĮwxWXQq
lv-LV
aieānsojmtkupšrēzldīvgbcķūļņžfčģhA.MSVDLKBIERGJTZPHNOFUCĀŠĶŽĢĒČŅ-Ī)(ĻŪ3ōyżW2/w:XYx
mk-MK
аеоинвтрслпкудмзбчгјшцњжфхќџѓљѕ-АК’СМБПИТВЕГДШРНЛХФЈЦУОЗ!Ч.ЃѐЖѝЊЉ
mn-MN
агэлдйхүуниросчтөцмвбшзжыяекьпфёъюБКМАСГЛДХПТФНВРЭОУШИЧЖЦЗЯЮӨЕЙщҮЁ
nb-NO
ensrtialkogdmpuvfbjhøyåæc-éSzwABTVMFHELOKNGxDRPIC.ôèJUqWØÅYZüQäçêöòXëÆÄ
ne-NP
नरकसेलुहमदछोतयइँपबगटथीखौैएचभजउडवधअफूघझठङशढणषईआओञृंऊऔ१९८७६५४३२०ऐ ः-ऋ24,#/:31)(ळ5
nl-BE
enrsatioldgkpumhbcv-fwjijz'ySBHMxAKLëRDWVGTENCPOJFqIJZIïé.UèöYXQüêç6130245áäóûàîúôñ+í 8â97/Å
nn-NO
earnstilkogdmupvfbjøyhåæc-SzBwéHêTAMKFNGVELRxDôO.PòCèIJUWqØÅóYQZçüäXÄÆë
pl-PL
iaeonyzwrcmsktpuldjłbghąśężfóńćSKźPMB-WGCDRALTJZNOFHvEŁIŚUŻxVqYXQĆ'éŹ.
pt-BR
aesor-inmltcdhuvpbgfzájqíçãxõéóêôâMCBASPúIGTELNRFJUVDXHy.kZOQwKÁW'Í üèÉöàʺªÂëYÚÓïµñîøØÒ³²åÔ
pt-PT
aers-iolmntcdhuvápgbfzíãjqçxêéóõúâCAMPôSBTELGRkFVDNHIJOwyUKZWQXÁàYÍÉ .ÓèÂîÚü?;:,)("!
ro-RO
eiranotlucăsdm-pzțșgbfvâhîjxkCwBySMPADGTRLVIFqNHOEȘJUZKWȚXQYöüÎ/éíş82ţáÂè6543
ru-RU
оаеивнрстмлпукяшыдюзгбйчхьщжцфэъ-АКВМСБПГЛНДТИРЕФОЗЭХШУЯЧЖЮЦЙЩё
rw-RW
anir|XS/tkeyumbghozcsdwfpvj
sk-SK
enoaivrtjmksšlupcdhzáíbčyýúéťžľgfňóäxôďŕĺ-MBSKPwATLHVDGNRJŠOFZEICUČŽöĽqWXÁQYĎÚÍüŤ ÓěÉêÝŇĹàëÖ
sl-SI
aienormvtkljsčpduzgšhbcžSfPMKBVLGDRTŠCJćZHAIONČŽFXEUywWxüqYöĐQđ.Ć
sr-Cy
аиоенртусмвклпдјзгбшчцћњхљжфђСМБКџПВАДЈРТНЛГХИЕОФШЗЂЧЦУЖЉЋЏЊ
sr-La
aioenrtulsjmvkpdzgbščcćhžfđSMBKPVADJRTLNGHIEOFŠĐZČCUŽĆ
sv-FI
snraetilokgdmpubväföhåyjcx-SzéBHwMKLTAVRGFNEPDCOIJ:ÖUWÅ qÄ.YZü3241èáQ7X860ç5'ćóíê,9łøôâàæńšëñžşúûãřÉωψχφυτσρποξνμλκιθηζεδγβαΩΔęňāčğ㜌źŁµőï
sv-SE
snraetilokgdmpubväföhåyjcx-SzéBHwMKLTAVRGFNEPDCIOJ:ÖUWÅ qÄ.YZü3241èáQ7X860ç5'ćóíê,9łøôâàæńšëñžşúûãřÉωψχφυτσρποξνμλκιθηζεδγβαΩΔęňāčğ㜌źŁµőï
tk-TM
aylirdemngkýňtzsşbäjouüpçwhö-fžAGMSOHBNKcRPJTDIÝÇUÖ ZLvYEŞ
tr-TR
aeilrnımdktysuzşobüğcgçpvhöf'jAKSTBMEİHDGYÖONRFUŞPCÇVÜILZJ.âWĞw
uk-UA
оанивіретсмкулпьдзягйбюшчхцжєфїщ-'КСБПМГВДЛАТРНґОЗШФЧІХЯЖУЄЕЦЮҐЩЙЇИ
vi-VN
nhgticmuorpalbkáđyxưsvdạàếảốêệóớấúộqắéôụíơồềậeọợờâùịòẹèầãặăỏứìổẩủểẻựỉằởừửẫễũỡỗõĩẽẳữýỵẵỷỳNIVTHCỹPURDLMGAQFSBwEJĐK
Interestingly all alphabets but one seem to have <= 256 letters, which should allow me to use at most 1 byte per letter to store them. One should probably be able to pack more than one letter in one byte for languages with very short alphabets, I'll have to investigate that.
That language with more than 256 letters may be a bit of an issue encoding-wise 🤔
Update on the encoded trie, which is largely done and so far has the following characteristics:
pl-PL
, which zipped goes down to ~4MB. This should be ok for my use case, but morfologik
seems to get ~2.7MB and ~1.8MB respectively, which is a lot better, I guess deduplicating suffixes too is really worth it (I don't know if their dictionary is case-insensitive though, some implementations I've seen just ignore casing). I'd like to try to encode a FSA eventually too, but that's too complicated for me for now.I'll put the finishing touches on this thing, and then I'll try to understand what the situation is regarding compound words, any guidance on that would be greatly appreciated.
Mainly I'd say I'd like to have a way to extract compound words up to a limit just to gauge the situation across dictionaries, then I can think about how to address it.
Two random thoughts:
NB: I follow your comments, but during the weekdays have only an ability to spend a shallow attention on them, probably will read more deeply on weekend. But two things that you might found interesting that I know about morfologik's approach:
[208, 188, 208, 176, 208, 188, 208, 176]
). It is probably harder to debug, but easier for memory layouting, so the encoder/decoder can be absolutely "dumb".(I still suspect that digging in their code/links/docs could be fruitful for your tasks, too.)
I'll investigate dictionary bugs in the next few days, and I'll hopefully be able to provide some guidance for compounding, when the time comes.
This might be a good starting point for their approach to FSA building: https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/main/java/morfologik/fsa/builders/FSABuilder.java
Sorry I'm dumping too many words in this thread 😅.
An update/summary:
nspell
, but it's ~20x faster. I could just throw more compute at this, and/or walk the trie smartly perhaps rather than computing edit distances blinding, but for now that's good enough for me.morfologik
's code a bit but I didn't really learn a lot, I would need to spend much more time on this to really understand it.About dictionary parsing (hmm, it would be cool to split this ticket into several discussions actually... Maybe we can go with a private repo for your project with me as a contributor, so I can see/answer issues and discussions there? Or WDYT?):
]
at the end manually, unmunch.py
works fine)I'll investigate "forever to complete" (infinite loops or too long lists) dictionaries next.
PS: About this one:
It'd be great if you could update the unmunch script to extract compound words
I don't believe it is the right way, and not sure I have a resource to do it. If you want to try approaching this yourself, I might help with some advice; if you'll go the preprocessing way (try split, then see if it is a proper compound; which I believe is more promising), I can be of some advice, too.
About dictionary parsing (hmm, it would be cool to split this ticket into several discussions actually... Maybe we can go with a private repo for your project with me as a contributor, so I can see/answer issues and discussions there? Or WDYT?):
Sure, I'll open up a repo for this tomorrow.
In the meantime I'll close this issue as I would consider the original request addressed.
Regarding the throwing dictionaries:
es-DO
: Now I'm not getting the error anymore, but strangely I don't recall updating Spylls. Interestingly Spylls made me a 6.5MB file while Hunspell made me a 11.6MB file, I'm not sure what the reason behind that difference is.cs-CZ
: It works after the patch.hy-AM
: It works after the patch.I don't believe it is the right way, and not sure I have a resource to do it. If you want to try approaching this yourself, I might help with some advice; if you'll go the preprocessing way (try split, then see if it is a proper compound; which I believe is more promising), I can be of some advice, too.
I think I'll put this on hold for now, as all the possible paths seem complicated and I can't dedicate too much time to this at the moment, I'll probably get back to this in a couple of months or so, or if a simpler path materializes. In the meantime I'll ship what I've got and see how it behaves in practice.
Hello there, first of all thanks a lot for writing
Spylls
, all the documentation and the blog series, I haven't read it all in depth yet but so far it has been wonderfully informative and from what I've seen it's the best sort of documentation on how Hunspell works available, andSpylls
itself is probably the best port of Hunspell to another language.I have a use case where I need to perform spell-checking on the browser basically, and from what I've seen, mainly from playing around with
nspell
, an Hunspell-like approach to spell-checking would be pretty prohibitive for my use case, for some languages at least, it takes too long to generate words or even just parse dictionaries, plus keeping all those strings in memory can blow up memory usage fairly rapidly which is a concern for my use case.I think it's possible to approach the problem differently, pre-parsing dictionaries into a compact ~binary representation and working with that directly, trading some lookup performance, hopefully not too much performance, for amazing memory usage and startup times. There's some shallow documentation about that here but I think you alluded to something like this in a blog post when mentioning another spell checker so you might already be familiar with this.
Anyway I'd like to spend some time implementing something like that, for doing that I would first of all need to extract all valid words from all dictionaries, and that sounds like a task
Spylls
could be well suited to address, it's not a problem if it'll take hours for some dictionary, I care more about extracting all valid words as correctly as possible than doing it quickly.So could
Spylls
expose an API for doing that?