Not playing well with 'dictionary-it' (2)

writemonkey commented 5 years ago

Hi, thanks for the wonderful library. The Italian (also Portuguese) languages hang my app that uses nspell. Is there anything new on that front (there is a closed issue on Italian dic). Will that problems be fixed in the future? Is there a workaround? Thanks, i.

wooorm commented 5 years ago

There’s no news. The Italian dictionary/affix combination results in a huge list of possibilities, and I don’t know Italian so I can’t check if changing the code would still properly spellcheck it!

If you know Portuguese or Italian, we could start off where #9 ended?

writemonkey commented 5 years ago

Thanks for your quick answer! Unfortunately not. All languages that I speak work very well. Then again, you can find some Italian text on the web and you can see if spelling works as intended ...

I also tried this package:

https://www.npmjs.com/package/typo-js

and it has similar problems with same languages. Maybe vanilla js approach just won't cut it for those dictionaries :( Will investigate further ... i.

leonidasv commented 5 years ago

dictionary-pt-br has the same issue. :disappointed:

It leaks to an insane 1,4GB of RAM at runtime.

It worth checking Hunspell's official algorithm/implementation as it never runs above 50 MBs of RAM while checking against the same dict and process lots of words very fast. I'm currently trying to understand how it works, but sadly can't promise I will come back with a solution soon.

wooorm commented 5 years ago

It leaks to an insane 1,4GB of RAM at runtime.

I’d appreciate a fix! PRs welcome!

It worth checking Hunspell's official algorithm/implementation as it never runs above 50 MBs of RAM while checking against the same dict and process lots of words very fast. I'm currently trying to understand how it works, but sadly can't promise I will come back with a solution soon.

We/I can’t, because that would taint this project as being a derivative work of GPL code. nspell is reverse-engineered.

leonidasv commented 5 years ago

We/I can’t, because that would taint this project as being a derivative work of GPL code. nspell is reverse-engineered.

Switching to GPL wouldn't be a problem for my usage, but I understand your concerns.

However, I found that Myspell (from which Hunspell is based on) is licensed permissively and is probably the base for the comparing algorithm. As I didn't looked into Hunspell's source yet, I will try checking out Myspell's one first, in order to keep any fork/PR compatible with the MIT license (in case I get any useful insight).

nathanlesage commented 5 years ago

I just did some research on what happens during the load of the Italian dictionary, and the algorithm just gives up at line 6720: appuntare/ALKhlTXqrI.

If I comment out, it stops at the next line, so I think that's where the limit is. I don't know whether or not that helps, but as the Italian dictionary is rather large, maybe it's indeed required that the algorithm applies some memory saving options … I've seen that the ALKhlTXqrl-flag is used quite often, and it seems long, so maybe this helps?

I'll continue to research where the algorithm is defeated, so maybe I'll come up with a solution at some point!

PlNG commented 5 years ago

It should also be noted that hunspell-spellchecker has the same issue (excessive memory consumption with the use of the IT dictionary).

wooorm commented 4 years ago

@leonidasv pt-br now uses a new source. Hope this one works.

I suggest trying the Italian dictionary with nodehun to see if it works natively. The Italian source hasn’t been updated in 5 years, so maybe it’s just bad.

fabiospampinato commented 4 years ago

I don’t know Italian so I can’t check if changing the code would still properly spellcheck it!

@wooorm Italian is my mother tongue, if there's anything that I can do to help get this issue resolved I'd be happy to do that.

wooorm commented 4 years ago

Thanks Fabio, appreciate that!

Currently focussing on other stuff, but when I get ready for another push on nspell I’ll ping you!

fabiospampinato commented 3 years ago

I've taken a closer look at this, with the recent performance improvements I can get nspell to stop hanging after a good but manageable amount of seconds, until eventually it throws this error for me:

RangeError: Value undefined out of range for undefined options property undefined

Which according to this SO answer means something ran our of keys available, and according to this v8 engineer's SO answer Maps have an hard-coded 2^24 max number of keys possible.

Basically as I'm understanding it the Italian dictionary is generating so many combination that the engine becomes incapable of handling them.

This should be fixable by using a pool of Map objects internally rather than a single one. It would be useful to measure how many actual keys the Italian dictionary needs though, if instead of 16M it's 16B or something then the whole effort would be pointless.

wooorm commented 3 years ago

Alternative idea: the source of the dictionary hasn’t been updated in years. Do any of you Italian-speaking folk know if there are alternative groups making Italian dictionaries? Or is there a cap we can put on this expression to not generate a bazillion entries?

nathanlesage commented 3 years ago

Thanks for digging into it, @fabiospampinato!

Or is there a cap we can put on this expression to not generate a bazillion entries?

I think yes: I mean, why don't we try to simply read in all contents of the .dic-file, and we already have the flags behind it, so why do we need to resolve these flags immediately? Wouldn't lazy-loading them be easy? I am currently thinking of the following (please tell me if I forget something that prevents this):

The files are loaded, but this "loading" means "Save them to an internal line array"
Once a word is looked up (e.g. "giorno", to stick with the italian example), we could take all possible candidates from the line array (we can be generous with this, because some words may have modifications as a prefix, the goal should only be to not parse all entries) and resolve them until we find a match. If the word is written wrongly, we won't get a match, but the upper cap for this would be once all candidates are resolved.
Then save those resolved words in an internal map (maybe even distinguished by first letters, so that we have, e.g., 26 maps from A to Z containing all resolved words)

This would increase the computational requirements a little bit at the beginning, but this might first reduce the insane times we currently achieve even with English dictionaries during load, and enable loading any DIC file.

Any thoughts? What obvious thing did I overlook?

And thanks again for caring :)

fabiospampinato commented 3 years ago

Alternative idea: the source of the dictionary hasn’t been updated in years. Do any of you Italian-speaking folk know if there are alternative groups making Italian dictionaries?

I don't know anything about that.

Or is there a cap we can put on this expression to not generate a bazillion entries?

Doable, but ideally one wouldn't want the spell checker to understand 50% of a language and ignore the remaining 50% or something. That being said I think I got exactly 10 of those errors reported in my console, I don't know if the engine just gave up or if maybe we are only going like 0.00006% over the limit, which would be interesting.

In any case using multiple maps seems like a better option to me, as with a cap on the number of words we'll get nspell not recognizing some words or having a hard limit on the number of custom words that it can handle.

but this might first reduce the insane times we currently achieve even with English dictionaries during load

@nathanlesage with the recent performance optimizations I submitted I'm getting the English dictionary loaded in my ~6yo laptop in ~150ms or so, I think that's pretty good already, if you are running nspell in a worker it doesn't really matter much if it takes 0ms or 150ms to load a dictionary.

Any thoughts? What obvious thing did I overlook?

I'm not very familiar with the details of generating a dictionary currently, I think potentially something like that could be doable, but finding which words to lazily evaluate without parsing the entire thing sounds difficult to me, and it may backfire performance-wise, like performing that computation may pretty quickly take longer than just parsing the entire thing.

IMO further performance optimizations routes might be:

Using character scanning rather than regexes for rules, this may reduce init time by ~15% or so, or not.
Using a bloom filter or similar structure rather than asking the Map if a word is already present in it, as for example ~93% of the times for the french dictionary we are dealing with new words, maybe this could reduce init time by another ~20% or so, but I've never used bloom filters before so maybe I'm talking nonsense.
Pre-parsing dictionaries, this could potentially remove most of the time required for initialization but the problem is things like the french dictionary become a 70MB file from a ~2MB raw dictionary file, so shipping that with the app or downloading that is not really feasible.
Parallelizing initialization so that for example 4 worker threads might work on their own chunk of the dictionary, this could be doable and would reduce init time by an amount proportional to the number of workers and actual cores available, but chunking the dictionary is not trivial because some words can be added as prefix/suffix to other words for example.
Ordering dictionaries by word usage, so like nspell could first load the first 10000 words and if mispelled words are found, requiring some suggestions, the next 10k words could be loaded, and then the next ones etc. until some suggestions are found. As a negative side effect this could reduce the number of suggestions provided, but maybe that would be a good enough trade off for some languages. But then again partially parsing dictionaries like that is not trivial.
Maybe memory usage could be reduced significantly still by not joining words together, like if instead of storing "buongiorno" nspell stored "buon" and "giorno" in an array or something I think v8 could allocate those strings only once in memory for the entire dictionary, saving a good deal of memory for languages with lots of computed words, but then everything internally that currently works on strings would need to be rewritten to work on these kind of disjoined strings, which wouldn't be easy probably.

wooorm commented 3 years ago

Folks, I think performance is an XY problem. There are two variables here: a) dictionaries, b) hunspell vs. nspell. Most dictionaries work fine with nspell. Some don’t. As mentioned before, Hunspell does okay on Italian. There is a bug in the code here which causes lots of words, which I don’t believe need to be added (but I don’t speak Italian and hence can’t help here), to be added. The solution is to look at what is going on in the code that shouldn’t be going on. Perhaps it’s depending on a feature that isn’t supported here?

fabiospampinato commented 3 years ago

@wooorm are you certain there's a bug generating unnecessary words? 16M+ words sounds like a lot to me but given how lots of words are spelled differently in italian depending on their gender or pluralization I can see how a much greater number of words could be generated for italian compared to english. I'll skim the generated list of words, maybe I can spot some seemingly problematic ones.

wooorm commented 3 years ago

I thought that that was the problem, because (excuse me if I'm off) roughly related languages like Spanish or so don't have this problem.

Dictionaries are written by authors who sometimes aren't that fluent in programming. Some include bugs. I'm under the impression that this one isn't written well 🤷‍♂️ and that maybe hunspell is capping certain ginormous expansions whereas nspell isn't

fabiospampinato commented 3 years ago

I think I have some new interesting findings to share:

Generating way fewer words:
- I've taken a closer look at the italian dictionary, and it generates about 30 million words, exceeding the ~16M limit significantly.
- The number of words generated by the italian dictionary explodes because italian has these silly prefixes that can be attached to words, for example in english one might say "that tree", in italian "that" can be translated as "quello"/"quella" when referring to a noun in the singular form, but it's common to truncate the trailing vowel and attach the remaining characters to the noun, so in italian one would translate that english bit with "quell'albero" rather than "quello albero". Another major issue is that words in italian, but I guess in all languages with a latin alphabet really, can start with an uppercase character if they are at the start of a sentence, that helps making the list of words generated by nspell explode because now in the list you'll have: albero, Albero, quell'albero, quell'Albero, Quell'albero and Quell'Albero.
- Potential fix number 1: I think it would be totally reasonable to consider those prefixes as their own words, which they are grammatically technically, and instead just having "quell'", "Quell'", "albero" and "Albero" in the dictionary.
- Potential fix number 2: Duplicating the number of words just for accounting for the case where "albero" is at the start of a sentence sounds super wasteful to me, there shouldn't be "albero" and "Albero" in the dictionary really, but just "albero", I think we can probably account for this dynamically when computing suggestions and checking for correctness.
- Potential fix number 3: "Quell'Albero" is not actually correct spelling because "albero" is not a name like "Fabio" or "Italy", unless somebody explicitly gives something the "Albero" name for some reason, so it shouldn't be present in the final list of words at all.
- I think all these potential fixes together could cut down on the number of words generated, and by consequence CPU and RAM required, by up to ~80%.
- French also has a similar issue and that's why its dictionary contains millions of words too.
Mind-blowing unbelievably amazing truly outstanding performance:
- Computing the list of words dynamically is where most of the CPU time is spent when loading a dictionary, and most of the RAM is spent just storing this huge list of words which contains largely redundant information (i.e. it compresses to a high degree), apparently it's possible to cut down on CPU time almost entirely by pre-parsing dictionaries, and cut down on RAM usage massively by storing the list of words in a fancy ~compressed data structure that doesn't need to be decompressed before being used, more details on how the data structure works here.
- With the optimization mentioned above I can get the English dictionary to load in ~2ms in my machine, rather than the current ~150ms, and, unbelievably, the gzipped precomputed list of words weighs less than the gzipped raw dictionary used to generate it in the first place, so it would take less time to download it too.
- With the same optimization I can get the French dictionary to load in ~20ms in my machine, rather than ~4s, milliseconds not seconds, and the ~compressed list of words weighs just 1.2MB, which doesn't require nearly as much RAM to keep in memory (currently instead we are at ~240MB for the french dictionary) which goes down to ~800kb when gzipped, and with the potential reduction on the number of words generated these figures would go down too.
- Also if for some reason a language legitimately requires 50 million words to be stored I think as a side effect of how the data structure required to generate this ~compressed list of words works we probably won't be hitting the hard-coded 16M keys limit anyway.
- What's the catch?
- Generating the ~compressed list of words is expensive, it took half an hour for the french dictionary on my machine, but that must be done only once.
- If the way lists of words are generated changes, i.e. if nspell gets better, the ~compressed lists of words should be calculated again.
- The ~compressed list of words can't be mutated dynamically, meaning adding and removing words from the dictionary must be implemented on top of that, but it should be doable I think.
- There aren't many libraries that I could find that could generate the "succinct", as they call them, data structure that we need, none of them widely used, so the available ones might be buggy and difficult to fix.
- It might be difficult to extend these libraries to work with arbitrary alphabets (i.e. they might not currently support storing accented characters for example, or vietnamese characters or cyrillic characters etc.).

I think first of all we should get those optimizations I mentioned in to cut down massively on the number of words generated, it would be great if @wooorm could take care of that as I'm not nearly as familiar with the library, that would bring down the time required to load even problematic dictionaries like the french or italian ones under 1s, which we can work with. Secondly I, or some other volunteer, will spend some time attempting to achieve that truly "final" mind-blowing optimization. Thirdly we will all share on Twitter our happiness for the outstanding performance achieved, especially in the face of the Rust-all-the-things freaks.

Thoughts?

nathanlesage commented 3 years ago

This sounds amazing! Thank you for digging into it!

If the complete (!) list of computed words (without affixes, if I understand you correctly? Just the words we can search for?) is THAT useable, we could ditch hunspell completely, and just write a small program that generates these compressed lists from hunspell dictionaries. We can leverage Github Actions for that.

Then we would just need mechanisms to load them and do searches in these dictionaries. Then users could download our generated dictionaries and need to load them using nspell.

With regard to the generating program -- if you tell me how you did it, I could volunteer to write such a generator in Rust, so a faster system programming language. We could then run it via push commits, to keep the lists up to date.

This would split up the work and we would be able to implement this faster? @wooorm could then provide guidance with regard to the algorithms.

What do you think?

fabiospampinato commented 3 years ago

without affixes, if I understand you correctly?

I think affixes are used during word generation, so affixes should already be taken into account when serializing parsed dictionaries, the affixes themselves are not included in the serialized dictionaries but they weigh next to nothing compared to the actual words generated (just 24kb zipped for the French one, I think the English one is 1kb or something) so they can just be downloaded too with no significant performance implications essentially.

If the complete (!) list of computed words is THAT useable

It's truly amazing, the magic data structure allows for fetching efficiently all possible ways to complete a word too, fetching known possible suffixes essentially, it might allow for fetching known prefixes too, I'm not sure.

we could ditch hunspell completely, and just write a small program that generates these compressed lists from hunspell dictionaries. We can leverage Github Actions for that.

We need something to generate the final list of words (nspell), something to pack that into the magic data structure (wizardry), something that can work directly with that packed data structure (dark wizardry), finally something implemented on top of that that allows for adding and removing words (a future version of nspell maybe, I guess it depends on which path @wooorm wants the project to take).

Then we would just need mechanisms to load them and do searches in these dictionaries.

That's all included in the magic library I tried earlier already.

With regard to the generating program -- if you tell me how you did it, I could volunteer to write such a generator in Rust, so a faster system programming language. We could then run it via push commits, to keep the lists up to date.

All the magic comes from this library: https://github.com/mckoss/dawg, computer science boys.

I wouldn't bother rewriting that in Rust honestly, it takes seconds to generate a packed dictionary for English, we can probably get it to generate packed dictionaries for any languages under 1m with some optimizations both to it and nspell. And with a rewrite subtle bugs may be introduced.

If you want to volunteer some time you should probably look into how to make that library work with arbitrary unicode words rather than just ASCII ones. Other than that there doesn't seem to be really much more work necessary I think.

This would split up the work and we would be able to implement this faster? @wooorm could then provide guidance with regard to the algorithms.

I don't currently have too much time to dedicate to this, but I absolutely want to get this done eventually because the resulting performance is incredible. So I don't know, maybe ideally @wooorm could update nspell to generate less words, @nathanlesage could patch the succinct DAWG implementation for our needs, and I could write the wrapper library that ties everything together (meaning patching nspell on writing something else)?

I would probably also want to look into using hunspell directly for words generation as well.

Also some words have some "codes" associated with them in nspell, I'm not too sure what's the deal with them, we probably shouldn't just throw them away, and we can't store them in the DAWG itself either. But that's a secondary problem we can figure out later.

nathanlesage commented 3 years ago

Aaaaah a tree (graph) — I've dealt with these during my last project, and as I will probably be using them in my PhD, I could have it count as work time porting that thing to a system language and splitting up the work into two libraries? But we'd probably use the TS DAWG library for nspell to use them as well.

In any way, while the Math behind that looks cool, it's actually not THAT difficult to implement (we could draw on a lot of arXiv papers as the algorithm is already there). This might give us the freedom to also add non-ascii support (ergo increasing the amount of graph starting points from 26 to, idk, 128 or so)

We just need to know how the data structure of the library looks so that we can adapt the output. (Btw the reason for me insisting on a non-js program is simply because it's easier to have a command line program running on a GitHub worker, it could lift work for us to do in the future. And, I don't like utilizing JavaScript for Computational-heavy tasks due to its runtime overhead, but maybe I'm just the stereotypical German in this respect)

What do you think of the following:

Workload command line application:

Hunspell dic/aff
    --> expand internally, resolving all words
    --> map Capitalized and non capitalized words
        and other optimizations you suggested
    --> build the tree
    --> save as tarball/zip (?)

Nspell:

Read the tarball
    --> read in the tree
    --> use for spelling correction immediately

fabiospampinato commented 3 years ago

@nathanlesage It's not just the graph, it's the combination of the graph and its compact string representation that doesn't really need to be decompressed that gives the amazing performance numbers I reported, if you didn't bother with the succinct string representation and instead required decompressing the dictionary and rebuilding the graph I don't think you will see nearly as much performance improvements, there's a good chance it may backfire even, ~2s of the ~3.3s that nspell currently takes to parse the french dictionary it's just adding words to a JS Map, I really don't think you can just do that 100x faster, you need to work with the ~compressed graph directly.

I think going with TS would be a better choice, mainly because we have a reference implementation already that looks well written well tested and well commented, we need a JS version at least for working with the string anyway because we need this to work in the browser, and a CLI wrapper for it can be made trivially.

But feel free to reimplement the packing part of it in Rust or whatever language you like, as long as we can get this thing to work I don't care if portions of the succinct DAWG library are written in Rust.

save as tarball/zip (?) Read the tarball

The plan looks good but we can't afford zipping the graph or any other form of string serialization that requires decompressing before using, the performance gains come from using the string representation directly.

fabiospampinato commented 3 years ago

Also an implementation of the DAWG that we can actually use in the browser would allow for storing custom words to add and remove from the dictionary in a memory efficient way potentially, so I don't really see any inherent advantages of using Rust other than potentially saving a few minutes once when generating these graphs.

nathanlesage commented 3 years ago

Ah so you were just talking about the string representation itself - apologies. I'll have a look at that library myself and report back!

fabiospampinato commented 3 years ago

Potentially a Rust implementation could be useful in the future to traverse the graph in parallel using SIMD instructions in WebAssembly, but like SIMD instructions aren't here yet, it would probably be difficult to implement that on top of the string representation, and checks for presence in the graph seem instant to me anyway so any optimization in that regard may be pointless.

fabiospampinato commented 3 years ago

hunspell's unmunch command can dump all known words from a dictionary, but I think it doesn't work with non-ASCII languages.

Interestingly it seems to generate roughly the same number of words as nspell, so nspell may reimplement enough of it to be able to generate the words list we need.

hunspell's interpretation of the italian dictionary also has the same issue I mentioned with the prefixes, it generates 36 million words, which together make a ~650MB file, which zipped goes down to ~90MB, it'd be interesting to see what the gzipped version of the packed DAWG for this words list is, but that would take hours if not days to compute on my machine, most probably it's going to be unreasonably large, those problematic prefixes need to be handled specially.

fabiospampinato commented 3 years ago

It just finished generating the succinct DAWG for the first 6 million words in the italian dictionary, it ~compressed 100MB into 60kb (99.94% compression ratio), which screamed "that's buggy for sure" to me, but then again opening the text file containing those 6 million words everything looks kind of the same to me, it's a mess of different prefixes that blow up the combinatorics, plus the DAWG library currently is case insensitive so it deduplicated probably something like 75% of those words anyway.

That's to say that that DAWG library looks really like a piece of magic, and I guess perhaps having lists of words in the hundreds of megabytes might not be that big a deal actually.

andreafeccomandi commented 3 years ago

Hi everyone, I'm following this issue with great interest because I would like to use the spellchecking functionality in pure javascript.

I'm the author of bibisco, a desktop app made in Electron.

Currently, I am implementing the spellchecking features using electron-spellchecker which however uses native modules, which from Electron version 9 are deprecated.

I can't use the spellchecker shipped with Electron, because it doesn't allow you to specify the language to use on Mac.

I find the @fabiospampinato's idea of having an exploded vocabulary of terms optimized using DAWG very interesting. The same idea is also implemented by the simple-spellchecker library without DAWG optimization.

However, this solution presents the problem of suggestions; how would they be implemented? In the simple-spellchecker library, a binary search with the damerau-levenshtein distance is used.

Some other notes:

I'm Italian, so I am available in case of linguistic doubts.
The hunspell's unmunch command works well also on UTF-8, but the command execution environment must support UTF-8. I was able to generate Russian and Czech vocabulary.
The hunspell's unmunch command has performance problems with some vocabularies; I was unable to generate Turkish vocabulary. Any suggestions?

fabiospampinato commented 3 years ago

However, this solution presents the problem of suggestions; how would they be implemented? In the simple-spellchecker library, a binary search with the damerau-levenshtein distance is used.

You could convert the succinct DAWG into whatever tree structure simple-spellchecker is using, so this issue doesn't really exist as you can map one problem onto the other and at least one of the two is solved in your mind.

In practice though it wouldn't make sense not to use the succinct DAWG directly as that's where the magic is. In the context of the DAWG searching for suffixes means walking down the graph listing all paths to all terminal nodes (fast), searching for prefixes means walking up the graph and listing all paths leading to the staring node (it could be made fast too but requires some preprocessing to kind of invert the graph), searching for spelling suggestions in general could mean exploring the space of strings at reasonable edit distances from the query string and checking for presence in the graph (i.e. walking down the graph in search for a terminal node associated with the edited string in question, which is fast enough but probably slower than binary search), but we could also look for suggestions potentially more efficiently by walking the graph directly as we know already which paths lead to valid words (I think this could potentially be faster than naively computing edit distances and binary search on the whole list).

The hunspell's unmunch command works well also on UTF-8, but the command execution environment must support UTF-8. I was able to generate Russian and Czech vocabulary.

There's an issue on hunspell's repo about implementing another command for dumping all words, I don't think unmunch per se is slow, it extracted half a gigabyte of words in seconds in my machine from the italian dictionary.

The hunspell's unmunch command has performance problems with some vocabularies; I was unable to generate Turkish vocabulary. Any suggestions?

You can attempt to extract the words via nspell or some other somewhat compatible library.

nathanlesage commented 3 years ago

searching for prefixes means walking up the graph and listing all paths leading to the staring node (it could be made fast too but requires some preprocessing to kind of invert the graph)

Actually, not a problem: the search algorithms on acyclic graphs work fast in both directions. I read a few things yesterday night and I think it's really fairly simple to generate and traverse such trees.

But we're still lacking a fast file format for these trees, as I still think it's wasted CPU time to recompile the tree everytime we load the app (plus the additional disk IO) -- even transistors have a limited lifetime and I'd like to efficientiate the hell out of this very, very good idea.

As for the mapping; pretty easy: as far as I know distance algorithms can give you some alternatives, and we can then traverse the tree yielding the necessary words using topological search.

An additional idea, which does not have any priority, but nevertheless: We could add a simple form of machine learning, as one instance will only be run on one computer by one user, so we could save wrongly typed words and their replacements in a hashmap, which can be offloaded and loaded using a simple function call, which can then speed up spell checking even more, if we, e.g. limit suggestions to 10 max or smth like that.

fabiospampinato commented 3 years ago

We could add a simple form of machine learning, as one instance will only be run on one computer by one user, so we could save wrongly typed words and their replacements in a hashmap, which can be offloaded and loaded using a simple function call, which can then speed up spell checking even more, if we, e.g. limit suggestions to 10 max or smth like that.

I'm not sure what you are suggesting here, sharing a single spell-checker instance across processes? It would be somewhat interesting but you can't do that in the browser so I personally have no interest in that. Treating commonly spelled/mispelled words specially for faster suggestions? Sounds like a good idea but the performance improvements may only be marginal (i.e. not like making a seconds thing a milliseconds one).

IMHO the next potential steps if somebody would like to strive for spell-checking perfection would be:

Adding another layer of checks in the form of some neural nets trained to understand grammar and potentially higher-level features too (words repeated too many times, plagiarism, good writing in general etc.), which would be too difficult to instead manually code, especially for languages we don't even understand.
Making plugins for common editors and saying "look, it's like Grammarly but you don't need to send us everything you type, what do you say?".
Having a high-accuracy model for detecting the language used, enabling reliable automatic multi-dictionary/multi-grammar checks.
Rebuilding all the dictionaries taking into account the available written corpus of text for each language, which means potentially supporting every language on the planet too.

That's a bit too far ahead of our current state though, let's get the thing to start up in reasonable times first 😄

nathanlesage commented 3 years ago

Adding another layer of checks in the form of some neural nets trained to understand grammar and potentially higher-level features too

You don't need neural nets for that. You can solve this much faster with trees as well.

Having a high-accuracy model for detecting the language used, enabling reliable automatic multi-dictionary/multi-grammar checks.

Nice but I think this overshoots by a wide margin. Plus this will increase the computational burden for many users by far. Neural nets are not end user playtoys, I'm afraid :/

I'm not sure what you are suggesting here, sharing a single spell-checker instance across processes? It would be somewhat interesting but you can't do that in the browser so I personally have no interest in that.

No, just aomething like a per-user cache. Would help make these suggestions the top-ones, and users may thank us for that. Plus, while I understand that you want to focus on your own needs, I would suggest we should keep this library open for many people. Plus, as Notable is Electron-based, you could make use of that yourself ;)

But you are right, we should first focus on making nspell as it is more usable!

fabiospampinato commented 3 years ago

You don't need neural nets for that. You can solve this much faster with trees as well.

You don't strictly need neural nets for anything, but how else is someone going to come up with those trees for russian, hebrew and tens of other languages? 🤔

Nice but I think this overshoots by a wide margin. Plus this will increase the computational burden for many users by far. Neural nets are not end user playtoys, I'm afraid :/

I think eventually that would become a viable strategy, if it actually isn't already in some cases, I'm currently using @wooorm franc for document-level language detection and I'd be surprised if one can't get a model of comparable size even just for a handful of languages (perhaps asking the user which languages he/she speaks or something) with much higher accuracy and comparable performance, plus eventually I'd imagine most laptops and smartphones will have some form of accelerators for doing those operations, maybe regular GPUs are already good enough for this.

Plus, while I understand that you want to focus on your own needs, I would suggest we should keep this library open for many people.

Sure, I didn't mean to imply that some ideas are inherently bad if I won't make use of them.

nathanlesage commented 3 years ago

You don't strictly need neural nets for anything, but how else is someone going to come up with those trees for russian, hebrew and tens of other languages? 🤔

Well, easy. You have a fixed set of characters that are possible, let's take a Japanese example:

こんにちは, こんばんは, and ここ can be represented in this graph:

こ --> ん --> に --> ち --> は --> [EOW]
|     |
|     +--> ば --> ん --> は --> [EOW]
|
+---> こ --> [EOW]

No need for any neural net here. You only need a list of all words ;)

plus eventually I'd imagine most laptops and smartphones will have some form of accelerators for doing those operations, maybe regular GPUs are already good enough for this.

No, the trend actually goes towards having neural nets on remote servers, and using smartphones more and more only as GUIs towards SaaS. There's no neural net (and no model) running locally on almost all computers. But there's a lot of hype as many services advertise using "Artificial Intelligence", but most of it is just hot air without any actual neural net behind the curtains. Don't let yourself be fooled by that ;D

fabiospampinato commented 3 years ago

Well, easy. You have a fixed set of characters that are possible, let's take a Japanese example:

We might be referring to two different things, the graph you made looks to me like what the DAWG for those 3 japanese words would look like (if that は character at the end of two different words got deduplicated too), but I don't see that kind of graph telling me anything about grammar, i.e. I really don't think you can make something that understands how sentences work on top of a graph representation of a vocabulary.

No, the trend actually goes towards having neural nets on remote servers, and using smartphones more and more only as GUIs towards SaaS. There's no neural net (and no model) running locally on almost all computers. But there's a lot of hype as many services advertise using "Artificial Intelligence", but most of it is just hot air without any actual neural net behind the curtains. Don't let yourself be fooled by that ;D

iPhones and iPads at the very least, of which there are probably hundreds of millions around, have been shipping with dedicated ML circuitry since a few years now, so if anything some devices are going to have ML accelerators because some of them have them already, whether most Android smartphones will follow suit or not I don't have a crystal ball to say for certain.

To get another data point I just tried a node module for google's cld (Compact Language Detector), node-cld, and it gives me a result using the sample snippet listed in the project's readme in about half a millisecond, I think that's something we can work with already. And my machine it's a 6yo laptop with, to the best of my knowledge, no dedicated circuitry for the kind of operations these models perform, I don't even have an external GPU, not that that model is probably running in the GPU anyway. And the entire model weighs ~6MB, for 160 languages, that's kind of a lot but let's not forget that to this day I'm still wasting ~200MB storing french words inefficiently in RAM.

So I'm pretty sure that can work already today for some devices without any particular trick, and eventually computers will get faster, will have more ram, probably dedicated ML accelerators too in my opinion, and we don't actually need a single model that can detect 160 languages, for which we don't even have dictionaries for.

But hey, I'm not arguing for you to add language detection on Zettlr, I'm just saying that sounds like the right path to take to me and not a particularly far fetched one 🤷‍♂️

nathanlesage commented 3 years ago

But hey, I'm not arguing for you to add language detection on Zettlr, I'm just saying that sounds like the right path to take to me and not a particularly far fetched one 🤷‍♂️

No no, I really like that, but I just wanted to highlight that neural nets should be treated with caution. It's due to my job, which is to look at neural nets, and I've seen so much bullshit going around 🙈

Because, for instance, "ML circutry" is simply an additional CPU or GPU, because apart from CPUs, GPUs, and Google's TPUs, there's nothing out there to compute on. It's just an advertising thing from Apple …

fabiospampinato commented 3 years ago

@nathanlesage Did you look into what it would take to have a succinct DAWG implementation that we can use for spell checking?

nathanlesage commented 3 years ago

Hey @fabiospampinato! No unfortunately not, I'm currently drowning in work :/ The only good thing is that I'm currently researching a lot on NLP techniques, among them said graphs, so while I can't tend to an implementation yet, I think this is a good way forward (to quote one of the papers, "Computer Science is basically all about trees").

Setting myself a reminder now for end of January and then I can have a look at it! :)

fabiospampinato commented 3 years ago

FWIW another issue with the italian dictionary is that it generates numbers textually up to 99k or something like that, so a big percentage of the words generated has to do with that.

With saner languages like English that's not that much of a problem because one may write "nine thousand five hundred and three" (maybe without "and"?), which is just made up of nice short atomic words, but in Italian one may write "novemilacinquecentotre", slamming everything together like a crazy person, which causes lots more words to be generated by the dictionary.

wooorm commented 3 years ago

🤌

To recap, is most of the problem storing those in memory, or generating them?

fabiospampinato commented 3 years ago

To recap, is most of the problem storing those in memory, or generating them?

Kind of both, one shouldn't generate millions of words at startup to begin with because that takes multiple seconds already, and one shouldn't store millions of words in memory because that takes hundreds of megabytes (if done naively anyway, but non-naive alternatives are much more computationally expensive).

Potentially the italian dictionary could be patched to remove those numeric words entirely, but it would still generate millions of words. To give a sense of perspective the italian dictionary currently produces a ~700MB file containing ~30 million words, with some post-processing I've shaved that down by 90% and it still contains ~3 million words, it's just not feasible to generate all of them at startup quickly with JS.

There's a reimplementation of Hunspell in Python called Spylls that's really well documented, and from reading some of its documentation I think I learned that Hunspell bypasses both problems kind of by cheating, it generates all the various variations of a stem on demand depending on which word is being spell-checked so it has no major startup or memory issues, and it just stops generating these words early after an hard-coded amounts of milliseconds to keep the process fast so it never spends more than 200ms or whatever on any single word, so it's basically fast by definition.

I'm personally currently pursing another approach, parsing all dictionaries at build time in a way that gives you 0 startup delay, extremely low memory consumption, really fast lookup performance, and you still account for ~all words, but it's pretty complicated and it's unclear how it should work with compound words because as I understand it dictionaries like the one for Finnish encode for potentially billions and billions of potential words that aren't actually all valid in Finnish but the dictionary treats them all as valid, and you can't possibly generate and encode them efficiently.

fabiospampinato commented 3 years ago

One ~final comment on this: I've written a custom JS spell checker for a commercial product based on the succinct trie approach mentioned above (using a succinct fsa might work even better), I can't open-source the spell checker itself but if somebody else would like to pursuit the same approach here's some words of encouragement for you:

Italian dictionary:
- For the Italian dictionary this new spell checker can now ~parse (it's not really "parsing", it's just deserializing a few simple structures) the dictionary in ~10ms, nspell crashes after processing the dictionary for many seconds. That's essentially a ~100% reduction in startup delay, unless you are using a similar approach whatever language your spell checker is written in this one blows it out of the water, in terms of startup performance at least.
- It consumes a grand total of ~9.5MB of memory, nspell crashes after consuming hundreds of megabytes of memory.
- It can lookup 7 million words (half hits and half misses) in ~16s, that's ~42 words per millisecond.
English dictionary:
- For the English dictionary this new spell checker can ~parse the dictionary in ~5ms, I think that took nspell ~150ms to do in my machine.
- It consumes 500KB of memory, nspell used something like ~13MB for it if I recall correctly.
- It can lookup ~170k words (half hits and half misses) in ~350ms, that's almost 500 words per millisecond.
- It correctly provides suggestions for 85% of the common misspelling cases listed here in ~1.3s (the remaining 15% it's either no suggestions are provided at all or not all the ones that Wikipedia mentions are provided) (and it's kind of impossible to get 100% on that because some examples are in the form of [correct word] -> [other correct word] which is undetectable with regular word-based spell checkers), where I think nspell took something like ~35s to pass ~83% (I don't recall the exact number now) of those tests on my machine. Having words organized in a Trie or an FSA rather than a hash table just allows you to explore nearby words much more efficiently.

Monkatraz commented 3 years ago

I've made an NPM package that is a near complete port of Hunspell, in pure TS. It passes the relevant Hunspell tests. It's named Espells, and you can find the repository here.

With no ill meaning whatsoever, I think this package can be effectively replaced with Espells in any usage. In particular, it handles this Italian dictionary (and others) without issue. This issue in fact is what drove me to make Espells.

@wooorm If you don't mind my continued forwardness, could I get some free advertising in, say, the dictionary package readmes? I'd love to get the word out.

wooorm commented 3 years ago

This looks like a clear license violation, both the python "port", and espells?

Monkatraz commented 3 years ago

This looks like a clear license violation, both the python "port", and espells?

I should be careful with my wording, I think. I use "port" because it's the best way I can think of to describe it. As far as I can tell, and how Spylls is demonstrated, was that it was reverse engineered from Hunspell's documentation and tests. After completing all of the main code, I finally took a look at some Hunspell source code and it looks very different, so I feel confident in this.

If Spylls is in violation, so is Espells, but if that happens I'll change the license.

EDIT: That isn't to say that there would have to be some legalistic examination - honestly, if anyone is uncomfortable, I'll just switch the license to the MPL 2.0, like Hunspell. The MIT license I chose was just force of habit.

wooorm commented 3 years ago

I think both are in violation:

Spylls is explained as a “explanatory rewrite”, this sounds like carefully reading through the code and then implementing it either the same of a more clear way, and in the posts the Spyll author also explicitly mentions reading through the source and what they found there. That means it’s not 100% a new body of work. Which means it has to follow the MPL 1.1/GPL 2.0/LGPL 2.1 with copyright statement of Hunspell and add to it. So the first issue is there, and it’s a big one: spyll uses MIT, which is very permissive, whereas the author of Hunspell chose a strong license, to specifically prohibit such permissiveness. On top of that, it doesn’t acknowledge the copyright that the Hunspell author(s) have over their code
As Espell is more clearly a port, it should at least have the copyright statement of Spylls.

But I think the Spylls violation is a big one.

honestly, if anyone is uncomfortable, I'll just switch the license to the MPL 2.0, like Hunspell. The MIT license I chose was just force of habit.

Every open source maintainer should be uncomfortable. I understand it happens due to force of habit but it’s still wrong. People’s hard work is stolen. Their few wishes aren’t honored. Their work is not acknowledged. That’s the tiny thing a license does: ask for an acknowledgement and a few small wishes. Hunspell is seriously underfunded: the solution is to fund them.

The goal of nspell is to be an explicit MIT work, at the cost of being far from perfect.

Monkatraz commented 3 years ago

As Espell is more clearly a port, it should at least have the copyright statement of Spylls.

Yeah, meant to do that. Housekeeping with the package, sorry.

People’s hard work is stolen. Their few wishes aren’t honored. Their work is not acknowledged. That’s the tiny thing a license does: ask for an acknowledgement and a few small wishes. Hunspell is seriously underfunded: the solution is to fund them.

I understand this, and I will switch to MPL 2.0, as I should've. However, I will be a bit... blunt. Hunspell's source is nigh-unreadable and difficult to contribute to. Commits haven't been made in years. Hunspell is good software, but it's not good for us software. It needs replacement, but very little has come close to parity with it, except Spylls and Espells. Lastly, I will state that I do not wish for my intentions to be read as more extreme than they are, my error was simply an absent-minded mistake.

The goal of nspell is to be an explicit MIT work, at the cost of being far from perfect.

I understand, good luck.

wooorm commented 3 years ago

I understand your good intentions and see them separately from the “absent-minded mistake”
Forking and rewriting is easy compared to maintaining a project for 10 years, in my experience, I remain of the opinion that a proper solution is to fund Hunspell
I don‘t think “not good software” is a good reason for porting without acknowledgement and under different rules
I would appreciate it if you could perhaps ask Spylls to fix their license
Once the license situation is solved, I’m fine linking from nspell to your alternative!

Monkatraz commented 3 years ago

I would appreciate it if you could perhaps ask Spylls to fix their license

I will make an issue on the repository.

Once the license situation is solved, I’m fine linking from nspell to your alternative!

Awesome!

I don‘t think “not good software” is a good reason for porting without acknowledgement and under different rules

Apologies if it came across like that, I agree with you.

zverok commented 3 years ago

(Spylls author here)

@wooorm Thanks for raising the concern. (And thanks @Monkatraz for bringing the problem to my attention.)

It is an important issue, and it just slipped my mind completely (I thought about it several times while working on the project in private, "dude, don't forget the licensing thing!"... and then, when I finished, I was so excited that I just forgot.)

It is my bad, and I'll be fixing it to MPL (and include an extended notice about Hunspell authors copyright) next week; on the road currently.

No bad intention was here, just underthinking of the important issue (not a really good excuse, but all I have).

wooorm / nspell

Not playing well with 'dictionary-it' (2) #11