xxyzz / WordDumb

A calibre plugin that generates Kindle Word Wise and X-Ray files for KFX, AZW3, MOBI and EPUB eBook.
https://xxyzz.github.io/WordDumb/
GNU General Public License v3.0
386 stars 19 forks source link

Using spacy for POS detection when creating word wise epub #71

Closed Vuizur closed 1 year ago

Vuizur commented 2 years ago

First of all, thank you for all your hard work on this extension! It is a really impressive and cool feature.

If I understand it correctly, you currently use spacy for named entitiy recognition for X-ray. I think it would also be cool if you could use Spacy's POS detection to get the correct translation of a word for word wise. This is especially useful for languages like Spanish, where verbs and nouns are often written the same. I tested it a bit, and spacy is pretty good at telling them apart, so this could improve the accuracy of the translations quite a bit.

xxyzz commented 2 years ago

The verb and noun gloss of the same word won't be much different, applying spaCy pipeline will also make the program slower(the current code creates Word Wise file instantly). The benefits of using POS probably won't worth the effort and the loss of speed.

I see you also use files from kaikki.org in your project, maybe you'll be interested in the pull requests I created for wiktextract and wikitextprocessor. These pull requests add support of parsing non-English Wiktionary dump files, that would provide non-English Word Wise glosses which is requested by many users.

Maybe in the future POS will be used but currently parsing non-English Wiktionary has higher priority.

Vuizur commented 2 years ago

The verb and noun gloss of the same word won't be much different, applying spaCy pipeline will also make the program slower(the current code creates Word Wise file instantly). The benefits of using POS probably won't worth the effort and the loss of speed.

I think for Spanish it would be super beneficial, I looked at a random page on my Tablet and think it could have fixed about 9 mistranslations. For other languages the improvement might be a lot smaller. But I can see that this might take a lot of effort, so of course one has to prioritise.

I see you also use files from kaikki.org in your project, maybe you'll be interested in the pull requests I created for wiktextract and wikitextprocessor. These pull requests add support of parsing non-English Wiktionary dump files, that would provide non-English Word Wise glosses which is requested by many users.

Totally, I was always too afraid to try to edit this code, so it is really great that you enable the path that in the future more Wiktionaries can get parsed. 👍 I wrote a program reading the 40 GB HTML dump of the Russian wiktionary only because I doubted too much that I could get an adapted Wiktextract to run properly 😁.

xxyzz commented 1 year ago

Hi @Vuizur, I have added this feature to the master branch. Please download the zip file from GitHub Actions to test it: https://github.com/xxyzz/WordDumb/actions/runs/4081858346

The new POS feature can be enabled in the plugin configuration window.

Vuizur commented 1 year ago

This is amazing, I am really excited about this feature. 👍

It seems like there is still a small bug and some words get skipped in the output file: grafik grafik

(Windows 11, Calibre 6.11.0)

Vuizur commented 1 year ago

I am not exactly sure if this is the reason, but when I have both an epub and a converted mobi file and Worddumb wants me to select a format (in this case I clicked on mobi), I get the following error message:

Starting job: Generating Word Wise and X-Ray for El Dragón Renacido 
Job: "Generating Word Wise and X-Ray for El Dragón Renacido" failed with error: 
Traceback (most recent call last):
  File "calibre\gui2\threaded_jobs.py", line 82, in start_work
  File "calibre_plugins.worddumb.parse_job", line 127, in do_job
  File "calibre_plugins.worddumb.deps", line 157, in download_word_wise_file
TypeError: spacy_model_name() missing 1 required positional argument: 'prefs'

Called with args: ((6, 'MOBI', 'C:\\Users\\hanne\\Calibre Library\\Robert Jordan\\El Dragon Renacido (6)\\El Dragon Renacido - Robert Jordan.mobi', <calibre.ebooks.metadata.book.base.Metadata object at 0x000002C076625120>, {'spacy': 'es_core_news_', 'wiki': 'es', 'kaikki': 'Spanish', 'gloss': False, 'has_trf': False}), True, True) {'notifications': <queue.Queue object at 0x000002C076625420>, 'abort': <threading.Event object at 0x000002C076625CF0>, 'log': <calibre.utils.logging.GUILog object at 0x000002C076625DB0>} 

Edit: I think this only occurs with the new feature enabled.

xxyzz commented 1 year ago

https://github.com/xxyzz/WordDumb/commit/bc32c7a533054521f246baf302d10134a5d64d2f fixes the error.

There are two cases a word don't have Word Wise:

Vuizur commented 1 year ago

Thanks a lot for fixing the error!

There are two cases a word don't have Word Wise:

  • The customize lemmas table doesn't have that word or the same POS type

  • spaCy doesn't lemmatize the word

I don't mean Word Wise information, the entire word seems to be missing. In the picture it is the third word for example (dejó).

xxyzz commented 1 year ago

Oh I didn't notice that, https://github.com/xxyzz/WordDumb/commit/723634a2535c1ebbdb75255171270e5f3a9aa508 should fix this bug.

Vuizur commented 1 year ago

Awesome, the sentences are full now. And the new feature fixed two errors only in the first sentence of my Spanish example book 😁, the glosses are super good now.

The lemmatization seems to be a bit broken by the new feature currently, inflected words don't get definitions.

Before: grafik After: grafik

I checked with the medium Spanish model, it seems to get the POS and lemmatization correct (for example for dejó or ojos). All the words that are missing are inflected. For some reason the word pero also gets no gloss (Spacy classifies it as CCONJ, which gets mapped to conj by your code, so I have no idea why it should not work. (There is also a noun with the same string in the kaikki data, but the disambiguation works in the other cases. 🤔)

xxyzz commented 1 year ago

Could you upload the book in your screenshot?

Vuizur commented 1 year ago

Novelas y fantasias - Roberto Payro_comparison.zip

This is another (copyright free) Spanish book, but it shows the same behaviour.

Vuizur commented 1 year ago

I just tested some other languages, I think the spaCy POS algorithm is also good for languages where the flashtext algorithm is currently broken due to special characters (I think German/probably all cyrillic languages.) In these cases it also matches substrings, like when you have the word "really" it would match the word "real".

xxyzz commented 1 year ago

Ah, I forget to use the lemma form, https://github.com/xxyzz/WordDumb/commit/a483fe5bd9c2c49b3ccb207b52b16bf01b18279e should fix this bug.

flashtext is pretty much an abandonware but I can't find another alternative library. Maybe I can enable the POS feature by default and get rid of this dependency. But spaCy is slower than flashtext and spaCy's lemmatizer is not that accurate.

Vuizur commented 1 year ago

Great, it is fixed now. 👍 You are right that for languages like Spanish the POS version finds a bit less definitions than the original flashtext version. Mostly due to spaCy/wiktionary disagreements (sometimes spaCy says something is AUX and wikipedia says it is a verb) and their rule-based lemmatizer that sucks for irregular verbs, for example.

In the original flashtext version you used the words + inflections from the Kaikki data? The big advantage of this approach as I undertand it is that at the end it could also support basically any language, not only the ones supported by spaCy. On the PR I linked there seem to exist workarounds so that non-latin languages are better supported, nobody benchmarked them though.

I also think it might be a good idea to create a general purpose library to perform the cleaning of the kaikki inflections (or maybe contribute it to wiktextract) that we both could contribute to because it is potentially useful to many people (me among them) and requires language-specific knowledge.

xxyzz commented 1 year ago

Yeah, I was using flashtext and pyahocorasick(for Chinese, Korean and Japanese) before. The Kindle English inflected forms are created from LemmInflect and for EPUB book inflected forms are from kaikki.org's Wiktionary JSON file. This also guarantees all known forms in the book will be found.

I not sure what you mean a "library cleaning kaikki inflections data". The forms data in kaikki.org's JSON file don't need much cleaning, I don't change these data much.

Vuizur commented 1 year ago

I not sure what you mean a "library cleaning kaikki inflections data". The forms data in kaikki.org's JSON file don't need much cleaning, I don't change these data much.

I think there is quite some stuff that comes together:

xxyzz commented 1 year ago

I just realize what that issue you linked before really means... So flashtext is only useful for English. I think I could just use spaCy's PhraseMatcher as flastext and pyahocorasick for all languages by adding inflected forms to the matcher and match the text attribute instead the lemma attribute.

As for improving kaikki.org's data that would ideally need to change the wiktextract code and edit Wiktionary page IIUC. Unfortunately I don't understand German and Russian so I can't tell if the inflection is correct or not for those languages. Wiktionary editors use bots to make changes like these a lot, I bet they already have bots that can remove duplicated words.

Vuizur commented 1 year ago

As for improving kaikki.org's data that would ideally need to change the wiktextract code and edit Wiktionary page IIUC. Unfortunately I don't understand German and Russian so I can't tell if the inflection is correct or not for those languages. Wiktionary editors use bots to make changes like these a lot, I bet they already have bots that can remove duplicated words.

True, duplicates is maybe not the best example. These only occur if you throw away the forms' tags. But in general the fixes I refer to are not mistakes in Wiktionary or wiktextract, because for example the stressed inflections for Russian are super userful in general. They are only problems if you try to directly use the Kaikki data for lemmatization or to look up a word from its inflected form as it would appear in a usual text. So it doesn't directly fit in the wiktextract core code.

xxyzz commented 1 year ago

Just to make sure I understand the problems you posted, I'll use this https://kaikki.org/dictionary/All%20languages%20combined/meaning/%D1%87/%D1%87%D0%B8/%D1%87%D0%B8%D1%82%D0%B0%D1%82%D1%8C.html page as an example:

* Removing meta info tags (inflection table info/...)

"ru-noun-table", "hard-stem", "accent-c", "1a imperfective transitive" should be removed from forms. This should be fixed in wikiextarct code. But won't affect Word Wise because book texts usually don't have these words.

* Removing inflections that actually aren't inflections (for example, paired verbs in Russian. Or auxiliary verbs that are also sometimes in the table)

You mean "бу́ду" and it's forms should be removed? But have this word won't affect Word Wise.

* Removing stress marks in cyrillic languages (I think this in addition to the flashtext bugs currently causes weird behaviour for cyrillic languages. There exists a simplemma library that was also bitten by this I think.) Latin also has a similar problem

Remove stress mark so "чита́ть" becomes "читать", right? But book texts use "чита́ть" not "читать". And this won't be a problem for spaCy's PhraseMatcher. I should add stress mark to kaikkai's lemma form instead.

* Removing articles (relevant for German I think)

https://kaikki.org/dictionary/German/meaning/l/le/lesen.html

In this case, ignore forms that start with "haben"? Again, have this word doesn't matter. The form after it still can be matched.

Vuizur commented 1 year ago

"ru-noun-table", "hard-stem", "accent-c", "1a imperfective transitive" should be removed from forms. This should be fixed in wikiextarct code. But won't affect Word Wise because book texts usually don't have these words.

True, although it is theoretically conceivable that there exist exceptions, but I can't name any right now.

You mean "бу́ду" and it's forms should be removed? But have this word won't affect Word Wise.

In Russian, there are pair verbs that are very similarly written and differ in their "aspect" (some grammatical concept). These pair verbs are in the forms list. However, this causes for example the words for "try" and "torture" to exist as inflections for each other, which we don't really want. For German this causes "haben" to be an inflection of almost everything (because it is listed as an auxiliary verb in the inflections).

Remove stress mark so "чита́ть" becomes "читать", right? But book texts use "чита́ть" not "читать". And this won't be a problem for spaCy's PhraseMatcher.

In books stress marks are not written, only in books for learners, but they are pretty hard to find. For Russian you have a point though, because each learner in theory should use my program to get them back 😁. For other cyrillic languages that have unpredictable stress those stressed inflections will cause the words not to get found. Or for Latin as well.

I should add stress mark to kaikkai's lemma form instead.

This is actually a good point. I was also working a bit on dictionary creation, and made the conclusion that ideally the form that is displayed to the user should be the inflection tagged as "canonical". It does have the drawback that some of these canonical forms can still be buggy and contain something like " f m". So for smaller languages where the data hasn't been looked over this might lead to smaller mistakes, but in general it is a good idea.

https://kaikki.org/dictionary/German/meaning/l/le/lesen.html In this case, ignore forms that start with "haben"? Again, have this word doesn't matter. The form after it still can be matched.

I think in my last tests there existed some German nouns where you only had as an inflection "den Zaubererern", but not "Zauberern" (don't nail me on this example, but it was something similar). As far as I understand it this would cause the word not to be found if the article isn't used in the text.

I already wrote some untested code, I will link it here when it is OKish.

xxyzz commented 1 year ago

I was completely wrong about the stress... I read the Russian example sentence from the English Wiktionary and thought normal Russian text also have those marks. I should read the Stress Wikipedia page more carefully. So for Russian, Belarusian and Ukrainian languages, both with and without stress mark forms should be added to PhraseMatcher. But I guess your "add-stress-to-epub" library can't "un-stress" words.

Russian pair verbs seem complicated and require knowledge of the language, I'll temperately ignore that. And for "haben" or "den" in German forms I can probably get away with it if most words have the ideal infection form without them.

The unstressed forms will be addressed first, this seems more important than other issues.

Vuizur commented 1 year ago

I wrote a bit not very tested code here (API might change): https://github.com/Vuizur/wiktextract-lemmatization The theory is that you can put a forms array in there and get a fixed one back, so that it should be easy to integrate in existing code.

xxyzz commented 1 year ago

Cool! I could use your code in the Proficiency repo for Russian, Belarusian and Ukrainian languages.

xxyzz commented 1 year ago

I included your code in the v0.5.4dev Proficiency pre-release, it's currently used by the WordDumb code in the master branch. I also use spaCy's PhraseMatcher to find Word Wise even when Use POS type is disabled, this would add a few seconds for loading the PhraseMatcher object but should have better result than flashtext on Russian and German.

Vuizur commented 1 year ago

Nice, the word detection works now flawlessly for Russian. 👍