Fix word wise for stressed Russian epubs

xxyzz / WordDumb

A calibre plugin that generates Kindle Word Wise and X-Ray files for KFX, AZW3, MOBI and EPUB eBook.

https://xxyzz.github.io/WordDumb/

GNU General Public License v3.0

376 stars 19 forks source link

Fix word wise for stressed Russian epubs #192

Closed Vuizur closed 1 month ago

Vuizur commented 7 months ago

I fixed the code for stressed epubs by using (only in this special case) two spacy docs: one containing the text and one for the lemmatization/pos detection. I tested it for one Russian and one non-Russian book so far and it seemed to work. Should I add the same for kindle? (I only can't test it.)

xxyzz commented 7 months ago

Only one character need to be removed? It could be added at here: https://github.com/xxyzz/WordDumb/blob/460db47886fc1f6fe36e2329c659b3ee54a837a5/epub.py#L137-L144

Vuizur commented 7 months ago

Removing that one character works fine for books created by my program, for general purpose one should maybe use the more sophisticated remove_accents function like in Proficiency, which can also remove grave accents.

I don't 100 percent understand the code, but removing it from the place you suggested will also remove the character from the output epub, right? I would want to keep it. So that at the end it looks like this, but currently spacy can't perform lemmatization and POS analysis with stressed text:

xxyzz commented 7 months ago

Yes, that'll change the book text, I forget you want to keep the stress marker...

This issue should be fixed in spaCy's Russian lemmatizer, change the text using str.replace breaks the word location and would make the footnote added to the wrong place.

Vuizur commented 7 months ago

There is an issue on the spacy repo that is related to the fixing of the lemmatizer: https://github.com/explosion/spaCy/issues/12530. It seems terribly complicated. I think my workaround works fine. It takes the token index positions, because these don't change between stressed and unstressed texts, so the alignment is kept.

xxyzz commented 7 months ago

I think it's better to wait for spaCy's pr, sorry. This patch runs the model pipelines on words that have stress marks again...

Vuizur commented 7 months ago

On Russian words with stress marks. So it doesn't affect any non-Russian books, and no normal Russian books. It probably won't even affect normal Russian books that have French citations like "À mauvais ouvrier point de bon outil.", because it only detects combining accent marks. So the performance impact should be negligible in all other cases except with stressed Russian books, where the program is currently broken.

The problem with the spacy PR is that the original one has been sitting for almost a year, and only fixes lemmatisation, but not POS detection. For fixing POS detection we would apparently have to host our own unstressed_core_news_* models and implement a custom language, which would probably result in more convoluted code changes than in this PR.

xxyzz commented 7 months ago

Doesn't the English Wiktionary have forms that have stress marks? Have you tried disable the "Use POS type" feature, stressed forms should be matched if they are in the Word Wise db.

Vuizur commented 7 months ago

Doesn't the English Wiktionary have forms that have stress marks? Have you tried disable the "Use POS type" feature, stressed forms should be matched if they are in the Word Wise db.

True, this works.

xxyzz commented 7 months ago

Another way to fix this is add word wise notes first then add stress marks...

Vuizur commented 7 months ago

Another way to fix this is add word wise notes first then add stress marks...

It is possible, although implementing the processing of word wise files would be hard. 🗿

xxyzz commented 1 month ago

I think this is fixed in the latest release, all enabled forms can be matched.

Vuizur commented 1 month ago

I think it should be, thanks!

Dne pá 9. 8. 2024 16:11 uživatel xxyzz @.***> napsal:

I think this is fixed in the latest release, all enabled forms can be matched.

— Reply to this email directly, view it on GitHub https://github.com/xxyzz/WordDumb/pull/192#issuecomment-2278041422, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG66XKO5QW7R2WLASU55QGTZQTE25AVCNFSM6AAAAABDXN4JOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZYGA2DCNBSGI . You are receiving this because you authored the thread.Message ID: @.***>