xxyzz / WordDumb

A calibre plugin that generates Kindle Word Wise and X-Ray files for KFX, AZW3, MOBI and EPUB eBook.
https://xxyzz.github.io/WordDumb/
GNU General Public License v3.0
373 stars 19 forks source link

WordDumb WrongTranslation Issue #134

Open Dorisking opened 1 year ago

Dorisking commented 1 year ago

Hi ~guys, I try to use WordDumb to read HarryPotter. But I get a lot of wrong meanings of words. for example , it explain drills as a type of strong cotton cloth instead a hand tool, power tool, or machine with a rotating cutting tip used for making holes. It sometimes choose the very rare and useless meaning of a word. How can I adjust Translation Settings? Does it related with the dictionary in kindle?

xxyzz commented 1 year ago

You could click the "other meanings" button to select the correct definition. This plugin only matches words with one definition, you could also change the default meaning in the plugin's "customize Kindle word wise" window.

xxyzz commented 1 year ago

Maybe I could train a machine learning model to match the word to its gloss, and also match person name or location to Wikipedia summary or Wikidata item.

adriangc13 commented 1 year ago

The same thing happens to me, all the words it explain are wrong, in this picture for example we see that it's explaining the word 'work' and when I select it we see that the word it's actually explaining is 'envoy'.

photo-2023-07-06-18-16-23

xxyzz commented 1 year ago

The same thing happens to me, all the words it explain are wrong, in this picture for example we see that it's explaining the word 'work' and when I select it we see that the word it's actually explaining is 'envoy'.

That's different from what this issue's author described. This issue is about a word's other gloss is showed by default, which is unsolvable at the monument.

The problem you mentioned could be caused by these actions: the book on Kindle is changed somehow(like send via email), or the Word Wise file was created for other languages.

adriangc13 commented 1 year ago

The same thing happens to me, all the words it explain are wrong, in this picture for example we see that it's explaining the word 'work' and when I select it we see that the word it's actually explaining is 'envoy'.

That's different from what this issue's author described. This issue is about a word's other gloss is showed by default, which is unsolvable at the monument.

The problem you mentioned could be caused by these actions: the book on Kindle is changed somehow(like send via email), or the Word Wise file was created for other languages.

I always do the following: add the book, convert it to KFX, check the metadata to make sure the language is correct, send it to my Kindle via cable and finally hit the wordumb button. I don't know what else to check.

xxyzz commented 1 year ago

I don't know what else to check.

Maybe you enabled the Use Wiktionary definition but forget to change word wise gloss language to Chinese on Kindle, see https://xxyzz.github.io/WordDumb/usage.html#create-files

Please create a new issue or discussion if the problem still exists since it's not the same with this GitHub issue.

adriangc13 commented 1 year ago

Maybe you enabled the Use Wiktionary definition but forget to change word wise gloss language to Chinese, see https://xxyzz.github.io/WordDumb/usage.html#create-files

Solved by unchecking this option, thank you.

Vuizur commented 1 year ago

Maybe I could train a machine learning model to match the word to its gloss, and also match person name or location to Wikipedia summary or Wikidata item.

It is a super interesting question. I randomly stumbled upon this problem for my thesis and tried using llama.cpp with an instruction-fine-tuned language model from Llama such Wizard-Vicuna-7B. I simply gave it the task in the format:

Sentence: <sentence>
Question: Which definition of <word> is correct here?
1. <definition>
2. <another definition>
Answer only with a number.
Answer: 

I benchmarked it for Russian (to copy a WIP graphic) WSD_results

Disclaimer: I benchmarked the association of words with etymologies, not with senses.

(The accuracy in reality is maybe 5 percent higher, the test data has a few mistakes). So WV7 (Wizard vicuna) runs on 8 GB RAM and Manticore 13B on 16 GB RAM PCs. And ChatGPT aced everything (except 1 example), but might be a bit too expensive.

In English the results will surely be better. The runtime will probably suck though, but if the users are very patient it might be possible.

Of course, training an own model, maybe with synthetic GPT3.5/4 data looks also pretty promising. But no idea.

This is maybe also interesting, but apparently only works for English (didn't test it): https://github.com/alvations/pywsd

xxyzz commented 1 year ago

I think I'll need to take a deep learning course first...

Using existing model is easier to start but the performance could be bad. Training a model might be unavoidable because the model needs to output customized data(Kindle Word Wise database id or Wiktionary gloss). And for that same reason, pywsd might not be suitable, or maybe I could replace the default gloss data they're using.

The ultimate goal is to find(or build) a model or a library the could take a chunk of text then magically mark the words in it with correct gloss and Wikipedia summary(output data should also have the token offset location).

Vuizur commented 1 year ago

Using existing model is easier to start but the performance could be bad. Training a model might be unavoidable because the model needs to output customized data(Kindle Word Wise database id or Wiktionary gloss). And for that same reason, pywsd might not be suitable, or maybe I could replace the default gloss data they're using.

I think large language models such as Llama would work out of the box, but be extremely slow. For Worddumb they would only be viable (but probably still a bit slow) if the user has a GPU with at least 8 GB VRAM, which probably almost nobody has. Compared to English, Llama does have pretty mediocre multilingual skills unfortunately.

pywsd uses oldschool algorithms, if I understood it correctly they might be applied to the Wiktionary data and not even be too slow, but the accuracy will likely be garbage. (But I don't know a lot about this.)

The ultimate goal is to find(or build) a model or a library the could take a chunk of text then magically mark the words in it with correct gloss and Wikipedia summary(output data should also have the token offset location).

True. I tried asking GPT-4 to add a short translation after each word of a specific text in [brackets], and it did what I asked. But it was still a bit buggy and will probably hallucinate a lot and give wrong answers with more exotic languages or rarer words.

It might only be a matter of time before something like this gets more viable. 👍

xxyzz commented 1 year ago

Using large language model for WSD maybe a little bit overkill IMO. I found this EWISER library: https://github.com/SapienzaNLP/ewiser, and they also have spacy plugin. Their paper is more recent and I'll see how I could integrate their work, look like I have a lot to learn...

The EWISER paper's authors' university also created babelfy.org, which has almost all the features I need but it has API limit(1000 per day).

xxyzz commented 1 year ago

I find the state-of-the-art WSD model at here: https://paperswithcode.com/sota/word-sense-disambiguation-on-supervised, and the best model is ConSeC: https://paperswithcode.com/paper/consec-word-sense-disambiguation-as

But I never trained a model before and don't have a GPU card, this would take some time...

xxyzz commented 2 months ago

I tried the LLaMA-3-Instruct-8B llamafile, I think accuracy is good but performance is ridiculously slow on CPU. I killed the process after waiting 4 hours. Maybe it's more usable with a powerful GPU?

Code pushed to the wsd branch: https://github.com/xxyzz/WordDumb/tree/wsd