xxyzz / WordDumb

A calibre plugin that generates Kindle Word Wise and X-Ray files for KFX, AZW3, MOBI and EPUB eBook.
https://xxyzz.github.io/WordDumb/
GNU General Public License v3.0
376 stars 19 forks source link

Use locally downloaded Wiktionary data to generate Word Wise for epubs #43

Closed woaidangyang closed 2 years ago

woaidangyang commented 2 years ago

Describe the bug

Thank you so much for creating this great plugin!

I encountered the following error while trying to generate Word Wise for an epub file. I think it's because the access to the website hosting the Wiktionary data is blocked.

Since I can download the database manually via a browser routed with a proxy, is it possible to add an option asking the Worddumb plugin to use the database on my computer I downloaded?

System Information

calibre, version 6.1.0 (win32, embedded-python: True)

Error message

Tonnerre de Brest!: An error occurred, please copy error message then report bug at GitHub.

Starting job: Generating Word Wise for How to Lie with Statistics 
Job: "Generating Word Wise for How to Lie with Statistics" failed with error: 
Traceback (most recent call last):
  File "calibre\gui2\threaded_jobs.py", line 82, in start_work
  File "calibre_plugins.worddumb.parse_job", line 87, in do_job
  File "calibre_plugins.worddumb.parse_job", line 188, in dump_wiktionary_job
  File "calibre_plugins.worddumb.data.wiktionary", line 144, in download_and_dump_wiktionary
  File "calibre_plugins.worddumb.data.wiktionary", line 47, in extract_wiktionary
  File "json\__init__.py", line 346, in loads
  File "json\decoder.py", line 337, in decode
  File "json\decoder.py", line 353, in raw_decode
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 4954 (char 4953)

Reproduce steps

Called with args: ((517, 'EPUB', 'C:\Users\Administrator\Calibre Library\Darrell Huff; Irving Geis\How to Lie with Statistics (517)\How to Lie with Statistics - Darrell Huff; Irving Geis.epub', <calibre.ebooks.metadata.book.base.Metadata object at 0x00000284096F0A30>, {'spacy': 'en_coreweb', 'wiki': 'en', 'kaikki': 'English'}), True, False) {'notifications': <queue.Queue object at 0x00000284096F0BB0>, 'abort': <threading.Event object at 0x00000284096F08E0>, 'log': <calibre.utils.logging.GUILog object at 0x00000284096F0C40>}

Screenshots or videos

No response

xxyzz commented 2 years ago

You could move the Wiktionary json file to ~/AppData/Roaming/calibre/plugins/worddumb-lemmas/ for Windows, the plugin will detect the file.

Maybe the request is block because I didn't add the useragent header, I'll add it later.

woaidangyang commented 2 years ago

Thank you so much for your quick reply! I'm trying it and let's see if it's working.

Not sure if this is the right place to ask. I also have some trouble to use the plugin to generate the X-Ray file. It says "Is GitHub/Wikipedia/Fandom blocked by your ISP? You might need tools to bypass internet censorship." Do you have any suggestions what kind of tools I can use to get the plugin bypass the Censorship?

xxyzz commented 2 years ago

There are some free software mentioned in this page: https://en.wikipedia.org/wiki/Internet_censorship_circumvention

woaidangyang commented 2 years ago

All right. The Word Wise for epubs is working!

But most of the words in the book are underscored now! lol

It would be great if only the words that pass a user defined difficulty level would be marked.

Great work, anyway!

xxyzz commented 2 years ago

There is a "Customize EPUB Wiktionary" button in the plugin preferences dialog, you could enable or disable words at there.

woaidangyang commented 2 years ago

I see. Got it. Thanks!

woaidangyang commented 2 years ago

Sorry. It seems that the settings (Enabled: true/false) can't be saved. Also, there are too many words marked "true" by default.

Maybe we can rank the words by their frequencies they would be met and users can select which level of words we want the Worddumb to label?

xxyzz commented 2 years ago

I forgot to save the Wiktionary JSON file... https://github.com/xxyzz/WordDumb/commit/55932f21c98d40a07fd8b7ef626522deedfd201e fixes this bug.

It kind hard to set difficult level for each word to fit most people's needs, so I expect the users to customize the table by themselves.

woaidangyang commented 2 years ago

OK. Thanks!

If I already have a list of words that I want the plugin to label (or to exclude), do you think there is an easy way to mark those words to "true" in a batch, instead of changing their status one by one?

xxyzz commented 2 years ago

You could write some code to edit the Wiktionary JSON file stored in the worddumb-lemmas folder, the first data in each array is a bool value to enable or disable the word. Then you need to open the customize dialog and click the save button to let the plugin creates a pickle dump file.

woaidangyang commented 2 years ago

Cool! Thank you so much!

xxyzz commented 2 years ago

User-agent header was added in https://github.com/xxyzz/WordDumb/commit/0d6559235a78468949721ee363a1706bd8848cd2. Hope it solves the download error.

xxyzz commented 2 years ago

Hi @woaidangyang, you may want to test the new import from Anki .apkg or CSV file feature. Difficulty values are imported from Anki card type. The CSV file should has at least on column of the words and an optional difficulty column.

woaidangyang commented 2 years ago

Super! This is the dream function I've been searching for a long time. It would be very useful for the language learners! Thank you so much!

xxyzz commented 2 years ago

I didn't notice the download error was caused by Internet censorship since you said download through proxy works fine, so I'll close this issue now.

xxyzz commented 2 years ago

You may need to read the calibre manual and the requests document.