Extra characters if a word gets annotated in the same sentence twice.

xxyzz / WordDumb

A calibre plugin that generates Kindle Word Wise and X-Ray files for KFX, AZW3, MOBI and EPUB eBook.

https://xxyzz.github.io/WordDumb/

GNU General Public License v3.0

376 stars 19 forks source link

Extra characters if a word gets annotated in the same sentence twice. #193

Closed Vuizur closed 7 months ago

Vuizur commented 7 months ago

Checkboxes

[X] I have read the document at xxyzz.github.io/WordDumb.
[X] I have not found similar issue or disscussion at GitHub.
[X] Reboot doesn't fix the problem.

Describe the bug

For words that get annotated in the same sentence twice, for the second word some extra characters are appended. See in this picture:

You can see it here in the screenshot for the word "necessaire". The second time it appears, it prints "necessaireaire", so the end of the word got appended again.

Thanks for all the active development!

Operating System name and version

Windows 10

Python version

3.12

calibre version

7.5.1

WordDumb plugin version

3.31.1 (current master branch), but also on the last stable version I think

Error message

Plugin settings and reproduce steps

Default settings

Generated files, screenshots or videos

No response

xxyzz commented 7 months ago

Could you upload the book file or the original book xml?

Vuizur commented 7 months ago

In the following epub, one of the affected sentences is "Après cela, afin de voler les résultats de mes recherches, le responsable a envoyé un assassin et m’a tué." (Chapter 8, XHTML file 9). Le garcon.zip

It turns responsable to responsableable

So my original theory is not correct. I honestly have no idea about the cause. It might be related to some weird HTML characters...

I wrote a small piece of code to find the affected words:

import regex as re

for filename in os.listdir(FOLDER):
    if filename.endswith(".xhtml"):
        # Open the file
        with open(FOLDER + "/" + filename, "r", encoding="utf-8") as file:
            # Read the file
            data = file.read()

            # Find a <ruby> string that is followed by a character other than whitespace or comma
            words = re.findall(r"</ruby>\p{L}+", data)
            # Iterate through all the words
            for word in words:
                # Print the word
                print(filename)
                print(word)
                i += 1

xxyzz commented 7 months ago

Maybe it's the code at here https://github.com/xxyzz/WordDumb/blob/97394c94fdaf69f29ab097961ae34f01d6a37e0b/parse_job.py#L397

doesn't work properly, I'm not certain. I recently realize I could only process unescaped text then write back escaped text, then I could git rid of this function. But I still should find which code causes the bug...

xxyzz commented 7 months ago

This is indeed a bug in the index_in_escaped_text function, str.index could find the wrong sub string. Darn...

xxyzz commented 7 months ago

https://github.com/xxyzz/WordDumb/commit/0b52bbf9993ab36d699545e0d2354fe72d73d325 should fix the bug.

Vuizur commented 7 months ago

Thanks a lot for fixing this! I just tested another book, and it seems to be working there as well.