Closed Vuizur closed 7 months ago
Could you upload the book file or the original book xml?
In the following epub, one of the affected sentences is "Après cela, afin de voler les résultats de mes recherches, le responsable a envoyé un assassin et m’a tué." (Chapter 8, XHTML file 9). Le garcon.zip
It turns responsable to responsableable
So my original theory is not correct. I honestly have no idea about the cause. It might be related to some weird HTML characters...
I wrote a small piece of code to find the affected words:
import regex as re
for filename in os.listdir(FOLDER):
if filename.endswith(".xhtml"):
# Open the file
with open(FOLDER + "/" + filename, "r", encoding="utf-8") as file:
# Read the file
data = file.read()
# Find a <ruby> string that is followed by a character other than whitespace or comma
words = re.findall(r"</ruby>\p{L}+", data)
# Iterate through all the words
for word in words:
# Print the word
print(filename)
print(word)
i += 1
Maybe it's the code at here https://github.com/xxyzz/WordDumb/blob/97394c94fdaf69f29ab097961ae34f01d6a37e0b/parse_job.py#L397
doesn't work properly, I'm not certain. I recently realize I could only process unescaped text then write back escaped text, then I could git rid of this function. But I still should find which code causes the bug...
This is indeed a bug in the index_in_escaped_text
function, str.index
could find the wrong sub string. Darn...
Thanks a lot for fixing this! I just tested another book, and it seems to be working there as well.
Checkboxes
Describe the bug
For words that get annotated in the same sentence twice, for the second word some extra characters are appended. See in this picture:
You can see it here in the screenshot for the word "necessaire". The second time it appears, it prints "necessaireaire", so the end of the word got appended again.
Thanks for all the active development!
Operating System name and version
Windows 10
Python version
3.12
calibre version
7.5.1
WordDumb plugin version
3.31.1 (current master branch), but also on the last stable version I think
Error message
Plugin settings and reproduce steps
Default settings
Generated files, screenshots or videos
No response