wustho / epr

CLI Epub Reader
MIT License
1.21k stars 49 forks source link

words are missing or out of order #30

Open trzhong opened 4 years ago

trzhong commented 4 years ago

I've read a epub in Chinese language using epr on macos 10.15.4, python 3.7:

窦文涛:今天 [1] 我终于见到了一位我一直想见到的老师——李玫瑾老师。虽然今天真的都无数次采访过您,通过电话连线。今天终于是见着真人了,我觉得您真是很有风度的一眉立目的那么一款,没想到看上去很温婉。样子的时候,会觉得您是穿着警服有点横

And the content displayed in ibooks is:

窦文涛:今天 [1] 我终于见到了一位我一直想见到的老师——李玫瑾老师。虽然今天真的是第一次见到您,但是在我和傅见锋[2] 做的节目当中,我们好像都无数次采访过您,通过电话连线。今天终于是见着真人了,我觉得您真是很有风度的一位女士!原来他们做点好采访,我没见到您样子的时候, 会觉得您是穿着警服有点横 眉立目的那么一款,没想到看上去很温婉。会觉得您是穿着警服有点横

Not only this paragraph or this book but also many have this problem.

wustho commented 4 years ago

This is crucial, I will try Chinese epub when I'm free,... Since, originally this only supported english... But I will try and have a look

wustho commented 4 years ago

Hey, there. I just tried looking it up, seems like this is out of my capability, sorry. Hope someone else make PR about this issue... It probably has something to do with HTMLtoLines(HTMLParser) class if anyone cares to help fixing this...

trzhong commented 4 years ago

Since "textwrap.wrap()" cannot handle Chinese character properly, I try to add below codes in "HTMLtoLines.get_lines":

            else:
                w = width
                l = len(i)
                cjk_l = len(i.encode(encoding='UTF-8'))
                asc_l = int((l * 3 - cjk_l) / 3)
                if cjk_l > l:
                    w = int(w * l / (l * 2 - asc_l))
                text += textwrap.wrap(i, w) + [""]
        return text, self.imgs

Although it does display the content correctly, I don't think this is the best solution. I prefer a better wrap library.

wustho commented 4 years ago

Wow, that's impressive troubleshooting... After I read your comment, I did some googling, and found this: https://bugs.python.org/issue24665

Indeed, as you said, textwrap.wrap() cannot handle Chinese character properly. And seems like issue regarding CJK support in textwrap is closed with rejected resolution based on confusions or some stuffs. So I think we won't get any support for non latin alphabet soon. For now I will add this issue as limitation in README while we're waiting for better wrap library as you suggested.

wustho commented 4 years ago

@trzhong hey there,you might want to try https://github.com/aeosynth/bk as an alternative...

aeosynth commented 4 years ago

I added support for wide characters to bk. There may be other issues, for example I don't know the line breaking rules for asian text.

1q84 by murakami rendered to 30 columns:

1q84

trzhong commented 4 years ago

I‘m still using my patch. Thx for the information.

trzhong commented 3 years ago

Finally, I found [rich] as a solution to replace [textwrap].

from rich import cells replace all [textwrap.text] with [cells.chop_cells]

That's all.

wustho commented 3 years ago

Wow https://github.com/willmcgugan/rich seems so powerful and features rich, thanks for pointing that out, mate... I'll try to implement it to epy...