wustho / epr

CLI Epub Reader
MIT License
1.2k stars 49 forks source link

Add separate logic for <pre> tags in HTMLtoLines #20

Closed ghost closed 4 years ago

ghost commented 4 years ago

I went with having separate logic for preformat tags in handle_data and get_lines functions of HTMLtoLines class. This is for two reasons:

  1. Not cleaning whitespace characters with re.sub() in handle_data: Whereas with other tags like <blockquote>, which aren't completely whitespace-dependent, <pre> text requires newlines and indentation/tabs be preserved for... well, formatting.
  2. Different logic for <pre> text in get_lines: textwrap defaults to replace_whitespace=true for wrap(), and it's suggested that str.splitlines() be used for newlines rather than setting replace_whitespace=false to prevent inconsitencies with formatting. I think this may cause issues with other indent tags, s rather than trying to balance parsing non-preformat tags against preformatted text, it seemed more reasonable to have the preformat text be parsed separately.

Aside from that, I tried to keep the variable names and logic consistent with the current code.

wustho commented 4 years ago

Whoaa thanks so much for this, this looks good....