Explore substituting external lxml parser for builtin html.parser

The largest current non-dev dependencies (above 1k) in my local venv are: (excluding small dist-info directories)

13344 lxml
10660 pip
10648 cryptography
2780 pytz
2224 setuptools
1820 urwid
1588 chardet (via requests)

Of the top 3, pip and cryptography are essential for the application, but on recently reviewing the bs4 (856) documentation I wondered if we could use the other backends, particularly since our input is fairly well constrained via the zulip rendered markdown. A potentially suitable backend would be the built-in html parser in the standard library, which comes for free in terms of space and would remove the need for this largest dependency.

Another reason to consider removal is that lxml is fast by virtue of its reliance on a C library; the downsides are this requires extra C-library build and runtime dependencies (eg. libxml2-dev libxslt1-dev) for installation/build in some cases (where wheels are not available?), including:

PyPy
Alpine dockfiles

I've drafted a WIP implementation of this at 2021-02-27-lxml-to-html.parser on my fork and it seems to behave OK and be "fast enough", though I'm interested to hear feedback!

One option we could consider is having lxml be an optional extra, albeit with added runtime complexity.

This was briefly discussed on chat.zulip.org in #zulip-terminal>lxml vs html.parser.

I've tagged this with area: optimization, which here is in terms of space/simplicity; I've not profiled how this impacts speed.

zulip / zulip-terminal

Explore substituting external lxml parser for builtin html.parser #1036