zulip / zulip-terminal

Official Zulip terminal client.
Apache License 2.0
650 stars 249 forks source link

Explore substituting external lxml parser for builtin html.parser #1036

Open neiljp opened 3 years ago

neiljp commented 3 years ago

The largest current non-dev dependencies (above 1k) in my local venv are: (excluding small dist-info directories)

Of the top 3, pip and cryptography are essential for the application, but on recently reviewing the bs4 (856) documentation I wondered if we could use the other backends, particularly since our input is fairly well constrained via the zulip rendered markdown. A potentially suitable backend would be the built-in html parser in the standard library, which comes for free in terms of space and would remove the need for this largest dependency.

Another reason to consider removal is that lxml is fast by virtue of its reliance on a C library; the downsides are this requires extra C-library build and runtime dependencies (eg. libxml2-dev libxslt1-dev) for installation/build in some cases (where wheels are not available?), including:

I've drafted a WIP implementation of this at 2021-02-27-lxml-to-html.parser on my fork and it seems to behave OK and be "fast enough", though I'm interested to hear feedback!

One option we could consider is having lxml be an optional extra, albeit with added runtime complexity.

This was briefly discussed on chat.zulip.org in #zulip-terminal>lxml vs html.parser.

I've tagged this with area: optimization, which here is in terms of space/simplicity; I've not profiled how this impacts speed.

neiljp commented 3 years ago

Notes comparing parser backends: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser