whatwg / html-build

Build scripts for https://github.com/whatwg/html
Other
62 stars 62 forks source link

Add a Rust implementation of the Perl preprocessing logic. #279

Closed jeremyroman closed 10 months ago

jeremyroman commented 1 year ago

This implementation uses a full HTML parser (Servo's, html5ever) and operates on a DOM. It ends up being somewhat more verbose than Perl, but hopefully also more extensible/maintainable.

jeremyroman commented 1 year ago

Currently, this only builds with Rust if PROCESS_WITH_RUST=1 exists in the environment. It also uses cargo run to invoke the binary, rather than assuming it always exists and is up to date (as we could perhaps do on the server).

Still, it works. It's verbose because it's hard to beat PCRE operating on text for conciseness when operating on an actual DOM tree.

@domenic @domfarolino

domenic commented 1 year ago

This is great!! I didn't realize there was so much preprocessing in those scripts...

I'm honestly not going to be able to provide great Rust review. Maye @domfarolino can help with that. But really, even not-well-reviewed Rust with unit tests is better than untested Perl (or Pascal). So we should bias toward merging this.

It seems like, given how you've set this up, there are no blockers to merging this. (Although maybe we should update the README to document things a bit?) Things to do before we ship this:

I hope to tackle these things myself.

domenic commented 1 year ago

I pushed a commit that removed the the custom <ref> stuff. Please take a look, and especially if there's more we could remove?

Also, I'm wondering how necessary parser.rs is. Could we just use read from stdio using non-async IO? Or could we use an existing package like https://crates.io/crates/html5ever-stream/0.1.0 ?

jeremyroman commented 1 year ago

There was indeed more that can be removed; I've done so. All that's left that could be replaced by such a crate is basically the 20-line read loop (which is basically the contents of Parser::read_from).

It could also be replaced by just doing this synchronously, which would be fine in practice. I generally like the idea of being async because it's a little more annoying to retrofit async than to do it up-front, and it's not a lot of code to do here. In principle I like the idea of the build server being able to one day do this processing in a single processing without needing to spawn a separate thread per request, too, even though in practice I realize the build server is not that busy.