whatwg / html-build

Build scripts for https://github.com/whatwg/html
Other
62 stars 62 forks source link

Output parse errors for the Rust part of the build step #291

Closed domenic closed 4 months ago

domenic commented 7 months ago

This fixes #290, by outputting the parse errors encountered by the Rust build step's parser. Previously they were being stored in the RcDom instance's errors vector, and ignored. Now they are threaded through to the final io::Result, and then output by main().

The hardest part of this was adding line numbers to the errors. Doing this necessitated creating a wrapper for RcDom, which I inventively called RcDomWithLineNumbers, which implements TreeSink with two methods parse_error and set_current_line given custom behavior, while the other many methods just delegate to RcDom's implementation.

(There doesn't appear to be any better way to do this in Rust, currently. If you search for "rust delegation" you'll find lots of discussion of the problem, and various solutions which work if you can modify the delegated-to class, i.e. RcDom in our case. But nothing seems to work if the delegated-to class is in third-party code.)

Additionally, this enables exact_errors as a parser option, which provides slightly more information in a couple of cases related to character references.

Original PR description I spent some time trying to address https://github.com/whatwg/html-build/issues/290 This sort of works. (Although the code needs a lot of cleanup; probably we want to bubble the errors to the top level instead of printing them immediately.) However, the biggest problem is that there is no line number information. So the output is terrible: ``` Compiling html-build v0.0.0 (/home/domenic/src/html-build) Finished release [optimized] target(s) in 3.14s Running `target/release/html-build` Attributes on an end tag No matching tag to close Error: Custom { kind: InvalidData, error: "Parse errors encountered" } ``` The upstream issue is https://github.com/servo/html5ever/issues/48 . I think this is possible to fix if we do something where we override the `RcDom` sink so that instead of [doing nothing when `set_current_line` is called](https://docs.rs/markup5ever/0.11.0/src/markup5ever/interface/tree_builder.rs.html#238), it stores that information somewhere. And then I guess we'd override [the `parse_error` implementation](https://docs.rs/markup5ever_rcdom/latest/src/markup5ever_rcdom/lib.rs.html#224-226) to use that stored information. Figuring out how to create some version of `RcDom` with these overrides seems likely to be something Rust supports, but I'm running out time for playing with Rust today. If anyone else wants to pitch in, patches are appreciated. /cc @noamr @jeremyroman
domenic commented 7 months ago

Figuring out how to create some version of RcDom with these overrides seems likely to be something Rust supports

Not... really, it seems. https://chat.openai.com/share/fa746c1a-403f-4611-bce7-7277e21114d8

jeremyroman commented 7 months ago

An alternative hacky option we could always have is to treat validation as a "lint" step as part of the GitHub actions, as opposed to a feature of the core build process itself. Then we don't need the core build's parser to track errors in detail. Obviously it's less elegant, but potentially an option. Or in between -- the core build simply tracks whether there were errors, and then we do a separate lint/diagnostic pass to describe the errors given there were some.

Either of these might be preferable to fully forking markup5ever_rcdom.

domenic commented 6 months ago

@jeremyroman @domfarolino for review if you all have time.