whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.03k stars 2.62k forks source link

Consider using sectioning elements for the HTML spec itself #5649

Open domenic opened 4 years ago

domenic commented 4 years ago

The HTML singlepage spec currently has 10374 child nodes of <body>, because we just put a bunch of <hN>s and <p>s all together.

If we instead used sections (or even <div>s), we could get a few benefits:

If this is a good idea, some thoughts on implementation:

mithray commented 3 years ago

I would be happy to implement the sections in the way that is implied by heading elements according to the spec.

Advantages

My suggestion would be to do this in the source itself. If you give me the go ahead on this, this is what I will do.

FYI: This would be my first issue on this standard.

domenic commented 3 years ago

@mithrayls sorry for the delay in responding! I'm excited that you're interested in tackling this. We'd love your help.

I'd be happy to have you do this in the source itself. However, I'd feel most comfortable if you did it in an automated fashion somehow. It would be much easier for any reviewer to audit a script, than it would be for them to audit the hundreds of lines of diffs (all of which are just adding <section> or </section>). Do you have any thoughts on that?

P.S.

The spec, by definition, describes a standard, but it does not follow this standard itself. It would be really nice if people could look at the code of the spec to see how they can write HTML themselves! (at least, the outputed HTML from the source)

To be clear, not using section is totally valid HTML, and still follows the standard. A flat list of elements is fine. But, some more structure does help other programs, as you mention.

mithray commented 3 years ago

My natural starting point would be to use the parse5 library, iterate through the nodes and surround the h tags with section tags where I hit a boundary of equal or greater importance. Another approach, which I actually already successfully used to solve my own personal problem parsing the standard involved the use of a multiline regex but that might be considered unprofessional ;-)

No worries about the delay. I noticed you blogging about the spammy Pull Requests! This was an awkward moment for me as I knew I had this issue outstanding with you ! :-p

[EDIT] I realize the source meets the standard it describes, but I think it would not meet best practices of semantic HTML? At any rate, section tags would make it easier to parse and locate sections.

[EDIT] I've tried both parse5 and jsdom to parse and then serialize, as well as with and without passing it through a prettifier, but the diff is very large due to what seem to be very minor changes, such as whitespace between tags. For this reason, it might be better to go with the regex idea. Unless there is some kind of canonical prettification for the source code that will allow me to make changes to a parsed tree without creating a huge diff of irrelevant changes(the diff changes actually make the source harder to read by getting rid of helpful formatting)? I think that would be useful for making automated changes. I think a canonical prettifier would make more sense.

domenic commented 3 years ago

Yeah, when I saw your pre-edit message this morning, I was afraid that parsing-then-serializing would cause too many diffs, since HTML generally does not roundtrip in that way.

Although I'm interested in canonical prettifification of the source at some point, I don't think it's a good idea to block this project on that.

What about using parse5, but instead of using its serialization, using its node location info to textually insert into the source string? I.e. something like this pseudocode:

const source = readSourceFile();
let output = source;

const parsed = parseIt(source);

let delta = 0;
for (const h1 of parsed.getH1s()) {
  output = output.substring(0, h1.nodeLocation + delta) + "\n<section>\n" + output.substring(h1.nodeLocation +delta);
  delta += "\n<section>\n".length;
}

I'm not sure if that's workable, or if it's better than regexes.

Another route would be to use tools like parse5 to validate the output. In particular, I'm thinking something that verifies that each hN is contained in N-deep section elements. That sounds pretty easy. And then you could use regexes or any other technique; we'd just need to hand-check the validation code, then we could trust it.