Add unique ids to each parse error

gsnedders commented 8 years ago

It would be convenient for the sake of testing implementations to have unique ids on each parse-error.

I know @Hixie opposed this because the spec only requires parse errors to emitted at specific points, and some implementations merge multiple parse errors in the spec, and hence cannot distinguish them.

I don't think this is a good argument against it, as implementations can always have a mapping of their parse error x corresponds to a and b in the spec.

At the moment html5lib-tests essentially has position and some meaningless message.

Both html5lib and html5ever have unique messages for each parse error. It would be nice to make sure we have the right errors being raised.

We probably want ids something like the keys in the dictionary E in https://github.com/html5lib/html5lib-python/blob/master/html5lib/constants.py#L7.

gsnedders commented 8 years ago

cc @zcorpan @nox @hsivonen

zcorpan commented 8 years ago

Sounds like a good idea. (I also think the parser could use more ids in general.)

hsivonen commented 8 years ago

I'm OK with this.

haroldfredshort commented 8 years ago

I've gone through the microsyntax parse errors first. There are not very many but these are the hardest to properly identify. With one exception they can all fit the 'x-in-y' pattern. What do you think of these so far?

parse-error-comma-in-img-srcset parse-error-trailing-commas-in-img-srcset-url parse-error-invalid-value-in-img-descriptor parse-error-img-descriptor (this is an else-case parse error for img descriptors, not sure what to call it) parse-error-empty-size-in-sizes-attribute parse-error-invalid-size-in-sizes-attribute

Example: <span data-x="concept-microsyntax-parse-error" id="parse-error-comma-in-img-srcset">parse error</span>

Next I'll see about listing the generalized forms for the remaining parse errors.

sideshowbarker commented 8 years ago

I think we should consider actually uniquely naming each parse error in the body of the spec; like:

<code><dfn>CommaInImgSrcset</dfn></code><span
    data-x="concept-microsyntax-parse-error">parse error</span>

domenic commented 8 years ago

Gah, we keep giving poor @haroldfredshort conflicting information :(...

domenic commented 8 years ago

Maybe people would all agree on something like

... that is a <span data-x="concept-microsyntax-parse-error">parse error</span>
(<dfn data-x="parse-error-comma-in-img-srcset">comma in img srcset</dfn>).

???

sideshowbarker commented 8 years ago

Gah, we keep giving poor @haroldfredshort conflicting information :(...

oofs, yeah, sorry for not re-reading the comments in the PR more carefully

Maybe people would all agree on something like

... that is a <span data-x="concept-microsyntax-parse-error">parse error</span>
(<dfn data-x="parse-error-comma-in-img-srcset">comma in img srcset</dfn>).

Yeah, that would be fine by me.

zcorpan commented 8 years ago

I believe this bug is about parse errors in the "Parsing HTML documents" section -- not about errors in srcset.

haroldfredshort commented 8 years ago

Regarding the comment from @zcorpan, should I not add identifiers to the seven srcset parse errors, or would it be okay to identify these differently? Like this:

that is a <span data-x="concept-microsyntax-parse-error">parse error</span>
(<dfn data-x="microsyntax-parse-error-comma-in-img-srcset">comma in img srcset</dfn>).

The parse errors in the "Parsing HTML documents" section are simpler, but before I go further I am including an example here to make sure we are all agreed that this is how we want it to be:

...
<dt>EOF</dt><dd><span>Parse error</span>
(<dfn data-x="parse-error-eof-in-cdata-section-state">EOF in CDATA section state</dfn>)
...

sideshowbarker commented 8 years ago

should I not add identifiers to the seven srcset parse errors

The srcset parse errors are not errors that get implemented by HTML parsers. They instead are targeted just to the separate specific code that browsers implement in their srcset implementations.

So it would be good to have IDs for those as well, but those should be in a separate PR, because this issue was raised specifically just for the parse errors for HTML parsers as defined in the HTML parsing algorithm, and so the target is just HTML parser developers/implementations.

haroldfredshort commented 8 years ago

@sideshowbarker Okay, thank you for the clarification.

haroldfredshort commented 8 years ago

I think each of the HTML parse errors can be made to fit into one of the following generalized error forms: foo-in-bar foo-not-in-bar expected-foo-got-bar unexpected-foo invalid-foo As discussed previously, each error id will begin with parse-error-.

Please let me know what you think.

gsnedders commented 8 years ago

I think a few of the entity parse errors might not quite fit into that, but am open to suggestions there (also, it's only three or four, so that's not really wasted effort if we decide to change them after you've made a PR!).

I'm happy to go with @domenic's suggestion as to the format of the dfn, so I think we're agreed to it.

gsnedders commented 8 years ago

(Sorry for that having made this bug sufficiently clear before, by the way!)

inikulin commented 7 years ago

Is anyone working on it? If not, I would like to champion it if no one would mind.

haroldfredshort commented 7 years ago

I've given up, go for it!

On Sat, Mar 11, 2017, 11:11 Ivan Nikulin notifications@github.com wrote:

Is anyone working on it? If not, I would like to champion it if no one would mind.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/whatwg/html/issues/1339#issuecomment-285892593, or mute the thread https://github.com/notifications/unsubscribe-auth/AAa9GODVj5UhOLj_2mdr2RM4ZOMQHaU2ks5rkvHKgaJpZM4IpJlq .

hsivonen commented 7 years ago

Unique id per error in the HTML parsing section seems like a good idea. When a human-facing message would be parametrized with the particular element names at hand, those variables varying shouldn't cause the error identity to be multiplied.

inikulin commented 7 years ago

OK, so here is the plan: I'll stick with error format proposed by @domenic in https://github.com/whatwg/html/issues/1339#issuecomment-226713123 with - as a word separator for error ID. Along the way, I'll update html5lib-tests with expected error data, thereby we'll have functionality covered with tests. I will use parse5 as a reference implementation. I'm planning to split job into two big chunks which will be issued as separate PRs: one for tokenization errors and one for tree construction errors.

[x] Implement error reporting and testing infrastructure in parse5
[x] Implement errors for tokenization
[ ] Implement errors for tree construction

@caridy and @diervo have expressed their desire to participate: feel free to join once I'll sort out bootstrapping-related things (first point in the TODO-list) and we'll coordinate our work on this.

Does anyone has any concerns before I start?

gsnedders commented 7 years ago

@inikulin FWIW, there was various bits of debate about parse error format in html5lib-tests, especially giving positions (some people wanted tests for start/end positions of the token with the error, IIRC, so whatever format we have there should make sure it's extensible for that). On the whole I'm in favour of something like (line,col): fragment.

inikulin commented 7 years ago

@gsnedders I'm not sure we'll always be able associate error with the token correctly. However, from the top of my mind following should work in most cases:

For tokenizer error let (line, col) be position of consumed character that initiated an error;
For tree construction let (line, col) be position of first character of token that initiated and error.

What do you mean by fragment though? Error ID or source code representation of the token?

gsnedders commented 7 years ago

@inikulin currently I think some have for tree construction (line, col) being the end of the token; I meant the error ID by fragment.

inikulin commented 7 years ago

@gsnedders yeah, makes more sense.

diervo commented 7 years ago

@inikulin I spend the weekend reading all the threads and docs about the topic, so I'm ready to help with the implementation. One you have some idea/design draft with the first integration and implementation details, I will happily help with anything you need.

This is something we really need soon for our project at Salesforce so I would gladly co-champion it with someone that has more background with the spec related stuff.

inikulin commented 7 years ago

OK, I've finished preparation steps and and finally got my hands on the spec. Here is how it looks like:

I'm a bit hesitating if it's clear enough that what given in the parentheses is a error code. Maybe prefix form suggested in https://github.com/whatwg/html/issues/1339#issuecomment-226711788 will work better? Or we could just clarify this somehow in the parse error definition?

sideshowbarker commented 7 years ago

I'm a bit hesitating if it's clear enough that what given in the parentheses is a error code.

Seems like another option is to put the error name in front of the words parse error, like this:

Any character that is a not a Unicode character, i.e. any isolated surrogate, is a non-unicode-character-in-input-stream parse error.

domenic commented 7 years ago

Yeah, I personally like the prefix form, with @sideshowbarker's most recent variation seeming like an even nicer variant than the earlier comment.

Very excited about this!

We should probably also add something about the purpose of these identifiers in the parse error definition section.

inikulin commented 7 years ago

I wonder if we should enforce usage of specification-provided error codes, i.e. should we use "should" or "may" RFC 2119 keywords in:

Some parse errors have dedicated error codes that should/may be used by conformance checkers...

sideshowbarker commented 7 years ago

I wonder if we should enforce usage of specification-provided error codes, i.e. should we use "should" or "may" RFC 2119 keywords in:

Some parse errors have dedicated error codes that should/may be used by conformance checkers...

I’m supportive of using a normative “should” to state that requirement for conformance checkers.

zcorpan commented 7 years ago

When doing the next round to add codes to the tree builder, please also add a comma and space after "e.g." here: https://github.com/whatwg/html/pull/2701/commits/d6764a59e90694aad75b37de2f2ea4e3a49abc3b#diff-36cd38f49b9afa08222c0dc9ebfe35ebR100416

Edit: Did this in https://github.com/whatwg/html/pull/2728

inikulin commented 7 years ago

@zcorpan Thank you!

whatwg / html

Add unique ids to each parse error #1339