Open gsnedders opened 8 years ago
cc @zcorpan @nox @hsivonen
Sounds like a good idea. (I also think the parser could use more ids in general.)
I'm OK with this.
I've gone through the microsyntax parse errors first. There are not very many but these are the hardest to properly identify. With one exception they can all fit the 'x-in-y' pattern. What do you think of these so far?
parse-error-comma-in-img-srcset
parse-error-trailing-commas-in-img-srcset-url
parse-error-invalid-value-in-img-descriptor
parse-error-img-descriptor
(this is an else-case parse error for img descriptors, not sure what to call it)
parse-error-empty-size-in-sizes-attribute
parse-error-invalid-size-in-sizes-attribute
Example:
<span data-x="concept-microsyntax-parse-error" id="parse-error-comma-in-img-srcset">parse error</span>
Next I'll see about listing the generalized forms for the remaining parse errors.
I think we should consider actually uniquely naming each parse error in the body of the spec; like:
<code><dfn>CommaInImgSrcset</dfn></code><span
data-x="concept-microsyntax-parse-error">parse error</span>
Gah, we keep giving poor @haroldfredshort conflicting information :(...
Maybe people would all agree on something like
... that is a <span data-x="concept-microsyntax-parse-error">parse error</span>
(<dfn data-x="parse-error-comma-in-img-srcset">comma in img srcset</dfn>).
???
Gah, we keep giving poor @haroldfredshort conflicting information :(...
oofs, yeah, sorry for not re-reading the comments in the PR more carefully
Maybe people would all agree on something like
... that is a <span data-x="concept-microsyntax-parse-error">parse error</span>
(<dfn data-x="parse-error-comma-in-img-srcset">comma in img srcset</dfn>).
Yeah, that would be fine by me.
I believe this bug is about parse errors in the "Parsing HTML documents" section -- not about errors in srcset
.
Regarding the comment from @zcorpan, should I not add identifiers to the seven srcset parse errors, or would it be okay to identify these differently? Like this:
that is a <span data-x="concept-microsyntax-parse-error">parse error</span>
(<dfn data-x="microsyntax-parse-error-comma-in-img-srcset">comma in img srcset</dfn>).
The parse errors in the "Parsing HTML documents" section are simpler, but before I go further I am including an example here to make sure we are all agreed that this is how we want it to be:
...
<dt>EOF</dt><dd><span>Parse error</span>
(<dfn data-x="parse-error-eof-in-cdata-section-state">EOF in CDATA section state</dfn>)
...
should I not add identifiers to the seven srcset parse errors
The srcset parse errors are not errors that get implemented by HTML parsers. They instead are targeted just to the separate specific code that browsers implement in their srcset
implementations.
So it would be good to have IDs for those as well, but those should be in a separate PR, because this issue was raised specifically just for the parse errors for HTML parsers as defined in the HTML parsing algorithm, and so the target is just HTML parser developers/implementations.
@sideshowbarker Okay, thank you for the clarification.
I think each of the HTML parse errors can be made to fit into one of the following generalized error forms:
foo-in-bar
foo-not-in-bar
expected-foo-got-bar
unexpected-foo
invalid-foo
As discussed previously, each error id will begin with parse-error-
.
Please let me know what you think.
I think a few of the entity parse errors might not quite fit into that, but am open to suggestions there (also, it's only three or four, so that's not really wasted effort if we decide to change them after you've made a PR!).
I'm happy to go with @domenic's suggestion as to the format of the dfn
, so I think we're agreed to it.
(Sorry for that having made this bug sufficiently clear before, by the way!)
Is anyone working on it? If not, I would like to champion it if no one would mind.
I've given up, go for it!
On Sat, Mar 11, 2017, 11:11 Ivan Nikulin notifications@github.com wrote:
Is anyone working on it? If not, I would like to champion it if no one would mind.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/whatwg/html/issues/1339#issuecomment-285892593, or mute the thread https://github.com/notifications/unsubscribe-auth/AAa9GODVj5UhOLj_2mdr2RM4ZOMQHaU2ks5rkvHKgaJpZM4IpJlq .
Unique id per error in the HTML parsing section seems like a good idea. When a human-facing message would be parametrized with the particular element names at hand, those variables varying shouldn't cause the error identity to be multiplied.
OK, so here is the plan: I'll stick with error format proposed by @domenic in https://github.com/whatwg/html/issues/1339#issuecomment-226713123 with -
as a word separator for error ID. Along the way, I'll update html5lib-tests with expected error data, thereby we'll have functionality covered with tests. I will use parse5 as a reference implementation. I'm planning to split job into two big chunks which will be issued as separate PRs: one for tokenization errors and one for tree construction errors.
@caridy and @diervo have expressed their desire to participate: feel free to join once I'll sort out bootstrapping-related things (first point in the TODO-list) and we'll coordinate our work on this.
Does anyone has any concerns before I start?
@inikulin FWIW, there was various bits of debate about parse error format in html5lib-tests, especially giving positions (some people wanted tests for start/end positions of the token with the error, IIRC, so whatever format we have there should make sure it's extensible for that). On the whole I'm in favour of something like (line,col): fragment
.
@gsnedders I'm not sure we'll always be able associate error with the token correctly. However, from the top of my mind following should work in most cases:
(line, col)
be position of consumed character that initiated an error;(line, col)
be position of first character of token that initiated and error.What do you mean by fragment
though? Error ID or source code representation of the token?
@inikulin currently I think some have for tree construction (line, col)
being the end of the token; I meant the error ID by fragment
.
@gsnedders yeah, makes more sense.
@inikulin I spend the weekend reading all the threads and docs about the topic, so I'm ready to help with the implementation. One you have some idea/design draft with the first integration and implementation details, I will happily help with anything you need.
This is something we really need soon for our project at Salesforce so I would gladly co-champion it with someone that has more background with the spec related stuff.
OK, I've finished preparation steps and and finally got my hands on the spec. Here is how it looks like:
I'm a bit hesitating if it's clear enough that what given in the parentheses is a error code. Maybe prefix form suggested in https://github.com/whatwg/html/issues/1339#issuecomment-226711788 will work better? Or we could just clarify this somehow in the parse error definition?
I'm a bit hesitating if it's clear enough that what given in the parentheses is a error code.
Seems like another option is to put the error name in front of the words parse error, like this:
Any character that is a not a Unicode character, i.e. any isolated surrogate, is a non-unicode-character-in-input-stream parse error.
Yeah, I personally like the prefix form, with @sideshowbarker's most recent variation seeming like an even nicer variant than the earlier comment.
Very excited about this!
We should probably also add something about the purpose of these identifiers in the parse error definition section.
I wonder if we should enforce usage of specification-provided error codes, i.e. should we use "should" or "may" RFC 2119 keywords in:
Some parse errors have dedicated error codes that should/may be used by conformance checkers...
I wonder if we should enforce usage of specification-provided error codes, i.e. should we use "should" or "may" RFC 2119 keywords in:
Some parse errors have dedicated error codes that should/may be used by conformance checkers...
I’m supportive of using a normative “should” to state that requirement for conformance checkers.
When doing the next round to add codes to the tree builder, please also add a comma and space after "e.g." here: https://github.com/whatwg/html/pull/2701/commits/d6764a59e90694aad75b37de2f2ea4e3a49abc3b#diff-36cd38f49b9afa08222c0dc9ebfe35ebR100416
Edit: Did this in https://github.com/whatwg/html/pull/2728
@zcorpan Thank you!
It would be convenient for the sake of testing implementations to have unique ids on each parse-error.
I know @Hixie opposed this because the spec only requires parse errors to emitted at specific points, and some implementations merge multiple parse errors in the spec, and hence cannot distinguish them.
I don't think this is a good argument against it, as implementations can always have a mapping of their parse error x corresponds to a and b in the spec.
At the moment html5lib-tests essentially has position and some meaningless message.
Both html5lib and html5ever have unique messages for each parse error. It would be nice to make sure we have the right errors being raised.
We probably want ids something like the keys in the dictionary
E
in https://github.com/html5lib/html5lib-python/blob/master/html5lib/constants.py#L7.