whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.16k stars 2.69k forks source link

Make &aaa; a parse error #1257

Closed zcorpan closed 7 years ago

zcorpan commented 8 years ago

See https://github.com/whatwg/html/pull/1220#discussion_r63329806

In particular, &aaa; or <p title="&ampersand;"> should be parse errors but are not right now because characters are only consumed so long as they match something in the named character references table, and then the temporary buffer is checked for a trailing ; (for the above examples it would contain &aa or &ampe, respectively).

In https://checker.html5.org/ it reports an error for these cases. It does not report an error for really long entities (like 1000 chars).

Possible fixes:

  1. A new state that continues to consume (and append to the temporary buffer) alphanumerics when no match could be made, and that report a parse error on ';' and reconsume in character reference end state, and reconsume without parse error for anything else.

    This means that the temporary buffer would need to be able to hold an arbitrary amount of characters, which seems bad...

  2. Use new states that insert stuff in the right places (emit a char token or append to attribute value), and parse error on ;.
  3. Like (1) but have a fixed length of characters to consume in the temporary buffer (i.e. like v.nu HTML parser).

Also need to update the note in the spec about &notit; in an attribute value.

@RReverser @sideshowbarker

RReverser commented 8 years ago

Option 2 sounds good to me - it allows for continuous streaming, yet reports parse errors if implementation chooses to do so.

sideshowbarker commented 8 years ago

I have no strong preference here but am willing in principle to write patches for the v.nu HTML parser to line it up with whatever we can agree on.

In https://checker.html5.org/ it reports an error for these cases. It does not report an error for really long entities (like 1000 chars).

https://checker.html5.org/ is currently running from an off-master branch of the v.nu HTML parser that has a patch I wrote to align it as well as I could with the current HTML parsing algorithm requirements for dealing with ampersands and character references. There are some parts of the code that I found it pretty difficult to align fully to the spec without getting more review. But I wasn’t able to get the patch reviewed more thoroughly, so ended up landing it on a branch and building v.nu from that.

RReverser commented 8 years ago

This means that the temporary buffer would need to be able to hold an arbitrary amount of characters, which seems bad...

@zcorpan A related thought:

Actually, if an implementation strictly follows the spec as a guide for the real algorithm (we don't in our implementation exactly to prevent infinite buffering, but some might do), this is already a case in other places. Just as an example - RCDATA end tag name state collects all the letters and only when it finds a non-letter, it either emits the temporary buffer or checks it against the last start tag name and makes further decisions. So if such a straightforward algorithm gets an input of endless HTTP stream starting with e.g. <textarea></aaaaaaaaaa..., in the best case it might be buffering all those letters until it runs out of memory.

Proper implementation would be comparing letters one by one against the last start tag name for the early bailout instead.

Anyway, my point is that we already have this form of a problem around the spec, and we need to either fix it in all those places in the spec or just add a top-level warning/note and use your proposal number 1.

zcorpan commented 8 years ago

Hmm, I think we should go with option (2) here and also fix the other problem you mention. Filed #1306 - thanks!