U+000D CARRIAGE RETURN handling

mylogin commented 2 years ago

Section 13.2.3.5 says that U+000D must be removed from the input stream. If this is not done, then the character U+000D will be added to the current tag token's tag name in step 13.2.5.8 (Tokenizer tag name state) for example (<a\r\nhref="#"> will result in the tag name "a\r"). Why do we check this symbol at the tree construction stage if it was removed earlier?

13.2.3.5 Preprocessing the input stream Before the tokenization stage, the input stream must be preprocessed by normalizing newlines. Thus, newlines in HTML DOMs are represented by U+000A LF characters, and there are never any U+000D CR characters in the input to the tokenization stage.

13.2.6 Tree construction A character token that is one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), or U+0020 SPACE

annevk commented 2 years ago

That's because of  (only a conformance error).

zcorpan commented 2 years ago

There could be a note about this in the "Preprocessing the input stream" section.

whatwg / html

U+000D CARRIAGE RETURN handling #7669