whatwg / dom

DOM Standard
https://dom.spec.whatwg.org/
Other
1.58k stars 295 forks source link

Valid/Invalid characters in document.createElement() #849

Open josepharhar opened 4 years ago

josepharhar commented 4 years ago

A difference between the spec and browser implementations was raised here: https://bugs.chromium.org/p/chromium/issues/detail?id=1061436

In the spec, document.createElement should allow names that "match the Name production," which links to this definition, which allows characters in many ranges, including [#x10000-#xEFFFF], which I am guessing includes emojis and would therefore allow document.createElement('test-\u{1f602}')

However, Chrome, Firefox, and Safari all throw exceptions when this code snippet is run. Should the browsers update to match the spec, or should the spec be updated? Or am I misinterpreting the spec? If the spec should be updated, what should it say instead?

domenic commented 4 years ago

I agree with your reading of the spec.

Normally I would say we should change the spec to match. However, note https://github.com/whatwg/dom/pull/449 and the various issues linked from there; web developers would really like it if the rules were more permissive. (Although I think what they really want is for the DOM APIs to accept at least a superset of the HTML parser; I haven't checked what the HTML parser does with your example.) So, maybe we could take this as an opportunity to solve that long-standing issue? :)

fasttime commented 4 years ago

Here is the reporter of the Chromium issue mentioned above. Just wanted to note that removing the character range [#x10000-#xEFFFF] from the name production rules would create a discrepancy with the HTML rules for custom element names, where that range is included. That would consolidate the inconsistent behavior observed in current versions of Chrome and Firefox where you are allowed to register an element with an Emoji in its name but not to create it.

customElements.define("emotion-😍", HTMLElement); // OK
document.createElement("emotion-😍"); // Allowed per spec, error in Chrome and Firefox

So if the consense is that supplementary Unicode characters should be banned, I would recommend to also change the HTML rules accordingly.

josepharhar commented 4 years ago

Thanks for the response @domenic, I can see how this is just a part of making all the DOM APIs accept more characters. I saw another crbug about a similar issue here, and found a corresponding spec bug here. It sounds like if we do this, we should do it for all DOM APIs, as it looks like you suggested in #449. This seems like it would be a big change, right? What can we do to make this happen? Do people have strong opinions against it?

Thanks for the elaboration @fasttime, I agree that it would make sense for customElements.define() to match document.createElement()

domenic commented 4 years ago

It sounds like if we do this, we should do it for all DOM APIs, as it looks like you suggested in #449. This seems like it would be a big change, right? What can we do to make this happen? Do people have strong opinions against it?

I think it's probably not that big. In #449 and previous issues we were aiming for alignment, i.e. making the parser and the DOM APIs match. It took us a good amount of work (use counters, etc.) to discover that was likely not web-compatible. (Also, looking at the bug, it seems the process also involved educating me about namespace-related things... fun times.)

But with that foundational work out of the way, I think this becomes simpler. I.e., now we know it's just a matter of updating the spec, writing some tests, and filing/fixing some browser bugs. The change is just one line in the spec, to accept more things than were accepted before. And there shouldn't be any compat worries.

I think most people are for such a change, although it's been low priority for browser vendors. (Until now? 😄)

josepharhar commented 4 years ago

Sounds great! I'd *imagine* that it wouldn't be that hard to implement either. Perhaps we should start by listing all of the particular APIs that we would need to change?

domenic commented 4 years ago

I think the APIs are in two sets: elements and attributes. You could do the two sets together, or separately.

josepharhar commented 4 years ago

Thanks, that makes sense. Should we just change the spec and implement it? Should we ask for support from firefox and safari?

domenic commented 4 years ago

My suggestion is to do a spec pull request to make it more concrete, and iterate on that (I'm happy to help) to make sure we've got it right. Then we can ask for broader support/review.

As for when to work on implementation and tests, that's a judgment call. The potential downside is doing implementation/test work and finding out that we missed something that makes it show-stoppingly objectionable, which would mean the implementation/test work is "wasted". But in this case I think it'd be relatively safe to do so in parallel, as I am hopeful that this change will not be controversial.

domenic commented 2 years ago

Here is an initial stab at rules that take the union of what current DOM APIs allow and what the parser allows. Someone double-checking would be great; if I can get confirmation that they seem right, I can probably spend some time on a spec PR.

Probably we should not touch custom element name rules. We could in theory make PCENChar similarly lenient to LenientNameChar, but I'm not sure that leniency actually is a good idea for them, since customElements.define() basically gives us a single location at which to enforce good naming practices and, if you pass them, grant you custom element powers. It's not like the situation with parser-created vs. API-created.

Although I've phrased the above in terms of hypothetical grammar productions (e.g. LenientElementNameStartChar) the actual spec would probably be better as algorithms that loop over code units/code points, since that is how they're implemented. And per the OP of this thread the current implementations have bugs, which I suspect might be due to the attempt at translating from grammar specifications into algorithms.

domenic commented 2 years ago

Alternate approach suggested by @annevk: be maximally lenient in the DOM APIs. The difference from the above would be something like:

The upside of this is that it simplifies the spec and implementation a good bit. The intention of the current restrictions from the XML spec seems to be a misguided segmentation of Unicode which hasn't really panned out in practice. It might also give small performance wins; it's hard to say since on the one hand this is definitely a microoptimization, but on the other hand createElement() / setAttribute() / etc. are called quite often in modern web apps.

The downside is that this allows more cases where DOM APIs can produce unparseable HTML. I.e., currently you can produce HTML which does not survive a parse-serialize roundtrip using something like document.createElement("_a"). That is kind of sad. This more-lenient version would expand the number of cases where that's possible, to include things like document.createElement("$a"). Whereas the version in https://github.com/whatwg/dom/issues/849#issuecomment-1007541209 would disallow $a (while continuing to allow _a).

Additionally, if we're giving up on parse-serialize roundtrips anyway, it's not clear whether disallowing tab, LF, CR, FF, space, /, >, NULL is important either! We could just allow any string...

mfreed7 commented 2 years ago

So I'm generally supportive of the second, "maximally lenient" approach. It makes the spec, the implementation, and the developer understanding of this behavior straightforward, which is good for performance, bugs, and happiness. Except I do worry about this downside:

The downside is that this allows more cases where DOM APIs can produce unparseable HTML. I.e., currently you can produce HTML which does not survive a parse-serialize roundtrip using something like document.createElement("_a"). That is kind of sad. This more-lenient version would expand the number of cases where that's possible, to include things like document.createElement("$a"). Whereas the version in #849 (comment) would disallow $a (while continuing to allow _a).

Additionally, if we're giving up on parse-serialize roundtrips anyway, it's not clear whether disallowing tab, LF, CR, FF, space, /, >, NULL is important either! We could just allow any string...

My concern isn't so much that it'll allow non-roundtrippable DOM, but more that this might introduce mXSS type security concerns. For example, I'm quite sure we don't want to ever allow >, for this specific reason. But there might be other characters that render this pattern mXSS-dangerous?

domenic commented 2 years ago

For others who weren't aware of mXSS, it apparently is about malicious users supplying strings which are stored in a database, and then passed to innerHTML, such that they create unexpected effects because x.innerHTML = y; x.innerHTML doesn't always return y. (Because the parse/serialize roundtrip mutates the input.)

So I guess the relevance here is, if someone does

const el = document.createElement(userSuppliedString);
await storeInServer(el.outerHTML);

// ... later ...
const html = await getFromServer();
container.innerHTML = html;

and userSuppliedString is p><script>alert(1);</script, then el.outerHTML would end up as <p><script>alert(1);</script></p><script>alert(1);</script>. That is indeed bad!

If we disallow tab, LF, CR, FF, space, /, >, NULL, then nothing immediately sticks out to me as problematic. But I worry that I'm just not doing enough security analysis. It would be a shame if this simplification fix ended up causing everyone security problems.

From that perspective, the https://github.com/whatwg/dom/issues/849#issuecomment-1007541209 should be generally safer. It would make document.createElement(userSuppliedString) as dangerous as it currently is, plus the danger of el.innerHTML =<${userSuppliedString}>`. Which seems manageable.

Hmm.

annevk commented 2 years ago

That's a great point, though I'd like to actually solve this a bit more before considering options.

It seems that currently you can put the parser in a different state by using :, _, or a code point greater than U+007F from NameStartChar. However, you don't have much room after that as the parser operates on ASCII code points and you only get ., -, and 0 through 9 in NameChar, none of which have any effect.

The HTML parser on the other hand requires a through z (ASCII case-insensitive) to get into the tag name state at which point most things go. However, before it gets there (in tag open state) it also has special handling for !, /, and ?, as well as anything that is not a through z. Meaning you could get into the data state at which point & and < have some special meaning.

What does this mean? I'm not entirely sure.

It could be that inputs such as :&lt;script&gt;alert(1)&lt;&#47;script&gt; become problematic if we need to account for reparsing attacks. Also simpler inputs such as :<!--. (Though see https://lists.w3.org/Archives/Public/www-dom/2014JanMar/0175.html.)


I wonder if we can get to more lenient rules safely if we better account for ASCII.

If the first code point is a through z (ASCII case-insensitive), the following code points can be anything except for the code points that escape the tag name state. This ought to be safe as the HTML parser already operates this way.

If the first code point is not a through z I think it's fine if it's greater than U+007F, but below that it ought to match what NameStartChar allows (i.e., : or _). And if the first code point is not a through z, then subsequent code points should be what NameChar allows though again I think anything greater than U+007F ought to be okay too as it cannot influence the parser (except if we also need to account for Unicode normalization but at that point it becomes rather bananas in my opinion).

This is both stricter and more lenient than what @domenic wrote above. It's more strict when it comes to ASCII which I think is a good thing as that is the risky area of potential state transitions. It's "everything goes" for non-ASCII which I think is good mainly from the point of reducing complexity of the types of checks we need to perform.

Thoughts?

bathos commented 2 years ago

@annevk I think that it would also be reasonable to disallow lone surrogates. It’s possible to get them into the input stream with document.write and they’re accepted as chardata unchanged via innerHTML, etc, but they’d normally end up converted to U+FFFD in html document parsing scenarios (e.g. if one “appeared” in a document parsed as UTF-8, WTF-8 style). Given the translation there would be non-ascii→non-ascii, it may not be a major hazard, but it’s pretty hard to imagine any benefit to allowing them in names and not too hard to imagine it causing problems.

securityMB commented 2 years ago

@mfreed7 asked me about my thoughts on security implications of being more lenient in tag names. So here are my two cents.

@domenic wrote:

So I guess the relevance here is, if someone does

const el = document.createElement(userSuppliedString);
await storeInServer(el.outerHTML);

// ... later ...
const html = await getFromServer();
container.innerHTML = html;

and userSuppliedString is p><script>alert(1);</script, then el.outerHTML would end up as <p><script>alert(1);</script></p><script>alert(1);</script>. That is indeed bad!

If we disallow tab, LF, CR, FF, space, /, >, NULL, then nothing immediately sticks out to me as problematic. But I worry that I'm just not doing enough security analysis. It would be a shame if this simplification fix ended up causing everyone security problems.

I think it's a reasonable assumption that the user would also be able to control at least some parts of the content if they can supply the tag name. If that's the case, then userSuppliedString might be just script. Or (if we're being lenient) _<script which would be serialized to <_<script> starting in fact a SCRIPT element.

That being said, I've never been under impression that the design goal of DOM APIs is to disallow reparsing attacks. There are countless examples of "manually" created DOM trees that will execute JavaScript after reparsing. Even the HTML spec admits the problem. Three examples of that:

// Example 1
const div = document.createElement('div');
div.append(document.createComment('--><script>alert(1)</script>'));
console.log("<div><!----><img src onerror=alert(1)>--></div>");
// logs: "<div><!----><img src onerror=alert(1)>--></div>"

// Example 2
const style = document.createElement('style');
style.textContent = '</style><img src onerror=alert(1)>';
console.log(style.outerHTML);
// logs: "<style></style><img src onerror=alert(1)></style>"

// Example 3
const textarea = document.createElement('textarea');
const div2 = document.createElement('div');
div2.setAttribute('title', '</textarea><img src onerror=alert(1)>');
textarea.append(div2);
// logs: "<textarea><div title="</textarea><img src onerror=alert(1)>"></div></textarea>"

Furthermore, going back to the example given by @domenic:

const el = document.createElement(userSuppliedString);
await storeInServer(el.outerHTML);

// ... later ...
const html = await getFromServer();
container.innerHTML = html;

I'd say that no matter what happens in document.createElement, this piece of code is still a security issue because the html should be sanitized before assigning it to innerHTML.

So my personal take is that if we disallow tab, LF, CR, FF, space, /, >, NULL, we should be fine. This way, we disallow creating a reparsing issue from a single call to document.createElement. If some other conditions are met, then you can still have a mutation XSS but it's also the case for many other DOM APIs. I don't consider it that much of a problem, since in a real world mXSS usually happens when parser-created HTML is mutated on reparsing (check this research of mine to see an example).

annevk commented 2 years ago

For clarity, I was asked to write down https://github.com/whatwg/dom/issues/849#issuecomment-1058064183 a bit more formally. I still think that idea is reasonable as although I agree we do not defend against reparsing attacks, it still seems good to not introduce novel reparsing attacks.

Using XML EBNF's notation:

NewCreateElementName ::= HTMLParserCompatibleName | CreateElementCompatibleName
HTMLParserCompatibleName ::=  [a-zA-Z] [^#x00#x09#x0A#0xCx0D#x20/>]*
CreateElementCompatibleName ::= CreateElementCompatibleNameStartChar (CreateElementCompatibleNameChar)*
CreateElementCompatibleNameStartChar ::= ":" | "_" | [#x80-#x10FFFF]
CreateElementCompatibleNameChar ::= CreateElementCompatibleNameStartChar | [a-zA-Z] | "-" | "." | [0-9]

(I would be okay with attempting to ban surrogates, but it feels a bit murky given that you can create elements in the HTML parser that contain them still. I'd be more comfortable closing that hole if we closed it there at the same time.)

otherdaniel commented 2 years ago

I'd prefer if API + HTML parser + CSS naming rules would become more aligned, rather than less, so I think newly allowed character sets should be the same across HTML parser and document.createElement, and ideally CSS too. I'm less concerned about which character sets to pick, as long as we're consistent across the browser surfaces that deal with element + attribute names.

I'd quite strongly prefer that no existing HTML/XML meta characters would be newly allowed. E.g. several proposals above allow "<" as part of element names, or single quotes.

The Unicode replacement character (U+FFFD) should probably be disallowed. This has caused browser bugs before. (Examples in the reference.)

Not sure if this already exists, but there should probably be some language about which unicode canonicalization (not) to do, and how equality of names is determined. Ideally, this would also be aligned between HTML, JS, and CSS, where I care less about the actual rules than about whether they're the same or not. (CVE-2000-0884 is a bug at the OS level, where one part of the system canonicalizes this way, another that way.) I believe ECMAScript has fairly specific rules about this already.

annevk commented 2 years ago

We're operating under the assumption that the HTML parser won't change so I'm not sure what your comment means. Could you elaborate?

securityMB commented 2 years ago

I'd quite strongly prefer that no existing HTML/XML meta characters would be newly allowed. E.g. several proposals above allow "<" as part of element names, or single quotes.

Even if these characters are allowed, they shouldn't really undermine security, as currently it is also possible to have them in tag names created by HTML parser.

Consider the following example:

const doc = new DOMParser().parseFromString(`<elem<'abc>test`, 'text/html');
const tagName = doc.body.firstChild.tagName;
// tagName is now equal to "ELEM<'ABC"

Even if this HTML is serialized and reparsed, the tag would still be reparsed as ELEM<'ABC.