Open securityMB opened 3 years ago
cc @whatwg/html-parser
Example like, DOMPurify bypass is website/email old virus when you click on advertisement image/button/hyperlink/text. It was resolved by checking malicious code during HTML DOM Parsing.
< and > to < and > Change at code level as per your requirement like *lt".
IMO, the true reason of the described behavior is that <style>
is a raw text element so there is no "elements" and "attributes" inside it from the HTML DOM perspective, only text content (as correctly shown in the 3rd DOM example, with the #text: <a title="
child node of the style
element). Changing this behavior doesn't seem to be web compatible since at least the >
character is widely used in in-page styles as a CSS child combinator.
Could you please explain why do you expect <svg></p><style><a title="</style><img src onerror=alert(1)>">
and <svg><p></p><style><a title="</style><img src onerror=alert(1)>"></style></svg>
to be parsed differently?
@SelenIT because there it's an SVG style
element and those parse differently. Foreign content is tricky.
I understand the difference between HTML style
and SVG style
. I didn't get why the same HTML p
element breaks the foreign element parsing in the second case and doesn't break in the first. Is it the difference introduced by the "fragment case" (https://html.spec.whatwg.org/#parsing-main-inforeign)?
@SelenIT
Parsing of <svg></p>
is actually a spec bug that still works in Safari. Check: https://github.com/whatwg/html/issues/5113
[edit]: But the spec bug isn't the only case in which escaping of <
and >
in attributes would help. Check this famous XSS in Google Search video.
It had the following payload:
<noscript><p title="</noscript><img src onerror=alert(1)>">
It abused the difference in NOSCRIPT parsing when scripting is enabled and disabled. If the code would be serialized to:
<noscript><p title="</noscript><img src onerror=alert(1)>"></p></noscript>
then the XSS would also not have been possible.
Thanks a lot for the explanation!
I agree that escaping these characters will prevent the XSS in these cases. But isn't the existence of parsing modes/cases where attribute-like sequences aren't actually parsed as attributes the significant part of the problem here?
@SelenIT are there known mutation XSS exploits with parsing modes that don't also use "<" or ">" in an attribute value?
I think this change is a good idea and is probably doable.
The main unknown is the web compat risk. In 2008, all browsers except IE escaped "<" and ">" in attribute values, and the spec was changed based on feedback from myself, citing a single page that was broken in Opera but worked in IE. I don't recall any fallout of newly broken pages from making that change back then. However, it's been over a decade, and new content might have come to rely on the current behavior.
Any breakage here seems hard to find through static analysis, since it requires innerHTML
or something to serialize some HTML, and then something else to expect unescaped "<"s or ">"s. The example from 2008 was something like this:
<a href="javascript: if (foo < 10) doSomething()" id="theLink">link</a>
...
<script>
theLink.onclick = function() {
eval(this.outerHTML.match(/href="([^"]+)"/)[1]);
return false;
}
</script>
Yes, extremely silly code, but it's stuff like this that can break.
Unless someone comes up with an idea of how to identify and measure regressions beforehand, I think I would suggest experimenting with making this change in a browser and see what if anything breaks during the dev/beta period.
cc @whatwg/security , is there interest to experimentally implement this?
It doesn't sound crazy to me; if it's possible to ship compatibly, great.
/cc @otherdaniel, who might be interested in poking at this since it's somewhat related to https://github.com/WICG/sanitizer-api.
Do I correctly understand that the proposed change is to only change the to-string serialization format? If so, I am in favor of doing this from a security perspective, would be interested if @hsivonen has a strong opinion.
Do I correctly understand that the proposed change is to only change the to-string serialization format?
Exactly. The whole thing is just to encode <
and >
in attributes when serializing to string.
Implementing this would break anonymous functions in Javascript:
<x-checkbox onchange="{
document.querySelectorAll('#options x-checkbox').forEach(el => { el.checked = this.checked })
}">Check All?</x-checkbox>
=>
would have to be escaped which leads to breaking JS linters.
I just finished writing Form Associated custom elements and realized I can insert one-liners in HTML, but my HTML linter complained about >
(when it shouldn't have). Lo and behold, [onchange]
works for my custom checkbox without any extra code, which is pretty neat. I'd hate to lose it.
As mentioned in https://github.com/whatwg/html/issues/6235#issuecomment-767416703, I don't believe we want to change parsing - just serialization.
FYI, an experiment landed in Chrome: https://chromium-review.googlesource.com/c/chromium/src/+/4362958 (it should be available in Canary in a few hours).
Currently, it's under #enable-experimental-web-platform-features. There's also a use counter AttributeValueContainsLtOrGt
that hits whenever a serialized attribute value contains either <
or >
.
Fingers crossed for this to be web-compatible!
@securityMB has the change shipped in Chrome Stable yet?
@zcorpan Yes. AFAIR since M113. Still under the #enable-experimental-web-platform-features flag though.
OK. The use counter https://chromestatus.com/metrics/feature/timeline/popularity/4527 is currently at 5.3% which seems high (number of pages in httparchive 2.15% or 2.39%, also high).
The use counter is only triggered for serializing, not parsing, right?
Correct, the use counter is triggered during a serialization whenever an attribute value contains “<“ or “>”.
I agree that the usage is high although the good news is that nobody seems to have complained so far.
We’d still probably need to investigate when/why attributes usually contain these characters.
I'd like to revive this topic and attempt to move forward with the proposal despite the high UseCounter. In this comment I'll try to explain why this UseCounter is misleading and why this proposal is still worth pursuing.
First things first, the new escaping has existed in Chromium for over a year now behind --enable-experimental-web-platform-features
. I'm using this flag every day as do many other developers. As far as I'm aware, there are zero bug reports in Chromium that this change can be attributed to a site breakage. This is obviously anecdotal but something to keep in mind.
Now, about the UseCounter. The UseCounter is hit every time an attribute containing either "<" or ">" is serialized. So for example, it'd be hit in the following code:
const div = document.createElement('div');
div.setAttribute('test', 'a<b>c');
console.log(div.outerHTML)
The UseCounter does not track (and it's not possible to write a meaningful use counter for that) what happens with this value afterwards.
Usually when we think about UseCounters, we assume that when we make the change in question, all websites (which hit the UseCounter) will break. This is not the case here. Many (or probably most) of these websites will still work fine. This, unfortunately, makes it more difficult to make an informed decision based on the UseCounter. So let me try to explain why I believe most of those websites will still work fine.
Assume that we have a div
element with an attribute data-test
with value <test>
, which is then serialized.
Check the table below to find out how the serialization will look like, depending on whether we make this change or not.
Current serialization | Serialization after change |
---|---|
<div data-test="<test>"></div> |
<div data-test="<test>"></div> |
Note that the escaping of <
to <
and >
to >
is performed already by a large number of frameworks.
No matter which serialization you use, all HTML5-compliant parsers would create exactly the same DOM tree:
<html>
<head />
<body>
<div data-test="<test>" />
Furthermore, when you use various methods to access the value of the attribute (such as elem.getAttribute
, elem.attributes
, elem.dataset
and so on), you also get the same value in both serializations.
The only thing that could possibly break would be a code similar to what was already posted by @zcorpan in https://github.com/whatwg/html/issues/6235#issuecomment-766916695:
<a href="javascript: if (foo < 10) doSomething()" id="theLink">link</a>
...
<script>
theLink.onclick = function() {
eval(this.outerHTML.match(/href="([^"]+)"/)[1]);
return false;
}
</script>
It’s very difficult to check this type of things with just a static analysis (or UseCounter) so I’ll try to move forward with an experiment to run this change by default for certain amount of users to see if we can observe any breakages.
While we are very much aligned on solution and reasoning, I am not able to make the call whether we'd be comfortable with a potential webste-specific regressions in Firefox that are as complicated to debug as this one. That being said, we'd be happy to fast-follow if this turns out web compatible. Thank you for providing this level of detail, @securityMB :)
Thank you @mozfreddyb for your support! I started work internally to run the Finch experiment; I'll keep you updated.
The experiment is on! Currently limited to 50% of Canary and Dev users. I'm planning to keep it for two months. If we don't see too many complaints, I'm planning to ramp up to also 50% of Beta. And then to 1% of Stable.
Status update: now the experiment is also on for 50% of Beta (so: 50% of Canary, Dev and Beta).
With no complaints reported last month, we're now enabling the experiment for 1% of Stable.
I also created a very simple check if you're the part of the experiment (or enabled #enable-experimental-web-platform-features): https://escape-lt-gt-check.glitch.me/
I'm submitting this issue after a short discussion on Twitter with @zcorpan today.
I think we should change the rules of escaping a string in attribute mode, and also escape
<
and>
to<
and>
respectively.The fact that these characters are not escaped led to some security issues in HTML parsers and sanitizers.
As an example, see this DOMPurify bypass. The bug was that the following markup
was parsed into the following DOM tree in Chromium and Safari:
Now because the typical usage of sanitizers is as follows:
it means that the markup is serialized and then reparsed.
Now consider the following markup:
which is parsed into the following DOM tree:
It doesn't contain any harmful markup, so it is serialized to:
However, after reparsing a different DOM tree is created:
Leading to cross-site scripting. The reason for that is in fact that
p
breaks out foreign content.Please note that if
<
and>
were escaped, then the markup would be serialized to:Making this particular bypass (and many similar ones) impossible.