yesodweb / haskell-xss-sanitize

prevent XSS attacks by sanitizing html (this is different then escaping!)
Other
20 stars 15 forks source link

Function to get rid of invalid html #19

Open tysonzero opened 2 years ago

tysonzero commented 2 years ago

I am looking into rendering some user-submitted html, so unsurprisingly I'm planning on using this library to sanitize it.

However:

There is no guarantee for how a browser will display invalid HTML, so there is no guarantee that this function will protect your HTML from being broken by a user's html.

This makes me pretty nervous. Preventing invalid HTML from breaking the rest of the page is pretty essential.

How difficult would it be to add a functionality that would get rid of or fix invalid html to avoid this problem? Or at least invalid html that causes problems in practice.

Are there any known examples of html that is problematic even after sanitization? I was hoping that a parent div with some overflow: hidden would be sufficient.

snoyberg commented 2 years ago

I don't remember offhand. But my recommendation for sanitizing HTML is to parse and render with html-conduit.

In practice, that comment is probably out of date, since HTML5 does standardize HTML parsing rules. But rendering something more idiomatic is probably a good move.

tysonzero commented 2 years ago

It does not appear that html-conduit is guaranteed to output valid html5.

I actually found some valid html5 that becomes invalid when round-tripped through html-conduit:

Specifically anything that contains something like <tr><td>foo<td>bar</tr>, which is legal html5 where the second opening <td> automatically closes the previous <td>.

Yeah given all of HTML5's standardized error correction we may just chance it with sanitizeBalance. As long as it's bulletproof from a security perspective, the occasional rendering bug which we can hunt down isn't the end of the world.

A fully html5-compliant parser would definitely be ideal though.

tysonzero commented 2 years ago

Seems like just about all Haskell html5 parsers have the same <td> bug, including tagsoup and xss-sanitize.

tysonzero commented 2 years ago

Made a separate issue #20 to address html5 parsing issues