Improve handling of malformed unicode bidi control characters

jamiebuilds commented 7 months ago

What is the issue with the HTML Standard?

Discussed this with @fantasai

TL;DR - Could <bdo>, <bdi>, and possibly dir be updated to deal with unbalanced/malformed Unicode bidi control characters to avoid them affecting their surrounding text?

When accepting user input, users will often paste text from other programs that sometimes contains malformed (unclosed, unopened) unicode bidi control characters. When you use this user input in HTML, wrapping them with <bdi>, <bdo>, and/or dir is not enough to avoid it breaking out of the supposed 'isolation', and it can wreak havoc on the surrounding text.

<!-- &#x202E; = 'RIGHT TO LEFT OVERRIDE' (unclosed) -->
<!-- Maybe pasted from another program (or maybe just malicious) -->
<input value="&#x202E;Jamie">

In the simplest case, the text is appears as expected:

<p>Hey <bdi>&#x202E;Jamie</bdi>, welcome!</p>
<!-- Rendered: Hey eimaJ, welcome! -->

But if we pop the directional isolate, you can break out of the <bdi>

<!-- &#x2069; = 'POP DIRECTIONAL ISOLATE' (unopened) -->
<p>Hey <bdi>&#x2069;&#x202E;Jamie</bdi>, welcome!</p>
<!-- Rendered: Hey !emoclew ,eimaJ -->

You could of course wrap it with another <bdo> and it will be fixed again:

<p>Hey <bdi><bdi>&#x2069;&#x202E;Jamie</bdi></bdi>, welcome!</p>
<!-- Rendered: Hey eimaJ, welcome! -->

What's missing is a balancing of these control characters. In order to combat this we've written a function to close fix-up these strings before they end up in the HTML.

// Wraps with 'FIRST STRONG ISOLATE' + text + 'POP_DIRECTIONAL_ISOLATE'
bidiIsolate('hello') // '\u2068hello\u2069'

// Closes unclosed isolates/overrides/formatting:
bidiIsolate('\u202Ahello') // '\u2068\u202Ahello\u202C\u2069'

// Strips out unopened 'POP' control chars:
bidiIsolate('\u2069hello') // '\u2068hello\u2069'

Source: https://github.com/signalapp/Signal-Desktop/blob/ce0fb220411b97722e1e080c14faa65d23165784/ts/util/unicodeBidi.ts#L124

See Code

```ts // Copyright 2024 Signal Messenger, LLC // SPDX-License-Identifier: AGPL-3.0-only export function _bidiIsolate(text: string): string { // Wrap with with first strong isolate so directional characters appear // correctly. return ( FIRST_STRONG_ISOLATE + balanceUnicodeDirControlChars(text) + POP_DIRECTIONAL_ISOLATE ); } function balanceUnicodeDirControlChars(input: string): string { // This gets called by i18n code on many strings, so we want to avoid // as much work as possible if (!hasAnyUnicodeDirControlChars(input)) { return input; } let result = ''; let formattingDepth = 0; let isolateDepth = 0; // We need to scan the entire input string and drop some characters as we // go in case they are closing something that was never opened. for (let index = 0; index < input.length; index += 1) { const char = input[index]; switch (char) { case LTR_EMBEDDING: case RTL_EMBEDDING: case LTR_OVERRIDE: case RTL_OVERRIDE: formattingDepth += 1; result += char; break; case POP_DIRECTIONAL_FORMATTING: formattingDepth -= 1; // skip if its closing formatting that was never opened if (formattingDepth >= 0) { result += char; } break; case LTR_ISOLATE: case RTL_ISOLATE: case FIRST_STRONG_ISOLATE: isolateDepth += 1; result += char; break; case POP_DIRECTIONAL_ISOLATE: isolateDepth -= 1; // skip if its closing an isolate that was never opened if (isolateDepth >= 0) { result += char; } break; default: result += char; break; } } // Ensure everything is closed let suffix = ''; if (formattingDepth > 0) { suffix += POP_DIRECTIONAL_FORMATTING.repeat(formattingDepth); } if (isolateDepth > 0) { suffix += POP_DIRECTIONAL_ISOLATE.repeat(isolateDepth); } return result + suffix; } ```

By running text through this function before it ends up in HTML, we can at least prevent the malformed text from affecting its surrounding text which better aligns with the stated goal of <bdi>:

From [4.5.24]

The bdi element represents a span of text that is to be isolated from its surroundings for the purposes of bidirectional text formatting. [BIDI]

@fantasai suggested that this could possibly grafted onto <bdi>, <bdo>, and possibly all other uses of dir as well

annevk commented 7 months ago

cc @whatwg/i18n

aphillips commented 7 months ago

I18N discussed this in our call of 2024-04-18 and I drew an action to respond.

The initial reaction I18N has was that this sounds like a problem and is worth solving. It feels like the idea is that element-level isolation is independent of text-level isolation.

I would probably not link it just to <bdi>/<bdo>, since we went to some effort to make the dir attribute isolating on all elements by default. Shouldn't this behavior be tied to the CSS style unicode-bidi: isolate? The intention is to isolate the textual contents.

The code looks like it might be effective.

jamiebuilds commented 3 months ago

Shouldn't this behavior be tied to the CSS style unicode-bidi: isolate? The intention is to isolate the textual contents.

@aphillips I'm not sure if that question was directed at me, but that makes enough sense to me. I don't have any strong opinions about specifically how this behavior is added, just a general problem statement.

annevk commented 3 months ago

cc @whatwg/css @fantasai

Manishearth commented 1 month ago

I wrote up some thoughts on this on Twitter, but I figure I'd write it up properly for this issue:

The thing that probably should be fixed is this text:

Following Unicode Bidirectional Algorithm clause HL3 [UAX9], values other than normal effectively insert the corresponding Unicode bidi control codes into the text stream at the start and end of the inline element before passing the paragraph to the Unicode bidirectional algorithm for reordering. (See § 2.4.2 CSS–Unicode Bidi Control Translation, Text Reordering.)

What you basically want this to do is for this to have the effect of inserting these control codes such that the corresponding push/pop directives are correctly retained in the bidi algorithm.

The bidi algorithm matches things up in BD8-BD11. A bidi implementation could potentially allow the caller to specify a priority as to how things get matched. The spec doesn't have any affordances for this, but it could (we could expand on HL3, for example). However I don't really see implementations moving to such a model easily; since the API for such a model is a mess.

Another way of fixing this is just sanitizing the content: remove any unmatched override/embedding/isolate initiators or pop directives. Not too complicated. Kind of annoying.

I think the best thing to do from the CSS spec standpoint is to spec these as basically creating a new bidi context such that their content is processed in a separate bidi algorithm run than their surroundings. In this case, both override and isolate would run the bidi algorithm on their content with strong directionality as the directionality of the override/isolate element, and their surrounding content treats them as opaque embedded content (similar to how inline images / other inline-block elements get handled today), with the override/embedding ones being treated as LRM/RLM in the surrounding content, and the isolates being treated exactly like other inline-block elements (as the spec says, use U+FFFC).

I believe isolate-override would be handled as an "auto" algorithm on the inner content and a neutral character (U+FFFC) on the surrounding content.

It's unclear to me exactly which spec (CSS2 visual formatting?) handles how inline-block elements in bidi text are handled so I'm not quite clear on how this is best written in spec text, but I think that's the best model to follow.

cc @eggrobin

mihnita commented 1 month ago

How is any other improperly paired element handled? Pasting "foo this is bold" or "foo non-italic"?

I don't see this as different. Other that this is potentially more "visually disruptive" than most other tags?

Well, some of the tags can also be disruptive than a simple bold. For example a  with a class or style that makes it hidden. Or with an id that is the same with another id that already exists in the target document?

And what about tags that are not explicitly included in the text I copy, but as a user I still expect the result to look the same:

<p style="color:red">Some text <b>from here on it is bold</b> and normal again.</p>
                     ^^^^^^^

If I copy-paste the marked section of text, when I paste it I expect it to be bold and red.

Are these already solved? If yes, can the bidi issue be solved the same way?

Manishearth commented 1 month ago

How is any other improperly paired element handled? Pasting "foo this is bold" or "foo non-italic"?

This is handled in the (X|XHT|HT)ML parser: closing tags are inserted automatically where necessary. (Unpaired closing tags are just ignored).

Unfortunately, handling bidi controls in the parser would be major scope creep.

It should absolutely be possible to represent unpaired bidi controls in XML, and it even should be possible to represent that in HTML, so sanitizing in the parser seems right out. Here, we want the application to be able to control where they actually "escape".

So this cannot be handled the same way, or at least cannot be handled in the same place.

The algorithm used could be the same: above I mention that one solution is to sanitize content by "remov[ing] any unmatched override/embedding/isolate initiators or pop directives", but a different way to do it would be to insert matching pop directives for unmatched control characters and remove lone pop directives.

It's not clear to me that sanitization is the best path forward. The main three paths I see are:

Sanitize the input text
- This can be done by removing all unpaired control characters, or
- This can be done by inserting pop characters where necessary and removing unmatched pops
Isolate the bidi algorithm run
Tweak the bidi algorithm to allow for higher-level control over pairing of bidi controls. I'm not in favor of this because I don't think bidi algorithm implementations will actually be able to provide a convenient API for this.

macchiati commented 1 month ago

I tend to agree with Manish, that #3 would be more difficult to get the lower-level implementations to handle.

As to either Sanitize or Isolate, I think either one could work. I think a goal for Sanitize should be to "produce the same results as Isolate.

sophiebits commented 1 month ago

How is any other improperly paired element handled? Pasting "foo this is bold" or "foo non-italic"?

I would also add from a web developer perspective: including untrusted HTML is one of the most well-known security vulnerabilities as it leads to XSS. As a result, typically all HTML special characters < > & " ' are escaped as entities < > & " ' before they are interpolated into markup so that they are interpreted literally. (If you are interacting with HTML elements via the JavaScript DOM API, code like div.textContent = "foo this is bold" is equivalent to div.innerHTML = "foo this is bold".) Web frameworks such as React do this automatically when rendering all text; if you do want to render user-provided HTML then you need to opt for APIs such as the scarily-named dangerouslySetInnerHTML.

In contrast, it's generally assumed that characters other than these are safe to include. The vast majority of web applications make no attempt to sanitize bidi control characters present in untrusted user input. This is both because many people are not aware that this is even a possible issue and because there is no clear, authoritative guidance on how to do this correctly.

For example @jamiebuilds's code at the top of this issue is a valiant effort but I understand based on my reading of UAX #9 that it may fail to reject strings with directional formatting characters invalidly interleaved with directional isolate characters. I would look to the Unicode consortium to clearly document a correct algorithm for doing this type of sanitization and would look to HTML to ideally make it easy to include this, such as by changing some of the existing bidi CSS properties to have an effect as others have mentioned.

I also mentioned on Twitter that I'm unsure if there are other similar Unicode "security" concerns to be aware of. For example I imagine I don't generally want to allow combining marks at the beginning of a user-provided string. What else? This is ideally something where simple guidance would exist somewhere so that all web content authors can easily get this right.

jfkthame commented 1 month ago

In theory, I guess a control character such as U+206C INHIBIT ARABIC FORM SHAPING in user input could also have undesirable effects on the context, though in practice (AFAIK) such codes are not widely implemented in current systems.

macchiati commented 1 month ago

They are also deprecated: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7Bdeprecated%7D

Manishearth commented 1 month ago

As to either Sanitize or Isolate, I think either one could work. I think a goal for Sanitize should be to "produce the same results as Isolate.

The Sanitize algorithm that works identically to Isolate would be one that drops unpaired closing characters, and inserts extra closing characters at the end of text as needed.

It would not need to insert extra closing characters in the middle of the text, for example LRI RLO PDI has an unclosed RLO, however the algorithm will automatically close it anyway. Probably best to do it to be safe anyway?

There may still be tricky ways to break this around the fact that the bidi algorithm has a stack size, so an unpaired control character that is fine may become not-fine because its surrounding context got ignored because of the stack. Unclear.

sophiebits commented 1 month ago

There may still be tricky ways to break this around the fact that the bidi algorithm has a stack size, so an unpaired control character that is fine may become not-fine because its surrounding context got ignored because of the stack. Unclear.

I may be misunderstanding but my interpretation was that this is likely not an issue due to these notes in UAX# 9:

Note that this algorithm assigns a matching PDI (or lack of one) to an isolate initiator whether the isolate initiator raises the embedding level or is prevented from doing so by the depth limit rules. Note that this algorithm assigns a matching PDF (or lack of one) to an embedding initiator whether it raises the embedding level or is prevented from doing so by the depth limit rules.

Manishearth commented 1 month ago

I may be misunderstanding but my interpretation was that this is likely not an issue due to these notes in UAX# 9:

Right, but that's for the entire run, if we're following the model of "sanitize and then run the algorithm on the entire text" rather than "isolate the algorithm runs", then the matching PDIs/etc may get inserted in places that cross the isolation boundary we want. UAX #9 does not contain any concept of strongly associated isolation boundary as desired here.

The sanitization algorithm I listed prevents things from crossing the isolation boundary but I haven't put thought into all of the cases where depth limit stuff may cause certain things to change which boundaries actually show up.

I'm not sure if there's a problem here, to be clear, but there might be.

aphillips commented 1 month ago

Interestingly, I ran into this problem this morning. I'm working on a presentation about bidi.

My input looks like this:

<p>You bought <q lang="ar" dir="rtl">\u2067HTML \u0648 CSS: \u062a\u0635\u0645\u064a\u0645 \u0648 
\u0625\u0646\u0634\u0627\u0621 \u0645\u0648\u0627\u0642\u0639 \u0627\u0644\u0648\u064a\u0628</q>
by <span lang="en" dir="ltr">Jon Duckett</span>.

My demo page unescapes the Arabic (the \u stuff). Note the U+2067 (RLI) inside of the  tag. It is unmatched by a U+2069 (PDI).

Try it

The resulting display looks like:

Whereas the expected display looks like this:

The challenge here is that the extraneous U+2067 starts an isolate run, which interferes with the ... in the markup (which is intended to define an isolate boundary). The missing PDI wants to be before the . Instead, the  appears to close the U+2067's isolate run and the  leaks over the rest of the string. If I understand @Manishearth's comment just above, the sanitize would work if it respected isolate boundaries.

If a string started life as ⁧... something something...⁩ and were naively truncated before being dynamically inserted into the page (as with the book title here...), you might very easily get this result.

macchiati commented 1 month ago

Very good example. To put in other terms, it sounds like the

...\\...\...\...\...

is, in Unicode terms, acting like:

...\\...\...\...\...

which effectively adds a termination to the end that matches the very first \ (since that is the unmatched one).

...\\...\...\...\...\

Whereas Manish's proposal would pin within to the isolate it belongs to, thus

...\\...\\...\...\...

whatwg / html

Improve handling of malformed unicode bidi control characters #10251

What is the issue with the HTML Standard?