web link addon doesn't parse URLs containing `%20` properly

szymonkaliski commented 7 months ago

https://github.com/xtermjs/xterm.js/blob/master/addons/addon-web-links/src/WebLinkProvider.ts seems to break on URLs containing %20 (and other UTF-8-encoded symbols).

The issue comes from this line: https://github.com/xtermjs/xterm.js/blob/master/addons/addon-web-links/src/WebLinkProvider.ts#L68

decodeURI decodes %20 so a URL like http://test.com?param=a%20b gets decoded to http://test.com?param=a b, which makes the text not equal to urlText. I'd be happy to approach fixing this and sending a PR, but I'm honestly not sure what this test is protecting us from -- in which case new URL(text) would succeed but text would not be urlText and that wouldn't be a false positive?

Details

(I don't think this matters for this specific bug)

Browser and browser version: Chrome 120.0.6099.129
OS version: macOS Sonoma 14.1
xterm.js version: 5.3.0

Steps to reproduce

echo "http://test.com"; echo "http://test.com?param=a%20b"
first link is clickable, the second is not

jerch commented 7 months ago

Thx for finding this edge case, its indeed not covered currently. Would an additional clause w'o the decodeURI already fix this?

Background on the weird looking additional check - new URL does sometimes automagic transformations on the provided url strings, which I deemed not to be equal for security reasons (beside appending a slash, which happens pretty often). This might look paranoid, but since we dont control what new URL does here (up to faulty URL construction impl), this check tries to ensure basic equality. The decodeURI part is needed for utf-8 encoded url strings to properly match, and I kinda forgot to test url-encoded strings in the first place.

Whether that additional test is really needed or a sloppier handling would be sufficient - idk for sure. I normally try to go with security first, and since we outsource the URL construction to the browser with new URL we would inherit all implementation differences up to faults as well. I did not dig deeper into new URL and whether that can be used as an attack vector by malformed url strings and auto transformation, instead went with that equality check as a basic security measure.

szymonkaliski commented 7 months ago

Would an additional clause w'o the decodeURI already fix this?

Not sure if I understand what you're proposing :) Can you give an example?

I think what we could do is to test only the url.origin part, for example replace the original if statement with:

if (!text.startsWith(url.origin)) {
  continue
}

What do you think?

Additionally I think having some counter-examples to test with would be a good idea, like, in what situations does the regex matches, but new URL does not?

jerch commented 7 months ago

Not sure if I understand what you're proposing :) Can you give an example?

Sure, something like this (could be nicer formatted :sweat_smile: ):

      try {
        const url = new URL(text);
        const urlText = url.toString();  // maybe url.href is better here?
        const urlTextDecoded = decodeURI(url.toString());
        if (text !== urlText
            && text + '/' !== urlText
            && text !== urlTextDecoded
            && text + '/' != urlTextDecoded)
        {
          continue;
        }
      } catch (e) {
        continue;
      }

You are prolly right with your url.origin idea, still we would have to do it more explicit to also catch username|password urls:

function urlLeftSide(url: URL) {
  if (url.password && url.username)
    return `${url.protocol}//${url.username}:${url.password}@${url.host}`;
  if (url.username)
    return `${url.protocol}//${url.username}@${url.host}`;
  return `${url.protocol}//${url.host}`;
}
...
if (!text.startsWith(urlLeftSide(url))) {
  continue
}

(Not quite sure if this will treat IPv6 urls correctly, it most likely does...)

Blacklist testing would be awesome, plz feel free to add test cases for that. Currently its only whitelist tested.

szymonkaliski commented 7 months ago

You are prolly right with your url.origin idea, still we would have to do it more explicit to also catch username|password urls

Good point, this seems to work, should I submit a PR?

(Not quite sure if this will treat IPv6 urls correctly, it most likely does...)

It does seem fine, yes.

Blacklist testing would be awesome, plz feel free to add test cases for that. Currently its only whitelist tested.

Yea I just can't come up with a URL that goes through the regex, but that should be blacklisted 😅

jerch commented 7 months ago

... should I submit a PR?

That would be nice, sure. :smiley_cat:

Yea I just can't come up with a URL that goes through the regex, but that should be blacklisted.

Yeah that would need some reverse engineering from the URL parsing rules to find to nasty ones. Back then I was too lazy to do that :sweat_smile: I would not stress this too much here, unless you are eager to go through the official spec and also test against different browsers. Its def. tedious to get this done.

szymonkaliski commented 7 months ago

Great, thank you -- submitted the PR here: https://github.com/xtermjs/xterm.js/pull/4937

xtermjs / xterm.js

web link addon doesn't parse URLs containing `%20` properly #4934

Details

Steps to reproduce