xtermjs / xterm.js

A terminal for the web
https://xtermjs.org/
MIT License
17.77k stars 1.64k forks source link

control characters incorrectly ignored in OSC string #5173

Open PerBothner opened 1 month ago

PerBothner commented 1 month ago

Most control characters (including CR and LF) are ignored in OSC_STRING.

This is inconsistent with xterm, which I think we should follow in this respect. See the sos_table in VTPrsTbl.c in the xterm sources: "The CASE_IGNORE entries correspond to the characters that can be accumulated for the string function (e.g., OSC)." I also verified this using the gdb debugger.

A simple fix in EscapeSequenceParse.ts is to replace:

  table.addMany(EXECUTABLES, ParserState.OSC_STRING, ParserAction.IGNORE, ParserState.OSC_STRING);

by:

  table.addMany(EXECUTABLES, ParserState.OSC_STRING, ParserAction.OSC_PUT, ParserState.OSC_STRING);

Optionally, we can also tweak the handling of ParserAction.OSC_PUT in line 769:

     if (j >= length || (code = data[j]) < 0x20 || (code > 0x7f && code < NON_ASCII_PRINTABLE)) {

My suggestion is to not change that line.

I will submit a pull request if this looks reasonable.

jerch commented 1 month ago

We took the state handling from vt100.net, which lists for OSC_STRING state:

event 00-17,19,1C-1F / ignore

Paul Williams has tested that against multiple DEC VT devices, so I tend to believe, that xterm might be slightly off here.

PerBothner commented 1 month ago

The de-facto standard is xterm. It is more important to match xterm than essentially-non-existent hardware. The project is called xterm.js, not vt100.js. The demo sets the TERM environment variable to xterm-256color.

jerch commented 1 month ago

Sorry, I strongly disagree. The parser is the best tested thing we have in a continuity sense, thanks to Paul Williams. While xterm had to reverse engineer most things back in the 80s/90s, when literally no docs were available. So there are still flaws hidden there.

The project is called xterm.js, not vt100.js.

Yes I am aware of the name conflict. I really dont like the name and wished more than once it would not carry that name.

The demo sets the TERM environment variable to xterm-256color.

I voted more than once for a proper terminfo entry and an overall better TE options discovery. I literally never got any serious or constructive response to any of that.

PerBothner commented 1 month ago

Perhaps, but that ship has sailed. Almost all who want to use term.js will want to have modern extensions. Nobody wants a vt100. Modern extensions have standardized on xterm or a superset thereof.

See my new example. The man -Hcat is an example of a command that emits HTML containing significant newlines. If newlines are ignored by the OSC parser this becomes much more complicated. And for what purpose?

Maybe we need the equivalent of gcc's --pedantic mode for those who want strict vt100 compatibility.

For what it's worth, if you run domterm (with the native emulator, not xterm.js), you get TERM=xterm-domterm. It solves the ssh problem this way.

jerch commented 1 month ago

We cannot easily change the parser rules, as it will have unforeseen side effects on the shape of sequences. OSC is specced to safely transmit only ASCII printables, so no control chars. If your HTML snippets contain meaningful chars outside of that range (even higher UTF-8 or any other 8bit or multibyte encoding) you gonna need to use a transport encoding like base64. Thats the gist of it and done that way by all more complex OSC sequences like iTerm's IIP.

Btw DCS is a bit more free in its payload letting a few more control chars pass, but in general the same rule applies here for arbitrary bytes. ECMA-35 allows to spec custom sequence payload protocols with a special modifier sequence (was it called DOCS?), but thats def. dead and not supported by most TEs I have tested (played around with that when trying to create a sound transport sequence). For best support across TEs we are kinda bound to OSC with ASCII printables (and to DCS with less support across current TEs).

PerBothner commented 1 month ago

OSC is specced to safely transmit only ASCII printables, so no control chars. If your HTML snippets contain meaningful chars outside of that range (even higher UTF-8 or any other 8bit or multibyte encoding) you gonna need to use a transport encoding like base64.

I believe we have to allow general UTF-8 printable characters. There are a number of existing "standard" OSC escape sequences that take raw text without base64-encoding. For example OSC 2 (change window title) and OSC 8 (set link URL). We cannot restrict these to ASCII without breaking applications and reasonable expectations.

OSC and DCS payloads cannot pass through arbitrary binary content without encoding, but they should handle all valid "text" content. Pedantic compatibility with some obsolete pre-Unicode specification is not a good argument against. "Text content" (as may appear in "text files") includes printable characters, joiners, and valid whitespace. Portable valid whitespace includes at minimum space, tabs, and standard line-endings (LF or CR+LF). Might as well also allow at least VT, NEL, and FF, even if not as "portable" or well-defined.

We cannot easily change the parser rules, as it will have unforeseen side effects on the shape of sequences.

I think it is unlikely any application would misbehave if LF is passed through by OSC rather than ignored. If there is, either the application or the OSC handler can easily be fixed - and should be. However, having LF be ignored rather than passed through is more likely to cause misbehavior, and is more complicated to work around.

jerch commented 1 month ago

I just checked ECMA-48 - it also allows 00/08-00/13 (BS, HT, LF, VT, FF CR) in OSC. Seems DEC had a stricter idea about OSC, idk why. So I am not totally opposed to allowing those, as it would bring us closer to ECMA conformance. We still need to check for all OSC handlers, that they dont break badly (warning: the fact that DCS also allows LF caused us a CVE in the past).

I think your argument about about UTF8 is a "false friend" for your HTML purpose. HTML itself can be encoded in any text encoding, not only UTF8. Since you use system tools to generate those snippets it is not even guaranteed, that a custom system locale will not interfere here. Imho as an OSC sequence creator you should be prepared for that, e.g. make the sequence more robust in that regard. Beside that, OSC 8 is not UTF8, the spec speaks of "URI-encoded", which is still in ASCII. And OSC 2 is cumbersome in xterm - it def. allows other encodings and tries to convert it to UTF8 internally (passes an UTF8 string to X11, which hopefully can turn it back into whatever the local encoding was - just checked those pretty wonky looking lines in xterm source). So why does UTF8 work for us in OSC 2? Because when I wrote the parser, I followed the recommendation of Markus Kuhn to treat UTF8/16 encodings as "transport encoding" - the full stream gets decoded first, before the terminal sequences are parsed. Additionally I let all codepoints >U+A0 pass, as they are not sequence relevant. That processing is highly debatable and in fact not ECMA-35 conform anymore. The latter is the reason, why it will not work in other TEs that easily (and why xterm has to go those lengths to get UTF8 strings working in OSC 2).

In summary - if you dont want other TEs to choke on your OSC sequence, it should stay in ASCII printables.