Open svenvanhal opened 1 week ago
Thanks for the report!
One solution would be to "over-escape" all backslashes regardless of context, so \x07 becomes \x07 and \x07 becomes \\x07.
IIUC from looking at the JSON RFC, for control characters, the resulting JSON string should technically be "/non_printable_\u0007"
rather than using the \xXX
encoding.
All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).
https://www.ietf.org/rfc/rfc4627.html#section-2.5
I can for example no longer reliably transform the escapes back to their original bytes.
Using off-the-shelf libraries you'd still need some extra/custom decoding to reverse the custom Zeek encoding.
Was using \uXXXX
for control characters ever on the table? If we add it today, it might need to be opt-in/opt-out. Maybe there was prior discussion that that's not a model appropriate for Zeek logs that I'm unaware of. @timwoj ?
Currently, Zeek uses hex escapes (\x00) to escape non-printable characters (and possibly others) in both TSV and JSON logs. For TSV, this is fine. However, because JSON requires strict UTF-8 adherence, all backslashes are escaped again to prevent errors (
\x
->\\x
). Unfortunately, this is a destructive transformation that makes it impossible to reliably distinguish between text and a byte representation in edge cases.Take for example terminal control character
07
(BEL). When we include this character in a (clearly invalid) HTTP URI/non_printable_[byte 07]
, Zeek would log the following:TSV:
/non_printable_\x07
JSON:/non_printable_\\x07
Now include the (literal) text
\x07
instead of the control character in the HTTP URI:/non_printable_literal_\x07
. Zeek logs the following:TSV:
/non_printable_literal_\\x07
JSON:/non_printable_literal_\\x07
This results in identical output for both cases in JSON logs and means that we are unable to distinguish between the byte
\x07
and text\x07
. I can for example no longer reliably transform the escapes back to their original bytes.This issue is caused by the fact that
\\x07
is already valid JSON and does not need to be escaped again (case 2), as opposed to\x07
which has to become\\x07
(case 1).One solution would be to "over-escape" all backslashes regardless of context, so
\x07
becomes\\x07
and\\x07
becomes\\\\x07
. I'm not familiar enough with Zeek internals to propose a proper solution :)I've used HTTP for the examples because it was straightforward to create a sample PCAP, but this issue applies to every string type field in every Zeek log.
Attached are:
pcap.zip zeek6_http.json zeek6_http.tsv zeek7_http.json zeek7_http.tsv
If memory serves, this change was introduced in Zeek 3 when the log encoding changed from ASCII to UTF-8 (#458). #596 might also be related.