zeek / zeek

Zeek is a powerful network analysis framework that is much different from the typical IDS you may know.
https://www.zeek.org
Other
6.39k stars 1.21k forks source link

Discrepancy between JSON and TSV logs for non-printable bytes #3948

Open svenvanhal opened 1 week ago

svenvanhal commented 1 week ago

Currently, Zeek uses hex escapes (\x00) to escape non-printable characters (and possibly others) in both TSV and JSON logs. For TSV, this is fine. However, because JSON requires strict UTF-8 adherence, all backslashes are escaped again to prevent errors (\x -> \\x). Unfortunately, this is a destructive transformation that makes it impossible to reliably distinguish between text and a byte representation in edge cases.

Take for example terminal control character 07 (BEL). When we include this character in a (clearly invalid) HTTP URI /non_printable_[byte 07], Zeek would log the following:

TSV: /non_printable_\x07 JSON: /non_printable_\\x07

Now include the (literal) text \x07 instead of the control character in the HTTP URI: /non_printable_literal_\x07. Zeek logs the following:

TSV: /non_printable_literal_\\x07 JSON: /non_printable_literal_\\x07

This results in identical output for both cases in JSON logs and means that we are unable to distinguish between the byte \x07 and text \x07. I can for example no longer reliably transform the escapes back to their original bytes.

This issue is caused by the fact that \\x07 is already valid JSON and does not need to be escaped again (case 2), as opposed to \x07 which has to become \\x07 (case 1).

One solution would be to "over-escape" all backslashes regardless of context, so \x07 becomes \\x07 and \\x07 becomes \\\\x07. I'm not familiar enough with Zeek internals to propose a proper solution :)

I've used HTTP for the examples because it was straightforward to create a sample PCAP, but this issue applies to every string type field in every Zeek log.

Attached are:

pcap.zip zeek6_http.json zeek6_http.tsv zeek7_http.json zeek7_http.tsv

If memory serves, this change was introduced in Zeek 3 when the log encoding changed from ASCII to UTF-8 (#458). #596 might also be related.

awelzel commented 1 week ago

Thanks for the report!

One solution would be to "over-escape" all backslashes regardless of context, so \x07 becomes \x07 and \x07 becomes \\x07.

IIUC from looking at the JSON RFC, for control characters, the resulting JSON string should technically be "/non_printable_\u0007" rather than using the \xXX encoding.

All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

https://www.ietf.org/rfc/rfc4627.html#section-2.5

I can for example no longer reliably transform the escapes back to their original bytes.

Using off-the-shelf libraries you'd still need some extra/custom decoding to reverse the custom Zeek encoding.

Was using \uXXXX for control characters ever on the table? If we add it today, it might need to be opt-in/opt-out. Maybe there was prior discussion that that's not a model appropriate for Zeek logs that I'm unaware of. @timwoj ?