zeek / spicy

C++ parser generator for dissecting protocols & files.
https://docs.zeek.org/projects/spicy
Other
251 stars 37 forks source link

decode() on bytes should support UTF-16 #1788

Open sethhall opened 4 months ago

sethhall commented 4 months ago

It looks like the current implementation only supports ASCII and UTF-8 to decode into a string and the current library being used is strictly for UTF-8. In order to support anything with Windows roots, it would be nice to support UTF-16.

I poked around for a few minutes and found a potential small library that might work for the use case to decode UTF-16 into a string type.... https://github.com/nemtrif/utfcpp

sethhall commented 3 months ago

This still needs to be done (because the decode() method was clearly built with this in mind, but as a stop gap, I have a UTF-16 string reader (and it converts to utf-8 internally) implemented natively in spicy here: https://github.com/sethhall/spicy-parsers/blob/main/unicode/utf16.spicy

Ethanholtking commented 2 months ago

Hello I'd like to help solve this issue, however, I'm having difficulty trying to find where exactly the problem is. Could someone please help me locate where the decode() is?

bbannier commented 2 months ago

Hello I'd like to help solve this issue, however, I'm having difficulty trying to find where exactly the problem is. Could someone please help me locate where the decode() is?

Implementing the runtime part would go roughly like the following:

  1. Adding a UTF16 Charset value here: https://github.com/zeek/spicy/blob/943dea8d284c3b6fd65426e6e22abce1669ceeb1/hilti/runtime/include/types/bytes.h#L42
  2. Implementing handling of Charset::UTF16 in Bytes::decode here: https://github.com/zeek/spicy/blob/943dea8d284c3b6fd65426e6e22abce1669ceeb1/hilti/runtime/src/types/bytes.cc#L105-L132 The C++ unit test for Bytes::decode should also be updated here: https://github.com/zeek/spicy/blob/943dea8d284c3b6fd65426e6e22abce1669ceeb1/hilti/runtime/src/tests/bytes.cc#L64-L83

To make this available in Spicy code it needs to be added to both HILTI as well as Spicy:

  1. Add it to HILTI here: https://github.com/zeek/spicy/blob/943dea8d284c3b6fd65426e6e22abce1669ceeb1/hilti/lib/hilti.hlt#L14 There should also be a new test case here: https://github.com/zeek/spicy/blob/943dea8d284c3b6fd65426e6e22abce1669ceeb1/tests/hilti/types/bytes/decode.hlt
  2. Add it to Spicy here: https://github.com/zeek/spicy/blob/943dea8d284c3b6fd65426e6e22abce1669ceeb1/spicy/lib/spicy.spicy#L33-L37 Adding a test case is not strictly needed since this just wraps HILTI functionality.
rsmmr commented 1 month ago

@Ethanholtking Did that help? Are you working on this?