y-scope / clp-ffi-java

Apache License 2.0
10 stars 3 forks source link

`ISO_8859_1` breaking UTF-8 in CLP logtype String #42

Open intr3p1d opened 4 months ago

intr3p1d commented 4 months ago

Bug

Cause

clp-ffi-java internally use StandardCharsets.ISO_8859_1 in EncodedMessage.getLogTypeAsString(); https://github.com/y-scope/clp-ffi-java/blob/c4a74dbdeb09bd4e7e3d119826dddbe5005ccf53/src/main/java/com/yscope/clp/compressorfrontend/EncodedMessage.java#L30-L36 (getDictionaryVarsAsStrings also)

Effect

https://github.com/apache/pinot/blob/0a4398634be81cdbbe891b3da249134ef98743e7/pinot-plugins/pinot-input-format/pinot-clp-log/src/main/java/org/apache/pinot/plugin/inputformat/clplog/CLPLogRecordExtractor.java#L151-L154

This makes some characters broken like this: Request processing failed: jakarta.validation.ConstraintViolationException: getAgentsList.from: /u0011 이상이어야 합니다 into Request processing failed: jakarta.validation.ConstraintViolationException: getAgentsList.from:  이상이어야 합니다

This is fine after going through the decode function, but when dealing with individual logtype, these broken strings don't seem appropriate (LIKE searches, etc).

clp-ffi version

0.4.4

Environment

Linux, Java https://github.com/apache/pinot/blob/1d490c1ac3268103a16d77ddfa70f8f8602f9e96/pom.xml#L160

Reproduction steps

Encode some characters which is not supported by ISO_8859_1 Request processing failed: jakarta.validation.ConstraintViolationException: getAgentsList.from: /u0011 이상이어야 합니다 Then get the logtype Request processing failed: jakarta.validation.ConstraintViolationException: getAgentsList.from:  이상이어야 합니다