xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Use []byte instead of string for BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY #434

Open hangxie opened 2 years ago

hangxie commented 2 years ago

I'd like to see INT96 (though deprecated), BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY all use []byte as primitive type (https://github.com/xitongsys/parquet-go#type), I mean, they are bytes, not necessary to be valid UTF8 string.

I have a use case to print data from parquet without predefined schema in JSON format, the easiest way is to read schema from parquet, read data from parquet, transform data based on schema whenever needed (eg Decimal type), all in JSON world. However, this is not working as BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY for decimal type will most likely to be invalid UTF8 string so I got lots of U+FFFD (replacement for invalid UTF char) in JSON, the conversation cannot be reversed.

hangxie commented 2 years ago

This is related to https://github.com/xitongsys/parquet-go/issues/321