xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Compatibility of Decimal values: Can't read decimal values written by go-parquet with other parquet readers #428

Closed TN1ck closed 2 years ago

TN1ck commented 2 years ago

Hey all,

I'm currently trying to save numbers in the decimal file format and while reading/writing is working within parquet-go, problems surface when using other parquet readers, such as https://pypi.org/project/parquet-tools/, which is based upon the official apache java library.

When I saved my decimals using this annotation:

parquet:"name=my_var_name, type=BYTE_ARRAY, convertedtype=DECIMAL, scale=4, precision=16"

The schema is read correctly using another tool (for example with the linked the linked parquet-tools).

> parquet-tools schema my_parquet_file.parquet
message parquet_go_root {
  required binary my_var_name (DECIMAL(16,4)) = 0;
}

But the numbers do not match, up, when saving 100.00, this is the output:

> parquet-tools cat  my_parquet_file.parquet
my_var_name = 354438588167567.3648

I tried out different encodings, but the resulting value is the same. When using another storage format, i.e. FIXED_LEN_BYTES, the parquet-tools reader will crash:

org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file file:mye_parquet_file.parquet

Not sure if I'm doing something wrong or how I'm supposed to read these values using other readers.

hangxie commented 2 years ago

Try https://github.com/xitongsys/parquet-go/blob/master/example/type.go#L96-L99, or you can post your code here.

TN1ck commented 2 years ago

Thanks @hangxie, that put me onto the right track to fix it. What I did wrong:

  1. I tried to save the string as is, i.e. "1234.5678" for a decimal with scale 4. I thought the library would handle it internally, as reading / writing just worked with that approach, my assumptions seemed correct. But they broke when I tried reading those files using other readers. So then I saved "12345678" instead, or "1234000" for the case of "123.4000".
  2. The string was not in binary notation yet, so I used the StrIntToBinary you linked to put it into binary
  3. I didn't do anything for reading, i.e. I just assumed it would serialise it on it's own into a readable string, but alas it did not. I used types.DECIMAL_BYTE_ARRAY_ToString(bytes, 0, scale) to actually read the decimal value.

One gotcha I had with DECIMAL_BYTE_ARRAY_ToString was that I had to special case 0 as it won't change when scaled and DECIMAL_BYTE_ARRAY_ToString doesn't handle that yet, negative values were also problematic as seen in your PR https://github.com/xitongsys/parquet-go/pull/433.

Would be nice is this would be documented somewhere, but I guess Decimal is still a bit WIP.

I close this for now as the problem is mostly ergonomics right now and not an explicit bug.