xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Unexpected read result after write date as INT96 parquet type #453

Closed Mort4lis closed 2 years ago

Mort4lis commented 2 years ago

Hi everyone! I have a problem with writing/reading parquet file.

Let's take a look at an example: I create a json writer and schema with one column (INT96) and try to write one row with current date. Before write I convert time.Time to string by calling types.TimeToINT96. But after reading the output parquet file, I have got a wrong result.

If I replace the jsonWriter to usual ParquetWriter then it works correctly, but I need to write json. I will be glad for any help!

Code:

package main

import (
    "encoding/json"
    "fmt"
    "log"
    "time"

    "github.com/xitongsys/parquet-go-source/local"
    "github.com/xitongsys/parquet-go/reader"
    "github.com/xitongsys/parquet-go/types"
    "github.com/xitongsys/parquet-go/writer"
)

type Value struct {
    OrderDate string `json:"order_date" parquet:"name=order_date, type=INT96"`
}

const writeJSONSchema = `
{
  "Tag": "name=Schema, repetitiontype=REQUIRED",
  "Fields": [
    {"Tag": "name=order_date, type=INT96, repetitiontype=OPTIONAL"}
  ]
}
`

func main() {
    now := time.Now()

    fw, err := local.NewLocalFileWriter("output.parquet")
    if err != nil {
        log.Fatalf("Can't create file: %v", err)
    }

    pw, err := writer.NewJSONWriter(writeJSONSchema, fw, 1)
    if err != nil {
        log.Fatalf("Can't create parquet writer: %v", err)
    }

    writer.NewParquetWriter()

    val := Value{OrderDate: types.TimeToINT96(now)}

    valBytes, err := json.Marshal(val)
    if err != nil {
        log.Fatalf("Can't marshal value: %v", err)
    }

    if err = pw.Write(valBytes); err != nil {
        log.Fatalf("Can't write value: %v", err)
    }

    if err = pw.WriteStop(); err != nil {
        log.Fatalf("Can't stop write: %v", err)
    }

    if err = fw.Close(); err != nil {
        log.Fatalf("Can't close file: %v", err)
    }

    fr, err := local.NewLocalFileReader("output.parquet")
    if err != nil {
        log.Fatalf("Can't read file: %v", err)
    }

    pr, err := reader.NewParquetReader(fr, new(Value), 1)
    if err != nil {
        log.Fatalf("Can't create parquet reader: %v", err)
    }

    num := int(pr.GetNumRows())

    vals := make([]Value, num)

    if err = pr.Read(&vals); err != nil {
        log.Fatalf("Read error: %v", err)
    }

    orderDate := types.INT96ToTime(vals[0].OrderDate)

    // Wrong OrderDate
    fmt.Printf("Expected = %v\n", now)
    fmt.Printf("Got = %v\n", orderDate)

    pr.ReadStop()
    _ = fr.Close()
}
hangxie commented 2 years ago

First of all, INT96 is deprecated, consider using something else if you can.

The problem is that INT96 is stored as string internally, even though it is not valid UTF8 string, so when Marshal tries to serialize it to UTF8 string, it fails and populates Unicode replacement.

This is related to https://github.com/xitongsys/parquet-go/issues/434 and https://github.com/xitongsys/parquet-go/issues/321, both are problems caused by internal representation of []byte as string.

Mort4lis commented 2 years ago

First of all, INT96 is deprecated, consider using something else if you can.

The problem is that INT96 is stored as string internally, even though it is not valid UTF8 string, so when Marshal tries to serialize it to UTF8 string, it fails and populates Unicode replacement.

This is related to #434 and #321, both are problems caused by internal representation of []byte as string.

Thank you for reply, man! Yes, indeed I store Julian date as a byte representation in INT96 column type. And these bytes are not Unicode code points.