xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

What is the recommended way of marshalling time.Time values to Parquet? #376

Closed mattwelke closed 3 years ago

mattwelke commented 3 years ago

I was testing out the library and wrote a struct like this:

type user struct {
    ID string `parquet:"name=id,type=BYTE_ARRAY,convertedtype=UTF8,encoding=PLAIN_DICTIONARY"`
}

This worked. I was able to use code to marshall and unmarshal the Parquet data. I got the code for these steps from a blog post:

func generateParquet(data *user) error {
    log.Println("generating parquet file")
    fw, err := local.NewLocalFileWriter("output.parquet")
    if err != nil {
        return err
    }
    //parameters: writer, type of struct, size
    pw, err := writer.NewParquetWriter(fw, new(user), 1)
    if err != nil {
        return err
    }
    //compression type
    pw.CompressionType = parquet.CompressionCodec_GZIP
    defer fw.Close()
    if err = pw.Write(data); err != nil {
        return err
    }
    if err = pw.WriteStop(); err != nil {
        return err
    }
    return nil
}

func readParquet() ([]*user, error) {
    fr, err := local.NewLocalFileReader("output.parquet")
    if err != nil {
        return nil, err
    }
    pr, err := reader.NewParquetReader(fr, new(user), 1)
    if err != nil {
        return nil, err
    }
    u := make([]*user, 1)
    if err = pr.Read(&u); err != nil {
        return nil, err
    }
    pr.ReadStop()
    fr.Close()
    return u, nil
}

My main function just makes a user struct and writes it and then reads it back:

func main() {
    if err := generateParquet(&user{
        ID: "abc123",
    }); err != nil {
        log.Fatalf("could not generate Parquet: %v", err)
    }

    fmt.Printf("generated Parquet\n")

    users, err := readParquet()
    if err != nil {
        log.Fatalf("could not read Parquet: %v", err)
    }

    fmt.Printf("read Parquet: %v\n", users[0])
}

I get the following output:

2021/04/01 16:17:58 generating parquet file
generated Parquet
read Parquet: &{abc123}

But if I add a time.Time field, and choose a type from the README that made sense to me (I chose TIME_MICROS), it doesn't work. I get an error:

type user struct {
    ID        string `parquet:"name=id,type=BYTE_ARRAY,convertedtype=UTF8,encoding=PLAIN_DICTIONARY"`
    CreatedAt time.Time `parquet:"name=created_at, type=TIME_MICROS, encoding=PLAIN_DICTIONARY"`
}
...
    if err := generateParquet(&user{
        ID: "abc123",
        CreatedAt: time.Now(),
    }); err != nil {
        log.Fatalf("could not generate Parquet: %v", err)
    }
...
2021/04/01 16:19:55 generating parquet file
2021/04/01 16:19:55 could not generate Parquet: runtime error: invalid memory address or nil pointer dereference
exit status 1

It does work if I switch the type of CreatedAt in my struct to an integer type though, like int64. Then, I'd have to write my own transforming code when I want to marshal and unmarshal to and from Parquet, which would convert between integers and time.Time. This caught me off guard because usually I see time.Time supported in Go natively, like with JSON marshalling. It converts it to a time string etc and can parse that back, etc.

Is this manual conversion between integer types and time.Time the way to use Parquet in Go?

mattwelke commented 3 years ago

Actually, I think I found what I was looking for: https://github.com/xitongsys/parquet-go/blob/master/types/converter.go

(the answer being yes - you need to store timestamps as integers and do the conversion yourself)

I just didn't expect to need this file when I was skimming the README. But I'm glad it's documented.

wwaldner-amtelco commented 3 years ago

You mention using the converter.go to convert times, how exactly do you use this?
Do you have to define your struct as:

type user struct {
    ID        string `parquet:"name=id,type=BYTE_ARRAY,convertedtype=UTF8,encoding=PLAIN_DICTIONARY"`
    CreatedAt int64 `parquet:"name=created_at, type=TIME_MICROS, encoding=PLAIN_DICTIONARY"`
}
mattwelke commented 3 years ago

@wwaldner-amtelco Yup I remember writing code that used an integer data type. int64 sounds right. And then I figured that in the real world if I ever used this, I would create a Go struct just for marshalling to and from Parquet. If I wanted a Time representation of my int64 field, I'd add a method, or I'd add some mapping code to map the "to/from Parquet" struct to a domain model struct, so I'd end up with my time.Time field like I wanted.

This was just for tinkering on the side. I didn't put anything using this into production so I never got to test that idea.