xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Support for []byte as parquet type BYTE_ARRAY #321

Open lonnc opened 4 years ago

lonnc commented 4 years ago

I'm wanting to include some raw bytes (e.g. images) in a parquet file. Currently I'm using:

type JpegTest struct {
  Jpeg     []byte `parquet:"name=jpeg, type=UINT_8, repetitiontype=REPEATED"`
}

This works, but I'm assuming (parquet being new to me) that a more efficient route would be to use BYTE_ARRAY directly and not fall back to a LIST of UINT_8s. Perhaps:

type JpegTest struct {
  Jpeg         []byte  `parquet:"name=jpeg, type=BYTE_ARRAY, repetitiontype=REQUIRED"`
}

Now this doesn't work, as expected, with WriteStop() returning the error:

invalid Parquet Schema?: runtime error: invalid memory address or nil pointer dereference

Would it be possible to add this functionality, or is there an alternative approach I should use?

Aside, I've also tried using a string in the struct and casting from and to a []byte, but while it works, it really don't feel right.

lbjarre commented 3 years ago

I'm having the same problem, though now with v1.6.0 I don't even get the uint8 repeated version to work.

I created these test cases as some of the solutions I could think of looking at the docs:

package main

import (
    "fmt"

    "github.com/xitongsys/parquet-go-source/local"
    "github.com/xitongsys/parquet-go/writer"
)

func write(obj interface{}, filename string) error {
    fw, _ := local.NewLocalFileWriter(filename)
    pw, err := writer.NewParquetWriter(fw, obj, 4)
    if err != nil {
        return fmt.Errorf("failed on new writer: %v", err)
    }
    for i := 0; i < 10; i++ {
        err := pw.Write(obj)
        if err != nil {
            return fmt.Errorf("failed on write: %v", err)
        }
    }
    if err := pw.WriteStop(); err != nil {
        return fmt.Errorf("failed on write stop: %v", err)
    }
    return nil
}

func main() {

    bytes := []byte{0xDE, 0xAD, 0xBE, 0xEF}

    type ByteArray struct {
        Bytes []byte `parquet:"name=bytes, type=BYTE_ARRAY"`
    }
    fmt.Printf("byte array: %v\n", write(&ByteArray{bytes}, "bytearray.parquet"))

    type Uint8Repeated struct {
        Bytes []byte `parquet:"name=bytes, type=INT32, convertedtype=UINT_8, repetitiontype=REPEATED"`
    }
    fmt.Printf("uint8 repeated: %v\n", write(&Uint8Repeated{bytes}, "uint8repeated.parquet"))

    type Int32Repeated struct {
        Bytes []byte `parquet:"name=bytes, type=INT32, repetitiontype=REPEATED"`
    }
    fmt.Printf("int32 repeated: %v\n", write(&Int32Repeated{bytes}, "int32repeated.parquet"))

    type Uint8List struct {
        Bytes []byte `parquet:"name=bytes, type=MAP, convertedtype=LIST, valuetype=INT32, valueconvertedtype=UINT_8"`
    }
    fmt.Printf("uint8 list: %v\n", write(&Uint8List{bytes}, "uint8list.parquet"))

    type Int32List struct {
        Bytes []byte `parquet:"name=bytes, type=MAP, convertedtype=LIST, valuetype=INT32`
    }
    fmt.Printf("int32 list: %v\n", write(&Int32List{bytes}, "int32list.parquet"))
}

However, all of these fails in different places:

$ go run main.go
byte array: failed on new writer: type : not a valid Type string
uint8 repeated: failed on write stop: reflect: call of reflect.Value.Int on uint8 Value
int32 repeated: failed on write stop: reflect: call of reflect.Value.Int on uint8 Value
uint8 list: failed on write stop: reflect: call of reflect.Value.Int on uint8 Value
int32 list: failed on write stop: runtime error: invalid memory address or nil pointer dereference

Any of these tags that sounds more reasonable than the others? Is there something else that I've missed? Pinging @xitongsys for comments as well.