xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Not able to read a parquet file written by CSV writer #370

Closed optimistdk closed 3 years ago

optimistdk commented 3 years ago

Hello there !

Thanks for this lovely library. I have been playing around with the apis in the library and they mostly work. Due to dynamic nature of one our ingestion system, we wanted to use the CSV Writer for writing parquet files. But we consistently seem to have issues reading any type of bytes written in parquet file using CSVWriter.

Sample code:

package main

import (
    "fmt"
    "github.com/xitongsys/parquet-go-source/local"
    "github.com/xitongsys/parquet-go/reader"
    "github.com/xitongsys/parquet-go/writer"
)

type Name struct {
    First           string `parquet:"name=First, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN"`
    Middle          string `parquet:"name=Middle, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN"`
    Last            string `parquet:"name=Last, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN"`
    BirthCity       string `parquet:"name=BirthCity, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN"`
}

func main() {
    md := []string{
        "name=First, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN",
        "name=Middle, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN",
        "name=Last, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN",
        "name=BirthCity, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN",
    }
    // Write the csv parquet file
    fw, err := local.NewLocalFileWriter("csv.parquet")
    if err != nil {
        fmt.Errorf("Error: %+v", err)
    }
    cw, err := writer.NewCSVWriter(md, fw, 2)
    if err != nil {
        fmt.Errorf("Error: %+v", err)
    }
    err = cw.Write([]interface{}{"Harry", "S", "Truman", "Lamar"})
    if err != nil {
        fmt.Errorf("Error: %+v", err)
    }
    err = cw.WriteStop()
    if err != nil {
        fmt.Errorf("Error: %+v", err)
    }
    fw.Close()

    // Now read the csv parquet file
    fr, err := local.NewLocalFileReader("csv.parquet")
    if err != nil {
        fmt.Errorf("Error: %+v", err)
    }
    pr, err := reader.NewParquetReader(fr, new(Name), 4)
    if err != nil {
        fmt.Errorf("Error: %+v", err)
    }

    total_rows := int(pr.GetNumRows())

    records := make([]Name, total_rows) //read 1 row
    err = pr.Read(&records)
    if err != nil {
        fmt.Errorf("Error: %+v", err)
    }
    fmt.Printf("Number of records: %d\n", len(records))
    fmt.Printf("Record: %+v\n", records)

        fmt.Println("Using Parquet writer and reader\n")

        // Write the parquet file
        fw, err = local.NewLocalFileWriter("csv2.parquet")
        if err != nil {
                fmt.Errorf("Error: %+v", err)
        }       
        pw, err := writer.NewParquetWriter(fw, new(Name), 2)
        if err != nil {
                fmt.Errorf("Error: %+v", err)
        }       
    rec := Name {
        "Harry",
        "S",
        "Truman",
        "Lamar",
    }
        err = pw.Write(&rec)
        if err != nil {
                fmt.Errorf("Error: %+v", err)
        }       
        err = pw.WriteStop()
        if err != nil {
                fmt.Errorf("Error: %+v", err)
        }       
        fw.Close()

        // Now read the parquet file
        fr, err = local.NewLocalFileReader("csv2.parquet")
        if err != nil {
                fmt.Errorf("Error: %+v", err)
        }       
        pr, err = reader.NewParquetReader(fr, new(Name), 4)
        if err != nil {
                fmt.Errorf("Error: %+v", err)
        }       

        total_rows = int(pr.GetNumRows())

        records = make([]Name, total_rows) //read 1 row
        err = pr.Read(&records) 
        if err != nil {
                fmt.Errorf("Error: %+v", err)
        }       
        fmt.Printf("Number of records: %d\n", len(records))
        fmt.Printf("Record: %+v\n", records)

}

On running this test file:

$ go run  ~/test.go
Number of records: 1
Record: [{First: Middle: Last: BirthCity:}]
Using Parquet writer and reader

Number of records: 1
Record: [{First:Harry Middle:S Last:Truman BirthCity:Lamar}]

Can you please help me what am i not doing correctly while using the CSV writer.

xitongsys commented 3 years ago

hi, @optimistdk For CSV writer, it only supports OPTIONAL fields in previous version. In old version(<1.6) it will change the schema to OPTIONAL even if the user don't define it. But in 1.6 I delete this statement which induce your issue.

I update the library which now supports REQUIRED field in CSV writer. And I test your code which runs well.

Number of records: 1
Record: [{First:Harry Middle:S Last:Truman BirthCity:Lamar}]
Using Parquet writer and reader

Number of records: 1
Record: [{First:Harry Middle:S Last:Truman BirthCity:Lamar}]

Please use the latest version