xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

NewCSVWriterFromWriter: Release v1.6.0 does not work as expected, Present master works fine. #399

Closed dhawal1248 closed 3 years ago

dhawal1248 commented 3 years ago

The latest release v1.6.0 and the current master behave differently wrt to the attached code. The parquet file generated by release v1.6.0 has some missing rows in the final parquet file. The parquet file generated by the current master code, looks fine and has all rows in the final file.

I used the current master code using the replace directive in the project's go.mod file. (replace github.com/xitongsys/parquet-go v1.6.0 => <path to lib code>)

The following code produces different parquet files.

package main

import (
    "bufio"
    "fmt"
    "os"

    "github.com/xitongsys/parquet-go/writer"
)

func main() {
    md := []string{
        "name=Col1, type=BYTE_ARRAY, convertedtype=UTF8",
        "name=Col2, type=BYTE_ARRAY, convertedtype=UTF8",
        "name=Col3, type=BYTE_ARRAY, convertedtype=UTF8",
    }
    fname := fmt.Sprintf("%s/parquetTest/test.parquet", os.Getenv("HOME"))
    file, err := os.OpenFile(fname, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0660)
    if err != nil {
        fmt.Println(err)
        return
    }
    w := bufio.NewWriter(file)

    pw, err := writer.NewCSVWriterFromWriter(md, w, 4)
    if err != nil {
        fmt.Println("Can't create csv writer", err)
        return
    }

    num := 10
    for i := 0; i < num; i++ {
        data2 := []interface{}{
            fmt.Sprintf("Col1Val%d", i),
            fmt.Sprintf("Col2Val%d", i),
            fmt.Sprintf("Col3Val%d", i),
        }
        if err = pw.Write(data2); err != nil {
            fmt.Println("Write error", err)
        }
    }
    if err = pw.WriteStop(); err != nil {
        fmt.Println("WriteStop error", err)
    }
    fmt.Println("Write Finished")
    err = w.Flush()
    if err != nil {
        fmt.Println(err)
    }
    err = file.Close()
    if err != nil {
        fmt.Println(err)
    }
}

When this code uses the latest release(v1.6.0), the parquet files don't have all rows(some of the rows are missing from the final parquet file). The file looks something like this :-

+----------+----------+----------+
| Col1     | Col2     | Col3     |
|----------+----------+----------|
|           |           |           |
| Col1Val0 | Col2Val0 | Col3Val0 |
| Col1Val1 | Col2Val1 | Col3Val1 |
|           |           |           |
| Col1Val3 | Col2Val3 | Col3Val3 |
| Col1Val4 | Col2Val4 | Col3Val4 |
|           |           |           |
| Col1Val6 | Col2Val6 | Col3Val6 |
| Col1Val7 | Col2Val7 | Col3Val7 |
|           |           |           |
+----------+----------+----------+

When it uses the current master, it works as expected and the file looks like :-

+----------+----------+----------+
| Col1     | Col2     | Col3     |
|----------+----------+----------|
| Col1Val0 | Col2Val0 | Col3Val0 |
| Col1Val1 | Col2Val1 | Col3Val1 |
| Col1Val2 | Col2Val2 | Col3Val2 |
| Col1Val3 | Col2Val3 | Col3Val3 |
| Col1Val4 | Col2Val4 | Col3Val4 |
| Col1Val5 | Col2Val5 | Col3Val5 |
| Col1Val6 | Col2Val6 | Col3Val6 |
| Col1Val7 | Col2Val7 | Col3Val7 |
| Col1Val8 | Col2Val8 | Col3Val8 |
| Col1Val9 | Col2Val9 | Col3Val9 |
+----------+----------+----------+

Can you please create a fresh release if this was a bug in v1.6.0 which was later fixed? Thanks in advance :) @xitongsys

hstern commented 3 years ago

I am observing crashes with v1.6.0 caused by nil rows, presumably the same bug.

dhawal1248 commented 3 years ago

@hstern you can add the dependency like go get github.com/xitongsys/parquet-go@8ed6152 . This way it will always use the code at commit 8ed6152.

guenhter commented 3 years ago

@xitongsys Would you mind releasing a new version? The current 1.6.0 version (which you get by default with go get) is already quite old compared to the progress here on master. Also the example code is not working out of the box with 1.6.

xitongsys commented 3 years ago

released v1.6.1