xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Incorrect null count with all null values #523

Closed hangxie closed 1 year ago

hangxie commented 1 year ago

parquet-go seems to write wrong null count if all values for a field are null.

Generate all-nil.parquet with this program:

package main

import (
    "fmt"

    "github.com/xitongsys/parquet-go-source/local"
    "github.com/xitongsys/parquet-go/writer"
)

type AllTypes struct {
    F1 string  `parquet:"name=f1, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
    F2 *string `parquet:"name=f2, type=INT96"`
}

func main() {
    fw, err := local.NewLocalFileWriter("all-nil.parquet")
    if err != nil {
        fmt.Println("Can't create local file", err)
        return
    }

    pw, err := writer.NewParquetWriter(fw, new(AllTypes), 4)
    if err != nil {
        fmt.Println("Can't create parquet writer", err)
        return
    }

    for i := 0; i < 10; i++ {
        value := AllTypes{
            F1: fmt.Sprintf("f%d", i),
            F2: nil,
        }
        if err = pw.Write(value); err != nil {
            fmt.Println("Write error", err)
        }
    }
    if err = pw.WriteStop(); err != nil {
        fmt.Println("WriteStop error", err)
        return
    }
    fw.Close()
}

I'm expecting 10 null values but both parquet-cli (https://github.com/apache/parquet-mr/blob/master/parquet-cli/README.md) and my parquet-tools (https://github.com/hangxie/parquet-tools) returns 6, tested a couple of other cases:

  1. 1 record with null count 1
  2. 6 records with null count of 3
SuperEdison commented 1 year ago

v1.6.2 still incorrect.

type AllTypes struct {
    F1 string `parquet:"name=f1, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
    F2 string `parquet:"name=f2, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
}

func TestClient_HandleParquet(t *testing.T) {
    fw, err := local.NewLocalFileWriter("all-nil.parquet")
    if err != nil {
        fmt.Println("Can't create local file", err)
        return
    }

    pw, err := writer.NewParquetWriter(fw, new(AllTypes), 4)
    if err != nil {
        fmt.Println("Can't create parquet writer", err)
        return
    }

    for i := 0; i < 10; i++ {
        value := &AllTypes{
            F1: fmt.Sprintf("f%d", i),
            F2: fmt.Sprintf("f%d", i),
        }
        if err = pw.Write(value); err != nil {
            fmt.Println("Write error", err)
        }
    }
    if err = pw.WriteStop(); err != nil {
        fmt.Println("WriteStop error", err)
        return
    }
    fw.Close()
}
image
SuperEdison commented 1 year ago

@hangxie do something pls

hangxie commented 1 year ago

https://github.com/xitongsys/parquet-go/releases/tag/v1.6.2 was release almost 2 years ago, try head of master

SuperEdison commented 1 year ago

https://github.com/xitongsys/parquet-go/releases/tag/v1.6.2 was release almost 2 years ago, try head of master

this version still have this problem

hangxie commented 1 year ago

this version still have this problem

I don't know what your problem is - this issue is about incorrect null_count and since your code insert no null value, the parquet file I got reports zero null values which is the right behavior.

Feel free to open a new issue if you believe there is a problem, with a minimized sample code and expected output.