xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.25k stars 294 forks source link

Nil values #245

Closed pjebs closed 4 years ago

pjebs commented 4 years ago

How can I saves a series of strings, where some values are nil?

I can't see how to do it in docs.

xitongsys commented 4 years ago

Hi, @pjebs , you can use a LIST of OPTIONAL field.

pjebs commented 4 years ago

@propersam

pjebs commented 4 years ago
type Student struct {
    Name   *string  `parquet:"name=name, type=UTF8, encoding=PLAIN_DICTIONARY"`
    Age    *int32   `parquet:"name=age, type=INT32"`
    Id     *int64   `parquet:"name=id, type=INT64"`
}

Using a struct like above to encode nil values doesn't seem to work according to this site: http://parquet-viewer-online.com/

Is encoding nil values a new feature that possibly isn't implemented in the site?

xitongsys commented 4 years ago

hi, @pjebs sample codes

package main

import (
    "log"

    "github.com/xitongsys/parquet-go-source/local"
    "github.com/xitongsys/parquet-go/parquet"
    "github.com/xitongsys/parquet-go/reader"
    "github.com/xitongsys/parquet-go/writer"
)

type Student struct {
    Name    string  `parquet:"name=name, type=UTF8, encoding=PLAIN_DICTIONARY"`
    Class []*string `parquet:"name=class, type=LIST, valuetype=UTF8, valuerepetitiontype=OPTIONAL"`
}

func main() {
    var err error
    fw, err := local.NewLocalFileWriter("output/a.parquet")
    if err != nil {
        log.Println("Can't create local file", err)
        return
    }

    //write
    pw, err := writer.NewParquetWriter(fw, new(Student), 4)
    if err != nil {
        log.Println("Can't create parquet writer", err)
        return
    }

    pw.RowGroupSize = 128 * 1024 * 1024 //128M
    pw.CompressionType = parquet.CompressionCodec_SNAPPY
    num := 10
    class0, class1 := "math", "physics"
    for i := 0; i < num; i++ {
        stu := Student{
            Name:   "StudentName",
            Class: []*string{nil, nil, &class0, &class1},
        }
        if err = pw.Write(stu); err != nil {
            log.Println("Write error", err)
        }
    }
    if err = pw.WriteStop(); err != nil {
        log.Println("WriteStop error", err)
        return
    }
    log.Println("Write Finished")
    fw.Close()

    ///read
    fr, err := local.NewLocalFileReader("output/a.parquet")
    if err != nil {
        log.Println("Can't open file")
        return
    }

    pr, err := reader.NewParquetReader(fr, new(Student), 4)
    if err != nil {
        log.Println("Can't create parquet reader", err)
        return
    }
    num = int(pr.GetNumRows())
    stus := make([]Student, num) 
    if err = pr.Read(&stus); err != nil {
        log.Println("Read error", err)
    }
    log.Println(stus)

    pr.ReadStop()
    fr.Close()
}

results:

2020/05/09 15:15:04 Write Finished
2020/05/09 15:15:04 [{StudentName [<nil> <nil> 0xc0000f0790 0xc0000f07a0]} {StudentName [<nil> <nil> 0xc0000f07b0 0xc0000f07c0]} {StudentName [<nil> <nil> 0xc0000f07e0 0xc0000f07f0]} {StudentName [<nil> <nil> 0xc000101650 0xc0001016d0]} {StudentName [<nil> <nil> 0xc000101730 0xc0001017a0]} {StudentName [<nil> <nil> 0xc0001017b0 0xc000101820]} {StudentName [<nil> <nil> 0xc00058c010 0xc00058c020]} {StudentName [<nil> <nil> 0xc00058c030 0xc00058c040]} {StudentName [<nil> <nil> 0xc00058c050 0xc00058c060]} {StudentName [<nil> <nil> 0xc0000ba570 0xc0000ba580]}]
pjebs commented 4 years ago

If I have a very long list, this approach seems to put a lot of memory pressure due to if err = pw.Write(stu); err != nil {

I'm trying to write an export function for my package: https://github.com/rocketlaunchr/dataframe-go

The data could look like this (but nullable)

+-----+----------------+----------------+-----------+--------------------------------+
|     |      NAME      |     TITLE      | BASE RATE |          MEETING TIME          |
+-----+----------------+----------------+-----------+--------------------------------+
| 0:  | Cordia Jacobi  |   Consultant   |    84     |   2020-02-02 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 1:  | Nickolas Emard |      NaN       |    44     |   2020-02-03 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 2:  | Hollis Dickens | Representative |    44     |   2020-02-04 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 3:  | Stacy Dietrich |      NaN       |    86     |   2020-02-05 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 4:  |  Aleen Legros  |    Officer     |    42     |   2020-02-06 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 5:  |  Adelia Metz   |   Architect    |    36     |   2020-02-07 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 6:  | Sunny Gerlach  |      NaN       |    56     |   2020-02-08 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
| 7:  | Austin Hackett |      NaN       |    78     |   2020-02-09 23:13:53.015324   |
|     |                |                |           |           +0000 UTC            |
+-----+----------------+----------------+-----------+--------------------------------+
| 8X4 |     STRING     |     STRING     |   INT64   |              TIME              |
+-----+----------------+----------------+-----------+--------------------------------+
pjebs commented 4 years ago

These series could of millions of rows. In your way I have to define a struct:

type Student struct {
    Name []*string `parquet:"name=name, type=LIST, valuetype=UTF8, valuerepetitiontype=OPTIONAL"`
    Title []*string `parquet:"name=title, type=LIST, valuetype=UTF8, valuerepetitiontype=OPTIONAL"`
    Rate []*int64 `parquet:"name=rate, type=LIST, valuetype= INT64, valuerepetitiontype=OPTIONAL"`
    MeetingTime []*int64 `parquet:"name=meeting_time, type=LIST, valuetype= TimeMicros, valuerepetitiontype=OPTIONAL"`
}

Then I save 1 record with potentially millions of records in each slice struct field.

xitongsys commented 4 years ago

hi, @pjebs You should define the struct as

type Student struct {
    Name *string `parquet:"name=name, type=UTF8, repetitiontype=OPTIONAL"`
    Title *string `parquet:"name=title, type=UTF8, repetitiontype=OPTIONAL"`
    Rate *int64 `parquet:"name=rate, type= INT64, repetitiontype=OPTIONAL"`
    MeetingTime *int64 `parquet:"name=meeting_time, type= TimeMicros, repetitiontype=OPTIONAL"`
}

One row not column in the table is an object.

pjebs commented 4 years ago

@propersam