Closed pjebs closed 4 years ago
Hi, @pjebs , you can use a LIST of OPTIONAL field.
@propersam
type Student struct {
Name *string `parquet:"name=name, type=UTF8, encoding=PLAIN_DICTIONARY"`
Age *int32 `parquet:"name=age, type=INT32"`
Id *int64 `parquet:"name=id, type=INT64"`
}
Using a struct like above to encode nil values doesn't seem to work according to this site: http://parquet-viewer-online.com/
Is encoding nil values a new feature that possibly isn't implemented in the site?
hi, @pjebs sample codes
package main
import (
"log"
"github.com/xitongsys/parquet-go-source/local"
"github.com/xitongsys/parquet-go/parquet"
"github.com/xitongsys/parquet-go/reader"
"github.com/xitongsys/parquet-go/writer"
)
type Student struct {
Name string `parquet:"name=name, type=UTF8, encoding=PLAIN_DICTIONARY"`
Class []*string `parquet:"name=class, type=LIST, valuetype=UTF8, valuerepetitiontype=OPTIONAL"`
}
func main() {
var err error
fw, err := local.NewLocalFileWriter("output/a.parquet")
if err != nil {
log.Println("Can't create local file", err)
return
}
//write
pw, err := writer.NewParquetWriter(fw, new(Student), 4)
if err != nil {
log.Println("Can't create parquet writer", err)
return
}
pw.RowGroupSize = 128 * 1024 * 1024 //128M
pw.CompressionType = parquet.CompressionCodec_SNAPPY
num := 10
class0, class1 := "math", "physics"
for i := 0; i < num; i++ {
stu := Student{
Name: "StudentName",
Class: []*string{nil, nil, &class0, &class1},
}
if err = pw.Write(stu); err != nil {
log.Println("Write error", err)
}
}
if err = pw.WriteStop(); err != nil {
log.Println("WriteStop error", err)
return
}
log.Println("Write Finished")
fw.Close()
///read
fr, err := local.NewLocalFileReader("output/a.parquet")
if err != nil {
log.Println("Can't open file")
return
}
pr, err := reader.NewParquetReader(fr, new(Student), 4)
if err != nil {
log.Println("Can't create parquet reader", err)
return
}
num = int(pr.GetNumRows())
stus := make([]Student, num)
if err = pr.Read(&stus); err != nil {
log.Println("Read error", err)
}
log.Println(stus)
pr.ReadStop()
fr.Close()
}
results:
2020/05/09 15:15:04 Write Finished
2020/05/09 15:15:04 [{StudentName [<nil> <nil> 0xc0000f0790 0xc0000f07a0]} {StudentName [<nil> <nil> 0xc0000f07b0 0xc0000f07c0]} {StudentName [<nil> <nil> 0xc0000f07e0 0xc0000f07f0]} {StudentName [<nil> <nil> 0xc000101650 0xc0001016d0]} {StudentName [<nil> <nil> 0xc000101730 0xc0001017a0]} {StudentName [<nil> <nil> 0xc0001017b0 0xc000101820]} {StudentName [<nil> <nil> 0xc00058c010 0xc00058c020]} {StudentName [<nil> <nil> 0xc00058c030 0xc00058c040]} {StudentName [<nil> <nil> 0xc00058c050 0xc00058c060]} {StudentName [<nil> <nil> 0xc0000ba570 0xc0000ba580]}]
If I have a very long list, this approach seems to put a lot of memory pressure due to if err = pw.Write(stu); err != nil {
I'm trying to write an export function for my package: https://github.com/rocketlaunchr/dataframe-go
The data could look like this (but nullable)
+-----+----------------+----------------+-----------+--------------------------------+
| | NAME | TITLE | BASE RATE | MEETING TIME |
+-----+----------------+----------------+-----------+--------------------------------+
| 0: | Cordia Jacobi | Consultant | 84 | 2020-02-02 23:13:53.015324 |
| | | | | +0000 UTC |
| 1: | Nickolas Emard | NaN | 44 | 2020-02-03 23:13:53.015324 |
| | | | | +0000 UTC |
| 2: | Hollis Dickens | Representative | 44 | 2020-02-04 23:13:53.015324 |
| | | | | +0000 UTC |
| 3: | Stacy Dietrich | NaN | 86 | 2020-02-05 23:13:53.015324 |
| | | | | +0000 UTC |
| 4: | Aleen Legros | Officer | 42 | 2020-02-06 23:13:53.015324 |
| | | | | +0000 UTC |
| 5: | Adelia Metz | Architect | 36 | 2020-02-07 23:13:53.015324 |
| | | | | +0000 UTC |
| 6: | Sunny Gerlach | NaN | 56 | 2020-02-08 23:13:53.015324 |
| | | | | +0000 UTC |
| 7: | Austin Hackett | NaN | 78 | 2020-02-09 23:13:53.015324 |
| | | | | +0000 UTC |
+-----+----------------+----------------+-----------+--------------------------------+
| 8X4 | STRING | STRING | INT64 | TIME |
+-----+----------------+----------------+-----------+--------------------------------+
These series could of millions of rows. In your way I have to define a struct:
type Student struct {
Name []*string `parquet:"name=name, type=LIST, valuetype=UTF8, valuerepetitiontype=OPTIONAL"`
Title []*string `parquet:"name=title, type=LIST, valuetype=UTF8, valuerepetitiontype=OPTIONAL"`
Rate []*int64 `parquet:"name=rate, type=LIST, valuetype= INT64, valuerepetitiontype=OPTIONAL"`
MeetingTime []*int64 `parquet:"name=meeting_time, type=LIST, valuetype= TimeMicros, valuerepetitiontype=OPTIONAL"`
}
Then I save 1 record with potentially millions of records in each slice struct field.
hi, @pjebs You should define the struct as
type Student struct {
Name *string `parquet:"name=name, type=UTF8, repetitiontype=OPTIONAL"`
Title *string `parquet:"name=title, type=UTF8, repetitiontype=OPTIONAL"`
Rate *int64 `parquet:"name=rate, type= INT64, repetitiontype=OPTIONAL"`
MeetingTime *int64 `parquet:"name=meeting_time, type= TimeMicros, repetitiontype=OPTIONAL"`
}
One row not column in the table is an object.
@propersam
How can I saves a series of strings, where some values are nil?
I can't see how to do it in docs.