xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Out of memory error #411

Closed yukels closed 3 years ago

yukels commented 3 years ago

Hello

In our system we use python (actually pyarrow.parquet package) to create parquet files. We want to introduce Go service to provide some information from the parquet files. We use your latest version of the package v1.6.1.

We don't have any issue to load and use parquet file in python:

import pyarrow.parquet as pq
In [38]: table2 = pq.read_table('7e27377f15304257b3f83cae950d4a5f.parquet')

In [39]: table2.schema
Out[39]: 
X: float
  -- field metadata --
  PARQUET:field_id: '1'
Y: float
  -- field metadata --
  PARQUET:field_id: '2'
Z: float
  -- field metadata --
  PARQUET:field_id: '3'
timestamp_micro: double
  -- field metadata --
  PARQUET:field_id: '4'
reflectivity: int32
  -- field metadata --
  PARQUET:field_id: '5'
ray: int32
  -- field metadata --
  PARQUET:field_id: '6'
cycle: int32
  -- field metadata --
  PARQUET:field_id: '7'
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 964

In [40]: table2.to_pandas()
Out[40]: 
                 X         Y         Z  timestamp_micro  reflectivity  ray  cycle
0         0.000000 -0.000000 -0.000000     1.528806e+15             4    2    541
1         2.111819 -1.294615 -0.552844     1.528806e+15             0    0    541
2         2.110129 -1.225577 -0.552559     1.528806e+15             2    4    541

As you can see we have 3 float, 3 int32 and one double fields. The sample of content file can be downloaded from here.

Here is code we run to load the parquet:

package main

import (
    "log"

    "github.com/xitongsys/parquet-go-source/local"
    "github.com/xitongsys/parquet-go/reader"
)

type Data struct {
    X              float32 `parquet:"name=X, type=FLOAT"`
    Y              float32 `parquet:"name=Y, type=FLOAT"`
    Z              float32 `parquet:"name=Z, type=FLOAT"`
    TimestampMicro float64 `parquet:"name=timestamp_micro, type=DOUBLE"`
    Cycle          int32   `parquet:"name=cycle, type=INT32"`
    Ray            int32   `parquet:"name=ray, type=INT32"`
    Reflectivity   int32   `parquet:"name=reflectivity, type=INT32"`
}

func main() {
    path := "7e27377f15304257b3f83cae950d4a5f.parquet"
    fr, err := local.NewLocalFileReader(path)
    if err != nil {
        log.Fatal("Can't open file")
    }

    pr, err := reader.NewParquetReader(fr, new(Data), 8)
    if err != nil {
        log.Fatalf("Can't create parquet reader %s", err)
    }

    num := int(pr.GetNumRows())
    log.Printf("row num %d", num)

    data := make([]Data, 10)
    if err = pr.Read(&data); err != nil {
        log.Fatalf("Read error %s", err)
    }

    log.Printf("%+v", data)
    pr.ReadStop()
    fr.Close()
}

We try to use partial structure to load the parquet: only X and Y fields. On this case the parquet was loaded but we got wrong data records.

Please advice. Dmitry

yukels commented 3 years ago

Hi Did you a time to look into the issue? We really stuck with the problem :((

hangxie commented 3 years ago

All fields in the parquet file are optional, change Data to below works for me:

type Data struct {
    X              *float32 `parquet:"name=X, type=FLOAT"`
    Y              *float32 `parquet:"name=Y, type=FLOAT"`
    Z              *float32 `parquet:"name=Z, type=FLOAT"`
    TimestampMicro *float64 `parquet:"name=timestamp_micro, type=DOUBLE"`
    Cycle          *int32   `parquet:"name=cycle, type=INT32"`
    Ray            *int32   `parquet:"name=ray, type=INT32"`
    Reflectivity   *int32   `parquet:"name=reflectivity, type=INT32"`
}
yukels commented 3 years ago

Great! Thanks a lot!