xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.25k stars 294 forks source link

reader.go exit if line 337 and does not process after runtime error: index out of range [x] with length x #595

Open sbyrdsellGIT opened 1 month ago

sbyrdsellGIT commented 1 month ago

github.com/xitongsys/parquet-go v1.6.2

PROBLEM

I have a large complex structure with struct, string, int64, bool, map[string]struct, map[string][]*string with 1 TB of records I need to process.

If I run

fr, err := local.NewLocalFileReader(parquetFile)
    if err != nil {
        log.Println("Can't open file: ", parquetFile)
        os.Exit(1)
}

pr, err := reader.NewParquetReader(fr, nil, 10) // NP-> 10 int64 parallel number

if err != nil {
       log.Println("Can't create parquet reader", err)
       return
}

stus := make([]myStruct, 10) //read 10 rows

if err = pr.Read(&stus); err != nil {
    log.Println(fmt.Sprintf("Read error %s", err))
}

If any of the 10 records error's in my case it's erroring in reader.go line 337 during the marshal.Unmarshal i.e

if err2 := marshal.Unmarshal(&tmap, b, e, dstList[index], pr.SchemaHandler, prefixPath); err2 != nil {"

After erroring it returns runtime error: index out of range [x] with length x and doesn't send back any successful marshal.Unmarshal records. Causing the application to lose the 10 records.

WORK·A·ROUND

If I set
pr, err := reader.NewParquetReader(fr, nil, 1) // NP-> 1 int64 parallel number and

stus := make([]myStruct, 1) //read 1 rows

if err = pr.Read(&stus); err != nil {
    log.Println(fmt.Sprintf("Read error %s", err))
}

then I only skip the 1 unprocessed marshal.Unmarshal but this make the process slow down x10.

Does anyone have any suggestions to help me with this error or speed up this process?

-Stan