xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

slice bounds out of range #462

Open programmerX1123 opened 2 years ago

programmerX1123 commented 2 years ago

Hi, I am parsing a parquet file whose schema is (generated by parquet-tools):

{
  "Tag": "name=Schema, repetitiontype=REQUIRED",
  "Fields": [
    {
      "Tag": "name=Timestamp, type=INT64, repetitiontype=OPTIONAL"
    },
    {
      "Tag": "name=File_name, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"
    },
    {
      "Tag": "name=Avro_name, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"
    },
    {
      "Tag": "name=Offset, type=INT32, repetitiontype=OPTIONAL"
    },
    {
      "Tag": "name=File_format, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"
    },
    {
      "Tag": "name=Meta_data, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"
    }
  ]
}

And I use the following struct to hold the content of parquet file:

type Schema struct {
    Timestamp int64  `parquet:"name=timestamp, type=INT64"`
    AvroName  string `parquet:"name=avro_name, type=BYTE_ARRAY"`
    FileName  string `parquet:"name=file_name, type=BYTE_ARRAY"`
    Offset    int32  `parquet:"name=offset, type=INT32"`
}

When I try to parse a parquet file which has 4905 rows, the following error is thrown out:

panic: runtime error: slice bounds out of range [:4905] with capacity 3072

But when I run the same code on a parquet file that has only 5 rows, there is no error (these 2 parquet files are generated by same script so they share the same schema). Here is the result:

[{211297138286 Image0.avro 211297138286.png 269475} 
{210997038286 Image0.avro 210997038286.png 58} 
{210997038286 Image0.avro 210997038286.png 58} 
{210997038286 Image0.avro 210997038286.png 58} 
{210997038286 Image0.avro 210997038286.png 58}]

So is there a limit of the size of the parquet file? Besides, when I omit the AvroName field, the first parquet file can also be read successfully ( but AvroName is a field of file names just as FileName so I don't think there are any differences between them). Moreover, I have tested several parquet files with different number of rows, and they get the same slice bounds out of range error. Therefore I think this error is not caused by occasional mistake during the generation of parquet file. Now I am really confused and wonder if you can help me fix this bug. Thank you in advance!

hangxie commented 2 years ago

The schema and go struct don't match, OPTIONAL fields should be defined as pointer so it can be nil. If it does not work after changing definition of type Schema, it will be helpful to have a sample parquet file (and better with snippet of your source code) to troubleshoot.

ZhenSh commented 1 year ago

Hi @programmerX1123 I have run into this same issue, wondering how did you get the issue resolved? Could you share the info? Appreciate it. Thanks