xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Feature Request: Seek to RowGroup #461

Open zolstein opened 2 years ago

zolstein commented 2 years ago

In theory, one of the advantages of the parquet format is the ability to use metadata in the footer to avoid processing the entire file in order to locate specific records of interest. Specifically, one wants to use the RowGroup's min/max values per column to avoid processing RowGroups that don't contain records with particular values.

In practice, I can't see a way to do that using this library. SkipRows does almost what is needed, but the API doesn't make it possible (or at least easy) to navigate between row groups, and it needs to process every page so it doesn't provide the performance benefit.

I propose a new method on the Reader and ColumnReader types: SeekRowGroup(index int64) error that logically moves the reader to the start of the row group. This, in conjunction with the metadata in the footer, can be used to efficiently skip RowGroups that are known not to contain desired records.

If you have any interest in including a feature like this, I have a proof-of-concept that seems to work and that I can flesh out.

hangxie commented 2 years ago

Something like this (note that this lack of lots of nil/empty checks), maybe? My personal opinion is this is kind of "easy":

    for rgIndex, rg := range reader.Footer.RowGroups {
        for _, col := range rg.Columns {
            // TODO check full path
            if  col.MetaData.PathInSchema[len(col.MetaData.PathInSchema)] != "FieldToCheck" {
                continue
            }
            // check col.MetaData.Statistics.MaxValue and col.MetaData.Statistics.MinValue
            // and return rgIndex that matches criteria

There are definitely valid use case for this, though I never encountered one, note that min and max are not mandatory so this functionality only works for a certain number of parquet files.

zolstein commented 2 years ago

Something like this (note that this lack of lots of nil/empty checks), maybe?

Yeah, that is (more or less) how you'd identify row groups you care about. To clarify, though, the issue is that having done those checks there's no (easy, non-super-invasive) way to seek the ParquetReader into the right spot to consume from the beginning of the row-group. That's what the SeekRowGroup method solves.

note that min and max are not mandatory so this functionality only works for a certain number of parquet files.

True, but it's probably most likely that the files being consumed are generated using this library, and it does set the Min/MaxValue fields.

zolstein commented 2 years ago

I posted a draft PR of my PoC here. https://github.com/xitongsys/parquet-go/pull/469