xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Populating null_counts in ColumnIndex #473

Closed NoHomey closed 2 years ago

NoHomey commented 2 years ago

Hi

The team I’m part of is working on a project in which we use parquet-go for writing Parquet files which are then consumed by Trino - popular open source SQL query engine. After upgrading the Trino we’ve found out that it can no longer read the Parquet files that we write unless we disable the usage of statistics which degrades the queries performance. We found out that the reason for the exception that we were getting is that newer versions of Trino assume that the null_counts field in the ColumnIndex is populated. This is because Trino reads the statistics from the ColumnIndex and not from the ColumnMetaData.

We have a small working fix for the writing of null_counts to the ColumnIndex in case the null_counts from the DataPageHeader(V2).Statistics have been set and we would like to contribute that code so other people can benefit as well. Please let me know if you want that contribution to be submitted as part of your codebase so I can open a PR.

xitongsys commented 2 years ago

yes, please. For a long time, I'm focusing on my new job. So sorry for late response.

NoHomey commented 2 years ago

Hi @xitongsys ,

Also sorry for the late response I've missed the notification email for some reason... I've just opened a PR for the contribution.