xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.23k stars 294 forks source link

Corrupted Parquet Statistics in Trino SQL #587

Open cain129 opened 2 months ago

cain129 commented 2 months ago

Hello,

We are using parquet-go v1.6.2 to convert files into parquet. When they hit our SQL database Trino v380 we get this error when querying:

2024-06-18T20:23:45.343Z ERROR stage-scheduler io.trino.execution.StageStateMachine Stage 20240618_202345_03674_xm9wc.1 failed io.trino.spi.TrinoException: Corrupted statistics for column "filename" in Parquet file "s3a:///date_part=2024-06-18/.parquet". Corrupted column index: [Boudary order: UNORDERED null count min max page-0 page-1 page-2 page-3 page-4 page-5 page-6 page-7 ] at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:278) at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:164) at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:290) at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:195) at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49) at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:68) at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:268) at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:196) at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:338) at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391) at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:325) at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391) at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:325) at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391) at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:240) at io.trino.operator.WorkProcessorUtils.lambda$processStateMonitor$3(WorkProcessorUtils.java:219) at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391) at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:240) at io.trino.operator.WorkProcessorUtils.lambda$finishWhen$4(WorkProcessorUtils.java:234) at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391) at io.trino.operator.WorkProcessorSourceOperatorAdapter.getOutput(WorkProcessorSourceOperatorAdapter.java:150) at io.trino.operator.Driver.processInternal(Driver.java:388) at io.trino.operator.Driver.lambda$processFor$9(Driver.java:292) at io.trino.operator.Driver.tryWithLock(Driver.java:693) at io.trino.operator.Driver.processFor(Driver.java:285) at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1092) at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163) at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:488) at io.trino.$gen.Trino_380____20240612_170007_2.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)

This error could be an error on trino's side but Im opening this issue here because from looking at other parquet files converted elsewhere, there are some column statistics left out. Namely the column order which seems the be the problem here.

robertino commented 2 months ago

hey, not sure 100%, but this could be linked to https://github.com/xitongsys/parquet-go/issues/547