xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.25k stars 294 forks source link

How to get bytearray from select columns of a row group? #535

Open hkpeaks opened 1 year ago

hkpeaks commented 1 year ago

Yesterday, I published the first pre-release version of my project. You can find it at https://github.com/hkpeaks/peaks-consolidation/releases. The project supports processing billion-row CSV files by streaming mode and running all ETL processes in parallel. My in-memory dataset do not have datascheme. However, it can be changed on demand e.g. filter/aggregate of real number (user need to set float for filter, and sum for aggregate).

My next step is to support Parquet format. I’m exploring which Go library is most suitable for me to maintain my current processing speed. Currently, I keep my in-memory dataset as a bytearray read from CSV. I do byte-to-byte conversion to support ETL functions such as Distinct, GroupBy, JoinTable and Filter.

To implement Parquet format, I want to use the same processing model if I can get bytearray from Parquet file "by select columns of a row group" directly. So I can use Goruntine to read each group in parallell. For the first step of development, I plan to focus on read. If I can achieve processing speed no worse than CSV when reading Parquet files, I will proceed to handle writing Parquet format by bytearray. Currently I use DuckDB and Polars helping to convert csv file to parquet format.

I have download your code examples, and doing some test, seem work after I test Apache Parquet-Go (I try, but nothing can be worked propertly).