xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.25k stars 294 forks source link

Is there a way to append data to a existing parquet file or alternatively can I read and write data in a pipeline? #500

Open yorsita opened 2 years ago

yorsita commented 2 years ago

Hi team, I am writing data into parquet file in several times. In certain cases I want to append the data to an existing parquet file. I saw someone had asked the simialr issue 4 years ago and I was wondering is currently a way to do that? or alternatively can I read the parquet file in buffer then append data in the end and flush it to the same file?

Thanks!

FourSpaces commented 1 year ago

Parquet files are different from ordinary text files. They cannot append data to the end of the parquet file. You can use the following methods to solve the problem:

  1. You can try to read out data from multiple small parquet files and regenerate a large parquet file.

  2. You can write data into a text file. When the size of the text file meets your needs, convert the contents of the text file into a parquet file.

anjackson commented 10 months ago

The Parquet format does not directly support appending row groups, but fastparquet seems to manage it by patching/edititing the end of the file before appending another row group. See https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write for details. I don't know the Parquet format well enough to know whether this is a nasty hack or a perfectly reasonable tactic.