xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Huge memory consumption when writing millions of entries #415

Open aminmir326 opened 3 years ago

aminmir326 commented 3 years ago

Hi, I'm trying to write a very big dataset (millions of entries) to a parquet file. However, there's huge memory allocations happening within this library which leads to server restarts. Using pprof tool, I managed to confirm these allocations are happening in the parquet-go package.

Here's the output from pprof after running roughly 1h:

go tool pprof http://127.0.0.1:8080/debug/pprof/heap

pprof1

For example the Flush function in ParquetWriter is making a lot of allocations here:

list Flush:

pprof2

Do you have any tips on how to tune the configuration params in order to mitigate this issue?

hangxie commented 3 years ago

See if https://github.com/xitongsys/parquet-go/blob/0dd71c46430a98430d6430ae76f1f684ada788d5/README.md#tips-3 helps, though personally I don't think several hundreds of MB RAM is "huge" when you work with parquet, which is mainly used for "big data".

aminmir326 commented 3 years ago

The actual memory consumption is way more than several hundreds of MB of RAM. pprof shows only a fraction of what is actually happening. In my case, memory consumption reaches 16GB resulting in server restarts. It was around 10GB when I was profiling the code. What's more interesting is that even if I cancel the whole operation, memory is not freed. For example if it's 8GB when the operation is canceled, it stays around that even if the WriteStop is called. But thanks for the tip, I'll try tuning those parameters to see if there's any effect.

aminmir326 commented 3 years ago

Quick update, I tried other libraries and there were no memory issues running the same task. So I believe the issue above must be from this library. The good thing about this library though is that it supports parallelism which is a huge factor when working with massive amount of data. I've seen other people also complain about the memory issues on the internet too.

xitongsys commented 2 years ago

@aminmir326 could you provide more details ? Or an example codes to reproduce this is better. Thanks