Closed joey-laminar closed 2 years ago
Some memory comparisons from my code: Before:
$ go tool pprof mem.prof
File: my_parquet_reader
Type: inuse_space
Time: Apr 26, 2022 at 11:21am (IDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 327.31MB, 95.60% of 342.38MB total
Dropped 38 nodes (cum <= 1.71MB)
Showing top 5 nodes out of 60
flat flat% sum% cum cum%
116.43MB 34.00% 34.00% 116.43MB 34.00% bufio.NewReaderSize
115.44MB 33.72% 67.72% 115.44MB 33.72% bufio.NewWriterSize
62.16MB 18.15% 85.88% 89.78MB 26.22% github.com/xitongsys/parquet-go/layout.ReadPage
27.62MB 8.07% 93.95% 27.62MB 8.07% github.com/xitongsys/parquet-go/encoding.ReadRLEBitPackedHybrid
5.66MB 1.65% 95.60% 5.66MB 1.65% regexp/syntax.(*compiler).inst
After:
$ go tool pprof mem.prof
File: my_parquet_reader
Type: inuse_space
Time: Apr 25, 2022 at 10:16pm (IDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 67.47MB, 91.12% of 74.04MB total
Showing top 5 nodes out of 87
flat flat% sum% cum cum%
27.62MB 37.31% 37.31% 27.62MB 37.31% github.com/xitongsys/parquet-go/encoding.ReadRLEBitPackedHybrid
27.62MB 37.31% 74.62% 55.25MB 74.62% github.com/xitongsys/parquet-go/layout.ReadPage
5.51MB 7.44% 82.05% 6.01MB 8.11% runtime.allocm
4.70MB 6.34% 88.40% 4.70MB 6.34% regexp/syntax.(*compiler).inst
2.02MB 2.72% 91.12% 2.02MB 2.72% github.com/aws/aws-sdk-go/aws/endpoints.init
@xitongsys
This change is going to wreck performance and cost for anyone using the S3 data source, since now the library will make a new S3 GET request for every 4K of data. Probably that should be fixed in the S3 data source, but currently it isn't. Furthermore, this change seems to make it hard to actually do this sizing intelligently in the data source, since it will only be told to fetch bufferSize of data at a time.
IMO a lot of buffering should be moved out of the reader and into the ParquetFile implementations - this would move that decision into the abstraction that actually knows about the underlying storage.
I'm actually using the s3 data source myself (well a slightly modified version), to read files and sample a small number of rows from each column.
I added a print to the code that downloads from S3 and compared this version and the latest release. The results were surprising:
https://gist.github.com/joey-laminar/b73a73a4fdadfb09adb1a893752b2fd9
As you can see, there is a much bigger issue that even without this commit there are a large number of 4096 byte reads caused by the ReadFooter
function creating a thrift.NewStreamTransportR
which creates a bufio.NewReader
with the default size 4096.
After that, the difference in number of reads is not huge - because layout.ReadPage
creates a buffer of the page size and then calls io.ReadFull
, and thriftReader
makes sure to just send large Reads directly to Read.
So the reason this PR improves things is because the thriftReader stays around until we finish handling the file, so any buffers allocated would stay allocated, vs. after the PR where the thriftReader's buffer can stay small while large reads will still go directly to S3.
I agree there is still much room for improvement and that intelligently caching at the S3 would be useful (although deciding what to cache might still be hard)
Thanks for responding, and for double-checking my concerns!
I did some similar testing myself with some similarity in conclusion. I posted a PR to fix the ReadFooter
issue as well. (https://github.com/xitongsys/parquet-go/pull/460)
In my testing, I didn't see that the buffer size increases to the page size - in my test data it only showed 4K reads in the first 100 requests, and I incorrectly extrapolated that through the whole file. It's certainly better that it will read whole pages at a time when they exceed the buffer size.
However, this still seems bad, and outside of the behavior that a typical user wants or expects. To quote a section of the docs on configuring the page size:
Note: for sequential scans, it is not expected to read a page at a time; this is not the IO chunk. We recommend 8KB for page sizes.
The old behavior of buffering the entire column chunk (one column of a row group) seems in line with the expectation of the format, buffering one page at a time is not. Taking their suggestion of using 8KB page sizes limits the amount that reading a full page at a time helps.
Honestly the thing that bothered me about the old behavior was not necessarily reading the whole row-group (although again in extreme cases that can be problematic) as much as the extra memory use in between reading row-groups.
The underlying ThriftReader is allocated with enough memory for the whole RowGroup and that memory is not freed until we read the next RowGroup (and allocated memory for it as well). If we read a bit of data from each column the amount of memory allocation can sum up to large numbers.
A hack I've started using is to manually delete the columnBuffer after reading each column, so my code looks like this:
for ... {
pathStr := pr.SchemaHandler.ValueColumns[i]
values, _, _, err := pr.ReadColumnByPath(pathStr, SAMPLE_SIZE)
// A bit of a hack, manually delete the columnBuffer
// so that the GC can free it. Otherwise we end up
// keeping a huge amount of data in memory until we finish
// parsing the file
delete(pr.ColumnBuffers, pathStr)
}
This keeps memory pressure down enough that I would be tentatively in favor of reverting this PR for perfomance and to keep S3 costs down
On large parquet files these allocated hundreds of MB and caused OOM on systems with limited memory (1GB). Decreasing the buffer size doesn't have a huge impact on performance and lets us parse larger files