xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Appropriately size buffer when reading footer #460

Closed zolstein closed 2 years ago

zolstein commented 2 years ago

Using the default buffer size (4K) when reading the footer causes repeated small reads rather than one read of the correct size. This is especially bad when using a datasource like S3 that charges a cost per request regardless of the size of data returned.

Ideally ReadFooter could estimate the footer size and attempt to read the footer size and the footer in a single read operation, only making a second request if it under-estimated, but doing that correctly is much harder.

hangxie commented 2 years ago

Come here to say thank you for close to 10x performance improvement:

w/o this PR:

$ time parquet-tools size -q all -j s3://dpla-provider-export/2021/04/all.parquet/part-00000-471427c6-8097-428d-9703-a751a6572cca-c000.snappy.parquet
{"Raw":4632041101,"Uncompressed":14901963454,"Footer":441092}

real    0m18.169s
user    0m0.178s
sys 0m0.088s

$ time parquet-tools schema s3://dpla-provider-export/2021/04/all.parquet/part-00000-471427c6-8097-428d-9703-a751a6572cca-c000.snappy.parquet
...

real    0m20.906s
user    0m0.179s
sys 0m0.085s

w/ this PR

time ./build/parquet-tools size -q all -j s3://dpla-provider-export/2021/04/all.parquet/part-00000-471427c6-8097-428d-9703-a751a6572cca-c000.snappy.parquet
{"Raw":4632041101,"Uncompressed":14901963454,"Footer":441092}

real    0m2.376s
user    0m0.086s
sys 0m0.041s

$ time ./build/parquet-tools schema s3://dpla-provider-export/2021/04/all.parquet/part-00000-471427c6-8097-428d-9703-a751a6572cca-c000.snappy.parquet
...

real    0m2.230s
user    0m0.105s
sys 0m0.057s