Closed zolstein closed 2 years ago
Come here to say thank you for close to 10x performance improvement:
w/o this PR:
$ time parquet-tools size -q all -j s3://dpla-provider-export/2021/04/all.parquet/part-00000-471427c6-8097-428d-9703-a751a6572cca-c000.snappy.parquet
{"Raw":4632041101,"Uncompressed":14901963454,"Footer":441092}
real 0m18.169s
user 0m0.178s
sys 0m0.088s
$ time parquet-tools schema s3://dpla-provider-export/2021/04/all.parquet/part-00000-471427c6-8097-428d-9703-a751a6572cca-c000.snappy.parquet
...
real 0m20.906s
user 0m0.179s
sys 0m0.085s
w/ this PR
time ./build/parquet-tools size -q all -j s3://dpla-provider-export/2021/04/all.parquet/part-00000-471427c6-8097-428d-9703-a751a6572cca-c000.snappy.parquet
{"Raw":4632041101,"Uncompressed":14901963454,"Footer":441092}
real 0m2.376s
user 0m0.086s
sys 0m0.041s
$ time ./build/parquet-tools schema s3://dpla-provider-export/2021/04/all.parquet/part-00000-471427c6-8097-428d-9703-a751a6572cca-c000.snappy.parquet
...
real 0m2.230s
user 0m0.105s
sys 0m0.057s
Using the default buffer size (4K) when reading the footer causes repeated small reads rather than one read of the correct size. This is especially bad when using a datasource like S3 that charges a cost per request regardless of the size of data returned.
Ideally ReadFooter could estimate the footer size and attempt to read the footer size and the footer in a single read operation, only making a second request if it under-estimated, but doing that correctly is much harder.