Benchmark 2MASS csv->parquet conversion

Zarquan commented 3 years ago

Use the 2MASS csv->parquet conversion as a basis for benchmarking the performance of different configurations.

This is a top level task that can be broken down into several smaller steps:

Install a copy of the 2MASS csv files on a server we own (so we don't spam IPAC with our tests).
Package the 2MASS csv->parquet conversion as a notebook that generates test metrics
- time to download
- time to import
- time to convert ASCII-binary
- time export as parquet
Use this as an example to develop some data set validation checks
- row count check
- sum, avg, min, max checks
- use TAP/ADQL to get the same metrics from an external archive
- compare our metrics with values from an external source
Run the test notebook on a range of different configurations, logging the performance on each
Identify what the differences in performance are and figure out what is causing them

Note - this task isn't about increasing the performance, this is about using the benchmark tests to understand how the system works.

If there are no major differences in performance - that in itself is interesting and needs explaining.
If doubling the compute resources doubles the performance, that is interesting, but suspect.
- We would expect to see some overhead from the added complexity.
- A perfect doubling suggests we might be measuring the wrong thing.
If doubling the compute resources does not double the performance, it would be useful to know why.

NigelHambly commented 3 years ago

I'm not sure I understand the purpose of benchmarking csv to parquet conversion. It's working fast enough for administrative purposes (rare, one-off, bulk catalogue imports) and is not something we expect users to be interested in. Optimizing the system configuration for something that is of no interest to our end users seems unjustifiable in the face of all the other things that need doing. Moreover it's fraught with blind alleys and booby-traps in such a complex system. If we're worried about performance we'd be better to benchmark real-world science work flows IMHO because that can directly impact end-user experience. I suggest lowering the priority of this ticket to "very low priority, would be nice to know and understand" or perhaps even closing the ticket.

Zarquan commented 3 years ago

I agree we should be using real-world science work flows, but we don't have many of them yet. The aim of this task was to use using a workflow that we know to develop the test infrastructure, stress test the system and find weaknesses in it. It is definitely not to optimise the system for the data import task.

wfau / gaia-dmp

Benchmark 2MASS csv->parquet conversion #259