Open Zarquan opened 3 years ago
I'm not sure I understand the purpose of benchmarking csv to parquet conversion. It's working fast enough for administrative purposes (rare, one-off, bulk catalogue imports) and is not something we expect users to be interested in. Optimizing the system configuration for something that is of no interest to our end users seems unjustifiable in the face of all the other things that need doing. Moreover it's fraught with blind alleys and booby-traps in such a complex system. If we're worried about performance we'd be better to benchmark real-world science work flows IMHO because that can directly impact end-user experience. I suggest lowering the priority of this ticket to "very low priority, would be nice to know and understand" or perhaps even closing the ticket.
I agree we should be using real-world science work flows, but we don't have many of them yet. The aim of this task was to use using a workflow that we know to develop the test infrastructure, stress test the system and find weaknesses in it. It is definitely not to optimise the system for the data import task.
Use the 2MASS csv->parquet conversion as a basis for benchmarking the performance of different configurations.
This is a top level task that can be broken down into several smaller steps:
Install a copy of the 2MASS csv files on a server we own (so we don't spam IPAC with our tests).
Package the 2MASS csv->parquet conversion as a notebook that generates test metrics
Use this as an example to develop some data set validation checks
Run the test notebook on a range of different configurations, logging the performance on each
Identify what the differences in performance are and figure out what is causing them
Note - this task isn't about increasing the performance, this is about using the benchmark tests to understand how the system works.