wfau / gaia-dmp

Gaia data analysis platform
GNU General Public License v3.0
1 stars 5 forks source link

Benchmark 2MASS csv->parquet conversion #259

Open Zarquan opened 3 years ago

Zarquan commented 3 years ago

Use the 2MASS csv->parquet conversion as a basis for benchmarking the performance of different configurations.

This is a top level task that can be broken down into several smaller steps:

Note - this task isn't about increasing the performance, this is about using the benchmark tests to understand how the system works.

NigelHambly commented 3 years ago

I'm not sure I understand the purpose of benchmarking csv to parquet conversion. It's working fast enough for administrative purposes (rare, one-off, bulk catalogue imports) and is not something we expect users to be interested in. Optimizing the system configuration for something that is of no interest to our end users seems unjustifiable in the face of all the other things that need doing. Moreover it's fraught with blind alleys and booby-traps in such a complex system. If we're worried about performance we'd be better to benchmark real-world science work flows IMHO because that can directly impact end-user experience. I suggest lowering the priority of this ticket to "very low priority, would be nice to know and understand" or perhaps even closing the ticket.

Zarquan commented 3 years ago

I agree we should be using real-world science work flows, but we don't have many of them yet. The aim of this task was to use using a workflow that we know to develop the test infrastructure, stress test the system and find weaknesses in it. It is definitely not to optimise the system for the data import task.