What to measure during benchmarking?

The plan is to implement a benchmarking tool which automatically runs a suite of "Zarr workloads" across a range of compute platforms, storage media, chunk sizes, and Zarr implementations.

What would we like to measure for each workload?

Existing benchmarking tools only measure the runtime of each workload. That doesn't feel sufficient for Zarr because one of our main questions during benchmarking is whether the Zarr implementation is able to saturate the IO subsystem, and how much CPU and RAM is required to saturate the IO.

I'd propose that it'd be great to measure these parameters each time each workload is run:

Total execution time of the workload
Total bytes read / written for disk / network
Total IO operations
Total bytes in final numpy array
Average CPU utilization (per CPU).
Max RAM usage during the execution of the workload
CPU cache hit ratio

(Each run would also capture a bunch of metadata about the environment such as the compute environment, storage media, chunk sizes, Zarr implementation name and version, etc.)

I had previously gotten over-excited and starting thinking about capturing a full "trace" during the execution of each workload, e.g. capturing a timeseries of the IO utilization every 100 milliseconds. This might be useful, but makes the benchmarking code rather more complex, and maybe doesn't tell us much more than the "totals per workload" tell us. And some benchmark workloads might run for less than 100 ms. And psutil's documentation states that some of its counters aren't reliable when polled more frequently than 10 times a second.

What do you folks think? Do we need to record a full "trace" during each workload? Or is it sufficient to just capture totals per workload? Are there any changes you'd make to the list of parameters I proposed above?

zarr-developers / perfcapture

What to measure during benchmarking? #15