Closed JackKelly closed 1 year ago
Very much to my surprise, I'm yet to find any tools which record performance stats at regular time intervals as each workload is run. (which perhaps isn't surprising because most existing benchmarking tools assume the workload will take a tiny fraction of a second.)
I'm gonna put a little effort into exploring whether we could roll our own. Here's the draft design doc.
I've started implementing a general-purpose IO-centric benchmarking tool: perfcapture.
I'm gonna mark this as "done" now because we're using perfcapture
. Please shout if you have concerns!
List of benchmarking tool as shared in today's meeting by @jakirkham:
Thanks!
In my limited understanding, there's a distinction to be made between benchmarking tools versus profiliing tools. (Although I could be wrong!)
My understanding is that benchmarking tools (like perfcapture
, pytest-benchmark
, and ASV
) provide a framework for defining and running a set of workloads. Benchmarking tools tend to only measure a few very basic metrics for each workload (often they only measure the total runtime). Often, the aim is to detect performance regressions. I wrote perfcapture
as a simple framework to allow us to define multiple workloads; where each workload can be associated with any number of datasets (because I couldn't find an existing benchmarking tool which would allow us to associate datasets with each workload).
Profiling tools, in contrast, tend not to provide a "test harness" for defining and running workloads. Instead they measure the behaviour of any given process. Often in minute detail (CPU cache hits, memory bandwidth, IO bandwidth, etc.)
I'll copy-and-paste @MSanKeys963's wonderful list into a new issue, to remind me to try at least one of the profiling tools
We'd like to benchmark the performance of existing Zarr implementations, starting with Zarr-Python.
We've identified 5 benchmarking frameworks:
My current sense is that none of these packages are a perfect fit. The task is to decide if any of these packages fit well enough to be useful. I plan to more rigorously compare these 5 packages against our requirements.
The first 4 benchmarking frameworks are very focused on measuring the execution time of CPU-bound tasks, and detecting performance regressions. They answer questions like "does the latest release of the code reduce the runtime of the matrix multiplication function?". None of first 4 benchmarking frameworks are particularly interested in IO behaviour, or in comparing the behaviour of different projects.
I'm probably biased but I'd like our benchmarks to help answer questions like:
I'm almost certain that none of the 5 benchmarking frameworks listed above can help us answer these questions. So I'm wondering if we might be better rolling our own benchmarking framework. (Which shouldn't be too hard.
psutil
can measure utilisation of CPU(s), IO, etc. We could persist the benchmarks as JSON. And use something like streamlit to build a web UI.)Or maybe I'm getting over-excited and we should just use an existing benchmarking framework and be happy with just measuring execution time :slightly_smiling_face:. We almost certainly don't want to try to answer all my questions for every PR! Maybe the automated benchmark suite should just measure execution time. And then we can do one-off, detailed, manual analyses to answer the questions above. Although i do think it could be extremely powerful to be able to share detailed, interactive analysis of Zarr's performance across a wide range of compute platforms, storage media, and Zarr implementations :slightly_smiling_face:. And some of the questions above can only be answered after collecting a lot of performance data. So it might be nice to at least collect (but not analyse) lots of data on every benchmark run.