Pick benchmarking tool - Githubissues

JackKelly commented 1 year ago

We'd like to benchmark the performance of existing Zarr implementations, starting with Zarr-Python.

We've identified 5 benchmarking frameworks:

airspeed velocity (also known as ASV. Hat tip to @MSanKeys963 for telling us about ASV :slightly_smiling_face:). ASV has been around since 2013. It's used by lots of python projects such as astropy, numpy, scipy, and pandas. But it's built for one very specific use-case: tracking performance of compute-bound code over multiple releases. Some of the lead developers for a range of highly influential Python packages have collated a long list of concerns with ASV. Their conclusion is that ASV is clunky or impossible to use for other use-cases (like checking if local code modifications affect performance), and the code is hard to modify.
pytest-benchmark (Hat tip to @ap-- for telling us about pytest-benchmark :slightly_smiling_face:). A plugin for pytest that provides a benchmark fixture. Lots of stats. JSON export.
pyperf. Used by the pyperformance project.
codespeed. "a web application to monitor and analyze the performance of your code. Known to be used by CPython, PyPy, Twisted and others."
conbench. "Language-independent Continuous Benchmarking (CB) Framework... The Apache Arrow project is using Conbench for Continuous Benchmarking.".

My current sense is that none of these packages are a perfect fit. The task is to decide if any of these packages fit well enough to be useful. I plan to more rigorously compare these 5 packages against our requirements.

The first 4 benchmarking frameworks are very focused on measuring the execution time of CPU-bound tasks, and detecting performance regressions. They answer questions like "does the latest release of the code reduce the runtime of the matrix multiplication function?". None of first 4 benchmarking frameworks are particularly interested in IO behaviour, or in comparing the behaviour of different projects.

I'm probably biased but I'd like our benchmarks to help answer questions like:

Does Zarr-Python currently sustain 100% IO utilization across a range of different storage mediums (local SSD, cloud bucket, etc.)? How does this behaviour change if we use Dask with Zarr to submit many IO operations in parallel? Does bandwidth drop away if we submit too many IO ops in parallel? And does that limit depend on the storage medium and/or the number of CPU cores?
How does bandwidth & latency vary as a function of the read pattern (e.g. reading the entire Zarr dataset versus reading crops on chunk boundaries versus reading random crops which cross chunk boundaries)? How does that vary with different storage mediums?
How does Zarr-Python's IO behaviour compare to TensorStore's IO behaviour and Zarrita's IO behaviour?
What's the peak memory utilisation whilst reading?
If IO isn't the bottleneck then what is? Are all CPU cores constantly saturated?
What's the maximum number of IO operations per second that Zarr-Python can request?

I'm almost certain that none of the 5 benchmarking frameworks listed above can help us answer these questions. So I'm wondering if we might be better rolling our own benchmarking framework. (Which shouldn't be too hard. psutil can measure utilisation of CPU(s), IO, etc. We could persist the benchmarks as JSON. And use something like streamlit to build a web UI.)

Or maybe I'm getting over-excited and we should just use an existing benchmarking framework and be happy with just measuring execution time :slightly_smiling_face:. We almost certainly don't want to try to answer all my questions for every PR! Maybe the automated benchmark suite should just measure execution time. And then we can do one-off, detailed, manual analyses to answer the questions above. Although i do think it could be extremely powerful to be able to share detailed, interactive analysis of Zarr's performance across a wide range of compute platforms, storage media, and Zarr implementations :slightly_smiling_face:. And some of the questions above can only be answered after collecting a lot of performance data. So it might be nice to at least collect (but not analyse) lots of data on every benchmark run.

JackKelly commented 1 year ago

Very much to my surprise, I'm yet to find any tools which record performance stats at regular time intervals as each workload is run. (which perhaps isn't surprising because most existing benchmarking tools assume the workload will take a tiny fraction of a second.)

I'm gonna put a little effort into exploring whether we could roll our own. Here's the draft design doc.

JackKelly commented 1 year ago

I've started implementing a general-purpose IO-centric benchmarking tool: perfcapture.

JackKelly commented 1 year ago

I'm gonna mark this as "done" now because we're using perfcapture. Please shout if you have concerns!

MSanKeys963 commented 1 year ago

List of benchmarking tool as shared in today's meeting by @jakirkham:

Py-Spy - https://github.com/benfred/py-spy
Scalene - CPU, GPU, memory - https://github.com/plasma-umass/scalene
Memray - memory - https://github.com/conda-forge/memray-feedstock
Memory_profiler - memory - https://github.com/pythonprofilers/memory_profiler
NERSC’s timemory (w/GPU support) - https://github.com/NERSC/timemory
Vmprof - https://github.com/vmprof/vmprof-python
Pyinstrument - https://pyinstrument.readthedocs.io/
Filprofiler (memory) - https://github.com/conda-forge/filprofiler-feedstock
Austin - CPU, memory - https://github.com/P403n1x87/austin
Pyperf - eBPF based sampling profiler
- https://pyperf.readthedocs.io/en/latest/index.html
- https://github.com/psf/pyperf
Ecosystem around pyperf:
- Air Speed Velocity (ASV) - https://asv.readthedocs.io/en/v0.6.1/
- Pytest-benchmark - https://github.com/ionelmc/pytest-benchmark/

JackKelly commented 1 year ago

Thanks!

In my limited understanding, there's a distinction to be made between benchmarking tools versus profiliing tools. (Although I could be wrong!)

My understanding is that benchmarking tools (like perfcapture, pytest-benchmark, and ASV) provide a framework for defining and running a set of workloads. Benchmarking tools tend to only measure a few very basic metrics for each workload (often they only measure the total runtime). Often, the aim is to detect performance regressions. I wrote perfcapture as a simple framework to allow us to define multiple workloads; where each workload can be associated with any number of datasets (because I couldn't find an existing benchmarking tool which would allow us to associate datasets with each workload).

Profiling tools, in contrast, tend not to provide a "test harness" for defining and running workloads. Instead they measure the behaviour of any given process. Often in minute detail (CPU cache hits, memory bandwidth, IO bandwidth, etc.)

I'll copy-and-paste @MSanKeys963's wonderful list into a new issue, to remind me to try at least one of the profiling tools

zarr-developers / zarr-benchmark

Pick benchmarking tool #1