zarr-developers / zarr-benchmark

Benchmarking the performance of various Zarr implementations, using our perfcapture framework.
MIT License
1 stars 2 forks source link

Pick benchmarking tool #1

Closed JackKelly closed 1 year ago

JackKelly commented 1 year ago

We'd like to benchmark the performance of existing Zarr implementations, starting with Zarr-Python.

We've identified 5 benchmarking frameworks:

  1. airspeed velocity (also known as ASV. Hat tip to @MSanKeys963 for telling us about ASV :slightly_smiling_face:). ASV has been around since 2013. It's used by lots of python projects such as astropy, numpy, scipy, and pandas. But it's built for one very specific use-case: tracking performance of compute-bound code over multiple releases. Some of the lead developers for a range of highly influential Python packages have collated a long list of concerns with ASV. Their conclusion is that ASV is clunky or impossible to use for other use-cases (like checking if local code modifications affect performance), and the code is hard to modify.
  2. pytest-benchmark (Hat tip to @ap-- for telling us about pytest-benchmark :slightly_smiling_face:). A plugin for pytest that provides a benchmark fixture. Lots of stats. JSON export.
  3. pyperf. Used by the pyperformance project.
  4. codespeed. "a web application to monitor and analyze the performance of your code. Known to be used by CPython, PyPy, Twisted and others."
  5. conbench. "Language-independent Continuous Benchmarking (CB) Framework... The Apache Arrow project is using Conbench for Continuous Benchmarking.".

My current sense is that none of these packages are a perfect fit. The task is to decide if any of these packages fit well enough to be useful. I plan to more rigorously compare these 5 packages against our requirements.

The first 4 benchmarking frameworks are very focused on measuring the execution time of CPU-bound tasks, and detecting performance regressions. They answer questions like "does the latest release of the code reduce the runtime of the matrix multiplication function?". None of first 4 benchmarking frameworks are particularly interested in IO behaviour, or in comparing the behaviour of different projects.

I'm probably biased but I'd like our benchmarks to help answer questions like:

I'm almost certain that none of the 5 benchmarking frameworks listed above can help us answer these questions. So I'm wondering if we might be better rolling our own benchmarking framework. (Which shouldn't be too hard. psutil can measure utilisation of CPU(s), IO, etc. We could persist the benchmarks as JSON. And use something like streamlit to build a web UI.)

Or maybe I'm getting over-excited and we should just use an existing benchmarking framework and be happy with just measuring execution time :slightly_smiling_face:. We almost certainly don't want to try to answer all my questions for every PR! Maybe the automated benchmark suite should just measure execution time. And then we can do one-off, detailed, manual analyses to answer the questions above. Although i do think it could be extremely powerful to be able to share detailed, interactive analysis of Zarr's performance across a wide range of compute platforms, storage media, and Zarr implementations :slightly_smiling_face:. And some of the questions above can only be answered after collecting a lot of performance data. So it might be nice to at least collect (but not analyse) lots of data on every benchmark run.

JackKelly commented 1 year ago

Very much to my surprise, I'm yet to find any tools which record performance stats at regular time intervals as each workload is run. (which perhaps isn't surprising because most existing benchmarking tools assume the workload will take a tiny fraction of a second.)

I'm gonna put a little effort into exploring whether we could roll our own. Here's the draft design doc.

JackKelly commented 1 year ago

I've started implementing a general-purpose IO-centric benchmarking tool: perfcapture.

JackKelly commented 1 year ago

I'm gonna mark this as "done" now because we're using perfcapture. Please shout if you have concerns!

MSanKeys963 commented 1 year ago

List of benchmarking tool as shared in today's meeting by @jakirkham:

JackKelly commented 1 year ago

Thanks!

In my limited understanding, there's a distinction to be made between benchmarking tools versus profiliing tools. (Although I could be wrong!)

My understanding is that benchmarking tools (like perfcapture, pytest-benchmark, and ASV) provide a framework for defining and running a set of workloads. Benchmarking tools tend to only measure a few very basic metrics for each workload (often they only measure the total runtime). Often, the aim is to detect performance regressions. I wrote perfcapture as a simple framework to allow us to define multiple workloads; where each workload can be associated with any number of datasets (because I couldn't find an existing benchmarking tool which would allow us to associate datasets with each workload).

Profiling tools, in contrast, tend not to provide a "test harness" for defining and running workloads. Instead they measure the behaviour of any given process. Often in minute detail (CPU cache hits, memory bandwidth, IO bandwidth, etc.)

I'll copy-and-paste @MSanKeys963's wonderful list into a new issue, to remind me to try at least one of the profiling tools