Benchmarking - Githubissues

alimanfoo commented 5 years ago

Developing from conversations with @rabernat, it could be beneficial and interesting to a relatively broad audience if the zarr paper included some results of benchmarks. This could include benchmarks based on use cases from more than one scientific domain, as well as benchmarks run on different types of distributed computing system, to present some performance characteristics in a range of settings.

I'm opening this issue as a place to discuss this idea, which benchmarks might be included, how they might be set up and run, what we might aim to compare, etc.

alimanfoo commented 5 years ago

From my scientific domain of genomics, there are a number of simple benchmarks that could demonstrate the utility of zarr (coupled with a distributed/parallel execution engine like dask) for the kinds of interactive, exploratory population genomic analyses that are regularly performed on a variety of organisms. These benchmarks could be relatively simple, but at least demonstrate some nice scaling properties as the size of the data grows.

I've been collaborating with @ebegoli and @eauel and colleagues at ORNL who have been developing a package to enable running some simple but informative benchmarks on different computing evironments. @ebegoli and @eauel may have their own publication plans for that work which we wouldn't want to clash with, but I'm raising here to illustrate the type of benchmark that might be relevant to a paper on zarr.

alimanfoo commented 5 years ago

Just to elaborate a little on the potential genomics benchmarks, we've worked on two benchmarks so far.

The first is a very simple data summarisation problem, where the input data is an array of genotype calls, and the result is a row-wise summarisation of the data. It's a nice benchmark for zarr because the computation is very light, and so it is typically bottlenecked on I/O. So with this benchmark you are seeing something close to the I/O scaling potential of the stack being used.

The second benchmark is a principal components analysis, which involves some data pre-processing then a singular value decomposition (SVD). Currently that benchmark uses the SVD implementation from numpy, and so requires data to be brought into memory, but it could potentially be modified to use dask's SVD implementation, in which case it could be implemented as a distributed, out-of-core computation. If so, it would provide an interesting illustration of how zarr can be combined with dask to get good scaling properties on a non-trivial linear algebra computation.

alimanfoo commented 5 years ago

One more comment on the genomics benchmark, it could potentially be an interesting benchmark to run on an HPC-style system with data on a parallel file system, and on Google cloud with data on object storage, and so provide a comparison between the different types of distributed system.

alimanfoo commented 5 years ago

cc @mrocklin

rabernat commented 5 years ago

For now I'll post a link to my current MPI benchmarking effort: https://github.com/rabernat/zarr_hdf_benchmarks

I have a simple command line utility: https://github.com/rabernat/zarr_hdf_benchmarks/blob/master/parallel_read_write.py

That you can run like this: https://github.com/rabernat/zarr_hdf_benchmarks/blob/master/PBS_run_script_cheyenne.sh

Will write more soon.

zarr-developers / zarr-paper

Benchmarking #1