zarr-developers / community

An open community with an interest in developing and using new technologies for tensor data storage.
19 stars 1 forks source link

Projects using Zarr #19

Open alimanfoo opened 6 years ago

alimanfoo commented 6 years ago

If you are using Zarr for whatever purpose, we would love to hear about it. If you can, please leave a couple of sentences in a comment below describing who you are and what you are using Zarr for. This information helps us to get a sense of the different use cases that are important for Zarr users, and helps core developers to justify the time they spend on the project.

alimanfoo commented 6 years ago

Zarr is used by MalariaGEN within several large-scale collaborative scientific projects related to the genetic study of malaria. For example, Zarr is used by the Anopheles gambiae 1000 Genomes project to store and analyse data on genotype calls derived from whole genome sequencing of malaria mosquitoes, which led to this publication.

jhamman commented 6 years ago

Xarray has recently (https://github.com/pydata/xarray/pull/1528) developed a Zarr backend for reading and writing datasets to local and remote stores. In the Xarray use case, we hope to use Zarr to interface with Dask to support parallel reads/writes from cloud-based datastores. We hope to find that this approach out performs traditional file based workflows that use HDF5/NetCDF (https://github.com/pangeo-data/pangeo/issues/48). Xarray itself is used in a wide variety of fields, from climate science to finance.

@rabernat gets 99% of the credit for the zarr implementation in xarray.

alimanfoo commented 6 years ago

Thanks @jhamman.

MattKleinsmith commented 6 years ago

I'm participating in a Kaggle competition where the goal is to classify images by the camera models that took them, which is useful for forensics; for example, determining whether an image is a splice of multiple images.

I'm using bcolz right now (carrays only), but when I googled whether bcolz can handle simultaneous reads from two processes, one read per process, I came across Zarr. In a Dask issue in 2016, @alimanfoo said that bcolz was not thread-safe. If this is also true of processes, then I'll likely to switch to Zarr.

I'm a beginner when it comes to concurrency and parallelism (which I why I used the term "simultaneous" above). Could someone tell me why thread-safety and process-safety come into play even when only reads will occur? Does it have something to do with decompression? Even the name of a concept I could google would be helpful. Thank you.

alimanfoo commented 6 years ago

Hi Matt, thanks for posting. AFAIK it should be safe to read a bcolz carray from multiple processes. IIRC previously some issues were encountered when reading a bcolz carray from multiple threads, which were related to the interface with the c-blosc library which handles decompression. These may have since been resolved, in which case it may also now be safe to read a bcolz carray from multiple threads, but I cannot say for sure. If you only need an array class (and not a table class) then you may be better off switching to zarr, as zarr has been designed for concurrency and provides some additional storage and chunking options. Hth.

On Mon, Jan 22, 2018 at 2:07 PM, Matt Kleinsmith notifications@github.com wrote:

I'm participating in a Kaggle competition https://www.kaggle.com/c/sp-society-camera-model-identification where the goal is to classify images by the camera models that took them, which is useful for forensics, for example, determining whether an image is a splice of multiple images.

I'm using bcolz right now (carrays only), but when I googled whether bcolz can handle simultaneous reads from two processes, one read per process, I came across Zarr. In a Dask issue in 2016 https://github.com/dask/dask/issues/1033, @alimanfoo https://github.com/alimanfoo said that bcolz was not thread-safe. If this is also true of processes, then I'll likely to switch to Zarr.

I'm a beginner when it comes to concurrency and parallelism (which I why I used the term "simultaneous" above). Could someone tell me why thread-safety and process-safety come into play even when only reads will occur? Does it have something to do with decompression? Even the name of a concept I could google would be helpful. Thank you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/issues/228#issuecomment-359433012, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QulMG1s_eHbvTWb8rTAi1dcSy1Agks5tNJYpgaJpZM4RSZug .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

NickMortimer commented 6 years ago

Hi I'm currently looking at processing Argo float data in the cloud. Argo float basic data comes at single netcdf files, now you could read that data into a big database but I'm trying to come up with a databaseless solution so that the current files processing systems see no change.

@jhamman I like zarr and find it much easier to understand than HDF5 and more flexible. Where I am struggling a little is with xarray I tried just sending each cast to zarr with xarray's to_zarr, but this resulted in thousands of small files. Because of set cluster size this resulted in poor disk utalisation (in windows). This was because each attribute ended up in it's sown director with a single float in a 1k file.

Instead I've been storing an xarray object as a pickle I know this could cause problems (security etc.) but it is working very nicely. I create a large arrays of profiles (pickled objects) and it works great as a cache

jhamman commented 6 years ago

@NickMortimer - if there are some xarray/zarr specific issues that would help your use case, I think we'd be keen to engage on the xarray issue tracker.

NickMortimer commented 6 years ago

@jhamman OK I'll go there. I think it's mainly due to my file that I'm trying to process. There are lots of single value attributes that get stored in their own file. I will talk more over at xarray

jrbourbeau commented 4 years ago

Dask array has to_zarr/from_zarr functionality for interfacing with zarr

helps core developers to justify the time they spend on the project

This is a good point. It could be useful to have a webpage to point to that lists projects using zarr. For example, xarray has an "Xarray related projects" page in their sphinx docs and dask has a "Powered by Dask" section in https://dask.org/. Perhaps zarr could have something similar

QueensGambit commented 4 years ago

Hello @alimanfoo , the Zarr library is used to compress the training data for learning a neural network named CrazyAra to play chess variants:

In 2019, Zarr was also used for compressing the data which was generated in self-play games in a reinforcement learning setup. Here, Zarr was combined with the z5 library to export data in C++ and subsequently load it in python for training. I would like to cite this awesome library in my master thesis. Is there a desired way for doing so? Otherwise I would rely on the default way of citing GitHub repositories:

@misc{miles2019zarr,
    title = {zarr-developers/zarr-python},
        author = {Miles, Alistair},
    copyright = {MIT},
    url = {https://github.com/zarr-developers/zarr-python},
    abstract = {An implementation of chunked, compressed, N-dimensional arrays for Python.},
    urldate = {2019-12-19},
    publisher = {Zarr Developers},
    month = dec,
    year = {2019},
    note = {original-date: 2015-12-15T14:49:40Z}
}
tiagoantao commented 4 years ago

I am writing a book for Manning tentatively called "High performance Python for data analytics" and zarr is a big part of it, especially because a good chunk of the book is about persistence efficiency

amcnicho commented 4 years ago

Working on a project that is currently exploring the adoption of a stack built around Dask and Zarr (interface via Xarray). This is envisioned as a replacement for a legacy persistent data model underlying the flagging, calibration, synthesis, visualization, and analysis of astronomical data, especially from radio interferometers (e.g., ALMA).

@alimanfoo, does the Zarr project have a preferred method of citation in academic publications?

alimanfoo commented 4 years ago

Just to say thanks everyone for adding projects here, very much appreciated.

Re citation, we don't have a preferred method, your suggestion @QueensGambit sounds good in the interim. Writing a short paper about zarr is very much on the wish list.

benbovy commented 4 years ago

I've recently added saving simulation data as Zarr datasets in xarray-simlab.

https://xarray-simlab.readthedocs.io/en/latest/io_storage.html#using-zarr

Everything went smoothly when working on this. Thanks @alimanfoo and all Zarr developers! (Also thanks @rabernat, @jhamman and others for Zarr integration with Xarray).

I'm working now on running batches of simulations in parallel using Dask and save all that data in the same Zarr datasets (along a batch dimension), but I'm struggling with different things so I might ask for some help.

joshmoore commented 4 years ago

https://google.github.io/tensorstore/tensorstore/driver/index.html#chunked-storage-drivers

A TensorStore is an asynchronous view of a multi-dimensional array. Every TensorStore is backed by a driver, which connects the high-level TensorStore interface to an underlying data storage mechanism. Using an appropriate driver, a TensorStore may be used to access:

  • chunked storage formats like zarr, N5, Neuroglancer precomputed, backed by a supported key-value storage system, such as: Google Cloud Storage, Local and network filesystems
olly-writes-code commented 4 years ago

I'm running parallel simulations using dask distributed. I'm using Zarr to create and persist input data to disk for fast reading for each simulation. The data is too large to pass between the parallel processes. I was using Feather before but I needed more complex data structures. Switching to Zarr improved simulation speed by roughly 30%. Thanks for the work on this ❤️

gzuidhof commented 4 years ago

Lyft Level 5 just released a dataset in zarr format, it consists of over 1000 hours of driving data and observations. Along with it they have open sourced a codebase relating to the prediction and planning task in autonomous vehicles. Zarr is a great fit for ML problems!

Links:

jakirkham commented 4 years ago

Do they happen to have a tweet? May be worth retweeting

gzuidhof commented 4 years ago

https://twitter.com/LyftLevel5

alimanfoo commented 4 years ago

Thanks @gzuidhof, tweeted from zarr_dev here: https://twitter.com/zarr_dev/status/1277515272270815232

jakirkham commented 4 years ago

Thanks both 🙂

ericgyounkin commented 3 years ago

Hi, I am building a distributed multibeam sonar processing software suite using dask/xarray/zarr. Big fan of zarr! The format is so easy to work with. Really appreciate the work of this community.

Kluster

PaulJWright commented 3 years ago

We are going to be hosting the Solar Dynamics Observatory (SDO) Machine Learning dataset (https://iopscience.iop.org/article/10.3847/1538-4365/ab1005) on a public-facing Google Cloud bucket, stored in Zarr format! Appreciate the work here, this is fantastic!

parashardhapola commented 3 years ago

Hi @alimanfoo and Zarr developers,

First of all, thank you for developing, maintaining and consistently improving this incredible software.

We have created a memory efficient single-cell genomics data processing toolkit, called Scarf, using Zarr. There is a preprint describing Scarf here A tweet introducing Scarf can be found here

We chose Zarr over HDF5 because:

RichardScottOZ commented 3 years ago

Using it to enable country scale machine learning geology modelling from Sentinel data.

jhamman commented 2 years ago

We're using Zarr for a new visualization application, rending chunks of data directly from cloud object store using WebGL and Mapbox. A full writeup of our approach is here: https://carbonplan.org/blog/maps-library-release

thomcom commented 2 years ago

Hi there! I wanted to let you know that I'm working on a python wrapper for NVIDIA's nvcomp that will implement the numcodecs API and hopefully work seamlessly with zarr. We're investigating using Zarr with cudf and dask.

aladinor commented 2 years ago

Hi there! I am using Zarr to convert several hdf5 files coming from NASA P3 aircraft (planes that fly inside clouds including hurricanes). This conversion allows us to manage/access large data set for machine learning applied to cloud microphysics.

magnunor commented 2 years ago

Hey, we've added support for zarr in the HyperSpy library, which is a library for processing electron microscopy data. These last years, we've been getting faster and faster detectors, so our datasets have drastically increased in size: from 100 MBs to 100+ GBs. Previously we used HDF5 for this, but using zarr has greatly improved performance for working with these large datasets.

Kudos to @cssfrancis for adding this to HyperSpy: https://github.com/hyperspy/hyperspy/pull/2825

https://hyperspy.org/hyperspy-doc/current/user_guide/io.html#zspy-format

julioasotodv commented 12 months ago

Hi! At Sandoz Pharmaceuticals we use zarr to store large, compressed n-dimensional arrays for a wide variety of projects, including biosimilars structure optimization with high-dimensional (>1M-dimensional) bayesian optimization, storage of large adjacency matrices for large molecular graphs (some of these matrices are of dimensions 30M x 30M), and for medical imaging data.

Multi-threaded reads and writes using Dask have saved us hours of waiting!

Unfortunately I can't share any further details due to trade secrets, but I am really thankful for your great work with this storage format 😄

miccoli commented 10 months ago

Hi, zarr has become a cornerstone of all my data acquisition pipelines. The main “selling point“ for me is how easy it is to build very robust systems, with limited data loss in case of hardware/software failures.

PowerChell commented 6 months ago

Hello everyone, I am a Developer Advocate from Radiant Earth, and I love seeing all of this Zarr work. If any of you are interested in hosting your Zarr data on Source Cooperative (currently in public beta at beta.source.coop), please email me at hello@source.coop. We would love to increase exposure to any Zarr projects that want to be seen/shared.

Also interested in hosting other cloud-optimized data formats on Source as well. To read more about Source, please see this blog post.

jeromekelleher commented 6 months ago

I've used Zarr to store large-scale genetic variation data efficiently in tsinfer

I'm also involved in the sgkit project which uses Zarr to work with genomics data more generally (and currently writing a paper about it). Zarr is awesome :+1: