zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.53k stars 286 forks source link

Question: Streaming Data #134

Open jakirkham opened 7 years ago

jakirkham commented 7 years ago

Was curious if anyone had played around with streaming data to Zarr from an active source. Curious to hear more what people have tried and what works well. Also hearing about any known limitations would be very helpful.

alimanfoo commented 7 years ago

Could you give an example of an active source and the data generated?

On Mon, Feb 27, 2017 at 4:32 PM, jakirkham notifications@github.com wrote:

Was curious if anyone had played around with streaming data to Zarr from an active source. Curious to hear more what people have tried and what works well. Also hearing about any known limitations would be very helpful.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/zarr/issues/134, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qn1bfTMHmiAbvZ0D-efzgHjBA_hXks5rgvqjgaJpZM4MNU6I .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

JohannesBuchner commented 4 years ago

This is common in astronomy, where you would receive data either from a telescope or from an archive. For example, you may register to receive alerts about supernova, and receive structured data sets (e.g., images, spectra, position information). In some cases, the data sets are way too large to be stored and need to be processed (or down-selected) on-the-fly.

HDF5 and FITS files do not support this AFAIK. VOTables, a XML-based format which can contain binary data tables, does this well. However, it is probably not optimal in terms of data storage. Some people have started working on ASDF to address this and other problems. You can find more information there: https://asdf-standard.readthedocs.io/en/latest/intro.html

jhamman commented 4 years ago

Cross referencing my recent blog on this subject: https://medium.com/pangeo/streaming-zarr-ccf0d518b1c0. I think as long as we can write a MutableMapping interface to the streaming service (Redis in my case), Zarr is going to have no trouble handling this use case. The real work is going to be in management of the stream where we'll likely need additional tools to subscribe to stream events.