zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.5k stars 278 forks source link

Links between groups #297

Open mrocklin opened 6 years ago

mrocklin commented 6 years ago

Culturally some groups like to organize datasets day-by-day. This makes it easy to append new data by just dropping it into a directory.

Culturally other groups like to organize datasets as large monoliths. This makes it easy to manage large logical collections simply.

Is there a way to do both by having separate metadata files that both point to the same collections of bytes?

Similarlly, I might want a logical dataset that points to the most recent day of data. Ideally I could have single metadata file in one location that would contain a relative path that could change day by day.

I suspect that the answer to these questions today is "no, you can not do this. Zarr expects blocks to be in a certain location". However, I suspect that this might be doable if we were to extend metadata entries with an optional relative path to prepend to data key locations.

alimanfoo commented 6 years ago

Hi Matt, I'm not quite grokking. Could you given an example?

-- If I do not respond to an email within a few days, please feel free to resend your email and/or contact me by other means.

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo

ghost commented 6 years ago

Here is my understanding of the problem. Take some zarr store, a.zarr. Every day, some application writes some data to a.zarr. However, it groups the data together by the date on which it was written. We may have have groups like /2018/08/30, for example. What @mrocklin seems to be proposing is having multiple metadata files that "transmute" the user-facing appearance of a.zarr. Suppose we also had b.zarr and c.zarr, two stores that refer to a.zarr for data. However, b.zarr specifies in its metadata that it shows the "latest" data entries (/2018/08/30, e.g.), while c.zarr "flattens" all of the data in a.zarr to appear as though everything were under the root group.

@mrocklin Please let me know if I have misunderstood your proposal.

mrocklin commented 6 years ago

Yes, I think that that's more or less equivalent. I'll try summarizing it from the other direction.

Lets say that an automated process wants to dump a data file into a directory every day. They've chosen to store that as Zarr. However, in order to avoid mucking about with metadata they've chosen to just dump a new file every day.

$ ls *.zarr
2018-01-01.zarr
2018-01-02.zarr
2018-01-03.zarr
2018-01-04.zarr

However some of our scientific users don't want to manage this as many small zarr datasets, they are willing to make a metadata file around this data after the fact to represent it as one giant dataset. They create a new logical zarr dataset that contains only metadata. That metadata points to the pre-existing data contained in the other files:

$ ls *.zarr
2018-01-01.zarr
2018-01-02.zarr
2018-01-03.zarr
2018-01-04.zarr
all.zarr

$ ls all.zarr
.meta.json

$ cat all.zarr/.meta.json
{ ...
  { 
    ...
    relative_path: '../2018-01-01.zarr/'
  } 
  { 
    ...
    relative_path: '../2018-01-02.zarr/'
  }
  ...
}

Similarly a workload wants a zarr dataset that is also just a metadata file, but that metadata file points to the latest day of data

$ ls *.zarr
2018-01-01.zarr
2018-01-02.zarr
2018-01-03.zarr
2018-01-04.zarr
all.zarr
latest.zarr
alimanfoo commented 6 years ago

Thanks Matt. Here's a gist with some possibilities. In a nutshell, to be able to view all the data together, a user could (option 1) open the parent directory via zarr, effectively turning the parent directory into a group, or (option 2) use file system (hard) links. To get a "latest" dataset you could use hard links. You can't use symbolic links currently as zarr DirectoryStore does not dereference them, although this could probably be changed. Note that these solutions are specific to using a zarr DirectoryStore, they may not apply to other types of store.

Re the suggestion to include links within the zarr metadata, this is probably harder to do as it would need to be generalised to account for different types of store. I.e., could not assume DirectoryStore.

Note that these types of features sound very similar to what HDF5 provides via links. I believe there are "hard", "soft" and "external" links, see h5py docs on links. I had been trying to avoid implementing links within zarr, just to keep things simple, and because this requirement can be achieved at the file system level (if using DirectoryStore). But happy to discuss if the file system solution is not sufficient.

alimanfoo commented 6 years ago

Thanks @onalant. I've added some more examples to this gist to show how your example could be done with hard links. Again not saying this is a perfect solution, just illustrating a possibility.

alimanfoo commented 6 years ago

P.S. @mrocklin do you mind if I rename this issue something like "links between groups"?

mrocklin commented 6 years ago

Fine by me