zarr-developers / zarr_implementations

MIT License
38 stars 16 forks source link

storage of data files in the repository #29

Closed grlee77 closed 3 years ago

grlee77 commented 3 years ago

I wanted to raise a question of whether we should be storing any of the generated binary files in this repository? I did add them in #24 based on prior examples, but I see that the recent xtensor-zarr PR did not.

One annoyance is that if I run make data locally it will cause git to consider all of these files to be updated (although we could fix that with .gitignore). Another, is that if I run the tests locally using the files stored in the repo (generating only the absent xtensor-zarr data via make xtensor-zarr), I see the following failure:

FAILED test/test_read_all.py::test_correct_read[read z5py zarr using zarr, gzip] - OSError: Not a gzipped file (b'x^')

However, all tests pass if I regenerate the data with make data rather than using the ones stored in the repository.

joshmoore commented 3 years ago

Not sure if @constantinpape has any thoughts on the original design, but my vote would also be for no data files in the repo. (And if the previous files are too large in the history, stripping them)

A half-way point might be storing checksums.

constantinpape commented 3 years ago

Thanks for raising this @grlee77.

Not sure if @constantinpape has any thoughts on the original design, but my vote would also be for no data files in the repo. (And if the previous files are too large in the history, stripping them)

Initially, I did add all the data in order to just have some reference files for the zarr / n5 data format online. This repo then evolved and I agree that it makes more sense now to not store all the data any more.

However, I think it would still be useful to have some reference data here, which could be used by some (external) tools for static checks or similar. So I would propose to only keep the data / add new data if it corresponds to a different spec version (with different compressors). So, for now this would be data.zr, data.n5, data.z3. Instead of having them under data we could add a new folder example for this to avoid the issues @grlee77 reported.