zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
http://zarr.readthedocs.io/
MIT License
1.39k stars 263 forks source link

How to detect missing chunks in a zarr array. #587

Open royerloic opened 3 years ago

royerloic commented 3 years ago

This is really a feature request hiding in a question:

I am facing the following issue: one of my large lightsheet microscopy datasets seems to have missing chunks. These could have been 'lost' at any stage. I am not blaming zarr here for these, we know of other reasons such as file transfer could have caused these problems. Ideally I would want to verify, after each processing step, that all chunks have been written correctly -- or at least that no chunks are missing. How could this be done? Also, some checksuming and data integrity features would be important as we use zarr more and more for critical scientific data...

joshmoore commented 3 years ago

Hi @royerloic

I recently ran into a similar issue, but the chunks were intended to be missing. Is that possible in your case? i.e. do you have chunks where all pixels are the fill value (e.g. 0)? See https://github.com/intake/filesystem_spec/issues/342 for some background.

If that is the case, then it's going to be more difficult to guarantee that there are no true negatives. I don't know of a flag to enforce writing empty chunks but you could try setting a different fill value.

If that's not an issue, then this be roughly equivalent to https://github.com/zarr-developers/zarr-python/issues/392 though perhaps a workaround can be found for your particular use case.

jakirkham commented 3 years ago

Hey Loic, sorry to hear about the data loss. In addition to the longer term objectives Josh has highlighted, here are some things that could be tried today.

First it's possible to look at the keys in the Array's chunk_store attribute. This will give the keys, which contain all non-trivial data. Comparing these between two Arrays should clarify that the right keys are filled out in the result. Alternatively taking a ratio of Array's nchunks_initialized and nchunks should give 1 if all chunks have been initialized. Admittedly this is a bit crude, but it does help identify what has been written out and not.

Second (building off of Josh's point above) it might be worth picking a fill value for Arrays that will stick out as obviously bogus. NaN could be good if floating precision is an option or a very negative value when using signed integers. This could allow for doing some simple checks on the copied result perhaps with Dask (assert not da.any(da.isnan(a))). One could generalize this to a full Dask Array based comparison of the array before and after copying (this may be natural if you are using Dask to copy as well).

Third (again building off Josh's point 😄) it's possible to use some simple checksumming algorithm as a filter. These pack the checksum in the data of each chunk.

Fourth it's possible to compute a full checksum over a Zarr Array using hexdigest. Note this is sequential and uses cryptographic hashing algorithms. So can be quite slow.

Additionally it may be worthwhile to look at Zarr's convenience functions copy and copy_all to perform these transfers. These have other options like logging, handling replacement of values (in the event a copy needs to be run again), etc. All of which can be useful for provide greater certainty around copies, better visibility into failures, and places for us to debug issues that occur.

Also another thing worth exploring may be investigating different storage backends for different purposes. For instance writing to disk during acquisition, moving to a single file format (maybe ZipStore or something else) during transfer, and archiving to a database (key-value works well and some are implemented) or cloud (most providers supported) for long term storage. Admittedly this may be something you have given more thought than I at this point, but figured it deserved a brief mention 😉

None of these is the perfect solution. Though hopefully one or a few of these is useful in practical applications to improve quality. Questions and feedback welcome 🙂

royerloic commented 3 years ago

Thanks @jakirkham and @joshmoore !

I think a solution along the lines of #392 would be great, to be able to verify the integrity of a store even if there is no other copy of the store to compare it to.

In the meantime, I will experiment with some of the ideas mentioned above to verify integrity in an ad-hoc manner for now.

Thanks!

royerloic commented 3 years ago

One question, here is the info for a freshly generated zarr file:

image

Seems that not all chunks are initialised:

Chunks initialized : 115856/116064

Is the only reason that a chunk is not initialised that something went wrong? or can it happen just because the whole chunk can be zeroes?
After reading more it seems that if you write a chunk with all zeroes, then the chunk stays uninitialised. This is unfortunate because it becomes difficult to disentangle whether data is missing or data is zeroes. Would be good to have the possibility to write a 'null' chunk that does not take much (any) space (empty file? dict of null chunks?), but would make it clear that something was explicitly set to zero versus just missing.

joshmoore commented 3 years ago

or can it happen just because the whole chunk can be zeroes?

Yes, definitely.

jakirkham commented 3 years ago

Yeah this is why one of the suggestions above was to use a known problematic value (like would 65535 work here? Or would using int32 and -1 work?). Though yeah one could have a mask in a separate corresponding bool array (similar to how NumPy's masked arrays work) to mask out problematic values.