zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.45k stars 273 forks source link

S3 anonymous Zarr data examples... #385

Open DennisHeimbigner opened 5 years ago

DennisHeimbigner commented 5 years ago

I am in the process of constructing th initial netcdf-c library handler for the Zarr format. As part of this, I need to verify my assumptions about the mapping of the storage to S3. Are there any anonymously accessible zarr datasets that I can access (read-only)?

jhamman commented 5 years ago

@DennisHeimbigner - this dataset is on GCS but may work for you: https://storage.googleapis.com/pangeo-data/ecco/eccov4r3/

DennisHeimbigner commented 5 years ago

This prefix is readable https://storage.googleapis.com/pangeo-data so it may work. Thanks.

jakirkham commented 5 years ago

There's also a fixture directory in this repo, which has some dummy data for testing/validating the format/spec is still met during testing. This may be useful for seeding your own S3 bucket. Though it is quite small.

DennisHeimbigner commented 5 years ago

Unfortunately, these do not appear to reside on S3 itself.

jhamman commented 5 years ago

Right. This data is in GCS. Perhaps @jacobtomlinson knows of a public s3 zarr out there?

jacobtomlinson commented 5 years ago

No, but I can make one if you like?

DennisHeimbigner commented 5 years ago

That would be helpful if you did. It does not have to be complex, I am just trying to get the basic access correct.

alimanfoo commented 5 years ago

The S3 example in the zarr tutorial uses a very small toy dataset that is publicly accessible. Bucket is here: http://zarr-demo.s3-eu-west-2.amazonaws.com/

joshmoore commented 5 years ago

Would there be any interest in having a https://www.minio.io/ -based setup using docker within travis so that s3 tests could be run? This would carry a s3fs requirement at least at the testing scope.

Edit: Looks like gh-293 may either make this unnecessary or be a good template for adding this for a AWS clone.

alimanfoo commented 5 years ago

Would there be any interest in having a https://www.minio.io/ -based setup using docker within travis so that s3 tests could be run? This would carry a s3fs requirement at least at the testing scope.

Edit: Looks like gh-293 may either make this unnecessary or be a good template for adding this for a AWS clone.

Sorry for slow follow up here. I think this would be excellent. I had been concerned that the cloud storage class implementations that are not within the zarr code base were not getting put through the test suite, but this would solve that very nicely. I think #293 provides a template, but it would need a new PR to add test coverage for AWS S3 via s3fs.S3Map.

Also I noticed recently that GCS has support now for local emulation, so it should be possible to get something for GCS too via gcsfs.GCSMap. That could be done separately from the open PR to implement a GCS storage class via the official Python SDK (#252), which would be nice to finish but is a parallel piece of work.

martindurant commented 5 years ago

GCS has support now for local emulation

how? where? I'd love to see it. I think I saw this mentioned elsewhere.

To @joshmoore , you don't need minio, you can more easily use moto, which is what the s3fs tests use.

alimanfoo commented 5 years ago

Re emulation, sorry I think I got confused, I had seen this page about emulation for Google Cloud Datastore but of course that's something completely different from Google Cloud Storage.

joshmoore commented 5 years ago

you don't need minio, you can more easily use moto, which is what the s3fs tests use.

Thanks, @martindurant. I hadn't seen moto before. Happy to have the tests use whatever's appropriate in this repo, especially if mocking is preferred to integration tests. For me, the minio setup is also useful for more production testing. Would you also suggest using moto in server mode for that?

martindurant commented 5 years ago

I don't see why not. Moto lacks some rather specific features such as file versioning, but is pretty complete. minio also isn't exactly S3...

meggart commented 5 years ago

The S3 example in the zarr tutorial uses a very small toy dataset that is publicly accessible. Bucket is here: http://zarr-demo.s3-eu-west-2.amazonaws.com/

We are currently implementing an S3 backend for our Julia zarr package https://github.com/meggart/ZarrNative.jl/commits/S3storage . I wanted to ask if it is ok to use the dataset you mention here for our unit tests?

alimanfoo commented 5 years ago

Yes of course. Also happy to give you write access and/or put more test datasets there if it would be useful.

On Mon, 15 Apr 2019, 18:22 Fabian Gans, notifications@github.com wrote:

The S3 example https://zarr.readthedocs.io/en/stable/tutorial.html#distributed-cloud-storage in the zarr tutorial uses a very small toy dataset that is publicly accessible. Bucket is here: http://zarr-demo.s3-eu-west-2.amazonaws.com/

We are currently implementing an S3 backend for our Julia zarr package https://github.com/meggart/ZarrNative.jl/commits/S3storage . I wanted to ask if it is ok to use the dataset you mention here for our unit tests?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr/issues/385#issuecomment-483195469, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QjhqB7W90cnwI-XEA846csAHkEIOks5vhFLVgaJpZM4Z0TqS .

mhearne-usgs commented 4 years ago

@alimanfoo Regarding this S3 example, what is the file format of the zaar-demo data? I've tried placing a .zarr file (directory) on S3, and I am having issues accessing it.

joshmoore commented 4 years ago

@mhearne-usgs : see also https://github.com/martindurant/zarr/pull/1/files for an example of following @martindurant's moto suggestion.