zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
86 stars 28 forks source link

mime type / encoding format conventions #123

Open satra opened 2 years ago

satra commented 2 years ago

we are trying to include some type information in our jsonld descriptors of a zarr asset. i could not find a search response to a mime type for zarr. would application/x-zarr be appropriate?

jhamman commented 2 years ago

I don't recall how we got there but we've used application/vnd+zarr in the past.

ethanrd commented 2 years ago

Hi all - Unidata and the netCDF community is working on registering the application/netcdf media type with IANA (see netCDF GH Issue 42). Here are a few notes on the registration process in case it is useful.

The process for registering a media type with IANA (defined in RFC 6838) has an unregistered namespace that "may be used for [media] types intended exclusively for use in private, local environments". The sub-type in the unregistered namespace/tree is prefixed with a “x.”, which replaces the older “x-” prefix.

The vender tree/namespace (prefixed with “vnd.”) is used for "media types associated with publicly available products". A suffix starting with “+” has a special meaning in IANA media type names. So, application/vnd.zarr would fit the IANA model better than application/vnd+zarr. Vendor tree media types need to be registered, but registration and review is light weight compared to the standards tree.

The standards tree (no prefix) is intended for “[media] types of general interest to the Internet community”. Media types registered in the standard tree must either be:

  1. “in the case of registrations associated with IETF specifications, approved directly by the IESG”
  2. “registered by a recognized standards-related organization” (IESG makes a one-time decision on whether the submitter represents a recognized standards-related organization). This option also requires a well defined specification for the media type.

Registration on the full standards tree registry can take some time and effort. However, there is a provisional registration process available to facilitate prototyping and testing. The main hurdle for provisional registration is getting recognized as a “standards-related organization”. There are a number of standards and steering committees that are recognized as such. So, if Zarr decides to register on the standards tree, the Zarr Steering Committee might be the entity to get recognized.

This is as far as we’ve gotten for netCDF (application/netcdf is listed on the provisional standard media type registry). So I don’t yet know the details of the review part of the full registration process.

joshmoore commented 2 years ago

@satra, for which files are you thinking of adding a mimetype? The fact that there are multiple makes this an interesting problem. e.g. if someone downloads a chunk and learns that it's "application/zarr" or whatever, what can they do with that without the rest of the fileset?

I don't recall how we got there but we've used application/vnd+zarr in the past.

@jhamman, you use this for each .zgroup, .zarray and .zattrs file? Conceivably these could also have a prominent "json" in the mimetype.

jhamman commented 2 years ago

@jhamman, you use this for each .zgroup, .zarray and .zattrs file? Conceivably these could also have a prominent "json" in the mimetype.

So, we're using application/vnd+zarr as the asset media type in the STAC context where an asset is represented as a path that points to a directory that contains a .zgroup. We are not using the media types to represent the types of metadata or data objects within a zarr dataset.

satra commented 2 years ago

for which files are you thinking of adding a mimetype?

@joshmoore - same as @jhamman . in our archive we are using nesteddirectorystore hosted on s3 as an asset. only the top level path (e.g., /path/to/somename.ngff) in our database returns this mime-type within the metadata record, not the individual files underneath. we left our implementation for now with application/x-zarr with the possibility of converging on whatever consensus emerges.

rabernat commented 2 years ago

I just saw on a webinar from @bilts that NASA Harmony is using the mime type application/x-zarr for Zarr assets.

joshmoore commented 2 years ago

Quote: A media type consists of a type and a subtype, which is further structured into a tree. A media type can optionally define a suffix and parameters:

Excerpts from a partial read of https://www.rfc-editor.org/rfc/rfc6838.html:

Based on these, my general thoughts are:

satra commented 2 years ago

ping @yarikoptic

jbms commented 2 years ago

Is there any precedent for using mime types to refer to directory trees as opposed to individual files?

satra commented 2 years ago

there have been several efforts : https://www.w3.org/2002/12/cal/rfc2425.html and various vendor specific things including directories on android: vnd.android.cursor.dir

but nothing looking at the type of directory based stores that we are considering here.

yarikoptic commented 2 years ago

Is there any precedent for using mime types to refer to directory trees as opposed to individual files?

FWIW I thought to check what http://github.com/file/file (libmagic) thinks -- looking at source and running (on linux) I think all directories are just inode/directory and I don't even see that one among iana.../...media-types.xhtml.

  • I could also see getting behind use of +zarr so that the main intent of the entity could be expressed with another mimetype, image+zarr or application/zip+zarr. The document for that is Structured Syntax Suffixes. Another current example is +sqlite, which is defined to match application/vnd.sqlite3.

I wonder if it shouldn't be the other way around, i.e. have /zarr and then possibly the +suffix (e.g., +zip assuming that +directory is like a default.) rfc6838 ref on suffixes

examples from media-types ```shell $> curl --silent https://www.iana.org/assignments/media-types/media-types.xhtml | grep 'application.*+zip' application/bacnet-xdd+zip application/epub+zip application/lpf+zip application/p21+zip application/prs.hpub+zip application/vnd.comicbook+zip application/vnd.d2l.coursepackage1p0+zip application/vnd.espass-espass+zip application/vnd.etsi.asic-s+zip application/vnd.etsi.asic-e+zip application/vnd.exstream-empower+zip application/vnd.familysearch.gedcom+zip application/vnd.ficlab.flb+zip application/vnd.gov.sk.e-form+zip application/vnd.imagemeter.folder+zip application/vnd.imagemeter.image+zip application/vnd.iso11783-10+zip application/vnd.logipipe.circuit+zip application/vnd.maxar.archive.3tz+zip ```
joshmoore commented 2 years ago

I wonder if it shouldn't be the other way around, i.e. have /zarr and then possibly the +suffix (e.g., +zip assuming that +directory is like a default.) rfc6838 ref on suffixes

:+1: I could see that. Though I think the +zarr as with +sqlite3 or +zip could still be useful even if we want to target application/[vnd.]zarr for most cases. Though perhaps the fact that only one suffix is intended could come back to bite us.

jbms commented 2 years ago

@satra It appears that both of those examples, https://www.w3.org/2002/12/cal/rfc2425.html proposing a text/directory mime type, and vnd.android.cursor.dir, logically represent some sort of collection of items, but are in fact still represented as a single file or byte stream.

Note: application/zip+zarr would correspond to a single file (the zip file) so there is no issue there.

I can see the benefit of using a mime type if you have an existing database where things are identified by mime types. But my understanding is that so far mime types have been limited to identifying the format of a single file / byte stream. We may want to be careful in using mime types outside of their normal scope --- and perhaps at least see if this is something that has been done before.