Define the list of codecs in the v3 spec

d-v-b commented 2 months ago

Follow-up from #293

The way the v3 spec describes codecs today is inconsistent with the needs of a large number of users and implementers. Specifically, the spec says that there is no formal list of official codecs, but a number of people believe that a formal list of official codecs is either the true intention of the spec, or simply a much better specification than the alternative.

Changes in this PR:

removes the language stating that the spec does not list any codecs. Said language is replaced with a definition of a "core codec" (e.g., the codecs that are de facto part of the spec today), as well as a definition of a "community codec" (e.g., a codec published by a third party).
Zarr implementations ~MUST~ SHOULD support core codecs
Zarr implementations MAY support community codecs.
The requirement that codec identifiers be URIs that resolve to a human-readable specification has been removed, as we weren't even following that ourselves and it's not machine-checkable by definition.

Happy to take feedback, in particular on the styling. I'm bad at rst.

I think it's worth considering the application of the same "core" vs "community" distinction for other extension points, e.g. data types.

zoj613 commented 2 months ago

Does blosc really need to be a necessary codec? I feel as though Gzip is enough for a bytes->bytes core codec since its widely available. Blosc might not have bindings in many languages, making it difficult to be supported by some implementations. For example, zarr-ml cannot support blosc since as of today there aren't any OCaml bindings to the blosc C library. On the other hand, we can confidently assume nearly all modern languages have support for Gzip/Zlib given that it has been around for over 30 years.

d-v-b commented 2 months ago

Does blosc really need to be a necessary codec?

I cannot answer this question. For context, my personal opinion is that it's not a great idea to have a mandatory list of codecs, for exactly the reasons you list. But in #293 I was in the clear minority -- everyone else participating in the discussion preferred a strictly mandated list of codecs. The core problem is that the current v3 spec is incoherent w.r.t codecs, and we need to make it coherent. This PR represents the solution that the majority of people supported in #293.

zoj613 commented 2 months ago

Does blosc really need to be a necessary codec?

I cannot answer this question. For context, my personal opinion is that it's not a great idea to have a mandatory list of codecs, for exactly the reasons you list. But in #293 I was in the clear minority -- everyone else participating in the discussion preferred a strictly mandated list of codecs. The core problem is that the current v3 spec is incoherent w.r.t codecs, and we need to make it coherent. This PR represents the solution that the majority of people supported in #293.

I see. I wasn't aware of that discussion. I would like to also caution against defining a spec based on the features offered by the python implementations of a particular codec. For example, I noted here that the Zstd spec includes a checksum parameter that does not seem to be supported by many Zstd implementations across different languages. It appears to me that the parameter is influenced by the feature offered by the Zstd python implementation even though many other implementations do not seem to support it. This would make it difficult to full support that spec across Zarr implementations

d-v-b commented 2 months ago

@zoj613 would you mind bringing up your concerns in #293 for visibility?

d-v-b commented 2 months ago

(I think we could salvage this PR by changing some "MUST support" statements to "SHOULD support")

LDeakin commented 2 months ago

(I think we could salvage this PR by changing some "MUST support" statements to "SHOULD support")

This is in line with my thoughts in #293:

A zarr implementation does not need to support every codec to be conformant

and @normanrz:

Some codecs are essential to how Zarr works and should be required by all implementations. Most minimally, that is the bytes codec. Other codecs are so popular and general that all implementations should implement it

d-v-b commented 2 months ago

I think this is ready for review. I removed the requirement that zarr implementations MUST support the core codecs. instead, they SHOULD support them.

d-v-b commented 2 months ago

@zarr-developers/steering-council could someone have a look at this?

d-v-b commented 1 month ago

@zarr-developers/steering-council I really would appreciate some feedback on this PR. It has been ready for review for 3 weeks with no response. What is the blocker here?

For context, in zarr-python there are efforts to implement v3 versions of zarr v2 compression routines. The spec as currently written does not offer clear guidance for this process, and I believe that the changes contained in this PR will offer much needed clarifications that will be of immediate use to at least one implementation (zarr-python). If the strategy for codecs outlined in this PR is viable, then we can apply the same approach to datatypes.

rabernat commented 1 month ago

@d-v-b did you decide to stick with "community" over "extension"? I weakly favor "extension," but I don't want to block this PR on that point.

d-v-b commented 1 month ago

i'm happy to switch over to "extension" but i haven't changed the text yet

d-v-b commented 1 month ago

"community" is out, "extension" is in.

joshmoore commented 1 month ago

Few comments:

Thanks to @d-v-b for all the clean up!
On the MUST/SHOULD front (link), it might be good to make sure folks from sign off if it's reversing a previous decision.
As I mentioned to @d-v-b at a recent meeting, I'd would favor guidance for codec implementors wherever we have a preference on conventions moving forward, e.g., "codec names are cheap; prefer grabbing a new name (my-codec-2) over breaking the schema of your codec configuration."

d-v-b commented 1 month ago

I'd would favor guidance for codec implementors wherever we have a preference on conventions moving forward, e.g., "codec names are cheap; prefer grabbing a new name (my-codec-2) over breaking the schema of your codec configuration."

This PR is deliberately non-normative about the content of the extension codecs. The intention is that the maintenance of the extension codecs should fall to their authors. If spec maintainers must enforce a recommendation that extension codec authors avoid breaking changes to their codecs, then we are making spec maintainers assume a maintenance role for those codecs, which is an outcome we want to avoid. In fact, the entire point of this PR is to provide a path for members of the community to publish extension codecs with as little work from spec maintainers as possible. This is the spirit of the following text from the PR:

Publication in the "extension codecs" section does not confer primacy or an official designation to a codec. The list of extension codecs exists expressly as a tool to enable coordinated codec development.

Also, your requested recommendation is not so simple. We cannot simply recommend that codec developers avoid breaking changes to their codecs -- a codec developer can use semver in the configuration for their codec, and they can define in their codec specification how implementations of that codec should handle breaking changes.

It is exactly because this matter is complicated that I do not wish to comment on it in this PR, hence my use of limited language above. If we are to make recommendations to codec developers (and managing versioning is just one of many possible recommendations), it should not be in the zarr v3 specification but rather a "best practices for codec development" document hosted elsewhere. I would be happy to review such a document if anyone wants to open a PR to add one.

d-v-b commented 1 month ago

@normanrz @dstansby @LDeakin @zoj613 thoughts on the changes proposed in this PR? In particular the statement that implementations SHOULD support the core codecs, but also anything else.

joshmoore commented 1 month ago

This PR is deliberately non-normative about the content of the extension codecs. The intention is that the maintenance of the extension codecs should fall to their authors.

Agreed but ...

If spec maintainers must enforce a recommendation that extension codec authors avoid breaking changes to their codecs, then we are making spec maintainers assume a maintenance role for those codecs, which is an outcome we want to avoid.

I didn't say "enforce", but I do think it's our job to communicate preferences where possible since codec implementers can't read our minds.

We cannot simply recommend that codec developers avoid breaking changes to their codecs...

I disagree. We can definitely recommend; we can't require.

a codec developer can use semver in the configuration for their codec, and they can define in their codec specification how implementations of that codec should handle breaking changes.

That's an interesting point. If we think that's feasible, then I would include that in our discussion. An expert codec developer will likely have thought of the above and will not need our recommendation, but for the rest of the world, we should provide what guidance we can or at least a description of trade-offs that we are aware of. And I'd still say that codec breakage (without versioning) is something to be avoided for the community.

Tt should not be in the zarr v3 specification but rather a "best practices for codec development" document hosted elsewhere. I would be happy to review such a document if anyone wants to open a PR to add one.

:+1: though I think non-normative statements can easily be kept in this document until we have such a best practices guide.

d-v-b commented 1 month ago

I didn't say "enforce", but I do think it's our job to communicate preferences where possible since codec implementers can't read our minds.

Here is the crux of the matter: it is simply not my preference that authors of extension codecs avoid breaking changes to their codecs. In my view, they can do what they want with their codecs, except use the name of a core codec. The whole point of this PR is that we are hosting the specification of their codecs, and nothing else. You are asking for us to request or recommend something else, i.e. your personal preference that those developers adopt certain development practices. That preference is, as we see here, not simple and the subject of debate.

In the interest of getting the core features of this PR through, I would request that we not attempt to resolve a potentially complicated discussion about best practices for codec development with this PR. Lets find a different venue for that content. We need the core elements of this PR to host specifications for string datatypes in zarr v3. Pending additional evidence, we do not need development recommendations for codec developers at this time.

joshmoore commented 1 month ago

A few last points for tonight:

it is simply not my preference

To be clear, I'm not talking about you or your preferences specifically anywhere though I value your insights in these matters. It's about communicating to new devs with a common voice what we believe will lead to the most valuable ecosystem; they can always ignore us.

The whole point of this PR is that we are hosting the specification of their codecs, and nothing else

As I mentioned on call, I think you and I differ here in that as soon as we accept in the PR to list the community extension I think there are expectations, and we can keep trying to communicate that we take no responsibility, but it's probably as easy to list some best practices to avoid the worst mistakes.

I would request that we not attempt to resolve a potentially complicated discussion about best practices for codec development with this PR...

Understood. I don't necessarily agree that we want to wait for the evidence of having broken data to make a recommendation nor that it warrants a separate page, but I accept that you don't want to take responsibility for those changes.

P.S. I have some inkling that this PR and #293 point to differing understandings of what codecs are and/or how they should be handled (naming, etc.), but I can't put my finger on it, certainly not at this hour...

LDeakin commented 1 month ago

@normanrz @dstansby @LDeakin @zoj613 thoughts on the changes proposed in this PR? In particular the statement that implementations SHOULD support the core codecs, but also anything else.

It all looks good to me 👍

Breaking changes/versioning

+1 for everything stated in https://github.com/zarr-developers/zarr-specs/pull/312#issuecomment-2403397054 by @d-v-b.

If I were to propose an extension codec for one of the experimental codecs in zarrs, I would seek feedback during a PR and provide stability guarantees on merge. But I don't think that should be mandatory.

I think it would be perfectly fine for extension codecs to include a disclaimer like the following (if needed):

This codec may receive breaking changes and is recommended for evaluation only. This revision is supported by XX implementations.

joshmoore commented 1 month ago

Thanks, @LDeakin, it's a good example of what a developer would realistically want to communicate.

I think what I'm struggling with is that there seem to be more classes of extensions and we're trying to map a subset. Trying to describe the classes without using any of terms we've used so far:

Class	A	B	C	D	E	F
Confidence	100%	80%	60%	40%	20%	0%
Naming	Unique and "special"	Unique and "special" (?)	Unique	Unique	Unique in "registry"	No guarantees
Schema	zarr-specs	zarr-specs	zarr-specs	zarr-specs	Developer's page (if at all)	Developer's page (if at all)
Dicovery	zarr-specs	zarr-specs	zarr-specs	zarr-specs	zarr-specs	PyPi, etc.
Versioning	ZEP-controlled	ZEP-controlled	Recommendations	Unknown	Unknown	Unknown
Support	All implementations	Hopefully most implementations	Unknown	Unknown	Unknown	Unknown
Comment/Like					"registry"	"entrypoint"

(Note: I'm not proposing these classes for inclusion in the spec just trying to assign names so we can discuss previous and current state without getting confused)

I think if you were to have asked me before this PRs conversation I would have said "core codecs" are (A) and "extensions" are (C), and I would say if there are no guarantees on a codec then for me it's closer to (E), i.e. grab your name and go, or the current situation of (F).

I think this proposal is closer to suggesting (B) and (D), but happy to take input on the table to see if we're thinking about these things in the same way.

A reason I think this is important is the relationship to other future extensions. Would we feel comfortable with a similar level of confidence for an "extension data type" that we list in zarr-specs?

d-v-b commented 1 month ago

The goal of this PR is not to carefully map the space of codecs. We should do that somewhere else. The goal of this PR is to clarify the language of the spec in a useful way for codec development, in particular solving two acute, practical problems:

defining the status of the codecs that are currently listed in the spec (the "core codecs")
defining a minimal framework for publishing codecs that are not listed in the spec ("extension codecs") that solves an active problem: over in zarr-python we are back-porting zarr v2 codecs into zarr v3, and we (the zarr community) needs a place to publish the specs for those codecs so that other implementations can work with them.

So, for the purposes of this PR, there are only two classes of codecs under consideration: those listed in the spec ("core codecs"), and those not listed in the spec ("extension codecs").

For the codecs currently listed in the spec (core codecs), this PR addresses the following question: SHOULD implementations support those codecs, or MUST they support those codecs?

The answer I have reached in this PR is a decisive SHOULD. Suppose someone wants to write a Zarr implementation in a language X that doesn't have any bindings to blosc. If the spec requires that all Zarr implementations MUST support the core codecs, then language X cannot have a valid Zarr implementation until blosc bindings are added, and this is an unfair burden on members of the language X ecosystem. Crucially, this is not a hypothetical scenario. Because the alternative would cause serious problems, this PR alters the spec to state that the core codecs SHOULD be supported by Zarr implementations.

For codecs not listed in the spec (extension codecs), this PR addresses the following question: what is the relationship between extension codecs and core codecs? (answer: they have to use a different name from the core codecs). This PR also defines a service for zarr community members, wherein the zarr spec developers will host their codec. But, as stated in the PR, being listed doesn't confer any special status to an extension codec.

This PR represents my attempt to make the minimal amount of changes to resolve serious ambiguity in the spec document which is a barrier for codec development. If there are proposed changes to this PR that can better solve the problems I am trying to solve then please suggest them.

A reason I think this is important is the relationship to other future extensions. Would we feel comfortable with a similar level of confidence for an "extension data type" that we list in zarr-specs?

I think the changes I am proposing in this PR could work for data types as well. I would use the same two classes, with the same rules for members of each class.

dstansby commented 1 month ago

Aside from my comments above, I wanted to say that I am 👍 this change overall, and thanks a bunch for opening a concrete proposal to push this forward.

joshmoore commented 1 month ago

The goal of this PR is not to carefully map the space of codecs....

@d-v-b, though I appreciate your immediate goals, since this touches on our shared understanding of terminology it's unfortunately bigger. I was hoping the table would help since I believe a source of conflict here is that the definitions have shifted for some of us: we're not speaking the same language. How would you propose to work towards consensus here?

d-v-b commented 1 month ago

The goal of this PR is not to carefully map the space of codecs....

@d-v-b, though I appreciate your immediate goals, since this touches on our shared understanding of terminology it's unfortunately bigger. I was hoping the table would help since I believe a source of conflict here is that the definitions have shifted for some of us: we're not speaking the same language. How would you propose to work towards consensus here?

I don't know what the disagreement or conflict is -- what changes do you specifically want to see in this PR? We have discussed adding recommendations to codec developers, but since we disagree on the content of that recommendation I think it's out of scope for this PR, and should probably be contained in a separate PR; I'd be happy to continue that particular conversation in such a PR.

joshmoore commented 1 month ago

we disagree on the content of that recommendation I think it's out of scope for this PR

The fact that we disagree means that we will likely be having this conversation in any follow up PR, so unfortunately, I'd say it's in scope until we at least agree on the definition of what's in this PR.

Questions I have are:

core
- Are there any codecs required by all implementations? (A) - I think we're still waiting on some responses to your ping.
extensions
- what's the stability of "extension codecs"? (C vs D)
- should codecs that are for "evaluation only" as @LDeakin said have their schema listed (D), just be named (E) or maybe not be part of this repo (F)? If not D, does that point to a different type (e.g. "community") again?

d-v-b commented 1 month ago

Are there any codecs required by all implementations? (A) - I think we're still waiting on some responses to your ping.

no, pending feedback of course.

what's the stability of "extension codecs"? (C vs D)

should codecs that are for "evaluation only" as @LDeakin said have their schema listed (D), just be named (E) or maybe not be part of this repo (F)? If not D, does that point to a different type (e.g. "community") again?

These questions are important but they are out of scope for this PR. As stated in the text of this PR, an extension codec is any codec that is not listed in the list of core codecs, and any extension codec may be submitted for publication on the zarr-specs repo. I have labored to keep this PR simple, but it feels to me that you are broadly suggesting that I make this PR more complex. Given that nothing in this PR precludes additional efforts to refine our definition of extension codecs, I would ask that we put those efforts and that conversation in a separate venue, e.g. an issue or a PR.

joshmoore commented 1 month ago

These questions are important but they are out of scope for this PR. As stated in the text of this PR, an extension codec is any codec that is not listed in the list of core codecs, and any extension codec may be submitted for publication on the zarr-specs rep

I'm sorry, I disagree. Since this PR is defining these terms, it makes agreeing on that definition important, and will impact those future PRs.

it feels to me that you are broadly suggesting that I make this PR more complex.

No. Apologies. I'm striving to find an agreed understanding of these things you are defining in admittedly nicely simple terms. That's critical since as I described to you on the call and above I feel less than comfortable with the statement "any extension codec maybe be submitted" when the definition of extension is class D above (as an example).

d-v-b commented 1 month ago

These questions are important but they are out of scope for this PR. As stated in the text of this PR, an extension codec is any codec that is not listed in the list of core codecs, and any extension codec may be submitted for publication on the zarr-specs rep

I'm sorry, I disagree. Since this PR is defining these terms, it makes agreeing on that definition important, and will impact those future PRs.

"codecs in the spec" and "codecs not in the spec" are concepts that exist in the spec already -- I am merely giving these concepts names in my PR.

it feels to me that you are broadly suggesting that I make this PR more complex.

No. Apologies. I'm striving to find an agreed understanding of these things you are defining in admittedly nicely simple terms. That's critical since as I described to you on the call and above I feel less than comfortable with the statement "any extension codec maybe be submitted" when the definition of extension is class D above (as an example).

Why does it make you uncomfortable?

joshmoore commented 1 month ago

Rather than reiterate my points from above, I'll add that @rabernat's suggestion of a maturity level from https://stac-extensions.github.io/ lessened the nervous. So perhaps the change I'd consider here is how do we not close the door on such an addition in a follow-up PR? Perhaps as with @dstansby's suggestion, we try to say less rather than trying to work out the definitions at this point.

d-v-b commented 1 month ago

So perhaps the change I'd consider here is how do we not close the door on such an addition in a follow-up PR?

What specifically in this PR closes the door on follow-up PRs?

joshmoore commented 1 month ago

In my reading, it's defining extension points in the most permissive way when I would define them more restrictively. If we can agree on text to basically say "there are codecs not in the spec" then we can decide on this maturity categorization later.

d-v-b commented 1 month ago

it's defining extension points in the most permissive way when I would define them more restrictively.

This is exactly the kind of thing that can be changed in a follow-up PR, no?

d-v-b commented 1 month ago

Maybe it's best if we focus on a real example, because that's what this PR is trying to solve: we are actively porting zarr-python v2 codecs into zarr-python v3. To avoid fragmentation of the zarr ecosystem, we should put the specifications for those codecs somewhere outside of zarr-python to ensure that non-python implementations can support them. This PR is an attempt to solve that problem. Do you have a better solution? Do you think this PR fails at solving that problem?

joshmoore commented 1 month ago

This may be too far in the other direction but this is something I feel more comfortable with:

diff --git a/docs/v3/core/v3.0.rst b/docs/v3/core/v3.0.rst
index 0ab3cf2c..e944d63e 100644
--- a/docs/v3/core/v3.0.rst
+++ b/docs/v3/core/v3.0.rst
@@ -1233,19 +1233,18 @@ An existing core codec SHOULD be removed from the list when a sufficient number
 developers and Zarr users deem the codec worth removing, e.g. because of a technical flaw in the
 algorithm underlying the codec.

-Extension codecs
-----------------
+Other codecs
+------------

 Zarr implementations MAY support a codec that is not in the list of core codecs
-(hereafter termed an "extension codec"), provided the extension codec does not use an identifier
+provided the extension codec does not use an identifier
 that is already used by a core codec; the identifiers of core codecs are reserved.
-
-This specification places no other constraints on extension codecs. It is possible, through discouraged,
+This specification places no other constraints on these codecs. It is possible, through discouraged,
 that separate developers may define distinct codecs that use the same identifier.
 To minimize the impact of such name collisions, codec developers are strongly encouraged
-to publish their codec specifications as additions to the "extension codecs" section of Zarr v3 specification.
-Publication in the "extension codecs" section does not confer primacy or an official designation to a codec.
-The list of extension codecs exists expressly as a tool to enable coordinated codec development.
+to register their codec names in the XXX section of Zarr v3 specification.
+Listing a codec in this way exists expressly as a tool to enable coordinated codec development
+and does not confer primacy or an official designation to a codec.

 Stores
 ======

https://github.com/d-v-b/zarr-specs/compare/fix/list-of-codecs...joshmoore:zarr-specs:fix/list-of-codecs?expand=1

I note that I was also not sure where these codecs should be listed and so left a XXX. One thing that occurs to me is how to prevent future collisions:

Do we try to manage incoming names?
Do we state that future core codecs may steal back the names?

d-v-b commented 1 month ago

thanks for the suggestions. I think the name "other codecs" is a step down from "extension codecs", but I'm happy to listen to what other people think, in particular those who preferred "extension codecs" over "community codecs" (@TomAugspurger @rabernat).

I note that I was also not sure where these codecs should be listed and so left a XXX.

I think whenever we decide on a concrete location for the non-core codecs we can use a hyperlink to point to that location. So IMO it's fine if the text says "the section for the spec".

One thing that occurs to me is how to prevent future collisions:

Do we try to manage incoming names? Do we state that future core codecs may steal back the names?

It would be helpful if you could explain in more detail what collisions you are referring to. Name collisions between core codecs and non-core codecs are impossible because the names of core codecs are reserved. Name collisions between non-core codecs are totally possible but discouraged (as per common sense).

Do we try to manage incoming names?

Beyond excluding names that violate github TOS, managing incoming names seems too open-ended. if we attempt to curate the list of non-core codecs in such a way that we try to avoid name collisions between non-core codecs, then we encourage name squatting, e.g. where one implementation of jpeg compression with 1 user reserves the name jpeg before a jpeg codec with 1000 users. This repo does not have the bandwidth to adjudicate these things at this time.

Do we state that future core codecs may steal back the names?

I am guessing here that you are describing a scenario in which a non-core codec is made a core codec, and this results in the name of that codec becoming exclusive, which thereby invalidates any other non-core codecs that were currently using that name. I see a few options:

simply allow this outcome
prevent this outcome by making the list of core codecs immutable
only accept a new core codec if accepting it would not create name collisions
avoid name collisions entirely by identifying codecs with UUIDs instead of human-readable names

But before we get into these issues, I want to ask: does this PR prevent us from solving these problems in a follow-up PR?. If so, what changes should we make to this PR so that we can deal with these more complex issues in a follow-up PR? If not, can we have this discussion in a follow-up PR? Because the name collision stuff is entirely hypothetical at this point. For the problems we have today, I think the content of this PR is necessary and sufficient.

joshmoore commented 1 month ago

I think the name "other codecs" is a step down from "extension codecs"...

Agreed, but I was trying to leave us open to introduce either or both in the follow up.

So IMO it's fine if the text says "the section for the spec".

Just that we might prepare that location in this PR if we're saying to use it.

It would be helpful if you could explain in more detail what collisions you are referring to.

One collision scenario:

the wider codec community develops a great new codec (e.g. blosc3)
someone is motivated to use it ASAP and registers "blosc3" on the list.
core devs decide that they would like to use the well-known name "blosc3"

This repo does not have the bandwidth to adjudicate these things at this time.

Trying to do that would worry me, too. (This is likely the reason for the URI rule in the previous version.)

I am guessing here that you are describing a scenario in which a non-core codec is made a core codec, and this results in the name of that codec becoming exclusive, which thereby invalidates any other non-core codecs that were currently using that name

Yes exactly, thanks.

I see a few options:

:100:

simply allow this outcome

This could invalidate data, no?

prevent this outcome by making the list of core codecs immutable

i.e. pinned to a major version. Certainly lowers the benefit of the list, and might put the SHOULD in question.

only accept a new core codec if accepting it would not create name collisions

This would work for the list, but there can still be codecs not in the list (back to data invalidation).

avoid name collisions entirely by identifying codecs with UUIDs instead of human-readable names

Ha! I imagine there will be reasons not to, but I like this as a thought experiment.

does this PR prevent us from solving these problems in a follow-up PR?

I think this diff introduces the issue

-Each codec must be defined via a separate
-specification. In order to refer to codecs in array metadata
-documents, each codec must have a unique identifier, which is a URI
-that dereferences to a human-readable specification of the codec.

If so, what changes should we make to this PR so that we can deal with these more complex issues in a follow-up PR? If not, can we have this discussion in a follow-up PR?

If you want to roll back the URI change, then we can deal with that later as well. Another option would be to go to an integration branch and we can collaborate on the overall fix.

Because the name collision stuff is entirely hypothetical at this point. For the problems we have today, I think the content of this PR is necessary and sufficient.

I sincerely and wholeheartedly disagree and I will continue to say this widely. Spec development's impacts are too great to make changes to the main version of this document without considering those hypotheticals. It is exactly the job of zarr-spec core-devs.

d-v-b commented 1 month ago

-Each codec must be defined via a separate
-specification. In order to refer to codecs in array metadata
-documents, each codec must have a unique identifier, which is a URI
-that dereferences to a human-readable specification of the codec.

I removed the codec URI statement because that text conflicts with the actual definitions of the codecs we have today, and the actual practice of codec development.

we are not using the codec URIs in array metadata documents; therefore those URIs cannot aid in referring to codecs in said documents.
it is not practical to check that a URI resolves to a human-readable document, especially for codecs that people are actively developing. Read strictly, this requirement means that a codec cannot start with an implementation, which means that this requirement is worthless for real codec development.

I am happy to consider changes to the use of URIs in the codec definition, but they have to be consistent with how we are doing things today, unless you are interested in breaking changes to all of our codecs.

d-v-b commented 1 month ago

I think the name "other codecs" is a step down from "extension codecs"...

Agreed, but I was trying to leave us open to introduce either or both in the follow up.

I'm confused -- if you agree that "other codecs" is a worse name than "extension codecs", then why propose it here?

d-v-b commented 1 month ago

Because the name collision stuff is entirely hypothetical at this point. For the problems we have today, I think the content of this PR is necessary and sufficient.

I sincerely and wholeheartedly disagree and I will continue to say this widely. Spec development's impacts are too great to make changes to the main version of this document without considering those hypotheticals. It is exactly the job of zarr-spec core-devs.

This quote omits a crucial part of my statement, which was that we should address the hypothetical issues in a separate PR. I will not say anything more about this.

joshmoore commented 1 month ago

they have to be consistent with how we are doing things today

I agree, but also with the spec. That's why there's no easy answer.

I'm confused -- if you agree that "other codecs" is a worse name than "extension codecs", then why propose it here?

It was an attempt to keep the names unused for following changes.

This quote omits a crucial part of my statement, which was that we should address the hypothetical issues in a separate PR.

Except that every commit to this repository is for all intents and purposes a release, so (again a hypothetical) someone could read the spec and start reading in that interim period.

rabernat commented 1 month ago

Maybe it's best if we focus on a real example, because that's what this PR is trying to solve: we are actively porting zarr-python v2 codecs into zarr-python v3. To avoid fragmentation of the zarr ecosystem, we should put the specifications for those codecs somewhere outside of zarr-python to ensure that non-python implementations can support them. This PR is an attempt to solve that problem.

This point of view really resonates with me. As we get closer to the Zarr Python 3 release, it's apparent to me that we currently have a choice between either:

Doing things that are out of spec for V3 but needed to support actual users and achieve feature parity with V2 (e.g. consolidated metadata, vlen-utf8 codec, etc.)
Evolving the spec somehow to formalize these features

This is forcing us to come face-to-face with problems and inconsistencies in the V3 spec that have been present for a long time. IMO, the root of these problems is the flawed notion that we could somehow write a good spec prior to actually attempting to implement it. But now that we are implementing it, the spec is already frozen and unable to accommodate changes needed to enable these features.

Codecs are the prime example: where do the they belong? How do we add new ones? Does this require ZEPs? The spec is inconsistent and incomplete on this topic, and I'm grateful to @d-v-b for trying to find solutions.

If we accept that the V3 spec is basically immutable, here's the alternative: rather than trying to list the codecs in the spec, we return to the original spirit of the V3 spec, which is to liberally enable extensibility. On the topic of codecs, it says:

To allow for flexibility to define and implement new codecs, this specification does not define any codecs, nor restrict the set of codecs that may be used. Each codec must be defined via a separate specification.

To me, this pairs extremely well with the idea of community-developed extensions. All codecs (except bytes I guess) will be defined via extensions, and those extensions can and should be maintained and evolved separately from the spec and the ZEP process. I think @TomAugspurger's suggestions in #316 are great.

The way we would move forward with this route would be as follows:

Make a ZEP (and possibly clarification update to the spec) to explain (in more precise detail than currently exists) how to use the extension mechanism and the process for creating and sharing extensions.
Define codecs via appropriate extensions, following the process described in 1
Adapt implementations to parse and understand the presence of extensions in Zarr data

This seems especially appropriate since we clearly can't find consensus about which codecs belong in a list of "core" codecs. We need to find a way to move forward with development without such a consensus.

Until something like that happens, Zarr Python will have to continue with the process of implementing things that are out of spec in order to reach feature parity with V2. (Perhaps we could issue warnings in these cases?)

The hope is that we can retroactively adopt the extension approach to bring these features in line with some kind of spec.

How does this sound to people?

d-v-b commented 1 month ago

I am going to close this PR. After #293 I believed that there was broad consensus on two points:

the current language of the spec on codecs is incoherent
we can make the spec coherent in a non-breaking way with a few clarifying changes to the spec.

Beyond the ongoing harm caused by a specification document that is half-baked, there is also an immediate need for guidance as we develop new codecs for zarr v3 (a process that is encouraged by language in the spec).

Thus I attempted to implement the conclusions of #293 in this PR, and with a few exceptions this PR received broad support. I took great care to ensure that my changes were as minimal as possible, and I adapted the text to helpful feedback. With one notable exception, these changes were approved by the members of this discussion. However, it has since become clear that there is no clear path forward for this PR, and perhaps for any PR against the spec. I don't think it's worth the effort to continue here.

Except that every commit to this repository is for all intents and purposes a release, so (again a hypothetical) someone could read the spec and start reading in that interim period.

@joshmoore if this is something you believe, then I think you should feel great urgency to fix the incoherent parts of the spec. The spec has been broken for over a year. That is the cause of much more harm than anything in this PR.

rabernat commented 1 month ago

Davis, I'm sorry for the frustration this conversation has caused. I really appreciate your efforts, I'm committed to working with you to find our way out of this situation so we can keep development moving forward.

joshmoore commented 1 month ago

then I think you should feel great urgency to fix the incoherent parts of the spec. The spec has been broken for over a year. That is the cause of much more harm than anything in this PR.

I appreciate and agree with your analysis and I like framing it as a call for urgency. For what it's worth I did invest substantial time in this PR because I support what it's trying to do. I found myself, however, running into root causes which I then tried to resolve, recursively.

First our common understanding of what should be required (MUST vs SHOULD)
and what our definition of an extension is.
Followed by how we will transition extensions to core (incl. namespacing)
and even how we version the spec.
And underlying it all, how we go about making these decisions.

I hope you can appreciate that I was also frustrated at trying to unravel a year-long ball of yarn with the pressure of this concrete change. That's not your fault, but I was also only realizing it in real-time, surprised to find the state of things.

My urgency, then will be directed at those root causes, largely in reverse order, and I'd encourage those who are still motivated to tackle those issues so we can more efficiently tackle the many remaining issues.

zarr-developers / zarr-specs

Define the list of codecs in the v3 spec #312

Breaking changes/versioning