Add Valid Formatted Content for each scalar tag

Thom1729 commented 3 years ago

For #53.

Add a “Valid formatted content” section for each scalar tag. Invalid content is already defined in 3.3.3, so this should suffice.

I had to modify the perl script so that it wouldn't try to turn [eE] into a link. The modification is a hack, but doing it properly (not looking for links in <code> tags) looks like it would be complicated.

Thom1729 commented 3 years ago

The term "formatted content" only appears 4 times in the spec. It could easily be changed to a better term.

I think that changing that is outside the intended scope of this PR.

I have some issues with this change. One thing is that "Valid Formatted Content" is not explained anywhere, or I've missed it.

Yeah, that's currently missing. If we merge #140, then there will be an obvious place to define that; otherwise, I'll have to find another place to put the definition. I've marked this PR as a draft in the mean time.

The spec mentions this term a couple of times, but I don't see from this pull request, under which circumstances a plain ~ would be a valid null value under the JSON schema.

Per the table in section 10.2.2, only the plain scalar null will be resolved to the tag tag:yaml.org,2002:null. The plain scalar ~ would be left unresolved and “the YAML processor should consider [it] to be an error”.

Will reply to the rest in longer form.

Thom1729 commented 3 years ago

(Sorry in advance for the wall of text. I wanted to be comprehensive. Hopefully the formatting will help.)

I think it helps to first break up the issue into several questions.

Is !!float .nan permitted under the failsafe schema?
Is !!float .nan permitted under the JSON schema?
Is !!boolean TRUE permitted under the JSON schema?

Failure Points

Here I used the informal term “permitted”. What I mean specifically is whether any of the above examples triggers one of the failure points defined in the spec. There are four failure points relevant to tags:

A non-specific tag may be unresolved.
A tag may be unrecognized.
A node's content may be invalid.
A native type may be unavailable.

I'll go over each of these in brief before getting to the example questions.

For reference, I've created a diagram of the composition process in detail. GitHub won't let me attach an SVG here, so I posted it in #dev-spec.

1. A non-specific tag may be unresolved.

Each schema defines its own rules for resolving non-specific tags. However, this question concerns nodes with explicit tags, which are not subject to tag resolution. I think we can dismiss this as inapplicable — even if we decide that an explicit tag is not “permitted” in some context, I don't think we would say that it would be unresolved.

2. A tag may be unrecognized.

This is the most vaguely defined failure point. I'll quote section 3.3.3 (Recognized and Valid Tags) in its entirety:

To be valid, a node must have a tag which is recognized by the YAML processor and its content must satisfy the constraints imposed by this tag. If a document contains a scalar node with an unrecognized tag or invalid content, only a partial representation may be composed. In contrast, a YAML processor can always compose a complete representation for an unrecognized or an invalid collection, since collection equality does not depend upon knowledge of the collection’s data type. However, such a complete representation cannot be used to construct a native data structure.

What does it mean to be “unrecognized”? In the above paragraph, recognition is tied directly to the implementation's ability to validate the node's content and — in the case of scalar nodes — to produce the canonical form of the node's content. This implies that an implementation recognizes a tag if and only if it knows that tag's definition.

Could an implementation “play dumb” and refuse to recognize a tag that it does actually understand? I think so. There is no explicit requirement in the spec that an implementation recognize any tags in particular. (Version 1.3 would be an opportunity to set reasonable rules here.)

Can an implementation choose to reject all tags that are not part of the schema it's using? See the first sentence of 10.4 (Other Schemas):

None of the above recommended schemas preclude the use of arbitrary explicit tags.

At the very least, this means that an implementation may recognize tags outside the current schema.

3. A node's content may be invalid.

This is also defined vaguely and in much the same manner as tag recognition. Section 3.3.3 does establish that a tag imposes constraints that the content must satisfy. The spec does not provide explicit examples of such constraints. However, it does require that the content of each node be of a certain kind. This does imply that a node with the wrong kind of content for its tag is invalid.

However, while the spec is nearly silent on what makes a node's content invalid, the above definition does establish that validity is a function of a node's tag and content. There is nothing to indicate that validity is or may be determined by any other factor. In particular, it isn't determined by the schema. A node's content is not valid for its tag under one schema and invalid under another schema; schemas do not participate in validation at all.

Another reason why validation can't depend on the schema is that the schema may not be aware of the tag at all. Implementations may recognize tags outside the current schema. If validation depended on the schema, then the implementation could not validate any such tag — but that's nearly the definition of an unrecognized tag.

4. A native type may be unavailable.

This failure point is included for completeness's sake, and also to slightly clarify tag recognition. See 3.3.4 (Available Tags):

In a given processing environment, there need not be an available native type corresponding to a given tag. If a node’s tag is unavailable, a YAML processor will not be able to construct a native data structure for it. In this case, a complete representation may still be composed and an application may wish to use this representation directly.

That is, the unavailability of a native type for a given tag does not mean that the implementation should not recognize that tag. An implementation that does not natively support floating-point arithmetic nevertheless can and should recognize the tag:yaml.org,2002:float tag and compose a complete representation.

Examples

Now that we've looked at the possible failure points, let's look at our questions again. Since all of the examples use explicit tags, and we're concerned here about composition, the only relevant failure points are tag recognition and content validation.

1. Is `!!float .nan` permitted under the failsafe schema?

Will the tag be recognized? Section 10.4 establishes that, at the very least, it may be recognized.

Is the content valid? Yes. The spec is clear that .nan is valid content for the !!float tag.

2. Is `!!float .nan` permitted under the JSON schema?

Will the tag be recognized? Yes, unless something very silly is happening. The !!float tag is part of the schema, and it would make no sense for an implementation to support a schema but fail to recognize tags that are part of that schema.

Is the content valid? Still yes. Validity is a function of the tag and the content, not of the schema.

3. Is `!!boolean TRUE` permitted under the JSON schema?

Will the tag be recognized? Yes, as above.

Is the content valid? Yes, as above.

Other points that deserve to be addressed

If tags are independent from schemas, then why are they defined in the schemas chapter?

This seems to be an arbitrary organizational choice. Although the !!float tag is located in the spec within the definition of the JSON schema, that definition does explicitly list .inf, -.inf, and .nan as canonical forms. If the location of the tag definition were taken to be significant, then it certainly shouldn't be interpreted to disallow those values.

If tag validity did depend on the schema, then there would need to be separate entries for each tag in each schema that it's a part of. That's not what we see in the spec.

On the other hand, suppose that !!float .nan were invalid under the JSON schema. Then the JSON schema would not be a superset of the failsafe schema, because a document valid under the failsafe schema might be invalid under the JSON schema.

Doesn't this mean that the JSON schema is not a semantic superset of JSON?

Yes. It's clear that this was an oversight, but there's no way to interpret around it.

It's worth mentioning, however, that there are other obstacles. For one thing, section 10.4 does establish that “None of the above recommended schemas preclude the use of arbitrary explicit tags.” A document composed using the JSON schema may contain any tag recognized by the implementation. And, of course, there is syntax. A YAML implementation that wanted to emit valid JSON would already have to do several other things in addition to restricting the tags that it permits, and checking for non-finite float values is merely one such item.

What about Oren's comments?

If Oren were to chime in here, I'd appreciate it, but the linked comment is very short and devoid of any detail. It doesn't cite the spec and it doesn't grapple with any of the issues this exposition discusses. If I saw a way to reconcile it with the 1.2 spec as written, I would give it a solid try. But as it stands, I think that Oren was simply mistaken.

Alternatives

Here I'll discuss some possible spec changes that would resolve the issue of the JSON schema. I don't necessarily advocate any of them for 1.3; they're here to contrast with the spec as it currently stands.

Tie formatted value validation to schemas

We could remove validity from the definition of the tags themselves and tie it directly to schemas. Then, any schema could define its own rules, and the JSON schema could ban !!float .nan and !!boolean TRUE.

Downsides:

This would require tying canonicalization to schemas as well, because any valid value must be canonicalizable.
Because of the above, extensions could be incompatible. The same document interpreted under different schemas could result in different complete representations.
There would be no sensible way to recognize tags outside the current schema.

Allow schemas to restrict canonical forms

That is, leave the existing tag-based validation in place but also add an extra layer that's schema-specific and which only restricts which canonical forms are allowed. The JSON schema could ban !!float .nan, but not !!boolean TRUE. Because values could only be narrowed, incompatible extensions would be impossible.

Downsides:

It's a whole extra moving part of questionable use.
The JSON schema would no longer be a subset of the failsafe schema (unless the failsafe schema were also narrowed).

Split finite/nonfinite float tags

Split the existing !!float tag into e.g. !!finite and !!nonfinite tags, then only allow !!finite in the JSON schema.

Downsides:

Kind of clumsy.
Definitely poses backward-compatibility issues.

Allow tag subtyping

E.g. keep !!float, but also add e.g. !!float#finite and !!float#nonfinite. Specify !!float#finite in the JSON schema.

Downsides:

A whole new conceptual thing.

Thom1729 commented 3 years ago

Anyway, as pertains to this individual PR, I think that we can safely say that the values that already appear in the spec are valid, even though we can't necessarily say that other values are invalid. (E.g. we can say that true is valid content for the !!boolean tag even if the spec doesn't actually say that maybe is invalid content.) What we'll want to do to avoid a spec change is to clarify that handling of other values is implementation-defined.

Thom1729 commented 3 years ago

This PR was meant to address the following problems:

Problem 1: The spec defines a notion of “invalid content”, but it never quite says what content is invalid.

3.3.3. Recognized and Valid Tags

To be valid, a node must have a tag which is recognized by the YAML processor and its content must satisfy the constraints imposed by this tag. If a document contains a scalar node with an unrecognized tag or invalid content, only a partial representation may be composed. In contrast, a YAML processor can always compose a complete representation for an unrecognized or an invalid collection, since collection equality does not depend upon knowledge of the collection’s data type. However, such a complete representation cannot be used to construct a native data structure.

Problem 2: It's not clear from the spec what an implementation should do with e.g. !!bool maybe.

I put forth for discussion the following propositions.

From the spec, we can conclude…

…that the canonical form produced by a tag is valid content for that tag. (e.g. !!bool true)
…that the values that are resolved to a tag by the recommended schemas are valid content for that tag. (e.g. !!bool TRUE)
…that values of the wrong kind for a tag are invalid content for that tag. (e.g. !!bool [])
…that other values not addressed above are invalid content. (e.g. !!bool maybe)

(My prior comment addresses the related question of whether content may be valid for a tag under one schema but invalid under another. For the purposes of this comment, I assume that the answer is no: that validity of content for a tag does not depend on the schema.)

I think we have to accept (1) and (2). I think that (3) is also true. I don't think there's consensus that (4) is true.

The questions we need to answer for this PR (or a similar one) are:

What should we say about values that we agree are valid?
What should we say about values that we agree are invalid?
What should we say about values whose validity is not clearly established by the spec?

yaml / yaml-spec