yaml / yaml-spec

YAML Specification
http://yaml.org/spec/
347 stars 53 forks source link

What's the canonical value of `!!bool maybe`? #53

Open Thom1729 opened 4 years ago

Thom1729 commented 4 years ago

Silly question, I know, but what if someone writes the following document and processes it using the core schema?

- !!bool maybe
- !!int some text
- !!float some more text

All of the logic that knows what a boolean should look like is in the resolver, but explicit tags bypass that. These “wild” values must be assigned canonical forms; there doesn't seem to be a defined failure mode for rejecting a scalar as impossible to canonicalize.

I think that there are several reasonable answers here, but I don't think that the spec hints at any of them. In fact, in the definition of the JSON schema, the actual canonicalization algorithm is essentially left as an exercise to the reader — which is no problem for sane values, but leaves the question open for silly values.

perlpunk commented 4 years ago

Indeed this is not really specified, but I think the default behaviour should be to reject those values. The resolvers are not really bypassing the definitions when tags are used, they still have to look at the values anyway.

A processor can offer an option to silently ignore such values, though.

I recently setup a repo with test data for all of the four schemas: https://github.com/perlpunk/yaml-test-schema https://perlpunk.github.io/yaml-test-schema/

It does not have values for !!bool not-a-bool things yet, but I will add error as the expected result.

Hopefully we can make those tests more prominent, so that all YAML processors behave in the same way by default.

perlpunk commented 4 years ago

And actually I have to fix the test data, because unquoted strings should be an error in the 1.2 JSON schema. Currently the table suggests that it should be loaded as a string, which is wrong.

Thom1729 commented 3 years ago

In 1.2.1, the least intrusive way to address this would be by specifying that if a scalar node's (raw) content does not match any of the regular expressions that are used for resolution in the core schema, then the node's canonical content is implementation-defined.

Possible strengthenings:

eemeli commented 3 years ago

The question isn't silly at all, and generalises to "What happens when explicit tag resolution fails?"

!!bool maybe is a pretty good example to use here, as boolean values can only be represented as either true or false. I think the options are:

  1. return true, as "maybe" is a not-empty string which therefore casts to true
  2. return false, as that's the fallback
  3. return "maybe", as we can't treat the input as a boolean
  4. return all booleans with some wrapper, which can represent an error state
  5. throw an exception

For the first three options, there's also a question about whether a warning should be emitted.

At least for the JavaScript yaml, this is clearly under-defined behaviour:

const doc = YAML.parseDocument('!!bool maybe')

doc.toJS()
> 'maybe'

doc.toString()
> '!!bool true\n'

doc.warnings[0]
> YAMLWarning: Unresolved tag: tag:yaml.org,2002:bool at line 1, column 1:
> 
> !!bool maybe
> ^^^^^^
> 
>     at Composer.onError (/Users/eemeli/code/yaml/dist/compose/composer.js:69:36)
>     [... snip ...]
>     at Object.parseDocument (/Users/eemeli/code/yaml/dist/public-api.js:47:16) {
>   code: 'TAG_RESOLVE_FAILED',
>   pos: [ 0, 6 ],
>   linePos: [ { line: 1, col: 1 }, { line: 1, col: 7 } ]
> }

In other words, I'm currently following option 3 when producing a JS value, and option 1 when producing YAML.

For 1.2.1, I think the undefined-behaviour should be made explicit as "SHOULD canonicalize" and probably "SHOULD emit a warning". Doing anything more than that would change the current meaning of the spec.

For 1.3, I think we ought to either throw an error or go with "MUST canonicalize" and "MUST emit a warning", if we can be sure that the parser is working with a 1.3 document. Defaulting to false would be surprising, and we can't ask all implementation to wrap all values in non-native structures.

ingydotnet commented 3 years ago

A few points...

  1. We are assuming !!bool expands to tag:yaml.org,2002:bool. In 1.2 this can be changed with %TAG and in 1.3 also by a schema doc. Obvious, but worth mentioning.
  2. The term "tag resolution" should only be meant to mean the process of assigning a tag based on (1) the non-specific tag of the node, (2) the path leading from the root to the node, and (3) the content (and hence the kind) of the node. Stating things like "explicit tag resolution" makes no sense.
  3. It (in 1.2) is the job of the constructor to map node tags to functions, and then call those functions on each node. We could probably make this clearer in the spec. We should also have more specific terms for the processes, like "tag function binding" (or "tag binding") and "tag function application" (or "tag application" or "constructor function application").
  4. Given that !!bool is governed by yaml.org,2002 it follows that yaml.org,2002 is responsible for defining those functions (and their signatures) which should not allow maybe etc. In 1.3 speak we refer to these as "libraries" and "library interfaces (headers)". In 1.2 its really left to the constructor implementer to DTRT.
  5. The 1.2 spec schema definitions only define "tag resolution" (no tag -> explicit tag). They don't go further, so I don't think that's right to change in 1.2.1.

I think we can add a general section clarifying these points.

ingydotnet commented 3 years ago

I started a branch for spec changes addressing this https://github.com/yaml/yaml-spec/tree/tag-binding-failure

eemeli commented 3 years ago

While looking through @Thom1729's PR on this, I realised that we have an even greater imprecision we need to fix: Nothing in the Core Schema spec actually defines how !!bool TRUE should be parsed.

The problem here is that the regular expressions defined in the core's tag resolution section are only applied for "Scalars with the ? non-specific tag", and if there's an explicit !!bool tag, then the handling e.g. 'TRUE' is undefined.

Thom1729 commented 3 years ago

Just to clarify: the tag resolution rules only apply to scalar nodes with the ? non-specific tag, and only under a particular schema. However, the rules for producing the canonical form for a scalar node with a given tag apply to all nodes with that tag, whether that tag was specified explicitly or determined via tag resolution.

(The more these issues come up, the more I think that the tag definitions should be in a separate part of the spec from the schema definitions. I would absolutely advocate moving them for the 1.3 spec, and it might even make sense to move them for 1.2.1. This would help to make it clear that the canonical form rules for a tag are not schema-specific. I might write up a draft PR for this later.)

That said, it's true that the spec doesn't exactly specify how the formatted content of a !!bool node is turned into true or false. The spec does specify that the result of canonicalization is one of those two values, and the linked PR specifies that the only valid formatted content is any of the six strings listed. Given this, I think it's obvious (to the point of being unambiguous) that true, True, and TRUE should be canonicalized to true, and similarly for false. However, it never hurts to be explicit, and we could write out the canonicalization procedure if we want. Originally, I meant to do this, replacing the text of the existing “Canonical Form” section, but I decided that adding a new “Valid Formatted Content” section would be less intrusive. I could do it the other way if we think it's worth it.

(As an aside, I keep using “canonicalization” to refer to the process of producing the canonical form of a scalar node's formatted content. In 1.3, I would like to use this term in the spec. I think it would let us express this stuff more clearly, e.g. “The composition process consists of three steps: identification of anchors, resolution of non-specific tags, and canonicalization of scalar formatted content.” I am open to doing this in 1.2.1 if we think it appropriate.)