Open VladimirAlexiev opened 2 years ago
A relevant discussion on one technical question from the linked issue:
no easy fix ... for aliases/anchors.
Alias names cannot be preserved. But c14n can generate predictable aliases to achieve identical serialization. https://json-ld.github.io/rdf-dataset-canonicalization/spec/ (URGNA) does that for blank nodes, which is a lot more difficult since graph isomorphism is a problem of exponential complexity .
Given two semantically equivalent YAMLs...
What does "semantically equivalent" mean in this context? For example, are the following documents semantically equivalent?
- some_string
- some_string
- &anch some_string
- *anch
- some_string
- "some_string"
Technically, 1 is different from both 2 and 3, and their representation graphs differ. Still, a lot of apps/frameworks will treat them the same when going all the way to native objects level. For example, 1 and 2 are hard to distinguish if your language has immutable strings, and 1 and 3 will be the same if an unquoted but not special-cased scalar is interpreted as a string (which matches YAML core schema).
Is "YAML representation graphs are identical" a useful definition of "semantically equivalent YAML documents" or do you want to go to some higher level?
...a parse-serialize operation results in the same c14n YAML A parse-serialize round-trip of a c14n YAML results in the same string
What step(s) does "parse-serialize" include in terms of the YAML processing pipeline above (spec) - are we speaking about going to representation tree and back?
Given some internal representation, any c14n-conformant YAML tool will produce the same serialization
This requirement seems to only make sense within one programming language as the internal representation will necessarily be different for different languages. Is that expected?
Also, would it be okay for c14n to be more restrictive than normal YAML ("no, you cannot canonicalize an arbitrary document without specifying its schema and some information about non-standard tags it uses, if any")?
What does "semantically equivalent" mean in this context?
I presume that it must refer to the spec's notion of node equality. Under that definition, all three of the the example documents (interpreted under the core schema) are equal to each other. A “canonical presentation” of those documents might look something like this:
%YAML 1.2
---
!<tag:yaml.org,2002:seq>
- !<tag:yaml.org,2002:str> "some_string"
- !<tag:yaml.org,2002:str> "some_string"
...
Also, would it be okay for c14n to be more restrictive than normal YAML ("no, you cannot canonicalize an arbitrary document without specifying its schema and some information about non-standard tags it uses, if any")?
I think that this would be unnecessary. A “canonical presentation” should present an explicit tag property for each node. Then, there are no non-specific tags to be resolved according to any schema. And it should present the canonical form for each scalar. Thus the “canonical presentation” should have identical semantics under any schema. (This is off the top of my head, so please let me know if I've missed something.)
Note that while the example canonical presentation is schema-independent, the three example documents are not. If you had a converter that took in an arbitrary YAML document and converted it to a canonical presentation, then you would have to supply a schema as input — just as you would when loading an arbitrary YAML document.
I see two potential issues with aliases. Two documents that differ only in whether certain nodes are aliased have equal representation, so (I presume) we should want them to have the same canonical presentation. This means that we would have to make consistent decisions about when to use aliases when not necessary. The simplest approach would be to never use unnecessary aliases. This seems like the most sensible to me, except that this would make a canonical presentation converter vulnerable to a “billion laughs” attack. A converter could mitigate this by stopping at a certain number of serialization nodes. The next-simplest approach might be to use aliases whenever possible. This might be tricky or expensive to implement. In addition, it might produce surprising results for simple documents.
The other issue is node equality for mapping keys. In non-recursive cases, node equality is very simple. In most recursive cases, it's still pretty reasonable. But you can embed arbitrary directed graphs in YAML, so node equality requires solving graph isomorphism, which is NP-complete. Currently, the 1.2.2 spec punts on equality for any recursive nodes (acknowledging that previous spec versions did not fully specify it).
Expecting YAML implementations to solve GIP would be a big ask. There are practical algorithms that would probably run well for most documents, but they are not simple, and there's no way around exponential runtimes in the worst case, which would be a security risk. Alternatively, I think that everything would be nice and tractable if we added a restriction that a mapping key may not contain its parent mapping as a descendent. This is obviously not ideal on its face, but it probably would negatively affect very few users and it would allow us to properly specify node equality in a way that libraries could reasonably implement.
All this aside, I'm not familiar with YAML-LD, but it might be more practical at this point to define a canonical presentation for YAML-LD documents. If YAML-LD is intended to express the same application semantics as a JSON-based format, then I suspect that YAML-LD canonicalization may be substantially simpler and easier than arbitrary YAML canonicalization.
I presume that it must refer to the spec's notion of node equality. Under that definition, all three of the the example documents (interpreted under the core schema) are equal to each other.
So [[], []]
and [&arr [], *arr]
are also equal and canonicalized the same way? That sounds like a footgun in case anyone mutated the data structures deserialized from YAML.
Thinking more about this, I find the current YAML stance on node identity inconsistent. It both declares it unimportant (in terms of equality and in terms of scalar identity - I forgot there was a special case for that in the spec) and treats it as important by providing the way to represent (almost) arbitrary node graphs by identity. IMHO, the two "consistent" stances would be either "node identity is not important, your data structure has to be a tree, aliases are only shortcuts" or "node identity is important, the mapping key equality check should be schema (application) specific". However, that's a question for another day.
(interpreted under the core schema)
A “canonical presentation” should present an explicit tag property for each node. Then, there are no non-specific tags to be resolved according to any schema.
This means it's not possible to have a general tool or library to canonicalize any YAML file - you don't know how the tags are resolved in the correct context, so you need to either provide that information out of band or say "that's not supported, just emit the canonical YAML directly, that's the only way". Yes, that's a possible tradeoff, but it should be explicitly acknowledged.
Alternatively, I think that everything would be nice and tractable if we added a restriction that a mapping key may not contain its parent mapping as a descendent.
Hmm... looks like this should be enough to eliminate exponential blowup, yes. Quadratic will still be there, I think, but that's less of a problem.
So
[[], []]
and[&arr [], *arr]
are also equal and canonicalized the same way? That sounds like a footgun in case anyone mutated the data structures deserialized from YAML.
They are equal according to the spec. There are much thornier cases, such as &a [ *a ]
and &b [ [ *b ] ]
, where the spec does not specify.
Thinking more about this, I find the current YAML stance on node identity inconsistent. It both declares it unimportant (in terms of equality and in terms of scalar identity - I forgot there was a special case for that in the spec) and treats it as important by providing the way to represent (almost) arbitrary node graphs by identity. IMHO, the two "consistent" stances would be either "node identity is not important, your data structure has to be a tree, aliases are only shortcuts" or "node identity is important, the mapping key equality check should be schema (application) specific". However, that's a question for another day.
I agree that the spec does not adequately nail this stuff down. Identity is mentioned, but it's not referred to anywhere and it's not clear whether applications are intended to take any notice of it. Strictly speaking, whether an implementation constructs equal-but-not-identical nodes to the same native value or different native values is a construction detail outside the scope of the spec (in the same way that e.g. integer precision of native types is), but to stop there would be a cop-out; I think that this is an area in which we can and should provide sensible advice, if not commandments. At the very least, we might provide “rigidly defined areas of doubt and uncertainty”.
An interpretation where “aliases are only shortcuts” is not consistent with the representation model. Although an implementation may keep track of aliases behind the scenes, the representation model does not contain alias nodes. If two different collections contain the same child node, then which one of them is the “original” reference and which is the alias is a serialization detail, and it wouldn't survive e.g. reordering keys.
“Identity is important” has pitfalls of its own, but it's probably closer to actual implementation behavior. I do think that there is also a self-consistent middle ground where node identity is not important, but a representation is a true graph.
This means it's not possible to have a general tool or library to canonicalize any YAML file - you don't know how the tags are resolved in the correct context, so you need to either provide that information out of band or say "that's not supported, just emit the canonical YAML directly, that's the only way". Yes, that's a possible tradeoff, but it should be explicitly acknowledged.
In the same sense, it is not possible to have a general tool or library to load any YAML file. Most YAML documents will be parsed to a serialization that has non-specific tags, which require a schema to resolve. Admittedly, most YAML documents will probably use the core schema, but it was a deliberate design decision to allow other schemas. The ship has sailed on that tradeoff.
The key point, I think, is that the fundamental “information content” of a YAML document is its representation graph. Once you have a representation graph, you don't need a schema to construct native types or to compare nodes. I suggest that, rather than thinking of a document as having a canonical presentation, you might think of a representation graph as having a canonical presentation. The loading process for an arbitrary document is schema-dependent, but the dumping process for an arbitrary representation can be schema-agnostic simply by being maximally explicit.
Hmm... looks like this should be enough to eliminate exponential blowup, yes. Quadratic will still be there, I think, but that's less of a problem.
I think that in most cases, it should be linear or basically linear. There are cases where things could still get weird, though:
- &a [ [ *a ] ]
- &b [ [ [ *b ] ] ]
- &c [ [ [ [ [ *c ] ] ] ] ]
- { *a, *b, *c }
Under one reasonable definition, all three nested sequences should be equal, but proving this could require a number of comparisons equal to the LCM of the depths of the lists. Definitely a question requiring serious thought.
Given two semantically equivalent YAMLs...
What does "semantically equivalent" mean in this context? For example, are the following documents semantically equivalent?
- some_string - some_string
- &anch some_string - *anch
- some_string - "some_string"
I say they are all different.
a node which is an alias to another node, regardless if it's a scalar or a collection, is semantically different from two individual nodes. If you load the data and change the content of one node, the other aliased node will also change. I cannot see how the first and second example are semantically equal.
So I disagree with @Thom1729 here.
Example 3, Assuming we see the "canonical" form here, then there must be a reason why some_string
is plain in item one and quoted in item two. We must assume that they are semantically not equal, and that a plain some_string
is something special in that YAML file, without knowing the used schema. That's only known by the app producing this YAML file.
If we assume we see a canonical form, quoted doesn't equal plain, only the kind of quoting is irrelevant.
So
[[], []]
and[&arr [], *arr]
are also equal and canonicalized the same way? That sounds like a footgun in case anyone mutated the data structures deserialized from YAML.They are equal according to the spec.
I don't think so. Like I explained above, the data structures loaded from those are different.
I think we agree; I'm hair-splitting in my typical way.
Just to be maximally clear/explicit:
!
non-specific tag while the plan scalars will get the ?
non-specific tag.The question of whether the three examples are “semantically equivalent” is basically a question about what “semantic equivalence” means. It is likely that some implementations may construct identical nodes into a single native data structure, and non-identical nodes into different native data structures. This is not an unreasonable thing to do. If by saying that X and Y are “semantically equivalent”, we mean that implementations should treat them the same way, then we're saying that X and Y may be equal as representations, but not semantically equivalent.
Originally, I thought that representation equality should be a sufficient definition of semantic equivalence for the purposes of this discussion. But having considered the practical concerns of reference identity in construction, I agree that a strong notion of semantic equivalence has to respect identity in some way — that it must be a stronger condition than mere representation equality.
That said, I still think that a “canonical presentation” should be understood as pertaining to a representation graph. Perhaps one way of expressing this is that two representation graphs should have the same “canonical presentation” if and only if:
So the “canonical presentation” should not add or remove anchors/aliases (for collection nodes), but should leave the alias ”structure” unchanged. However, it may relabel any set of anchors/aliases, and it may rearrange mapping keys in a way that changes which serialization node is the anchor and which serialization nodes are aliases. So the following two documents (under the core schema) are “semantically equivalent” and should produce the same “canonical presentation”:
foo: &a [1, 2, 3]
bar: *a
bar: &b [1, 2, 3]
foo: *b
But the following document is not equivalent to the above and should produce a different canonical presentation, despite having an equal representation graph:
foo: [1, 2, 3]
bar: [1, 2, 3]
I don't have a strong opinion as to whether the spec should define a notion of canonical presentation, but the underlying principle would seem to apply to the serialization process in the general case: if identity of collection nodes is semantically significant, then we should require that the serialization process respect this and not arbitrarily add/remove anchors and aliases (to collection nodes). To the extent that this may already be implied by the spec, I would support making it explicit.
@Thom1729 then we have a different definition for the representation graph. If your two examples have the same representation graph, then when serializing again, no anchor would be emitted anymore, for both examples. The fact that the two nodes were aliased in your example 1, would get lost.
If two YAML streams are semantically equal, they can be serialized as the very same string.
[ &a [], *a ]
and [ [], [] ]
cannot be serialized as the same YAML string.
An interpretation where “aliases are only shortcuts” is not consistent with the representation model. Although an implementation may keep track of aliases behind the scenes, the representation model does not contain alias nodes. If two different collections contain the same child node, then which one of them is the “original” reference and which is the alias is a serialization detail, and it wouldn't survive e.g. reordering keys.
Sorry, I was a bit unclear. The nodes are "only shortcuts" in the meaning that logically they're exactly equivalent to repeating the anchored part of the tree in the place of the alias. Yes, that would be a noticeable limitation compared to the YAML of today.
The key point, I think, is that the fundamental “information content” of a YAML document is its representation graph. Once you have a representation graph, you don't need a schema to construct native types or to compare nodes.
I thought it was the other way round - you can go from the representation graph to the character stream and back without a schema, but need a schema to map the graph to the proper native structures. Is the representation graph supposed to contain already resolved tags, not yet resolved tags or can that vary based on how you get the graph?
I think that in most cases, it should be linear or basically linear.
I've been thinking worst cases... and I think that if you have N sequence nodes containing each other in various weird ways and have a mapping containing every one of them as a key (and thus have to compare whether the structure of this "knot" looks the same from any two different starting nodes), that's gonna be more like N^2.
@perlpunk Sorry, I think I'm describing what I mean confusingly. The representation graphs are “equal” according to the definition in 3.2.1.3. Node Comparison (which is used for mapping key uniqueness), but they are not the same, and the difference is significant for exactly the reasons you mention. It's unfortunate (and confusing) that two different, non-interchangeable representation graphs may be “equal”.
The existing definition of “equal” is needed for mapping key uniqueness (and possibly for other purposes), but we might want to rename it to something like “loose equivalence” to make it clear that it does not mean that two nodes are totally interchangeable.
I thought it was the other way round - you can go from the representation graph to the character stream and back without a schema, but need a schema to map the graph to the proper native structures. Is the representation graph supposed to contain already resolved tags, not yet resolved tags or can that vary based on how you get the graph?
Generally, dumping is a process where the implementation makes arbitrary choices (node style, indentation, anchor names, etc) and loading is a process where the implementation throws those choices away. Composition is a sort of exception where you need to know what schema the document author or serializer had in mind.
The parsing process takes a character stream and turns it into a serialization tree. In the serialization tree, some nodes may have the “non-specific tags” ?
and !
, and there are aliases and anchors. The composition process takes a serialization tree and turns it into a representation graph, where every node has a specific tag and there are no anchors or aliases. In order to do this, the composer needs a schema that tells it how to resolve non-specific tags to specific tags.
The serialization stage is the reverse of this: a serializer may use a schema to turn some specific tags into non-specific tags. But it doesn't have to: it can leave every tag fully specified. If it does, then the resulting serialization could be composed without a schema, because there are no questions for a schema to answer.
So you can take any arbitrary representation graph and serialize it without a schema (that is, leaving all tags specific), and present it as a character stream, and then parse it and compose it without a schema (because there are no non-specific tags to resolve). But you can't take an arbitrary text stream and parse/serialize it without a schema, because its serialization tree may contain non-specific tags that need to be resolved using a schema.
You never need a schema to construct native types from a representation graph. The constructor sees only the representation graph, with its fully resolved tags; it doesn't know or care what schema was used to compose that representation graph.
You never need a schema to construct native types from a representation graph. The constructor sees only the representation graph, with its fully resolved tags; it doesn't know or care what schema was used to compose that representation graph.
Does this mean that if I load the same document into two different native data structures, from the YAML point of view that looks like building two different representation graphs (with the same node contents, but different tags) and then constructing native types for both of them?
if I load the same document into two different native data structures
I'm not sure what you mean by this, can you clarify?
On Thu, Jul 7, 2022, 18:29 Denis Lisov @.***> wrote:
You never need a schema to construct native types from a representation graph. The constructor sees only the representation graph, with its fully resolved tags; it doesn't know or care what schema was used to compose that representation graph.
Does this mean that if I load the same document into two different native data structures, from the YAML point of view that looks like building two different representation graphs (with the same node contents, but different tags) and then constructing native types for both of them?
— Reply to this email directly, view it on GitHub https://github.com/yaml/yaml-spec/issues/289#issuecomment-1178307905, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEBMII3372A37N7277F3XFTVS5K3HANCNFSM524L3YBQ . You are receiving this because you were mentioned.Message ID: @.***>
I'm not sure what you mean by this, can you clarify?
In Rust with the serde
serialization library it's typical to say "please deserialize this document into the following native type"
let just_vectors: Vec<Vec<String>> = yaml_lib_name::deserialize(data)?; // just any list of lists of strings will go
let list_of_triples: Vec<(String, String, String)> = yaml_lib_name::deserialize(data)?; // the nested vecs have to be length 3
let list_of_structs: Vec<MyStructName> = yaml_lib_name::deserialize(data)?; // "tuple struct" serializes as a list too
Do these three deserialize calls logically build different representation graphs (with different tags) from the YAML point of view? Coming back to our topic: would canonicalization support for this API be expected to invent some names for these implicit tags and include them in the serialized version when serializing these structures?
As an author of data structures in YAML, I consider the change from non-aliased to aliased nodes (or back) a significant edit.
I'd like to point out, that even (changed) anchor/alias identifiers can convey meaning.
For example, &tuple [1, 2, 3]
is obviously wrong, thus different from &triple [1, 2, 3]
.
Maybe step back a little from the problem and come back to the motivating cause?
This compares to easier code review when the code is always (automatically) formatted the same way.
Maybe try another analogy for anchor/aliases: symbolic file links (symlinks)?
git
, I would get a commit for the "content" of the symlink.But this analogy fails in the aspect that the anchor/alias identifier is neither the symlink name, nor its target name -- both being vertices in the file system graph -- but rather a name for the edge.
Do these three deserialize calls logically build different representation graphs (with different tags) from the YAML point of view?
In a word, no.
Construction is the process that turns a representation graph into native types. This process varies widely among implementations, partly because different implementation environments have different native types. But even a particular implementation may support different construction behaviors. For example a JavaScript YAML library may allow you to specify whether scalar nodes with the tag:yaml.org,2002:int
tag are constructed as the JavaScript number
type (which has broad support, but is a double-precision floating point type under the hood) or as the bigint
type (which is more faithful, but not available in some older environments). These sorts of decisions, and the options to control them, are completely up to the implementation.
So if I understand the code sample correctly, then each line will cause the YAMl library to parse the raw YAML text into a serialization tree, compose it using whatever schema is the default (presumably the core schema) into a representatinon graph, and then construct the representation graph into the indicated native types for each line. In all three lines, I would expect the parsing and composition to be the same, for the implementation to use the same schema, and for the representation graphs to be fully equivalent.
Note that the above is a conceptual description. It is possible that the internal implementation of the library may not strictly follow all of the steps in order. This is fine as long as it ends up with a correct result in the end.
Coming back to our topic: would canonicalization support for this API be expected to invent some names for these implicit tags and include them in the serialized version when serializing these structures?
I'm not sure exactly what you mean by “implicit tags”.
Suppose we have the following document:
- [a, b, c]
- [1, 2, 3]
The document has “implicit tags” in the sense that the presentation does not have explicit tag properties in it (like !!int
). It will be parsed into a serialization tree where each node is assigned the non-specific tag ?
. During composition, these nodes' tags will be resolved using a schema to a specific tag. Under the core schema, the sequence nodes will all get the specific tag tag:yaml.org,2002:seq
and the scalars will get tag:yaml.org,2002:str
or tag:yaml.org,2002:int
. This produces a representation graph where every node has a specific tag. That representation graph is equivalent to the representation graph that would be produced for the following document:
!<tag:yaml.org,2002:seq>
- !<tag:yaml.org,2002:seq>
- !<tag:yaml.org,2002:str> a
- !<tag:yaml.org,2002:str> b
- !<tag:yaml.org,2002:str> c
- !<tag:yaml.org,2002:seq>
- !<tag:yaml.org,2002:int> 1
- !<tag:yaml.org,2002:int> 2
- !<tag:yaml.org,2002:int> 3
The above is what a “canonical presentation” of that representation graph might look like — which is to say, it is a “canonical presentation” of the first document as interpreted under the core schema. If the first document were interpreted under a different schema, then it might result in a different representation graph, which would have a different “canonical presentation”. But in the “canonical presentation”, all tags are completely explicitly specified, so it would produce the same representation graph under any schema.
All of this is separate from the question of what native data structures might be constructed from the representation graph. The representation graph doesn't know anything about Rust's tuples or vectors. All it knows is that the sequence nodes all have the tag tag:yaml.org,2002:seq
. It's the construction process that looks at a sequence node with tag tag:yaml.org,2002:seq
and decides that it should construct a tuple or a vector, and it may make its decision based on other information it knows, such as what Rust return type is expected.
I consider the change from non-aliased to aliased nodes (or back) a significant edit.
Yeah, I think we have a consensus on that.
I'd like to point out, that even (changed) anchor/alias identifiers can convey meaning. For example,
&tuple [1, 2, 3]
is obviously wrong, thus different from&triple [1, 2, 3]
.
It may be the intent of the author to convey meaning that way, but the YAML data model treats alias/anchor identifiers as a serialization detail. These identifiers are not part of the representation graph and a YAML processor may freely add anchors to nodes, remove anchors from nodes, or rename anchors and aliases, as long as this does not change the structure of the document (including which nodes are aliased to which other nodes).
**Maybe try another analogy for anchor/aliases: symbolic file links (symlinks)?
If I rename the symlink itself -- that's a change. If I change the content of the target file that the symlink points to -- that's a change. If I only rename the target file and change the symlink accordingly -- that's still a change (at the target side), although one might discuss this isn't a change at the source side. But if I were to version this, e.g. in git, I would get a commit for the "content" of the symlink. But this analogy fails in the aspect that the anchor/alias identifier is neither the symlink name, nor its target name -- both being vertices in the file system graph -- but rather a name for the edge.**
In a file system, link names are significant, but in YAML, anchor and alias names are not significant (except insofar as they encode the structure of the document). Also, YAML aliases are like hard links, not symlinks — the serialization node with the anchor isn't the “real” one, it's just the one that happens to appear first in the serialization tree.
If I had to encode a filesystem in YAML, it might look something like this:
aFile: &a !file Hello, World!
aDirectory: !directory
anotherFile: !file Goodbye, World!
hardLinkToAFile: *a
symlinkToAFile: !symlink /aFile
The paths /aFile
and /aDirectory/hardLinkToAFile
refer to the same file. Neither one of them is any more or less the “real” file location than the other. The YAML document could be rewritten to list aDirectory
before aFile
, and then /aDirectory/hardLinkToAFile
would be presented as an anchored scalar and /aFile
would be presented as an alias, but the semantics of the document would be unchanged. Similarly, the anchor name a
is completely arbitrary and could be anything, as long as it matched up properly.
Does this make sense?
Does this make sense?
Yes, thanks for the clarification.
As to the anchor/alias names not being significant, I understand that this is according to the current YAML specification. From an author's point of view, I disagree. But I have a similar view on comments in (YAML) code: a (source) code without any comments may not be considered of equal quality when compared to a well-documented code -- although they produce the same binary. When reviewing such files, I wouldn't let the undocumented version pass.
Coming back onto the original motivation:
use cases like cryptographic signing, reliable diff, Verifiable Credentials
If (part of) the documentation / specification of a data structure is transported in the comments and/or anchor names, these details become unchangeable regarding signing and verification[1]. This problem is solvable through signing the textual representation.
If only the behavior of the system that picks up the YAML data is significant, the problem is highly dependent on the application(s) -- and cannot be solved at YAML specification level.
What exactly shall be solved by c14n on an intermediate level? Whom does it help?
I think that goal ought to be better defined. I recon that we will gain answers to the pending decisions.
Or maybe, there is a spectrum of grades of YAML equality, and we need a multivalued (if not multidimensional) c14n.
[1] https://en.wikipedia.org/wiki/Verification_and_validation
For use cases like cryptographic signing, reliable diff, Verifiable Credentials: it would be useful to have a defined algorithm for YAML canonicalization/normalization (c14n).
I.e. when c14n is enabled:
c14n is addressed for various information standards:
Links:
274 makes that more specific