yaml / yaml-spec

YAML Specification
http://yaml.org/spec/
348 stars 54 forks source link

Expose Graph Representation as NetworkX DiGraph #312

Open colltoaction opened 1 year ago

colltoaction commented 1 year ago

Hi! I'm interested in contributing a categorical definition of the representation graph.

I want to foster collaboration between some Python projects like this one to be the foundation of a programming environment based in YAML.

I believe YAML is meant to be the carrier because it is widely available. Every new language introduces a new syntax, which I would like to avoid. YAML support is mature in all languages and programmers are familiar with it. Human-friendliness, in my opinion, was a great achievement.

So with regards to categorification my intuition is YAML forms a symmetric monoidal category.

I started the following conversations which you might find interesting:

Imagine using YAML in projects like this one, and potentially tons of papers! https://youtu.be/IBESZlBRBzY

Thanks for your time!

Thom1729 commented 1 year ago

Disclaimer: I know the basics of category theory but am far from an expert.

Are you talking about a categorical description of general YAML representation graphs, or about an encoding into (a subset of) YAML representation graphs of some other thing that has an existing categorical description?

If the latter, then I imagine that YAML should work fine. Its major advantage over e.g. JSON for representing graphs is that a YAML representation graph is a true graph, not merely a tree, so you can represent information with a graph structure “natively” without needing to create an artificial system of references on top of the YAML data.

If you mean the former, then I don't quite understand the motivation. If we consider a symmetric monoidal category of representation graphs, then what would be the tensor product?

colltoaction commented 1 year ago

Disclaimer: I know the basics of category theory but am far from an expert.

I guess me as well :)

If you mean the former, then I don't quite understand the motivation. If we consider a symmetric monoidal category of representation graphs, then what would be the tensor product?

I think the representation graph can be "lifted" to a more general category for free. Researchers and programmers could use YAML as a bridge.

The tensor product could be defined by the mapping, where each pair is a parallel arrow (mappings are unordered).

This only considers the basic features of YAML. E.g I haven't considered tags.

colltoaction commented 1 year ago

Hi again @Thom1729! As I continued researching and implementing, I discovered that in a YAML document these a items below are considered different nodes.

b: a
c: a

I haven't been able to implement the representation graph in NetworkX and I think it has to do with this definition. I think a directed graph can't represent this structure because it would lose some information.

b ---> a
c -/

Do you think this is a reasonable implementation concern / limitation? I'm hoping I missed something to make this work! WIP at:

Thanks for your help.

UnePierre commented 1 year ago

Please consider that YAML distinguishes the following two examples -- by choice:

b: a
c: a

b: &1 a
c: *1

If maps are vertices in a directed graph, then keys can be associated with outgoing edges that point to other vertices.

In that understanding, the first is a tree of three vertices (root, a, a') and two edges (root-b->a, root-c->a') and the two "a" vertices are distinct.

The second is a diamond-shaped DAG of two vertices (root, a) and two edges (root-b->a, root-c->a) that share start and end, but have different labels (b, c).

But, keeping in mind that also keys may be aliases, things get weird quiet quickly.

Thom1729 commented 1 year ago

From 3.2.1.3 (emphasis added):

A YAML processor may treat equal scalars as if they were identical.

In general, a YAML document may contain nodes that are equal to each other, but not identical. For instance, in the following document the two mapping nodes are equal but not identical:

- {}
- {}

The same applies to scalars, with the caveat that pursuant to the above citation an implementation may consider equal scalars to be identical. So if you're modeling the representation graph of the document in your comment using some other graph modeling system, then that other system should have one node for each scalar, even scalars that are equal to each other.

On the other hand, in the following document there is only one scalar node whose content is a:

b: &x a
c: *x

If you wanted to model a graph structure in YAML for some ordinary reason, you'd probably use something like the example in your comment, and link things together on the application end. But if I understand correctly, you're trying to embed arbitrary YAML representation graphs in NetworkX. I don't know anything about NetworkX or its semantics, but in order to model a representation graph in full generality, you would need to allow for multiple equal-but-not-identical nodes. If you're currently modeling YAML nodes with some NetworkX type that lacks identity semantics, then you might need to modify that model to ensure that equal nodes can be distinguished. If it's only a problem for scalar nodes (e.g. you're using something like a string type) then under 3.2.1.3 you are allowed to treat equal nodes as identical.

colltoaction commented 1 year ago

Thank you both!

you're trying to embed arbitrary YAML representation graphs in NetworkX.

Yes. The NetworkX library is a widely used Python library with Graphs, DiGraphs, MultiGraphs.

I want to use the Representation Graph but the problem I see is that equal-but-not-identical nodes don't have any difference in the token stream.

under 3.2.1.3 you are allowed to treat equal nodes as identical

I would certainly want to do this, but I think this intermediate representation is still losing information. It seems there are more things to encode than the spec would suggest in order to make this "an implementation detail" in the load/dump processes.

Thom1729 commented 1 year ago

equal-but-not-identical nodes don't have any difference in the token stream.

What do you mean by “the token stream”? Is this a NetworkX concept?

colltoaction commented 1 year ago

https://github.com/yaml/pyyaml/blob/main/lib/yaml/tokens.py

I think this is a better example than the first one:

- a: b
- a: c

Since a has edges to both b and c it doesn't seem possible to recover the original connectivity in a graph wih equal nodes as identical.

An extended data structure can manage this — but is that still a graph? or is it better modeled as hypergraph, paths graph, etc.?

Thom1729 commented 1 year ago

https://github.com/yaml/pyyaml/blob/main/lib/yaml/tokens.py

It's hard to talk about nodes in terms of tokens because a node in the representation graph may correspond to zero, one, or many tokens. But a node that is identical to another node that occurs earlier will correspond to an alias token.

- a: b
- a: c

This is a document with seven nodes: a sequence node (with two children), two mapping nodes (each with one pair of children), and four scalar nodes. Two of the scalar nodes are equal to each other; all other pairs of nodes are inequal. So if you consider this document as a representation graph, and encode it into NetworkX, then you will presumably have seven nodes in NetworkX (unless the encoding creates additional nodes).

If you instead interpret it as encoding information about a graph with three nodes named a, b, and c, then that is a perfectly sensible encoding that you could use in an application. However, the structure of that graph with a, b, and c is not the same as the inherent structure of the YAML document.

I don't quite understand what you mean about recovering the individual connectivity.

colltoaction commented 1 year ago

I see what you mean. My goal is to have a representation in terms of the three scalar, with mapping and sequencing encoded within a data structure.

I want to do this since the seven-node structure doesn't suit my application needs. This made me wonder if the spec is really defining graph structure. Shouldn't vertices and edges be sets?

I'm looking into the categorical framework to extend the representation with a higher level description, preserving the inherent structure of existing documents (i.e preserving connectivity).

Thom1729 commented 1 year ago

This made me wonder if the spec is really defining graph structure. Shouldn't vertices and edges be sets?

Eh. It's traditional to define formal directed graphs in as something like a 2-tuple, with a set of nodes and a set of edges that are each an ordered pair of nodes. I think of this as an implementation detail; in my mind, the graph is not that 2-tuple of sets, but the abstract thing that we could represent in that fashion.

The essential difference between a general directed graph and a YAML representation graph is that YAML nodes each have a kind and a tag, and rather than each node just having a set of edges to other nodes, it has either scalar content, an ordered list of nodes, or an unordered set of key-value pairs (where keys are unique under equality). You could represent a YAML representation graph as a formal directed graph by adding extra structure — coloring nodes by kind, etc, and probably using systems of extra nodes to represent collection contents.

I may have lost track of exactly what you're looking to accomplish. I think that there are good solutions for encoding mathematical graphs as YAML representation graphs, and also good solutions for encoding YAML representation graphs as mathematical graphs, but due to the different data models I don't think you're going to find a reasonable bijection, or an encoding either way that never requires adds nodes/vertices — and an encoding that minimizes extra nodes/vertices may be far from the most ergonomic.

colltoaction commented 1 year ago

I may have lost track of exactly what you're looking to accomplish. I think that there are good solutions for encoding mathematical graphs as YAML representation graphs, and also good solutions for encoding YAML representation graphs as mathematical graphs

The representation in the spec is one of my favorite parts, but I haven't found such mathematical encodings (at least in Python). Working at the representation level with a graph library like NetworkX would be awesome, since it would be closer to the mathematical description than an object oriented interface.

but due to the different data models I don't think you're going to find a reasonable bijection, or an encoding either way tha never requires adds nodes/vertices — and an encoding that minimizes extra nodes/vertices may be far from the most ergonomic.

I think the spatial layout can be modeled as a hypergraph or string diagrams. I'm discovering challenges as I go but I'm doing this with enthusiasm. I have no intention of rushing a contribution.

colltoaction commented 12 months ago

I think this is a useful combinatorial model. A bipartite graph could work as well.

IMG_20231127_122352_589

Thom1729 commented 12 months ago

Hypergraphs seem like overkill to me. Off the top of my head, a reasonably simple encoding might use labelled directed graphs:

This requires one vertex per YAML node, plus one vertex per mapping entry, and resembles the representation graph closely. You could eliminate the vertex labelling or the edge labelling, or both, by adding auxiliary vertices.