yaml / pyyaml

Canonical source repository for PyYAML
MIT License
2.54k stars 515 forks source link

End of document delimiter (`...`) #793

Closed philpagel closed 5 months ago

philpagel commented 5 months ago

First of all, thanks for a very useful package!

I have a question:

According to the YAML specification, "Three dots ('...') indicate the end of a document without starting a new one". So shouldn't the parser be ignoring anything after that, perhaps until another --- is encountered?

And the documentation for safe_load says "Parse the first YAML document in a stream and produce the corresponding Python object."

However:

>>> doc = '---\ntime: 20:03:20\nplayer: Sammy Sosa\naction: strike (miss)\n...\nSome other stuff'
>>> yaml.safe_load(doc)
[...]
yaml.parser.ParserError: expected '<document start>', but found '<scalar>'
  in "<unicode string>", line 6, column 1:
    Some other stuff
    ^

Am I misinterpreting the specification, or is this a bug? Or is there a way to tell safe_load (or load) to really only parse the first doc and then stop consuming the stream?

nitzmahone commented 5 months ago

IMO this is the desired behavior, not a bug. I'm not a spec expert (maybe @ingydotnet, @perlpunk, or someone else can chime in), but IIUC the primary purpose of doc markers is to delimit multiple documents in a single YAML-only stream, not to arbitrary mixed-content document streams. Doing so reliably in a parser-friendly fashion would be difficult, especially since you can mix and match styles with doc/directive end markers (the latter also being ---) around otherwise bare documents. If the spec allowed arbitrary non-YAML content between documents, there'd need to be a way to escape non-YAML lines that might be "interesting" to the parser, and it would make the future introduction of any new top-level markers a breaking change (since the escaping mechanism would also need to be updated).

That's not saying you can't trivially roll your own framing mechanism to support such a thing, but I wouldn't anticipate any direct support for such a thing from the spec/tooling.

ingydotnet commented 5 months ago

https://play.yaml.io/main/parser?input=LS0tCnRpbWU6IDIwOjAzOjIwCnBsYXllcjogU2FtbXkgU29zYQphY3Rpb246IHN0cmlrZSAobWlzcykKLi4uClNvbWUgb3RoZXIgc3R1ZmYK

---
time: 20:03:20
player: Sammy Sosa
action: strike (miss)
...
Some other stuff

Is valid YAML according to the spec, but pyyaml and libyaml error on it.

They want to see: https://play.yaml.io/main/parser?input=LS0tCnRpbWU6IDIwOjAzOjIwCnBsYXllcjogU2FtbXkgU29zYQphY3Rpb246IHN0cmlrZSAobWlzcykKLi4uCi0tLQpTb21lIG90aGVyIHN0dWZmCg==

---
time: 20:03:20
player: Sammy Sosa
action: strike (miss)
...
---
Some other stuff

or:


---
time: 20:03:20
player: Sammy Sosa
action: strike (miss)
---
Some other stuff
philpagel commented 5 months ago

Thanks for your clarifications! So, for now, I'll work around that by extracting the YAML block before feeding it to the parser. My application is parsing the YAML header of Markdown documents before transforming them with Pandoc – so I can't easily change the document format because I need to keep Pandoc happy. I do have control over the document generation process so extracting the YAML header myself, reliably, is not a big deal. It just would have been very elegant to let the YAML parser do that, too.

nitzmahone commented 5 months ago

It does seem like there are some inconsistencies in PyYAML's implementation WRT the spec here, but the long-standing default behavior of "anything trailing the document is an error" does appear to have been intentional with the implementation of get_single_node(). Putting my user hat on, it's also exactly the behavior we want for Ansible (and I'd argue most users that are not explicitly supporting multi-doc scenarios)- if that were to suddenly change, it could cause a lot of subtle problems. That part is arguably an API choice (rather than a spec deviation)- basically if you want to allow/ignore trailing stuff, use the multi-doc APIs and just ignore all but the first document stream (and of course ensure that PyYAML's API is fully spec-adherent in multi-doc scenarios).

It's also not clear to me from the spec what the actual intended behavior was around non-document content between explicit documents, and how that might interact with a bare document following a doc end marker. There's mention of "communication channels", and the spec says that ... does not start a new document- fair enough, but this test seems to imply that any content following a doc end marker that's not a marker or directive starts a new bare document. That certainly makes it a simple rule if that's in fact the case, because it precludes the possibility of non-document "junk" on any line following a doc-end marker, but the spec's references to the c-forbidden production around bare documents confuses me there (quite possibly PEBCAK on my part :wink:).

nitzmahone commented 5 months ago

I'll work around that by extracting the YAML block before feeding it to the parser.

@philpagel Since this appears to be more an API implementation choice of get_single_node() (used by the non-all versions of the top-level API functions) explicitly failing on extra document content, assuming you're using PyYAML directly, something like the following should do what you want without any extra machinations:

next(yaml.safe_load_all(doc))
philpagel commented 5 months ago

@nitzmahone: Thanks! That works like a charm.

nitzmahone commented 5 months ago

@philpagel nice- thanks for verifying the workaround gets you the behavior you need. I'm going to close this issue, since we've verified that the underlying bits are (more or less) spec-compliant and that the correct behavior is already visible via the multi-doc APIs.