Infinite loop when matching a zero-width production under a quantifier.

Consider:

[211] l-yaml-stream ::=
  l-document-prefix*
  l-any-document?
  (
      (
        l-document-suffix+
        l-document-prefix*
        l-any-document?
      )
    | c-byte-order-mark
    | l-comment
    | l-explicit-document
  )*

[202] l-document-prefix ::=
  c-byte-order-mark?
  l-comment*

The l-document-prefix can clearly match the empty string. In l-yaml-stream, it is always referenced under an unbounded quantifier.

l-document-prefix* clearly matches the empty string. Considered as a CFG, it is ambiguous — the resulting parse tree could contain zero or any number of l-document-prefix nodes. In the spec, we define a PEG-like model to avoid ambiguity. In that model, quantification is both greedy (it always matches as many repetitions as possible) and possessive (once it has matched those repetitions, it will never try again with fewer). So the “correct” parse tree here would be an infinite number of l-document-prefix nodes. Or, phrased differently, a naive implementation of a parser would infinite-loop when parsing this. (I discovered the issue myself while writing a parser intended to consume the YAML grammar as directly and literally as possible).

Obviously, actual YAML implementations do not have trouble with this (or they wouldn't work at all). However, I think it's worth addressing the issue for several reasons:

It's arguably a spec bug.
If someone does try to implement YAML parsing by directly implementing the grammar in a literal way, then I think that should work and we should try to facilitate this.
An implementation that implements the grammar very literally might be helpful for demonstration purposes, and any deviation from the grammar as described in the spec would make that implementation less effective as a demonstration.
Although the YAML grammar is not a true CFG, many people have intuition for CFGs, and making the grammar less ambiguous when viewed through that lens might make it easier to understand.

One way to address this issue in a general way would be to modify the definition of the star quantifier so that it will decline to match the empty string as a repetition. This is easy to implement in code, but might be a bit more complicated to specify. I think there's value to keeping the production grammar definition as simple as possible.

Alternatively, we could address the issue by changing the grammar. In the case of l-document-prefix, we could redefine it from c-byte-order-mark? l-comment* to c-byte-order-mark | l-comment. This should leave the behavior of l-yaml-stream unchanged.

The other case I'm aware of involves the special <end-of-file> production. There are various ways that the grammar might match a line break or end of input, and some of them are indirectly under a quantifier. Because <end-of-input> matches are always empty, this leads to the same infinite matching situation as above. I haven't looked into modifying the grammar to avoid the issue here. In this case, an alternative approach would be to ensure that the input ends with a line break (by appending one before parsing if needed). This would allow us to eliminate the special <end-of-input> production entirely.

In principle, I think that if a grammar does not have this issue, then that should be provable. That would definitely be true for a CFG, but I'm not absolutely certain that this is true for the YAML grammar (or that it would be easy to do).

yaml / yaml-spec

Infinite loop when matching a zero-width production under a quantifier. #296