yaml / yaml-spec

YAML Specification
http://yaml.org/spec/
346 stars 53 forks source link

Infinite loop when matching a zero-width production under a quantifier. #296

Open Thom1729 opened 1 year ago

Thom1729 commented 1 year ago

Consider:

[211] l-yaml-stream ::=
  l-document-prefix*
  l-any-document?
  (
      (
        l-document-suffix+
        l-document-prefix*
        l-any-document?
      )
    | c-byte-order-mark
    | l-comment
    | l-explicit-document
  )*

[202] l-document-prefix ::=
  c-byte-order-mark?
  l-comment*

The l-document-prefix can clearly match the empty string. In l-yaml-stream, it is always referenced under an unbounded quantifier.

l-document-prefix* clearly matches the empty string. Considered as a CFG, it is ambiguous — the resulting parse tree could contain zero or any number of l-document-prefix nodes. In the spec, we define a PEG-like model to avoid ambiguity. In that model, quantification is both greedy (it always matches as many repetitions as possible) and possessive (once it has matched those repetitions, it will never try again with fewer). So the “correct” parse tree here would be an infinite number of l-document-prefix nodes. Or, phrased differently, a naive implementation of a parser would infinite-loop when parsing this. (I discovered the issue myself while writing a parser intended to consume the YAML grammar as directly and literally as possible).

Obviously, actual YAML implementations do not have trouble with this (or they wouldn't work at all). However, I think it's worth addressing the issue for several reasons:

One way to address this issue in a general way would be to modify the definition of the star quantifier so that it will decline to match the empty string as a repetition. This is easy to implement in code, but might be a bit more complicated to specify. I think there's value to keeping the production grammar definition as simple as possible.

Alternatively, we could address the issue by changing the grammar. In the case of l-document-prefix, we could redefine it from c-byte-order-mark? l-comment* to c-byte-order-mark | l-comment. This should leave the behavior of l-yaml-stream unchanged.

The other case I'm aware of involves the special <end-of-file> production. There are various ways that the grammar might match a line break or end of input, and some of them are indirectly under a quantifier. Because <end-of-input> matches are always empty, this leads to the same infinite matching situation as above. I haven't looked into modifying the grammar to avoid the issue here. In this case, an alternative approach would be to ensure that the input ends with a line break (by appending one before parsing if needed). This would allow us to eliminate the special <end-of-input> production entirely.

In principle, I think that if a grammar does not have this issue, then that should be provable. That would definitely be true for a CFG, but I'm not absolutely certain that this is true for the YAML grammar (or that it would be easy to do).

UnePierre commented 1 year ago

Another possibility might be: empty-string | c-byte-order-mark | l-comment+

... although I don't know whether my pseudo-code is within the rules grammar.