yaml / yaml-test-suite

Comprehensive, language independent Test Suite for YAML
MIT License
172 stars 58 forks source link

Should L24T/01/in.yaml and JEF9/02/in.yaml end with \n? #118

Open camilstaps opened 2 years ago

camilstaps commented 2 years ago

I'm looking at an issue of my parser with L24T/01, added in #105. My parser gives back a string without trailing newline.

If I look at L24T/01/in.yaml in the data-2022-01-17 tag, the file does not have a trailing newline:

$ hexdump -C yaml-test-suite/L24T/01/in.yaml 
00000000  66 6f 6f 3a 20 7c 0a 20  20 78 0a 20 20 20        |foo: |.  x.   |
0000000e

If my understanding of the spec is correct, b-chomped-last would then use the alternative b-as-line-feed rather than <end-of-input>, so there should be no trailing newline parsed. Is this correct? Should the file be generated with a trailing newline, or should out.yaml be adapted? Or is my understanding of the spec incorrect?

I have the same issue with JEF9/02, added in #90.

Pinging @ingydotnet & @perlpunk because you were involved in the PR that added this test.

ingydotnet commented 2 years ago

I recall looking into something like this where the presence of a final newline might affect the meaning.

I am of the firm opinion that the presence or absence of a final line break should not ever change the meaning of a yaml stream.

That said, I'm not 100% sure the 1.2.2 spec enforces that.

All the implementations we regularly test agree on the meaning as seen here: https://play.yaml.io/main/parser?input=Zm9vOiB8CiAgeAogIA==

This should probably be addressed as 1.2.2 errata (if it shown to be misrepresented in the productions) and fixed in 1.2.3.

Looking at the spec link you provided:

b-chomped-last(CLIP)  ::= [b-as-line-feed](https://yaml.org/spec/1.2.2/#rule-b-as-line-feed) | <end-of-input>

would match here and CLIP is explained as

Clipping is the default behavior used if no explicit chomping indicator is specified. In this case, the final line break character is preserved in the scalar’s content. However, any trailing empty lines are excluded from the scalar’s content.

In my opinion, this explanation is assuming that all lines end with a line break. This should be explicitly made clear in that explanation.

I'd go forward with handling all lines as if they end with a break.

Looking forward to your implementation, Ingy

camilstaps commented 2 years ago

Thanks a lot!

In my opinion, this explanation is assuming that all lines end with a line break.

This makes sense to me.

Could you clarify a bit where you plan to fix this? At first I thought I could just interpret end-of-line as \n in b-chomped-last(KEEP|CLIP). But this is not enough for JEF9/02. Trying to follow the spec by hand, I end up in l-literal-content(3,KEEP). Then the first optional part of l-literal-content does not apply, because l-nb-literal-text requires nb-char+ and the spaces are all indentation. So l-chomped-empty(3,KEEP) should match ␣␣␣∎. This boils down to l-empty(3,BLOCK_IN)*. But this rule ends with a b-as-line-feed which is not in the input.

Alternatively I suppose you could let b-as-line-feed and perhaps b-non-content accept end-of-input. In my implementation this leads to infinite recursion. I assume that this is because there is an environment in which the end-of-input is then quantified over with * or + so that the parser does not move forward. Then this may just be a problem of my implementation and could possibly be fixed by requiring the parser to move forward in */+.

Or yet another way could be to accept end-of-input next to b-as-line-feed in l-empty(n,KEEP).

As my implementation uses parser combinators it follows the spec very closely, so knowing which way you're going would help to keep the two in sync. But there is no urgency on my side so I'm also perfectly happy to wait for an erratum / change for 1.2.3.