yaml / libyaml

Canonical source repository for LibYAML
http://pyyaml.org/wiki/LibYAML
MIT License
963 stars 323 forks source link

Sequence start token missing when sequence is not indented #91

Open mkaluza opened 6 years ago

mkaluza commented 6 years ago

Hi! I've encountered the following error. When sequence items are not indented, they don't generate YAML_BLOCK_SEQUENCE_START_TOKEN before first item, but immediately YAML_BLOCK_ENTRY_TOKEN. Event parser behaves equaly regardless of indentation. Is this a bug or a feature?

indented:

seq:
  - item1
  - item2

output from token parser:

STREAM START
[Block mapping]
(Key token)   scalar seq 
(Value token) <b>Start Block (Sequence)</b>
<b>Start Block (Entry)</b>
scalar item1 
<b>Start Block (Entry)</b>
scalar item2 
<b>End block</b>
<b>End block</b>
STREAM END

not idented sequence:

seq:
- item1
- item2

parser output:

STREAM START
[Block mapping]
(Key token)   scalar seq 
(Value token) <b>Start Block (Entry)</b>
scalar item1 
<b>Start Block (Entry)</b>
scalar item2 
<b>End block</b>
STREAM END

test parser used:

yaml_event_parser.c.txt yaml_token_parser.c.txt

ingydotnet commented 6 years ago

@mkaluza Thanks for bringing this up. To be honest, we haven't really been considering the token set in our testing (available at https://github.com/yaml/yaml-test-suite).

The parser test output would look like this:

+STR
+DOC
+MAP
=VAL :seq
+SEQ
=VAL :item1
=VAL :item2
-SEQ
-MAP
-DOC
-STR

and is the same for both cases (as you said).

I suspect this this should be considered a bug, but I can't say for sure at this time. We may add token output to some of our tests in the suite.

Are you using the YAML token stream for anything in particular? Knowing your use case might be useful. It would certainly be interesting.

@perlpunk @flyx any opinions here?

flyx commented 6 years ago

I think that lexical analysis is an implementation detail. It is certainly not covered by the specification. Therefore, as long as the right event stream is generated, the implementation is conformant to the specification regardless of how the lexer token stream looks like.

However, conformance to the spec is not the only property that should be of concern for an implementation. If the implementation allows to publicly query the lexer's token stream, its layout should be documented. I can understand why the lexer provides different outputs for the two inputs; it recognises a change of the indentation level as block start and emits the token, while without the indentation change, the parser will need to correctly analyse the semantics. From a parser implementor's perspective, I think this is not a bad thing to do. There simply isn't a visible indentation change in the input, why should the lexer emit a token?

Now as I said, the problem here is that public API should be properly documented. I don't think it is in this case. Since the lexer API is not something proposed by the YAML spec, it is a bad thing that there is no description about how libyaml's lexer generates the tokens (correct me if I'm wrong).

I would advise against changing this in the code because

Instead, documentation about how the lexer's output looks would be great so that people know that unindented sequences generate their tokens in the way they do.

mkaluza commented 6 years ago

@ingydotnet are you asking why I'm using tokens and not events or what I'm doing with yaml? As for the first (based on this tutorial https://www.wpsoftware.net/andrew/pages/libyaml.html), token based seemed simpler because it explicitly gave me KEY/VALUE info, while event based doesn't do that and at that point it was one-thing-less-to-worry-about. But I could probably move to event based, when I become more familiar with everything. In that tutorial both parsers were presented as equivalend, but from what @flyx says it seems that token based one is more low-level and I might be misusing it.

As for the second - I'm writing a flexible config parser, so nothing fancy - at least yaml-wise (but the parser is a bit fancy as it is fully declarative).

So based on what @flyx says it's probably not a bug, but it would be nice if event parser reported if current event is a key or value.

mkaluza commented 6 years ago

@flyx I've been thinking about the things you pointed out. As for the spec: What is the exact meaning of YAML_BLOCK_SEQUENCE_START_TOKEN? I had a glance (just briefly as I'm not a language parsing specialist and I don't intend to become one) at YAML spec and I've seen things about the hyphen being treated as part of indentation (and therefore allowing only one space after it). That to me suggests that even with unindented hyphens the list itself is indented and therefore a block starts, so the token should appear. In fact, can the sequence like this (I'm not talking about flow style) start without a new block? I mean, does it make 'common' sense?

As for unforseen consequences - why do you think so? Right now everything using token parser must take into account that YAML_BLOCK_SEQUENCE_START_TOKEN may or may not appear and therefore detect sequences when first item appears, effectively ignoring that token. So if it instead always appeared, it shouldn't make the difference... And after all - you have your test suite to verify if things are ok or not.

As for it not being easy... this I don't know, I haven't seen the code.

I must point out again that I'm no specialist here so I might be missing something subtle, but I'm just thinking along the line of "what that token is for if it can be omitted and no harm is done".

Have a good evening. MK

flyx commented 6 years ago

I am not an author of libyaml, so treat the following statements as informed guesses.

When writing a lexer for a language that uses indentation for structural layout, you have two choices:

  1. Produce a lexer token for every new line which carries the number of how many spaces of indentation this line has. Then let the parser figure out where changes in indentation occur and handle them appropriately.
  2. Make the lexer intelligent enough to be able to recognize the start and end of a structure imposed by indentation. In this case, your lexer generates a token when a new structure at some indentation depth is started and another one when that structure ends. The end token always spans zero characters, which is strange for a lexer token, but not a problem in general.

So I assume libyaml's lexer took the second approach. Even more, the lexer not only recognizes that a new structure starts at the indentation change, it also recognizes that the structure that starts is a sequence. Hence the name YAML_BLOCK_START_SEQUENCE_TOKEN. block is the usual name for such structures in programming languages (like Python).

As for the mention of the hyphen being „treated as part of indentation“: The specification defines YAML's grammar, but is not implementation advice. That statement only explains why the grammar allows block sequences which are not more indented than the parent mapping key. So as far as the semantics of YAML are concerned, a hyphen which is not more indented indeed starts a new block. However, implementation-wise someone seems to have found it easier to handle that case in the parser, not in the lexer. Therefore, the lexer does not generate that token. Consider that when writing lexers or parsers, everything is allowed as long as it leads to a proper result. Do not try to figure out what the lexer does based on the specification, because it is an intermediate result never mentioned there (in the process diagram, parsing is just one step in the YAML loading process).

As for unforseen consequences - why do you think so?

Because I developed software professionally for more than 5 years. The likeliness of unforeseen consequence is especially high when you think there shouldn't be any. That's why they are unforeseen ;).

I'm just thinking along the line of "what that token is for if it can be omitted and no harm is done".

I cannot tell you without looking at the code. It may be left out without harm or it may not.

As I said, the problem is that there is no documentation about how libyaml's lexer output should look. I would say it is unwise to change the working system just because its output may be unexpected. Instead, it should simply be documented.

token based seemed simpler because it explicitly gave me KEY/VALUE info

Did you consider using the DOM API instead? That does give you key/value info. Your actual problem seems to be that you looked into the event API and found it to be too low-level (as it misses some info that you need to infer yourself from the event stream). And your solution is to go even more low-level, which makes you encounter the problem you describe here. Instead, using the higher-level API (i.e. DOM) may be a more appropriate solution. I would generally assume that if you use the lexer API for anything but syntax highlighting, you should probably use another API level.

ingydotnet commented 6 years ago

To support @flyx, I seem to remember @xitology not being interested in an API layer that could turn tokens back into YAML because the tokens were an implementation detail.

Then again, it was long ago and my memory is not what it was...

mkaluza commented 6 years ago

@flyx thanks for the answer. I looked at some examples of using DOM API (http://codepad.org/W7StVSkV) and it seems easy, but unless I'm missing something, it seems not to support anchors/references :( I mean - it can read them, but I get referenced object parsed one more time instead of the information that "hey! it's that one!". is there a way to get info about anchors/references this way? A simple test case:

---
foo: &anchor
  K1: "One"
  K2: "Two"

bar: *anchor
flyx commented 6 years ago

In the DOM API, anchors have already been resolved. Therefore, you just get to the same node regardless of whether you query the value of foo or the value of bar. That is exactly how YAML is designed to work (you should not care which is the first occurrence, or even that two values point to the same node).

You can still recognize values pointing to the same nodes, since in the DOM API, you get the contained values as yaml_node_item_t which is an int. Just compare two integers to determine whether they point to the same yaml_node_t. However, the only scenario I can imagine where this matters with a configuration file is when you create native objects from parts of the YAML document. If that is the case, simply store pointers to the already generated objects in a hashmap with the yaml_node_item_t as key. You can then check for any new index whether you already generated a native object from it.

mkaluza commented 6 years ago

2018-01-31 10:27 GMT+01:00 flyx notifications@github.com:

In the DOM API, anchors have already been resolved. Therefore, you just get to the same node regardless of whether you query the value of foo or the value of bar. That is exactly how YAML is designed to work (you should not care which is the first occurrence, or even that two values point to the same node).

You can still recognize values pointing to the same nodes, since in the DOM API, you get the contained values as yaml_node_item_t which is an int. Just compare two integers to determine whether they point to the same yaml_node_t.

Ok, having looked closer it's indeed the same node (pointer), not duplicated one. Thanks for the tip :)

However, the only scenario I can imagine where this matters with a configuration file is when you create native objects from parts of the YAML document. If that is the case, simply store pointers to the already generated objects in a hashmap with the yaml_node_item_t as key. You can then check for any new index whether you already generated a native object from it.

well that's exactly that scenario :) I needed to reuse the results produced from the configuration, not the configuration itself and I figured references would be perfect for that as they would save me adding some id field of my own.

FanchC35 commented 3 years ago

Dear All, mkaluza, I am facing the same "YAML_BLOCK_SEQUENCE_START_TOKEN not generated when no indentation" problem. As I unerstand this is rather a feature than a bug (see comments in scanner.c and below). But I am using libyaml to convert yaml to json and can't see how to support blocks with no indentation without the YAML_BLOCK_SEQUENCE_START_TOKEN (and its YAML_BLOCK_END_TOKEN counterpart). Any idea about how to solve this? mkaluza, did you manage to solve this? Many thanks in advance, Francois

YAML also permits non-indented sequences if they are included into a block mapping. In this case, the token BLOCK-SEQUENCE-START is not produced:

  key:
  - item 1    # BLOCK-SEQUENCE-START is NOT produced here.
  - item 2
jasimmons1973 commented 1 year ago

When I use the yaml_emitter_set_indent() to some other value (I used 6). I see the following format:

the spacing doesn't seem right. Is this a output bug related to this issue?

perlpunk commented 1 year ago

... the spacing doesn't seem right. Is this a output bug related to this issue?

@jasimmons1973 That's not a bug, just a matter of taste and unrelated