Control whitespace manually

xsawyerx commented 4 years ago

Some places shouldn't have spaces. For example, $foo->@[$foo] cannot be written as $oo -> @ [ $ foo ].

By default, our BNF ignores all spaces and we play around t make it not ignore spaces in critical places.

We should remove the automatic spaces in our BNF and then add spaces where we think appropriate.

gonzus commented 4 years ago

I just wanted to leave a comment here, that this will bring you into the classical territory of the two separate questions about a grammar / parser:

Does it recognize all syntactically valid strings in your language?
Does it recognize only syntactically valid strings in your language?

(1) is easier to achieve and convince yourself of the case. (2) can be much more difficult, and will usually require meddling with the "pure" grammar definition.

So, at some point (and I am not saying this is the case for the particular white space issue you are discussing here), it is better to leave the grammar alone and look for other means of improving it.

xsawyerx commented 4 years ago

This is a very useful reflection on this problem. While writing a response, I've looked into a few things to have some more context for a decision here.

Maybe what we're asking ourselves is, "Do we express syntax that works, or do we express the syntax we want you to use?"

3.4      # a correct number
3 .    4 # also a correct number, but a crappy way to write it

Unlike PPI, we don't provide whitespace information, so we'll provide the elements either way. The question is, as you phrased, what do we want to support or what do we want to break?

The problem I foresee is:

Places that are not ops (which don't have spaces in them) but still cannot have spaces. For example, -> is an op so no space. But @{ ... ] is not an op but it doesn't allow spaces.
This turns back into providing too much flexibility to the user, allowing many mistakes, making code less readable, less consistent, and awkward.

Are we providing a standard or consistency? I think we should try to aim for both.

Maybe we need to review these places first, though:

All special variables: $5, $#, etc. Interestingly, $ 5 is perfecrtly fine as $5 by Perl. This is an example of "yes, it works, but I think it's a really bad syntax definition to allow a space there."
Postfix dereference. @ [ is not acceptable, but it's not an op so we use some basic lexemes along with more complex lexemes to make it not support a whitespace.
q-like values (q/qq/qw/qx/qr/m/s/tr/y) can have spaces between delimiters, but we don't allow it. Otherwise, you have code like s { }{} which is - again - the kind of crappy syntax we're trying to avoid. (There is a different problem because of it though, since m () is a legitimate subroutine name but not a legitimate q-like m() value because of the space.)
Regex modifiers: /foo/ x - Perl does not accept it, despite the modifier not being an op. // is an op, but //x isn't. It's a modifier of the op. (Unless I'm wrong.)
Version numbers: We don't want to allow v 0.4.6 Perl doesn't support space between v and the number.
Version numbers: We don't want to allow v4 . 6. Perl is weird about this: v4. 5 .6 works but v4. 5. 6 fails.
Quoted values. Any space in them should be captured as part of the string and not removed by the parser.
Prefixed dereferencing meaning can change with spaces: $# { $ var } means ${#}{$var}, but $#{$var} is $#{$var}. Here ${#} means the same as ${a} which means $a (or $# in the example) which will trigger a die on 5.30 and above because $# is no longer supported.
Prototypes: sub foo :bar(baz) works, but sub foo :bar (baz) fails.

Maybe just not allowing ambiguous stuff is the big value that we are providing, rather than also control how things are being written, like with spacing. However, the examples above (especially the last two) show that spacing does matter not only to how good is the code being written, but what Perl will do with it.

This moves me more toward "We should control some spacing in a strict manner."

gonzus commented 4 years ago

In a traditional lexer / parser setup, some of these things would just be tokens, so no optional space between them (for example, 4.5, v0.1.2, @{, etc). I don't know how marpa handles this.

Keep in mind that for some of the tooling, you do want to keep track of white space. For example, an editor plugin will probably need exact line / column information to do its job.

If I were to be wrong about this, I would rather be wrong on the side of being too strict.

xsawyerx commented 4 years ago

Could you give me of an example where you want to keep track of whitespace?

gonzus commented 4 years ago

If you want to show a caret below the starting position of an error.

If the editor works in terms of operations like "line L, column C, print sub in red".

xsawyerx commented 4 years ago

This can be achieved in Marpa. I have a branch that gives the character start position and length, but Marpa can also help me get the line and column numbers.

xsawyerx commented 4 years ago

Available now at #75. (It's still a WIP due to the AST cleanup we do - I didn't update that.)

xsawyerx commented 4 years ago

@gonzus I would really appreciate any additional thoughts from you on this. I have merged code that now produces the start position, length of elements, line number, and column number for each token.

gonzus commented 4 years ago

I would say that's all you need for things like rust-like error messages, so \o/

xsawyerx commented 4 years ago

I have a branch for whitespace-specifics.

It didn't resolve too many issues and until I figure out a good rule for which elements define spacing (before and/or after), there are tons of ambiguities because the parser doesn't know who owns the space between the elements.

sub foo { ... }

If sub has a space after, and foo has a space before - who owns that space?
If foo has a space after, and { has a space before - who owns that space?

In short, many small expressions had numerous ways to parse them.

I'll leave this open and look into it in the future. If anyone can suggest how to specify the spaces for elements, we'll look into it again.

vickenty commented 4 years ago

It might be possible to avoid ambiguities by explicitly inserting whitespace tokens between items in the rules, like this:

SubDecl ::= KeywordSub _ Ident _ Block

_ ~ [ \t\n\r]*

xsawyerx commented 4 years ago

This is what I've done. It's just that these whitespace tokens were defined before and after of different rules, causing each one of them to have the space.

Expression ::= SubDecl WS_Any
SubDecl ::= KeywordSub WS_Many Ident WS_Any Block WS_Any

In this example, it's not clear if WS_Any at the end of a block is by Expression or by SubDecl. So I think the best course of action would be a clear definition of which elements receive spaces - the top level rules or the low-level rules/lexemes.

vickenty commented 4 years ago

I don't think we need to add trailing whitespace anywhere except Program rule. If you look at the source text as a sequence of tokens, whitespace is always either between two tokens, before the first token, or after the last. So my guess is that if whitespace rules are consistently added throughout the grammar, any trailing whitespace should be handled as separating whitespace somewhere in the rule above:

# This is the only place where we should need leading and trailing whitespace
Program ::= WS_Any StatementSeq WS_Any
StatementSeq ::= Statement WS_Any Semicolon | ...
#                          ^^^^^^ this is where trailing whitespace after SubDecl would go.
Statement ::= Expression | ...
Expression ::= SubDecl | ...
SubDecl ::= KeywordSub WS_Many Ident WS_Any Block

xsawyerx commented 4 years ago

I don't understand the consistency you're expressing here...

vickenty commented 4 years ago

Maybe consistently was the wrong word to use. What I meant is that if we modify all grammar rules to include WS between the terms in the definition, it should handle all possible whitespace in the program.

chromatic commented 4 years ago

Maybe what we're asking ourselves is, "Do we express syntax that works, or do we express the syntax we want you to use?"

tl; dr - you can express a grammatical opinion about the standardness of text, but have you expressed a semantic opinion about the programness of a program?

Does it help to redefine the question by thinking of a multi-stage approach? Lexing and tokenizing into a parse tree gets you valid syntax, but it doesn't ensure that you have the desired higher-level semantics.

Maybe the example of whitespace between lexemes such as a variable's sigil and name doesn't really make this concrete, but if you were to write a pretty-printer, you'd see that creating a semantically useful tree representation for even a simple use statement such as use strict 'vars'; can be a lot more work that it seems like it should be.

In other words, whitespace is one thing, but associating error messages with any construct larger than a lexeme feels like a question better asked of an AST than a parse tree.

xsawyerx / guacamole

Control whitespace manually #71