zaach / jison

Bison in JavaScript.
http://jison.org
4.34k stars 448 forks source link

Jison's advanced grouping options not working #340

Closed tuliogomesbarbosa closed 7 years ago

tuliogomesbarbosa commented 7 years ago

Hi there, i'm trying to parsing an input with regex using a non-capturing parenthesis (?:). However Jison's lexer is getting the whole match while it should get only the group match. Take a look at the following code:

this.yytext += match[0];

The length of match array is 2 and the group match is at position 1. The input that i'm trying to parse is

: request --color #ABCDEF'

and i need to get only the 'request' portion.

My regex is :([^\r\n]+)(?:\-\-color\s+\#[0-9A-Za-z]+) and the regex generated by Jison is /^(?:([^\r\n]+)(?:--color\s+#[0-9A-Za-z]+))/

Am i missing something here? I've already readed the Deviations From Flex Bison topic.

GerHobbelt commented 7 years ago

yytext always contains the entire regex match string; if you want to look at groups within a matched lexer regex, you should use this.matches[] array, where every capturing regex group gets an index as per Regex.match() API.

Side note: macro's add capturing groups of their own, so the index number is not 'immediately obvious' when you employ lex macros in your lexer rule, e.g. {ALPHA}([0-9]+) and macro ALPHA [a-zA-Z] will have 2 capturing groups in the generated regex, rather than only the one for ([0-9]+)

tuliogomesbarbosa commented 7 years ago

Thanks @GerHobbelt. But how can i use this.matches[] to force the yytext variable to be assigned with my capturing regex group? Is there some way of achieve this inside my grammar.jison file?

I also noticed that this._input.slice(match[0].length) is slicing my input somehow that the next call to next() method will thrown an error because of a wrong regex matching.

GerHobbelt commented 7 years ago

Certainly. In the lexer rule which you want to process, you can re-assign yytext; I don't know about the _input.slice() error, but the fundamental rule here is that you SHOULD NOT access lexer internal variables directly, but use the lexer API instead.


The next few bits are valid for my jison fork; it has an extended lexer API and a few bug fixes here and there, compared to vanilla (see #338).

Here are a few examples extracted from a production grammar, which showcase several lexer rule action code chunks using the lexer APIs; this stuff should give you plenty of ammo to tackle your challenge, I expect:

Showcasing overwriting yytext with this.matches[index]

/*
 * String Handling
 * ---------------
 */

"\u2039"([^\u203a]*)"\u203a"
        %{                                                  /* ‹string› */
            s = this.matches[1];
            yytext = s;
            return 'STRING';
        %}

"\u201c"([^\u201d]*)"\u201d"
        %{                                                  /* “string” */
            s = this.matches[1];
            yytext = s;
            return 'STRING';
        %}

"\u00ab"([^\u00bb]*)"\u00bb"
        %{                                                  /* «string» */
            s = this.matches[1];
            yytext = s;
            return 'STRING';
        %}

Showcasing overwriting yytext, using only part of the match and unput()-ting the tail for the next lexer invocation to work on

What happens here should be obvious:

"'"([^']*(?:"''"[^']*)*)"'"{TOKEN_SENTINEL}
        %{
            this.unput(this.matches[2]);

            s = this.matches[1];
            s2 = parser.dedupQuotedString(s, "'");
            yytext = s2;
            return 'STRING';
        %}

NOTE: jison places macros, such as {TOKEN_SENTINEL}, within a capturing regex group (vanilla jison is really wicked as it puts a capturing group around every macro expansion, hence with vanilla jison you MUST know the internal (nesting) structure of your macros for your matches[...] indexes to line up properly! GerHobbelt/jison only places a capturing group around the outer expansion, so matches[...] are easily counted and the internals of the macro (how it was constructed) are irrelevant then (unless, of course, you did add your own capturing groups in any of your macros! 😈 ).

Why does this matter when you use jison lexer macros? Well, anyone care to keep the 'internal structure' of this tree of macros at the front of their memory while writing a production lexer/grammar? I don't, that's for sure! 😄 (Note the use of NON-CAPTURING GROUPS in the macros; as GerHobbelt/jison also uses NON-capturing groups for internal expansions of macros, you can thus be sure that every macro in your lexer rules takes up exactly 1 (one) capturing group! No exceptions to that 'rule' to remember, so you can KISS and work on complex grammars without developing a headache due to lexer peculiarities.

ASCII_LETTER                        [a-zA-Z]

// Unicode literal chars set (only supported by GerHobbelt/jison as of this writing):
UNICODE_LETTER_RANGE                [\p{Alphabetic}]

// NOTE: macro expansion **within a regex [...] set** is also a GerHobbelt/jison feature!
IDENTIFIER_START                    [{UNICODE_LETTER_RANGE}_]
LABEL_START                         [{UNICODE_LETTER_RANGE}\p{Number}]
IDENTIFIER_LAST                     [{LABEL_START}_]
IDENTIFIER_MIDDLE                   [{IDENTIFIER_LAST}.]
LABEL_MIDDLE                        [{IDENTIFIER_LAST} ]
DOLLAR                              [\u0024]
WHITESPACE                          [\s\r\n]

NON_OPERATOR_CHAR                   [{WHITESPACE}{IDENTIFIER_LAST}]

ID                                  [{IDENTIFIER_START}][{IDENTIFIER_LAST}]*
DOTTED_ID                           [{IDENTIFIER_START}](?:[{IDENTIFIER_MIDDLE}]*[{IDENTIFIER_LAST}])?
WORD                                [{IDENTIFIER_LAST}]+
WORDS                               [{IDENTIFIER_LAST}](?:[\s{IDENTIFIER_LAST}]*[{IDENTIFIER_LAST}])?
DOTTED_WORDS                        [{IDENTIFIER_LAST}](?:[\s{IDENTIFIER_MIDDLE}]*[{IDENTIFIER_LAST}])?
JSON_WORD                           [{IDENTIFIER_LAST}](?:[{IDENTIFIER_LAST}\-]*[{IDENTIFIER_LAST}])?

OPERATOR                            [^{NON_OPERATOR_CHAR}]{1,3}

// Match simple floating point values, for example `1.0`, but also `9.`, `.05` or just `7`:
BASIC_FLOATING_POINT_NUMBER         (?:[0-9]+(?:"."[0-9]*)?|"."[0-9]+)

// This marks the end of an elemental token which is not itself an operator, row or column reference or a function.
TOKEN_SENTINEL                      \s*(?:$|[^\s\.{IDENTIFIER_LAST}\(\[\{\$\@\!\'\"])
DUALIC_OPERATOR_MUST_FOLLOW         \s*(?:$|[^{NON_OPERATOR_CHAR}\.\(\[\{\$\@\!\'\"])
OPERATOR_SENTINEL                   \s*(?:$|[^{NON_OPERATOR_CHAR}])

Showcasing another way to consume only a part of a lexer regex rule match, using JavaScript Regex Lookahead Assertions

// Recognize any function ID, with optional dotted sections, as a string which is followed by a `(` open brace, e.g. `Z.DIST(`
{DOTTED_ID}(?=\s*\()
        %{
            /*
             * lookup this blurb: it MAY be a (possibly namespaced) function identifier
             * (e.g. `SUM`, `namespace.user_defined_function42`).
             *
             * [...]
             *
             * Note that this is really another kind of lexical hack as here we include
             * a part of the GRAMMAR KNOWLEDGE in the lexer itself:
             *
             * since we 'know' now that the blurb `\1` is followed by an open brace `(`, we
             * can be certain that this is a function identifier and nothing else
             * that may have the same 'name', e.g. constant `E` or `PI` (or for very wide
             * spreadsheets: column `ABS`).
             *
             * > ### Note
             * >
             * > instead of using `matches[]` and the `.unput()` API, we employ native
             * > regex Lookahead Assertions. Hence `yytext` will cover the entire regex
             * > EXCEPT the trailing lookahead assertion. That the macro has its own
             * > capture group is cute but not needed/used here!
             */
            // console.log("looking up function identifier token (+ look-ahead) in symbol table: ", yytext, this, this.matches);
            s = yytext;
            rv = parser.getSymbol4Function(s);
            if (rv) {
                yytext = {
                    opcode: rv,
                    text: s
                };

                return 'FUNCTION';
            }

            // DRY: and go test the next rule(s) on the same content 
            // (this API is another GerHobbelt/jison only feature):
            this.reject();
        %}
tuliogomesbarbosa commented 7 years ago

Wow! You're a lifesaver! I'll try these features. Since this isn't an error but a lack of vanilla jison feature, i'll close this issue. Thanks again @GerHobbelt !! Keep doing this excelent work!