Changing lexer mode from parser fails

If your lexer has multiple states available to it, then these can be changed from within the parser rules by yy.lexer.begin('state');. However, the generated parser pulls the next token before running the code associated with a parse rule.

This offsets when the state change occurs by one. The parser may expect the correct token, but the lexer cannot return it, since it has not switched state yet. Same for switching out of a state.

In bison, parse rules are run before the next token is pulled, which allows this to work. I've attached an example grammar, to demonstrate the problem.

%lex
%x numMode
%%

\s+                   /* skip whitespace */
@@                return 'START'
<numMode>[0-9]+("."[0-9]+)?\b  return 'NUMBER'
<numMode>@@    return 'END'
<numMode>.   return 'TEXT'
<<EOF>>       return 'EOF'
.                     return 'INVALID'

/lex

%start expressions

%% /* language grammar */

expressions
    : e EOF
        { typeof console !== 'undefined' ? console.log($1) : print($1);
          return $1; }
    ;

e
    : START numbers
        { yy.lexer.begin('numMode'); $$ = $2;}
    ;

numbers
    : numbers END
        { yy.lexer.popState(); $$ = $1; }
    | numbers NUMBER
        { $$ = $1 + $2; }
    | numbers TEXT
        { $$ = $1 + $2; }
    | NUMBER
        { $$ = $1 + $2; }
    | TEXT
        { $$ = $1 + $2; }
    ;

TL;DR: not a jison bug but a grammar bug.

In bison, parse rules are run before the next token is pulled, which allows this to work.

Unless I'm completely mistaken, this is very strange, as it would mean that bison would perform deep semantic analysis of the action code -- last time I looked, it didn't.

What you probably mean is that, if you want this classic 'lexer hack' (better known as 'lexical tie-in'; not to be mistaken with 'the lexer hack') to work in the bison/yacc family, you can place action code blocks inside the rule (a.k.a. 'mid-rule action') at a place where you know that no look-ahead is required to reduce the rule up to that point, i.e.

e
    : START { yy.lexer.begin('numMode'); } numbers
        { $$ = $2;}
    ;

This way of coding the grammar would/should work, as now the grammar generator can run the action, which switches lexer mode, when the START rule has reduced to that point without the need for any 'look-ahead'.

Actions, by definition, can only be run when the matching rule has been completely matched (reduced) up to that point, which in this grammar's case means you require the lexer+parser to 'look-ahead' after START to match the numbers rule. That's how LR(k) / LALR(k) / etc. for any k works.

Unfortunately, jison doesn't support mid-rule actions like the example above so you need to rewrite the grammar manually to provide a rule set that is LA(0) (i.e. 'no look-ahead required to reduce the rule') at the appropriate places. (You'll need to be careful anyway, because what you're doing is the equivalent of a lexical tie-in, which is also documented in the bison manual here: http://www.gnu.org/software/bison/manual/html_node/Lexical-Tie_002dins.html#Lexical-Tie_002dins and don't forget to read the next section in that manual as well: that should make you realize that even if jison had mid-rule action support, it doesn't necessarily make it easier for you.

Here's a potential rewrite (untested, hence 'potential'), where the important rules have been marked LA(0).

Also note that your sample grammar is 'odd' in the sense of 'human reader being able to discern what's legal from the grammar definition' as it /seems/ legal from only reading the grammar rules to feed it input @@ 1 @@ 1 which should produce the token stream START NUMBER END NUMBER but can't do that as the END token in the current grammar would 'pop' the lexer state so the real token stream is this: START NUMBER END INVALID. Another seemingly viable input is @@ 1 @@ @@ which would cause some very curious activity as the recursion in the numbers rule suggests that this is legal, while two END tokens surely will nuke your lexer stack, particularly when you do it like that in a more complex grammar.

If your intent was to produce a grammar which munches tokens (numbers and text) delineated by @@ you'ld be best served with pulling the END outside the recursive rule. This addition/change is also reflected in the grammar below. If you want a grammar where the sentinel is optional, then you will have to 'tolerate' a delayed lexer hack due to the LA(1) situation when a sentinel is missing to clearly delineate the section which required the lexer hack or come up with another 'tweak' to ensure proper lexing/parsing operation. (OT: This is why I augmented the jison lexer with minimal 'backtracking' support.)

%lex
%x numMode
%%

\s+                  /* skip whitespace */
@@                return 'START';
<numMode>[0-9]+("."[0-9]+)?\b  return 'NUMBER';
<numMode>@@    return 'END';
<numMode>.   return 'TEXT';
<<EOF>>       return 'EOF';
.                     return 'INVALID';

/lex

%start expressions

%% /* language grammar */

expressions
    : e EOF
        { typeof console !== 'undefined' ? console.log($1) : print($1);
          return $1; }
    ;

/* The way the `e` rule and its subrules are written ensure that 
 * `start` and `sentinel/end` rules wil reduce without the need 
 * for any look-ahead: that's what is needed to make any 
 * lexer hack work, anywhere */
e
    : start_number_detection numbers sentinel /* !!! create rules so you can use them for action code exec */
        { $$ = $2;}
    ;

start_number_detection /* LA(0) --> action code executes before next token is fetched */
    : START
        { yy.lexer.begin('numMode'); }
    ;

sentinel /* LA(0) */
    : END
        { yy.lexer.popState(); }
    ;

numbers
    : numbers NUMBER
        { $$ = $1 + $2; }
    | numbers TEXT
        { $$ = $1 + $2; }
    | NUMBER
        { $$ = $1 + $2; }
    | TEXT
        { $$ = $1 + $2; }
    ;

Hmm, I was referring to this section of the manual, in which it states that the lookahead does not happen every time before reduction. That's what was throwing me the most about my debugging session with jison, as it always grabbed the next token before reducing, and I thought it shouldn't. (this could be me misunderstanding what my grammar requires)

Last time I did this in Bison and ran the resulting parser in debug mode, I got output kind of like this:

get token
stack: START
get token
stack: START NUMBER
get token
stack: START NUMBER TEXT
reduce
stack: START numbers
get token
stack: START numbers END
reduce
stack: START numbers
get token
stack: START numbers EOF
reduce
stack: e EOF
reduce
stack: expressions

I agree entirely with you though, this is a nasty little edge of these grammars, so if my example grammar is wrong it would not surprise me.

In my actual grammar, not included due to its proprietary nature, I ended up shifting this complexity into the lexer. It used a temp var to append the contents of yytext until it had found the ending delineator and could return a single token.

Therefore, this ticket is more of a documentation note for myself and others in the future. As such, your examples and explanations are much appreciated.

Also relevant is this issue: https://github.com/GerHobbelt/jison/issues/3 as there a bugfix for 'default action' state handling is discussed as part of another behaviour which requires the same parser generator ability as the yacc-style 'lexer hack' in this issue: the parser generator must be able to fetch look-ahead from the lexer as late as possible and for that default action parser table rows are crucial as these describe states where the parser does not need any look-ahead to know what to do next, after reducing the already matched rule.

WARNING: the referenced issue https://github.com/GerHobbelt/jison/issues/3 material is focused on the GerHobbelt fork; 'vanilla jison' there is a reference to the original zaach jison (i.e. this very repository)!

Hence, given https://github.com/GerHobbelt/jison/issues/3, my earlier comment dated 2013-03-28 is almost certainly WRONG: you have, with high probability, run into a jison bug discussed in https://github.com/GerHobbelt/jison/issues/3 !

zaach / jison

Changing lexer mode from parser fails #165