zaach / jison

Bison in JavaScript.
http://jison.org
4.35k stars 448 forks source link

EOF token is returned only once in recursive grammar #405

Open dsogari opened 10 months ago

dsogari commented 10 months ago

I have the following grammar:

// test.jison
%lex

%%

\s+ // skip whitespace
\w+ return 'IDENTIFIER';
\:  return 'BEGIN_BLOCK';
$   { console.log('EOF'); return 'EOF'; }
.   return 'INVALID';

/lex

%start DOCUMENT

%%

DOCUMENT: STATEMENT EOF;

STATEMENT: IDENTIFIER STMT_BLOCK;

STMT_BLOCK: /**/ | BEGIN_BLOCK DOCUMENT;

This is the test script:

// test.js
import jison from 'jison';
import fs from 'fs';

const grammar = fs.readFileSync('test.jison', 'utf8');
const parser = jison.Parser(grammar);
try {
    parser.parse(process.argv[2]);
} catch(err) {
    console.log(err.message);
}

The command node test.js 'level1' runs without errors and prints EOF.

We should expect node test.js 'level1: level2' to print EOF twice, but it prints this instead:

EOF
Parse error on line 1:
level1: level2
--------------^
Expecting 'EOF', got '1'

The reason is that the EOF token is returned only once, at the nested level. After that, the 1 token (the parser value for end-of-file) is returned. Unfortunately, we cannot reference this special token from the grammar, which makes it impossible to parse this particular language. :(

To fix it, I believe the $ (or equivalent <<EOF>>) rule should get picked up indefinitely while matching the end of file. Or else provide a way to reference the 1 token directly in the grammar.