Lexical error with "_" - Githubissues

fred-wang commented 10 years ago

I've been trying to write a LaTeX-to-MathML converter with Jison and found a weird lexical error. Below is a reduced testcase. If I copy the grammar in http://zaach.github.io/jison/try/ and try to parse "a b" (with a space between "" and "b") then I get the expected MathML output. However if I try "a_b" (without space) I get a lexical error. The grammar skips the whitespace so they should give the same result.

/* lexical grammar */
%lex

%%

\s+ /* skip whitespace */
"_" return "_";
[a-zA-Z] return "LETTER";
<<EOF>> return "EOF";

/lex

%start math

%% /* language grammar */

math
  : compoundTerm EOF { $$ = "<math>" + $1 + "</math>"; return $$; }
  ;

compoundTerm
  : closedTerm "_" closedTerm {
    $$ = "<msub>" + $1 + $3 + "</msub>";
  }
  ;

closedTerm
  : LETTER { $$ = "<mi>" + $1 + "</mi>"; }
  ;

GerHobbelt commented 10 years ago

Quick test shows that this rule

"_" return "_";

replaced by this

[] return "";

makes the grammar behave correctly.

Inspection of the generated input reveals why: string-delimited lexer regexes get a \b regex postfix (apparently to ensure such strings only match 'entire words' but that is me guessing. Anyway, for the lexer ruleset:

%%

\s+ /* skip whitespace */
[_] return "_";
"_" return "_";
"a" return "a";
[a-zA-Z] return "LETTER";
<<EOF>> return "EOF";

/lex

produces this generated set of (regex) lexer rules in JavaScript:

rules: [/^(?:\s+)/,/^(?:[_])/,/^(?:_\b)/,/^(?:a\b)/,/^(?:[a-zA-Z])/,/^(?:$)/],

where the \b's added by jison are clearly visible.

Of course this analysis does not answer if and where to fix this behaviour; with the \b's in there the lexer rules for literal words will look clean and concise, without it, you need more regex 'savviness' to get whole-word-matches - of course not a desired feature in your example grammar, but I bet a lot of parsers depend on this behaviour for their 'reserved words'matching in the lexer now (I know I do ;-) -- just hadn't really looked this hard before at the detail why it works that way now).

Met vriendelijke groeten / Best regards,

Ger Hobbelt

web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: ger@hobbelt.com

mobile: +31-6-11 120 978

fred-wang commented 10 years ago

Thank you for your response. I plan to generate the lexer from the (6.6Mb large!) http://www.w3.org/2003/entities/2007xml/unicode.xml file by extracting the relevant math characters and TeX commands. So I'd like to have a general way to workaround this bug since it is likely to happen in other cases. For example, I get it for the \sum command below (it works if I replace it with e.g. "\su"[m]):

/* lexical grammar */
%lex

%%

\s+ /* skip whitespace */
[_] return "_";
"\\sum" return "SUM";
[a-zA-Z] return "LETTER";
<<EOF>> return "EOF";

/lex

%start math

%% /* language grammar */

math
  : compoundTerm EOF { $$ = "<math>" + $1 + "</math>"; return $$; }
  ;

compoundTerm
  : closedTerm "_" closedTerm {
    $$ = "<msub>" + $1 + $3 + "</msub>";
  }
  ;

closedTerm
  : LETTER { $$ = "<mi>" + $1 + "</mi>"; }
  | SUM { $$ = "<mo>\u2211</mo>"; }
  ;

GerHobbelt commented 10 years ago

Whoops; scanned the wiki for other work and ran into #63 being listed as a known difference between flex and jison.

I ran afoul of that one myself as well today, but given the clash between #63 and this one means I should make the \b regex patching done by jison for 'simple keywords' an option-driven activity. Hence this one will receive a bit more attention in another issue/pull request, due within a week (holidays and all that jazz).

Side Note about large symbol tables being lexed

BTW: @fred-wang: if you intend to lex a large set of symbols you may be faster (and tighter in your code as well) when you do not offload such a huge Unicode symbol table to the jison lexer but instead use a more generic regex, a custom symbol table lookup (which can be a O(1) hash table easily) plus a .reject() to ignore the lexer rule when the symboltable does not deliver a match.

Example of one lexer rule and large custom symbol table

Example snippet (edited) from my own work to give a bit of a hint how a production lexer might look. It does beat jison having to match a huge set of miniature regexes, one for each 'symbol':

%lex

// very important: this enables .reject() to work (instead of throwing a parseError() tantrum)
%option backtrack_lexer

[... lots of unrelated lexer stuff...]

// wicked regex: match 1 to 3 non-whitespace characters.
// of course basic single-symbol lookup doesn't need this sophistication,
// but I'm sure you can reduce this to a simpler version when you don't need this.
\S{1,3}
        %{
            /*
             * Check if the matched string STARTS WITH an operator in the list below.
             * 
             * On the first pass, a hash table is created (and cached) to speed up matching.
             */
            if (!this.__operator_hash_table) {
                var definition_table = [
                    {
                        name: "&&", 
                        opcode: FKW_BOOLEAN_AND_OPERATOR | FT_BOOLEAN | FU_DERIVED, 
                        produce: function() {
                            // we use this function -> return approach as that makes jison recognize
                            // the tokens and replace them by (oh, yummy!) *numbers*, further speeding
                            // up the lexer/parser process (a peephole optimization, alas)
                            return 'BOOLEAN_AND_OPERATOR';
                        }
                    },
                    {
                        name: "||",
                        opcode: FKW_BOOLEAN_OR_OPERATOR | FT_BOOLEAN | FU_DERIVED,
                        produce: function() {
                            return 'BOOLEAN_OR_OPERATOR';
                        }
                    },
                    {
                        name: "<=",
                        opcode: FKW_LESS_OR_EQUAL | FT_BOOLEAN | FU_DERIVED,
                        produce: function() {
                            return 'LESS_OR_EQUAL';
                        }
                    },
                    {
                        name: ">=",
                        opcode: FKW_GREATER_OR_EQUAL | FT_BOOLEAN | FU_DERIVED,
                        produce: function() {
                            return 'GREATER_OR_EQUAL';
                        }
                    },
                    {
                        name: "<",
                        opcode: FKW_LESS_THAN | FT_BOOLEAN | FU_DERIVED,
                        produce: function() {
                            return '<';
                        }
                    },
                    {
                        name: ">",
                        opcode: FKW_GREATER_THAN | FT_BOOLEAN | FU_DERIVED,
                        produce: function() {
                            return '>';
                        }
                    },
                    {
                        name: "===",
                        opcode: FKW_IS_IDENTICAL | FT_BOOLEAN | FU_DERIVED,
                        produce: function() {
                            return 'IS_IDENTICAL';
                        }
                    },
                    {
                        name: "==",
                        opcode: FKW_EQUAL | FT_BOOLEAN | FU_DERIVED,
                        produce: function() {
                            return 'IS_EQUAL';
                        }
                    },
                    {
                        name: "=",
                        opcode: FKW_ASSIGN | FT_ANY | FU_DERIVED,
                        produce: function() {
                            return '=';
                        }
                    },
                    {
                        name: "\u221a",
                        opcode: FKW_SQRT_OPERATOR | FT_NUMBER | FU_ANY,
                        produce: function() {
                            return 'SQRT_OPERATOR';                     /* √ */
                        }
                    },
                    {
                        name: "\u2248",
                        opcode: FKW_ALMOST_EQUAL | FT_BOOLEAN | FU_DERIVED,
                        produce: function() {
                            return 'ALMOST_EQUAL';                      /* ≈ */
                        }
                    },
                    {
                        name: "\u2260",
                        opcode: FKW_NOT_EQUAL | FT_BOOLEAN | FU_DERIVED,
                        produce: function() {
                            return 'NOT_EQUAL';                         /* ≠ */
                        }
                    },
                    {
                        name: "\u2264",
                        opcode: FKW_LESS_OR_EQUAL | FT_BOOLEAN | FU_DERIVED,
                        produce: function() {
                            return 'LESS_OR_EQUAL';                     /* ≤ */
                        }
                    },
                    {
                        name: "\u2265",
                        opcode: FKW_GREATER_OR_EQUAL | FT_BOOLEAN | FU_DERIVED,
                        produce: function() {
                            return 'GREATER_OR_EQUAL';                  /* ≥ */
                        }
                    },
                    {
                        name: "\u2261",
                        opcode: FKW_IS_IDENTICAL | FT_BOOLEAN | FU_DERIVED,
                        produce: function() {
                            return 'IS_IDENTICAL';                      /* ≡ */
                        }
                    },
                    {
                        name: "\u00d7",
                        opcode: FKW_MULTIPLY | FT_NUMBER | FU_DERIVED,
                        produce: function() {
                            return '*';                                 /* × */
                        }
                    },
                    ... a lot more ... anyway, you get the drift ;-) ...
                    ... included the one entry below so you can observe this.xyz work ...
                    {
                        name: ".",
                        opcode: FKW_DOT,
                        produce: function() {
                            // switch lexer modes RIGHT NOW: next up is the `json_filter_expression` rule!
                            assert(this.topState() !== 'JSON_FILTERING');
                            this.pushState('JSON_FILTERING');

                            return '.';
                        }
                    },
                ];
                var k, d, ht;

                ht = [{}, {}, {}];
                for (k in definition_table) {
                    d = definition_table[k];
                    ht[d.name.length][d.name] = d;
                }
                this.__operator_hash_table = ht;
            }

            var s1 = false, s2 = false, s3 = false;

            s = yytext;
            switch (s.length) {
            case 3:
                s3 = s;
                s = s.substr(0, 2);
                // fall through
            case 2:
                s2 = s;
                s = s.substr(0, 1);
                // fall through
            case 1:
                s1 = s;
                break;
            default:
                assert(0, "should never get here");
                break;
            }

            // now find matches in the operator lookup table, largest match first:
            rv = this.__operator_hash_table[3][s3] || this.__operator_hash_table[2][s2] || this.__operator_hash_table[1][s1];
            if (rv) {
                // push the remainder back into the buffer before we continue:
                if (s.length > rv.name.length) {
                    this.unput(s.substr(rv.name.length));
                }

                yytext = (new V.FormulaParser.ASTopcode(rv.opcode))
                    .setLocationInfo(yylloc)
                    .setCommentsIndex(parser.getNextCommentIndex())
                    .setLexedText(rv.name);
                return rv.produce.call(this);
            }

            // when we don't have a match at all, we leave it to the other rules to hit something:
            this.reject();
        %}

/lex

fred-wang commented 10 years ago

Thanks. Yes, adding an option to choose whether we want a \b will work for me. At the moment, I just use sed to remove all these \b's.

The parser is relatively big, something like ~750kb (although it is ~118kb when gzipped). I plan to try to optimize the lexer since most actions are just "remap the TeX command to a unicode char". However, it seems that what takes the most time to generate is the grammar with the various operator and priorities.

FYI, I have a preliminary version here: http://fred-wang.github.io/TeXZilla/. However, there are some things to improve (including fixing errors in the unicode.xml file) and I have not done much testing yet...

GerHobbelt commented 10 years ago

You might want to try https://github.com/GerHobbelt/jison which has the 'no \b added' behaviour built in as a default. If you want the auto-\b behaviour you must spec the lexer option %options easy_keyword_rules. (commit 64759c4362af04f63e980764e322ddca279737a5 et al)

This work will be presented in a pull request (or a bunch of those) here, as there's more useful stuff in there that I'd like @zaach to have.

PS: will be on the road in a few hours after some Z so expect response in the new year only. ;-)

zaach commented 10 years ago

I missed this thread and have to read through it, but just a quick note that "%options flex" in the lexical grammar will prevent \bs from being inserted, as well as make lexer matching behave more like flex.

zaach / jison

Lexical error with "_" #204

mobile: +31-6-11 120 978

Side Note about large symbol tables being lexed

Example of one lexer rule and large custom symbol table