Closed fred-wang closed 10 years ago
Quick test shows that this rule
"_" return "_";
replaced by this
[] return "";
makes the grammar behave correctly.
Inspection of the generated input reveals why: string-delimited lexer regexes get a \b regex postfix (apparently to ensure such strings only match 'entire words' but that is me guessing. Anyway, for the lexer ruleset:
%%
\s+ /* skip whitespace */
[_] return "_";
"_" return "_";
"a" return "a";
[a-zA-Z] return "LETTER";
<<EOF>> return "EOF";
/lex
produces this generated set of (regex) lexer rules in JavaScript:
rules: [/^(?:\s+)/,/^(?:[_])/,/^(?:_\b)/,/^(?:a\b)/,/^(?:[a-zA-Z])/,/^(?:$)/],
where the \b's added by jison are clearly visible.
Of course this analysis does not answer if and where to fix this behaviour; with the \b's in there the lexer rules for literal words will look clean and concise, without it, you need more regex 'savviness' to get whole-word-matches - of course not a desired feature in your example grammar, but I bet a lot of parsers depend on this behaviour for their 'reserved words'matching in the lexer now (I know I do ;-) -- just hadn't really looked this hard before at the detail why it works that way now).
Met vriendelijke groeten / Best regards,
Ger Hobbelt
web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: ger@hobbelt.com
Thank you for your response. I plan to generate the lexer from the (6.6Mb large!) http://www.w3.org/2003/entities/2007xml/unicode.xml file by extracting the relevant math characters and TeX commands. So I'd like to have a general way to workaround this bug since it is likely to happen in other cases. For example, I get it for the \sum command below (it works if I replace it with e.g. "\su"[m]):
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[_] return "_";
"\\sum" return "SUM";
[a-zA-Z] return "LETTER";
<<EOF>> return "EOF";
/lex
%start math
%% /* language grammar */
math
: compoundTerm EOF { $$ = "<math>" + $1 + "</math>"; return $$; }
;
compoundTerm
: closedTerm "_" closedTerm {
$$ = "<msub>" + $1 + $3 + "</msub>";
}
;
closedTerm
: LETTER { $$ = "<mi>" + $1 + "</mi>"; }
| SUM { $$ = "<mo>\u2211</mo>"; }
;
Whoops; scanned the wiki for other work and ran into #63 being listed as a known difference between flex and jison.
I ran afoul of that one myself as well today, but given the clash between #63 and this one means I should make the \b regex patching done by jison for 'simple keywords' an option-driven activity. Hence this one will receive a bit more attention in another issue/pull request, due within a week (holidays and all that jazz).
BTW: @fred-wang: if you intend to lex a large set of symbols you may be faster (and tighter in your code as well) when you do not offload such a huge Unicode symbol table to the jison lexer but instead use a more generic regex, a custom symbol table lookup (which can be a O(1) hash table easily) plus a .reject()
to ignore the lexer rule when the symboltable does not deliver a match.
Example snippet (edited) from my own work to give a bit of a hint how a production lexer might look. It does beat jison having to match a huge set of miniature regexes, one for each 'symbol':
%lex
// very important: this enables .reject() to work (instead of throwing a parseError() tantrum)
%option backtrack_lexer
[... lots of unrelated lexer stuff...]
// wicked regex: match 1 to 3 non-whitespace characters.
// of course basic single-symbol lookup doesn't need this sophistication,
// but I'm sure you can reduce this to a simpler version when you don't need this.
\S{1,3}
%{
/*
* Check if the matched string STARTS WITH an operator in the list below.
*
* On the first pass, a hash table is created (and cached) to speed up matching.
*/
if (!this.__operator_hash_table) {
var definition_table = [
{
name: "&&",
opcode: FKW_BOOLEAN_AND_OPERATOR | FT_BOOLEAN | FU_DERIVED,
produce: function() {
// we use this function -> return approach as that makes jison recognize
// the tokens and replace them by (oh, yummy!) *numbers*, further speeding
// up the lexer/parser process (a peephole optimization, alas)
return 'BOOLEAN_AND_OPERATOR';
}
},
{
name: "||",
opcode: FKW_BOOLEAN_OR_OPERATOR | FT_BOOLEAN | FU_DERIVED,
produce: function() {
return 'BOOLEAN_OR_OPERATOR';
}
},
{
name: "<=",
opcode: FKW_LESS_OR_EQUAL | FT_BOOLEAN | FU_DERIVED,
produce: function() {
return 'LESS_OR_EQUAL';
}
},
{
name: ">=",
opcode: FKW_GREATER_OR_EQUAL | FT_BOOLEAN | FU_DERIVED,
produce: function() {
return 'GREATER_OR_EQUAL';
}
},
{
name: "<",
opcode: FKW_LESS_THAN | FT_BOOLEAN | FU_DERIVED,
produce: function() {
return '<';
}
},
{
name: ">",
opcode: FKW_GREATER_THAN | FT_BOOLEAN | FU_DERIVED,
produce: function() {
return '>';
}
},
{
name: "===",
opcode: FKW_IS_IDENTICAL | FT_BOOLEAN | FU_DERIVED,
produce: function() {
return 'IS_IDENTICAL';
}
},
{
name: "==",
opcode: FKW_EQUAL | FT_BOOLEAN | FU_DERIVED,
produce: function() {
return 'IS_EQUAL';
}
},
{
name: "=",
opcode: FKW_ASSIGN | FT_ANY | FU_DERIVED,
produce: function() {
return '=';
}
},
{
name: "\u221a",
opcode: FKW_SQRT_OPERATOR | FT_NUMBER | FU_ANY,
produce: function() {
return 'SQRT_OPERATOR'; /* √ */
}
},
{
name: "\u2248",
opcode: FKW_ALMOST_EQUAL | FT_BOOLEAN | FU_DERIVED,
produce: function() {
return 'ALMOST_EQUAL'; /* ≈ */
}
},
{
name: "\u2260",
opcode: FKW_NOT_EQUAL | FT_BOOLEAN | FU_DERIVED,
produce: function() {
return 'NOT_EQUAL'; /* ≠ */
}
},
{
name: "\u2264",
opcode: FKW_LESS_OR_EQUAL | FT_BOOLEAN | FU_DERIVED,
produce: function() {
return 'LESS_OR_EQUAL'; /* ≤ */
}
},
{
name: "\u2265",
opcode: FKW_GREATER_OR_EQUAL | FT_BOOLEAN | FU_DERIVED,
produce: function() {
return 'GREATER_OR_EQUAL'; /* ≥ */
}
},
{
name: "\u2261",
opcode: FKW_IS_IDENTICAL | FT_BOOLEAN | FU_DERIVED,
produce: function() {
return 'IS_IDENTICAL'; /* ≡ */
}
},
{
name: "\u00d7",
opcode: FKW_MULTIPLY | FT_NUMBER | FU_DERIVED,
produce: function() {
return '*'; /* × */
}
},
... a lot more ... anyway, you get the drift ;-) ...
... included the one entry below so you can observe this.xyz work ...
{
name: ".",
opcode: FKW_DOT,
produce: function() {
// switch lexer modes RIGHT NOW: next up is the `json_filter_expression` rule!
assert(this.topState() !== 'JSON_FILTERING');
this.pushState('JSON_FILTERING');
return '.';
}
},
];
var k, d, ht;
ht = [{}, {}, {}];
for (k in definition_table) {
d = definition_table[k];
ht[d.name.length][d.name] = d;
}
this.__operator_hash_table = ht;
}
var s1 = false, s2 = false, s3 = false;
s = yytext;
switch (s.length) {
case 3:
s3 = s;
s = s.substr(0, 2);
// fall through
case 2:
s2 = s;
s = s.substr(0, 1);
// fall through
case 1:
s1 = s;
break;
default:
assert(0, "should never get here");
break;
}
// now find matches in the operator lookup table, largest match first:
rv = this.__operator_hash_table[3][s3] || this.__operator_hash_table[2][s2] || this.__operator_hash_table[1][s1];
if (rv) {
// push the remainder back into the buffer before we continue:
if (s.length > rv.name.length) {
this.unput(s.substr(rv.name.length));
}
yytext = (new V.FormulaParser.ASTopcode(rv.opcode))
.setLocationInfo(yylloc)
.setCommentsIndex(parser.getNextCommentIndex())
.setLexedText(rv.name);
return rv.produce.call(this);
}
// when we don't have a match at all, we leave it to the other rules to hit something:
this.reject();
%}
/lex
Thanks. Yes, adding an option to choose whether we want a \b will work for me. At the moment, I just use sed to remove all these \b's.
The parser is relatively big, something like ~750kb (although it is ~118kb when gzipped). I plan to try to optimize the lexer since most actions are just "remap the TeX command to a unicode char". However, it seems that what takes the most time to generate is the grammar with the various operator and priorities.
FYI, I have a preliminary version here: http://fred-wang.github.io/TeXZilla/. However, there are some things to improve (including fixing errors in the unicode.xml file) and I have not done much testing yet...
You might want to try https://github.com/GerHobbelt/jison which has the 'no \b added' behaviour built in as a default.
If you want the auto-\b behaviour you must spec the lexer option %options easy_keyword_rules
. (commit 64759c4362af04f63e980764e322ddca279737a5 et al)
This work will be presented in a pull request (or a bunch of those) here, as there's more useful stuff in there that I'd like @zaach to have.
PS: will be on the road in a few hours after some Z so expect response in the new year only. ;-)
I missed this thread and have to read through it, but just a quick note that "%options flex" in the lexical grammar will prevent \b
s from being inserted, as well as make lexer matching behave more like flex.
I've been trying to write a LaTeX-to-MathML converter with Jison and found a weird lexical error. Below is a reduced testcase. If I copy the grammar in http://zaach.github.io/jison/try/ and try to parse "a b" (with a space between "" and "b") then I get the expected MathML output. However if I try "a_b" (without space) I get a lexical error. The grammar skips the whitespace so they should give the same result.