zaach / jison

Bison in JavaScript.
http://jison.org
4.35k stars 449 forks source link

Line beginning identifier (^) not work. #67

Open neizod opened 12 years ago

neizod commented 12 years ago

In the lexer part:

"a"     return 'BODY';
^"a"    return 'HEAD';

test case: a a return token: BODY BODY while

^"a"    return 'HEAD';
"a"     return 'BODY';

return token: HEAD HEAD. (expected: HEAD BODY)

jklmli commented 12 years ago

Have you tried using:

 "^a"
neizod commented 12 years ago

just try it and nothing happen as i expected.

zaach commented 12 years ago

This is tricky because the lexer uses JavaScript regular expressions, which don't allow you to start from an arbitrary position in a string. This means a new string is created each time starting at end of the last match, so ^ is technically alway true.

A possible workaround would be to prepend the input with a unique character and replace ^ with that character in the rules.

victorporof commented 12 years ago

@zaach The y flag [0] may help with this, however I don't know about how supported it is in other browsers than Gecko-based.

[0] https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/RegExp

alvaro-cuesta commented 10 years ago

Quick and dirty hack to solve this:

"a" %{
  this.yy_ = this;
  return (this.yylloc.first_column === 0) ? 'HEAD' : 'BODY';
%}
aaditmshah commented 10 years ago

What about using custom scanners? I have written a library called Lexer in the spirit of Flex which allows you to match arbitrary expressions as follows:

var Parser = require("jison").Parser;
var Lexer = require("lex");

var grammar = {
    "bnf": {
        // ...
    }
};

var parser = new Parser(grammar);
var lexer = parser.lexer = new Lexer;

lexer.addRule(/^a/, function (lexeme) {
    this.yytext =  lexeme;
    return "BODY";
});

lexer.addRule(/a/, function (lexeme) {
    this.yytext = lexeme;
    return "HEAD";
});

Perhaps we could integrate it into Jison to be the default scanner? Advantages:

  1. It's easier to use regular expressions themselves instead of string descriptions of regular expressions.
  2. It's easier to use functions themselves instead of string descriptions of function bodies.
  3. Lexer currently supports some very powerful features such as start conditions, global patterns, optional case insensitive matching, optionally matching beginning and end of lines, etc.

I've also wanted to improve the performance of Lexer for quite a while by using Finite State Automata instead of native regular expressions. Perhaps we could work on that collaboratively?

zaach commented 10 years ago

@aaditmshah A more JavaScript friendly lexer is definitely a nice thing to have, but one of the qualifications for the default lexer is that it can be expressed in a way that's familiar to Flex users.

I've thought about implementing a regex engine in JS, but building one with enough features and speed to be useful is more than I have time for. Another option I believe others have explored is compiling a C/C++ regex engine using emscripten.

aaditmshah commented 10 years ago

I have enough time to implement a regex engine in pure JavaScript. What is the interface required to integrate a regex engine with jison? Is it the same interface that's exposed by jison-lex?

amobiz commented 8 years ago

Since now we have "sticky" flag, we can make all regex sticky and multiline (/my) and manually set lastIndex of the regex going to test to the last matched regex's lastIndex?

var match, rule, lastIndex, i;
lastIndex = lastMatchRegex.lastIndex;
for (i = 0; i < rules.length; i++) {
    rule = rules[i];
    rule.regex.lastIndex = lastIndex;
    match = input.match(rule.regex);
    if (match) {
        return match[0];
    }
}
yosbelms commented 8 years ago

@amobiz hey, are you DDOSing?

amobiz commented 8 years ago

Sorry, thought no one is here. Just try to update information.

yosbelms commented 8 years ago

@amobiz, I think the solution to this issue is pointed by @alvaro-cuesta above regarding the lexer "eats" the input, so, ^"a" and "a" is the same rule, to handle custom specs in the %{%} block is very straight forward and explicit way.