yhirose / cpp-peglib

A single file C++ header-only PEG (Parsing Expression Grammars) library
MIT License
884 stars 112 forks source link

Case Insensitiv Literals #66

Closed ThomasKrenn closed 5 years ago

ThomasKrenn commented 5 years ago

Hello Yuji! is there a feature or another solution to use case insensitive literals?

E.g.:

START <- i'CALL' _ i'BEGIN' _? ~_ <- [ \t\n]

The grammar should accept: CALL BEGIN call begin Call Begin

Best regards Thomas

mqnc commented 5 years ago
START <- (C A L L) _ (B E G I N) _? ~_ <- [ \t\n]
A <- 'A' / 'a'
B <- 'B' / 'b'
...

:P

ThomasKrenn commented 5 years ago

Thanks for your response! Unfortunately this makes the AST quite large and and difficult to process. Here is a tiny example of what I want to parse:

DEFINE NAME1 COLOR RGB RVAL 11 GVAL 22 BVAL 33;
DEFINE NAME2 COLOR CMYK CVAL 11 MVAL 22 YVAL 33 KVAL 44;
define NAME3 color rgb rval 11 gval 22 bval 33;
define NAME4 color cmyk cval 11 mval 22 yval 33 kval 44;

There are a lot of named values that I cannot avoid.

For Lex there is a known solution: https://stackoverflow.com/questions/22686117/lex-case-insensitive-word-detection

zomgrolf commented 5 years ago

I could use that feature as well.

I had a quick look and it seems, that the easiest way to do it at the moment (other than doing what @mqnc has suggested), is to add a rule (see the "Adjust definitions" section in the docs) that uses a user-defined parser usr, to do what the parse_literal function does, but using case-insensitive comparisons.

@ThomasKrenn -- how complex is your grammar? The snippet you've posted looks quite straightforward, so maybe you can simply pre-process the input before parsing, to normalize the case?

mqnc commented 5 years ago

Yes I agree it's not the most elegant solution. I think if you want to hack peglib to be case-insensitive in general, the key is line 2112: if (i >= n || s[i] != lit[i]) { That's where the literal string comparison happens char by char if I understand it correctly. Otherwise you could just transform your complete input text to lower case and then parse it.

I think if you need selective case-insensitivity (which I think is a pretty useful general feature), you have to wait for Yuji to implement it. Usually takes less than a day ;)

zomgrolf commented 5 years ago

So I've tried prototyping a solution:

#define PEGLIB_NO_UNICODE_CHARS
#include <iostream>
#include <string>
#include <peglib.h>

using namespace std;

auto grammar = R"(
    START       <- BEGIN END
    BEGIN       <- <'begin'>
    END         <- <'end'>
    %whitespace <- [ \t\r\n]*
)";

auto ilit(const char* str)
{
    return peg::usr([match_str = string(str)](const char* s, size_t n, peg::SemanticValues& sv, peg::any& dt) -> size_t
    {
        for (int i = 0; i < match_str.length(); ++i)
        {
            if (i >= n || (tolower(match_str[i]) != tolower(s[i])))
                return -1;
        }
        return match_str.length();
    });
}

int main()
{
    peg::Rules rules = {
        {"BEGIN", peg::tok(ilit("begin"))},
        {"END", peg::tok(ilit("end"))},
    };

    peg::parser parser(grammar, rules);

    string input_string = "BeGiN   enD ";

    if (parser.parse(input_string.c_str())) {
        cout << "Success!\n";
    }

    return 0;
}

Seems to work reasonably well, although I agree, having the ability to express this directly in the grammar would be much better.

ThomasKrenn commented 5 years ago

@zomgrolf Great! Thanks for posting the example. I have 220 keywords (excluding color and pattern names) and would also prefer to be able to express it directly in the grammar.

I looked at the source. The Literal rule is implemented like in the PEG spec.

Literal        <- ['] (!['] Char)* ['] Spacing
                / ["] (!["] Char)* ["] Spacing

Maybe something like this would work:

ILiteral        <- [i'] (!['] Char)* ['] Spacing

and a ILiteralString class.

yhirose commented 5 years ago

Thanks for the nice suggestions. I'll try to support the extra rule when I have time. Only thing is how to describe the case insensitive literal in PEG grammar. The prefix i notation (i'abcde') is not as easy to support as we think. Here is the grammar that we could easily come up with.

Primary
   <- LiteralCaseInsensitive
    / Identifier !LEFTARROW
    / OPEN Expression CLOSE
    / Literal / Class / DOT

Literal
   <- [’] (![’] Char)* [’] Spacing
    / ["] (!["] Char)* ["] Spacing

LiteralCaseInsensitive
   <- ’i’ [’] (![’] Char)* [’] Spacing
    / ’i’ ["] (!["] Char)* ["] Spacing

There is a subtle problem. LiteralCaseInsensitive rule should come before Identifier ~LEFTARROW because i shouldn't be treated as an identifier string. But it doesn't allow us to use i as an identifier string any more when a literal string follows...

Anyway, I'll try to find a better syntax which doesn't make any side effects. Thanks for your great contribution!

yhirose commented 5 years ago

https://github.com/PhilippeSigaud/Pegged/wiki/Extended-PEG-Syntax#case-insensitive-literals

Some languages, such as HTML and Pascal, have case insensitive keywords. Appending an i to the literal causes the input and the literal to be compared case-insensitively, using std.uni.icmp. Thus, the rule

Keyword <- "writeln"i

https://pegjs.org/documentation

Match exact literal string and return it. The string syntax is the same as in JavaScript. Appending i right after the literal makes the match case-insensitive.

ThomasKrenn commented 5 years ago

@yhirose Thanks for the link. This is a great solution.

zomgrolf commented 5 years ago

Fantastic! The suffix syntax is pretty nice -- not only matches JS, but is also similar to UDLs in C++

yhirose commented 5 years ago

@ThomasKrenn, @zomgrolf, I have just implemented 'case-insensitive literal'i

ThomasKrenn commented 5 years ago

@yhirose

Thanks for implementing the feature!

std::tolower requires (tested with VS2015) `

include

` Best regards Thomas

yhirose commented 5 years ago

@ThomasKrenn, thanks for the report, I added <cctype>!