Closed ThomasKrenn closed 5 years ago
START <- (C A L L) _ (B E G I N) _? ~_ <- [ \t\n]
A <- 'A' / 'a'
B <- 'B' / 'b'
...
:P
Thanks for your response! Unfortunately this makes the AST quite large and and difficult to process. Here is a tiny example of what I want to parse:
DEFINE NAME1 COLOR RGB RVAL 11 GVAL 22 BVAL 33;
DEFINE NAME2 COLOR CMYK CVAL 11 MVAL 22 YVAL 33 KVAL 44;
define NAME3 color rgb rval 11 gval 22 bval 33;
define NAME4 color cmyk cval 11 mval 22 yval 33 kval 44;
There are a lot of named values that I cannot avoid.
For Lex there is a known solution: https://stackoverflow.com/questions/22686117/lex-case-insensitive-word-detection
I could use that feature as well.
I had a quick look and it seems, that the easiest way to do it at the moment (other than doing what @mqnc has suggested), is to add a rule (see the "Adjust definitions" section in the docs) that uses a user-defined parser usr
, to do what the parse_literal
function does, but using case-insensitive comparisons.
@ThomasKrenn -- how complex is your grammar? The snippet you've posted looks quite straightforward, so maybe you can simply pre-process the input before parsing, to normalize the case?
Yes I agree it's not the most elegant solution. I think if you want to hack peglib to be case-insensitive in general, the key is line 2112: if (i >= n || s[i] != lit[i]) {
That's where the literal string comparison happens char by char if I understand it correctly.
Otherwise you could just transform your complete input text to lower case and then parse it.
I think if you need selective case-insensitivity (which I think is a pretty useful general feature), you have to wait for Yuji to implement it. Usually takes less than a day ;)
So I've tried prototyping a solution:
#define PEGLIB_NO_UNICODE_CHARS
#include <iostream>
#include <string>
#include <peglib.h>
using namespace std;
auto grammar = R"(
START <- BEGIN END
BEGIN <- <'begin'>
END <- <'end'>
%whitespace <- [ \t\r\n]*
)";
auto ilit(const char* str)
{
return peg::usr([match_str = string(str)](const char* s, size_t n, peg::SemanticValues& sv, peg::any& dt) -> size_t
{
for (int i = 0; i < match_str.length(); ++i)
{
if (i >= n || (tolower(match_str[i]) != tolower(s[i])))
return -1;
}
return match_str.length();
});
}
int main()
{
peg::Rules rules = {
{"BEGIN", peg::tok(ilit("begin"))},
{"END", peg::tok(ilit("end"))},
};
peg::parser parser(grammar, rules);
string input_string = "BeGiN enD ";
if (parser.parse(input_string.c_str())) {
cout << "Success!\n";
}
return 0;
}
Seems to work reasonably well, although I agree, having the ability to express this directly in the grammar would be much better.
@zomgrolf Great! Thanks for posting the example. I have 220 keywords (excluding color and pattern names) and would also prefer to be able to express it directly in the grammar.
I looked at the source. The Literal rule is implemented like in the PEG spec.
Literal <- ['] (!['] Char)* ['] Spacing
/ ["] (!["] Char)* ["] Spacing
Maybe something like this would work:
ILiteral <- [i'] (!['] Char)* ['] Spacing
and a ILiteralString class.
Thanks for the nice suggestions. I'll try to support the extra rule when I have time. Only thing is how to describe the case insensitive literal in PEG grammar. The prefix i
notation (i'abcde'
) is not as easy to support as we think. Here is the grammar that we could easily come up with.
Primary
<- LiteralCaseInsensitive
/ Identifier !LEFTARROW
/ OPEN Expression CLOSE
/ Literal / Class / DOT
Literal
<- [’] (![’] Char)* [’] Spacing
/ ["] (!["] Char)* ["] Spacing
LiteralCaseInsensitive
<- ’i’ [’] (![’] Char)* [’] Spacing
/ ’i’ ["] (!["] Char)* ["] Spacing
There is a subtle problem. LiteralCaseInsensitive
rule should come before Identifier ~LEFTARROW
because i
shouldn't be treated as an identifier string. But it doesn't allow us to use i
as an identifier string any more when a literal string follows...
Anyway, I'll try to find a better syntax which doesn't make any side effects. Thanks for your great contribution!
https://github.com/PhilippeSigaud/Pegged/wiki/Extended-PEG-Syntax#case-insensitive-literals
Some languages, such as HTML and Pascal, have case insensitive keywords. Appending an i to the literal causes the input and the literal to be compared case-insensitively, using std.uni.icmp. Thus, the rule
Keyword <- "writeln"i
https://pegjs.org/documentation
Match exact literal string and return it. The string syntax is the same as in JavaScript. Appending i right after the literal makes the match case-insensitive.
@yhirose Thanks for the link. This is a great solution.
Fantastic! The suffix syntax is pretty nice -- not only matches JS, but is also similar to UDLs in C++
@ThomasKrenn, @zomgrolf, I have just implemented 'case-insensitive literal'i
@yhirose
Thanks for implementing the feature!
std::tolower
requires (tested with VS2015)
`
` Best regards Thomas
@ThomasKrenn, thanks for the report, I added <cctype>
!
Hello Yuji! is there a feature or another solution to use case insensitive literals?
E.g.:
START <- i'CALL' _ i'BEGIN' _? ~_ <- [ \t\n]
The grammar should accept:
CALL BEGIN call begin Call Begin
Best regards Thomas