Open ethindp opened 5 months ago
Tokens like NAME
or NUMBER
come from the tokenizer and the parser has no control over them. In order to change what constitutes an identifier, the tokenizer would have to be changed to handle NAME
tokens differently.
pegen
uses the python tokenizer by default, which has a strict definition of what an identifier is, but you could pass a different tokenizer when instantiating a parser object, if you really want to change that.
@lysnikolaou I might need a different tokenizer owing to the language I'm trying to parse having some unique lexical rules in regards to strings and such. The language is fully Unicode aware, so I have that to deal with. Are there any examples of overriding/replacing the tokenizer or should I just look at the default implementation?
I can't seem to find any documentation on this, so I thought I'd try here.
In many grammar specifications, rules like
NAME
orNUMBER
are used. I can see these defined in the fileTokens
, but how do I define these? Is it safe to do:Or are there better ways of doing this? I'm curious because different languages define what an "identifier" is, so I was curious how this is handled, and where these rules/tokens are (actually) defined.