we-like-parsers / pegen

PEG parser generator for Python
https://we-like-parsers.github.io/pegen/
MIT License
150 stars 32 forks source link

Defining things like what an identifier is #100

Open ethindp opened 5 months ago

ethindp commented 5 months ago

I can't seem to find any documentation on this, so I thought I'd try here.

In many grammar specifications, rules like NAME or NUMBER are used. I can see these defined in the file Tokens, but how do I define these? Is it safe to do:

identifier: characters_for_an_identifier

Or are there better ways of doing this? I'm curious because different languages define what an "identifier" is, so I was curious how this is handled, and where these rules/tokens are (actually) defined.

lysnikolaou commented 5 months ago

Tokens like NAME or NUMBER come from the tokenizer and the parser has no control over them. In order to change what constitutes an identifier, the tokenizer would have to be changed to handle NAME tokens differently.

pegen uses the python tokenizer by default, which has a strict definition of what an identifier is, but you could pass a different tokenizer when instantiating a parser object, if you really want to change that.

ethindp commented 5 months ago

@lysnikolaou I might need a different tokenizer owing to the language I'm trying to parse having some unique lexical rules in regards to strings and such. The language is fully Unicode aware, so I have that to deal with. Are there any examples of overriding/replacing the tokenizer or should I just look at the default implementation?