Closed h-vetinari closed 4 years ago
You must write your own main. Check build_parser() in build.py. You can supply your own tokenizer, as long as it duck-types Tokenizer.
Thanks for the info, will try!
Hi, I was also using pegen for my own project, I moved all logic out of Parser.expect to Tokenizer and then implemented my own tokenizer without depending on tokenize.py and python tokens.
@memoize
def expect(self, type: str) -> Optional[tokenize.TokenInfo]:
return self._tokenizer.expect(type)
By looking at toml.gram
It seems like it would be easier to implement some of those rules in a tokenizer and have simpler grammar.
I feel that this has stalled and can thus be closed. In case there are more questions about this in the future, feel free to reopen it.
I've been following this project with interest since the original discussion on
discuss.python.org
, it's great to see that this will most likely make it into python 3.9!I wanted to try using pegen to generate a parser for
toml
, which has recently reachedv1.0.0-rc.1
(background:pytoml
is no longer maintained, and I dislike many things abouttoml
-the-python-library, hence wanting to write my own library).The thing that made using pegen quite an obvious idea is that toml itself is defined through a grammar, even if the format is slightly different - I thought: why hand-code a parser if there's all this machinery already?! My first pass was translating this into the particular flavour of PEG-notation that pegen supports, but I'm running into problems with defining the basic building blocks (I'd like to exactly replicate
toml.abnf
if at all possible; assuming no bugs in the parser generator, compliance would then be basically for free).The first problem is that several rules are defined in terms of unicode ranges, and it looks to me like this is not supported in pegen at all. Example:
But even for an MVP without unicode ranges, I ran into troubles. Using a first draft of the translated
toml.gram
(hidden behind the fold), I getThis is somewhat surprising because
metagrammar.gram
uses some strings directly (where the exact same ones - like'!'
- fail for me). I haven't managed to find out how this works formetagrammar.gram
(even playing around with thefrom ast import literal_eval
import).After some more sleuthing, I then understood how pegen basically reuses
cpython/Lib/token.py
, and I tried to hack something together using that.Still, I can't get to the end of it, because I need to define
"
|'
, and those seem to have no direct correspondence intoken.py
...I understand that this project is heavily driven by replacing the old python parser. However, it seems that the general machinery that you have built is very close to being able to generate much more general PEG parsers, and it seems that could be a worthwhile goal too?
But even barring larger generalisations, could you maybe give me some pointers how to get literal strings working?