Closed SeanDS closed 2 years ago
It's not a hard requirement, it's just how it was coded (everything's a prototype and it was simpler that way and we didn't have a need for other parsers). If you want to submit a PR to allow the tokenizer to be swapped out that would be great!
There are still some things that couple with the python tokenizer. For example:
I had a look through the code and I think it's going to take a lot of work to implement what I want, and require updating interfaces to expand their scope. For example, tokenize.TokenInfo
is used in many type hints and code expects tokens provided to it to have the same interface. This class even uses token IDs from Python's token
module so can't currently be reused for other languages. Same for Tokenizer
which compares internal token numbers in inequalities:
Adding a custom Tokenizer
class seems to also require implementing methods like get_last_non_whitespace_token
since it gets generated as part of the parser.
Given the complexity of all these changes I think it would be best to leave this to someone more familiar with the code, if and when anyone decides to implement this. For the particular problem I'm working on I think I'll just write the parser manually.
FWIW, I used pegen with my own tokenizer/lexer. Here is how I interfaced with pegen:
class Tokenizer(pegen.tokenizer.Tokenizer):
def __init__(self, file, filename):
def tokengen():
lexer = MyLexer(file, filename)
yield from iter(lexer)
while True:
yield tokenize.TokenInfo(token.ENDMARKER, '',
(lexer.line_num, 0), (lexer.line_num, len(lexer.line)), lexer.line)
super().__init__(tokengen(), path = filename)
This tokenizer can then be passed into the built parser.
However, I didn't modify the token set; I still used tokenize.TokenInfo
.
This isn't likely to happen then. So I'm closing the issue.
I note the docs say:
Is there a reason this is a hard requirement? I'd like to use my own tokenizer because I want to match non-Python syntax (for example, SI-prefixed quantities such as
1.23k
, which doesn't get matched by any of the built-intokenize
module tokens as far as I am aware). I notice inpegen.parser.Parser
the productions for Python types are hard-coded to check for collisions with keywords etc. but perhaps the parser generator could also allow the developer to handle this themselves if they wish to use their own tokenizer.