Providing a different tokenizer

SeanDS commented 2 years ago

I note the docs say:

Tokens are restricted the the ones available in the tokenize module of the Python interpreter that is used to generate the parser. This means that tokenization of any parser generated by pegen must be a subset of the tokenization that Python itself uses.

Is there a reason this is a hard requirement? I'd like to use my own tokenizer because I want to match non-Python syntax (for example, SI-prefixed quantities such as 1.23k, which doesn't get matched by any of the built-in tokenize module tokens as far as I am aware). I notice in pegen.parser.Parser the productions for Python types are hard-coded to check for collisions with keywords etc. but perhaps the parser generator could also allow the developer to handle this themselves if they wish to use their own tokenizer.

gvanrossum commented 2 years ago

It's not a hard requirement, it's just how it was coded (everything's a prototype and it was simpler that way and we didn't have a need for other parsers). If you want to submit a PR to allow the tokenizer to be swapped out that would be great!

pablogsal commented 2 years ago

There are still some things that couple with the python tokenizer. For example:

https://github.com/we-like-parsers/pegen/blob/6178ec6e0afd23e33db3de8882fb369e4606cc53/src/pegen/python_generator.py#L206

SeanDS commented 2 years ago

I had a look through the code and I think it's going to take a lot of work to implement what I want, and require updating interfaces to expand their scope. For example, tokenize.TokenInfo is used in many type hints and code expects tokens provided to it to have the same interface. This class even uses token IDs from Python's token module so can't currently be reused for other languages. Same for Tokenizer which compares internal token numbers in inequalities:

https://github.com/we-like-parsers/pegen/blob/29600deb0cb430d549b9be13bb7ece2fd712f898/src/pegen/tokenizer.py#L67

Adding a custom Tokenizer class seems to also require implementing methods like get_last_non_whitespace_token since it gets generated as part of the parser.

Given the complexity of all these changes I think it would be best to leave this to someone more familiar with the code, if and when anyone decides to implement this. For the particular problem I'm working on I think I'll just write the parser manually.

edemaine commented 2 years ago

FWIW, I used pegen with my own tokenizer/lexer. Here is how I interfaced with pegen:

class Tokenizer(pegen.tokenizer.Tokenizer):
  def __init__(self, file, filename):
    def tokengen():
      lexer = MyLexer(file, filename)
      yield from iter(lexer)
      while True:
        yield tokenize.TokenInfo(token.ENDMARKER, '',
          (lexer.line_num, 0), (lexer.line_num, len(lexer.line)), lexer.line)
    super().__init__(tokengen(), path = filename)

This tokenizer can then be passed into the built parser.

However, I didn't modify the token set; I still used tokenize.TokenInfo.

gvanrossum commented 2 years ago

This isn't likely to happen then. So I'm closing the issue.

we-like-parsers / pegen

Providing a different tokenizer #55