Adjust tokenize module to handle new logic

lysnikolaou commented 1 year ago

As @pablogsal mentioned in his email, we need to specify how the tokenize module is going to work in light of the changes in the C tokenizer. We probably need to change it to reflect the new behaviour. Some questions we need to answer:

The tokenize module doesn't have a tokenizer state. Should we create one now that we need to keep a stack of tokenizer modes? Can we get around that?
Will all of the new tokens be part of the tokenize.py specification? My feeling is that we should implement all of FSTRING_START, FSTRING_MIDDLE and FSTRING_END for sure.
What's the backwards compatibility policy on this? Should we maybe include a parameter that would turn f-string to regular STRING tokens so that code that relies on that continues to work?

Looking forward to your thoughts. Will probably start working on the code as soon as we've got some answers.

CC @pablogsal @isidentical

pablogsal commented 1 year ago

The tokenize module doesn't have a tokenizer state. Should we create one now that we need to keep a stack of tokenizer modes? Can we get around that?

We should add it (or some form that allows us to keep track of the mode stack). Notice this variable needs to be local to the functions because several tokenizations can be going on at the same time and they should not interfere.

Will all of the new tokens be part of the tokenize.py specification? My feeling is that we should implement all of FSTRING_START, FSTRING_MIDDLE and FSTRING_END for sure.

Yes, the new tokens will be part of the specification (that's also what people on the discourse thread want 👍

What's the backwards compatibility policy on this? Should we maybe include a parameter that would turn f-string to regular STRING tokens so that code that relies on that continues to work?

The same as in the AST, this reflects the internal details of Python and therefore the "breakage" is justified. These things are a bit fuzzy so we can also decide that we want to add a flag if we want, although I don't think is necessary and that would complicate everything.

Thoughts?

isidentical commented 1 year ago

I agree with the first two points @pablogsal , but for the third it would've been nice if we could have claimed full backwards compatibility. The AST example makes sense, but if we can avoid (or provide an alternative) to the breakage that would be really nice. Thought about leveraging exact_type system but it still has problems (e.g. incomplete-strings with STRING type that has the exact_type of MIDDLE/START/END etc.). So probably not so nice / clear either. I'll look into a few usage sites and see if it is really required or whether they can just adjust the code to handle the theoretical tokens with something like the following (copying pasting it to their own project):

def my_project_tokenizer(input):
      f_string_tokens = []
      for token in real_tokenize(input):
           if token.type == F_STRING_START:
                f_string_tokens.append([])
           elif token.type == F_STRING_END:
               value = untokenize(*f_string_tokens.pop())
               new_token = Token(value, STRING)
               if f_string_tokens: # nested f-string, add it inside
                   f_string_tokens.append(new_token)
               else: # top-level f-string, return the final token
                   yield new_token
            elif f_string_tokens:
                f_string_tokens[-1].append(token)
            else:
                yield token

pablogsal commented 1 year ago

Heads up! I have written the new specification here:

https://github.com/python/peps/pull/2974

we-like-parsers / cpython

Adjust tokenize module to handle new logic #207