Tokenizer returns empty string for the last line

shumbo commented 1 year ago

Hi 👋 I'm using pegen for parsing a Python-like language, encountered something unexpected, and wondering if anyone can take a look.

Problem

It seems that pegen.tokenizer.Tokenizer stores the empty string for the last line of the source if the source does not end with a NEWLINE.

import io
import tokenize

from pegen.tokenizer import Tokenizer

source = "line 1\nline 2\nline 3"

print(source)

tok_stream = tokenize.generate_tokens(io.StringIO(source).readline)
tokenizer = Tokenizer(tok_stream)

# load all tokens
while True:
    tok = tokenizer.getnext()
    print(tok)
    if tok.type == tokenize.ENDMARKER:
        break

print(tokenizer._lines)

print("last line: ", tokenizer.get_lines([3]))

I expect the last line to print ["line 3"] but it prints [''].

After some investigation, I found that the empty line is coming from NEWLINE tokens generated by Python's tokenize module.

If the source does not end with a NEWLINE, tokenize injects one. For example, tokenizing the above input would produce

TokenInfo(type=1 (NAME), string='line', start=(1, 0), end=(1, 4), line='line 1\n')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 5), end=(1, 6), line='line 1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 6), end=(1, 7), line='line 1\n')
TokenInfo(type=1 (NAME), string='line', start=(2, 0), end=(2, 4), line='line 2\n')
TokenInfo(type=2 (NUMBER), string='2', start=(2, 5), end=(2, 6), line='line 2\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 6), end=(2, 7), line='line 2\n')
TokenInfo(type=1 (NAME), string='line', start=(3, 0), end=(3, 4), line='line 3')
TokenInfo(type=2 (NUMBER), string='3', start=(3, 5), end=(3, 6), line='line 3')
TokenInfo(type=4 (NEWLINE), string='', start=(3, 6), end=(3, 7), line='') 👈
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')

The second-to-last token does not exist in the original source but was added by the tokenizer. The problem is that this injected token does not have line (which on its own is reasonable).

https://github.com/we-like-parsers/pegen/blob/fab0c5b012836fcba07c1c1c828874745c8a4bfd/src/pegen/tokenizer.py#L59

Pegen's tokenizer uses this to cache the text on that line. So after execution, tokenizer._lines is set to {1: 'line 1\n', 2: 'line 2\n', 3: '', 4: ''}.

Possible Fixes

If this is an unintentional behavior, a possible fix I can think of is to update _lines only if the same line has not been set.

I've added a test case, implemented this change, and it seems to be working.

https://github.com/we-like-parsers/pegen/compare/main...shumbo:pegen:fix-last-line?expand=1

Is there anything else to consider, especially around the condition of the if statement? I can create a PR if this may benefit someone else.

MatthieuDartiailh commented 1 year ago

That does sound like a bug. Could you open a PR with your fixes ?

shumbo commented 1 year ago

@MatthieuDartiailh Thanks for your reply. I opened a PR with my fix. #77

I'd appreciate it if you can take a look.

we-like-parsers / pegen

Tokenizer returns empty string for the last line #76

Problem

Possible Fixes