The second-to-last token does not exist in the original source but was added by the tokenizer. The problem is that this injected token does not have line (which on its own is reasonable).
Pegen's tokenizer uses this to cache the text on that line. So after execution, tokenizer._lines is set to {1: 'line 1\n', 2: 'line 2\n', 3: '', 4: ''}.
Possible Fixes
If this is an unintentional behavior, a possible fix I can think of is to update _lines only if the same line has not been set.
I've added a test case, implemented this change, and it seems to be working.
Hi π I'm using pegen for parsing a Python-like language, encountered something unexpected, and wondering if anyone can take a look.
Problem
It seems that
pegen.tokenizer.Tokenizer
stores the empty string for the last line of the source if the source does not end with a NEWLINE.I expect the last line to print
["line 3"]
but it prints['']
.After some investigation, I found that the empty line is coming from NEWLINE tokens generated by Python's tokenize module.
If the source does not end with a NEWLINE, tokenize injects one. For example, tokenizing the above input would produce
The second-to-last token does not exist in the original source but was added by the tokenizer. The problem is that this injected token does not have
line
(which on its own is reasonable).https://github.com/we-like-parsers/pegen/blob/fab0c5b012836fcba07c1c1c828874745c8a4bfd/src/pegen/tokenizer.py#L59
Pegen's tokenizer uses this to cache the text on that line. So after execution,
tokenizer._lines
is set to{1: 'line 1\n', 2: 'line 2\n', 3: '', 4: ''}
.Possible Fixes
If this is an unintentional behavior, a possible fix I can think of is to update
_lines
only if the same line has not been set.I've added a test case, implemented this change, and it seems to be working.
https://github.com/we-like-parsers/pegen/compare/main...shumbo:pegen:fix-last-line?expand=1
Is there anything else to consider, especially around the condition of the if statement? I can create a PR if this may benefit someone else.