Closed AndreasArvidsson closed 5 months ago
This is another scanner bug, unfortunately.
Could you elaborate on that? I'm not quite sure what the scanner means in this context.
This is not the first bug where we've had leading or trailing whitespaces on a node. Would it be worth doing a unit test that checks for leading and/or trailing whitespaces?
Scanner means the code that does the lexing; see scanner.cc
. It's a bunch of C++ code that implements a custom lexer for TalonScript, and it's where you need to handle any features of the language that are tricky to express as grammars—e.g., indentation sensitivity or lookahead.
Could you elaborate on that? I'm not quite sure what the scanner means in this context.
I'd be happy to accept a PR with such tests?
Can you not just tweak the comment regex? https://github.com/wenkokke/tree-sitter-talon/blob/fd202684c693d1b893fe34575209452424cc9909/grammar.js#L44
I'm not sure what purpose that serves, because afaik comment tokens are lexed by the scanner. I guess you could try replacing .
by [^\r\n]
?
Yeah I was thinking something like that
so is this fixed by #42 ?
@pokey I'm not sure ... @AndreasArvidsson can you retest this?
I considered adding a unit test for this but it's not easy to capture using the built-in tree-sitter testing system, which doesn't include tests for node contents. I think we'd need to set up a separate unit test, e.g. using the Node.js API -- I'm sure this is easy but I'm just not very familiar with Node.js so it wasn't trivial for me.
@wolfmanstout The problem is still there, but slightly changed. node.text
is now "# hello\r\n"
Should definitely be doable with node
FWIW @wenkokke suggestion above would probably work. Despite the fact that comments are declared as an external they are still parsed by that regex. FWIW this is following the Python implementation pattern. I guess there is some subtle difference, assuming Python doesn't have the same behavior.
Okay, I have a draft of a fix out: https://github.com/wenkokke/tree-sitter-talon/pull/45
Before this is merged, I want to point out that the Python tree-sitter grammar has the exact same behavior (I tested it). Should Cursorless just be robust to this instead?
Before this is merged, I want to point out that the Python tree-sitter grammar has the exact same behavior (I tested it). Should Cursorless just be robust to this instead?
I based the scanner and grammar on the Python grammar, so they might actually welcome your changes there as well.
Might be less "this behavior is endemic" and more "that's where I copied it from".
That said, probably makes sense for Cursorless to be robust to this.
When using CRLF line endings comments will include a trailing
\r
node.text:
# hello\r