Comment node includes trailing `\r`

wenkokke / tree-sitter-talon

Tree Sitter parser for Talon files.

MIT License

8 stars 5 forks source link

Comment node includes trailing `\r` #36

Closed AndreasArvidsson closed 5 months ago

AndreasArvidsson commented 1 year ago

When using CRLF line endings comments will include a trailing \r

# hello
foo: "bar"

node.text: # hello\r

wenkokke commented 1 year ago

This is another scanner bug, unfortunately.

AndreasArvidsson commented 1 year ago

Could you elaborate on that? I'm not quite sure what the scanner means in this context.

This is not the first bug where we've had leading or trailing whitespaces on a node. Would it be worth doing a unit test that checks for leading and/or trailing whitespaces?

wenkokke commented 1 year ago

Scanner means the code that does the lexing; see scanner.cc. It's a bunch of C++ code that implements a custom lexer for TalonScript, and it's where you need to handle any features of the language that are tricky to express as grammars—e.g., indentation sensitivity or lookahead.

wenkokke commented 1 year ago

Could you elaborate on that? I'm not quite sure what the scanner means in this context.

I'd be happy to accept a PR with such tests?

pokey commented 1 year ago

Can you not just tweak the comment regex? https://github.com/wenkokke/tree-sitter-talon/blob/fd202684c693d1b893fe34575209452424cc9909/grammar.js#L44

wenkokke commented 1 year ago

I'm not sure what purpose that serves, because afaik comment tokens are lexed by the scanner. I guess you could try replacing . by [^\r\n]?

pokey commented 1 year ago

Yeah I was thinking something like that

pokey commented 1 year ago

so is this fixed by #42 ?

wolfmanstout commented 1 year ago

@pokey I'm not sure ... @AndreasArvidsson can you retest this?

I considered adding a unit test for this but it's not easy to capture using the built-in tree-sitter testing system, which doesn't include tests for node contents. I think we'd need to set up a separate unit test, e.g. using the Node.js API -- I'm sure this is easy but I'm just not very familiar with Node.js so it wasn't trivial for me.

AndreasArvidsson commented 1 year ago

@wolfmanstout The problem is still there, but slightly changed. node.text is now "# hello\r\n"

Should definitely be doable with node

wolfmanstout commented 1 year ago

FWIW @wenkokke suggestion above would probably work. Despite the fact that comments are declared as an external they are still parsed by that regex. FWIW this is following the Python implementation pattern. I guess there is some subtle difference, assuming Python doesn't have the same behavior.

wolfmanstout commented 1 year ago

Okay, I have a draft of a fix out: https://github.com/wenkokke/tree-sitter-talon/pull/45

Before this is merged, I want to point out that the Python tree-sitter grammar has the exact same behavior (I tested it). Should Cursorless just be robust to this instead?

wenkokke commented 1 year ago

Before this is merged, I want to point out that the Python tree-sitter grammar has the exact same behavior (I tested it). Should Cursorless just be robust to this instead?

I based the scanner and grammar on the Python grammar, so they might actually welcome your changes there as well.

Might be less "this behavior is endemic" and more "that's where I copied it from".

That said, probably makes sense for Cursorless to be robust to this.