wobscale / EuIrcBot

A featureful nodejs irc bot
MIT License
14 stars 15 forks source link

!wiki doesn't handle parentheses in URLs well #210

Closed LinuxMercedes closed 5 years ago

LinuxMercedes commented 5 years ago

Seems there's a problem with parentheses in URLs?

ilianaw | interesting  https://en.wikipedia.org/wiki/Galileo_(satellite_navigation)
      ^ | No Wikipedia page found for "Galileo_(satellite_navigation"
euank commented 5 years ago

This problem is not specific to wiki, but rather to the onUrl module.

The code in question is this:

https://github.com/euank/EuIrcBot/blob/2e4e06e5f02229a8f23b42d637e845fd35324435/modules/onUrl.js#L4-L7

The \\pP in trim start and trim end will trim anything in the unicode punctuation class. This is because people sometime write things like.. Blah blah (see http://example.com).

One fix is to always percent-encode parenthesis when pasting. That doesn't really seem like it solves exactly this problem though.

A more proper fix might be to do a slightly better heuristic: to assume that well formed urls will have balanced parenthasis. That is, we could change our url matching to treat both Foo (see http://url?with(parenthesis)) and Foo http://url?with(parenthesis) as match http://url?with(parenthesis) in both cases by recognizing that only one punctuation mark in the first is trailing. This could still be wrong (what about urls that aren't balanced), but will probably be more accurate than it is now.

Unfortunately, that only works for punctuation which can be balanced. For . the problem will remain, and I can't think of any better heuristic than always including or omitting it.