default hyperlink_rules is wrong for Markdown link format

yutkat commented 1 year ago

What Operating System(s) are you seeing this problem on?

Linux Wayland

Which Wayland compositor or X11 Window manager(s) are you using?

Sway

WezTerm version

20230529-192946-62c40118

Did you try the latest nightly build to see if the issue is better (or worse!) than your current version?

Yes, and I updated the version box above to show the version of the nightly that I tried

Describe the bug

If the link is Markdown format, I cannot jump to the site by hyperlink. Why is a ) included here?

https://github.com/wez/wezterm/blob/c6f4ff3626f66f5bc1c64978faf3dbf1e3b6106c/config/src/config.rs#L1619

Or the following rule should be applied first.

hyperlink::Rule::with_highlight(r"\((\w+://\S+)\)", "$1", 1).unwrap(),

To Reproduce

No response

Configuration

no config

Expected Behavior

This format will correctly link to https://github.com.

[github](https://github.com)

Now it is linked to https://github.com%29.

Logs

No response

Anything else?

No response

nobeans commented 1 year ago

I've just experiencing the same issue. It seems that the regular expression in [2] of the default settings is causing the problem:

  -- Matches: a URL in parens: (URL)
  {
    regex = '\\((\\w+://\\S+)\\)', -- [1]
    format = '$1',
    highlight = 1,
  },
  -- ...(snipped)...
  -- Then handle URLs not wrapped in brackets
  {
    regex = '\\b\\w+://\\S+[)/a-zA-Z0-9-]+', -- [2]
    format = '$0',
  },

The closing brace is included in the pattern, causing it to take precedence over the [1] pattern. I fixed it in my .wezterm.lua file as follows, and it seems to be working correctly:

  -- Then handle URLs not wrapped in brackets
  {
    regex = '[^(]\\b(\\w+://\\S+[)/a-zA-Z0-9-]+)',
    format = '$1',
    highlight = 1,
  },

I tested it with the following text to verify:

- (http://localhost/foo)
- [http://localhost/foo]
- {http://localhost/foo}
- <http://localhost/foo>
- http://localhost/foo
- http://localhost/foo)

(though I don't know why a closing brace is including the [2] pattern.)

nobeans commented 1 year ago

.wezterm.lua for workaround:

local wezterm = require 'wezterm'
local config = {}

config.hyperlink_rules = {
  -- Matches: a URL in parens: (URL)
  {
    regex = '\\((\\w+://\\S+)\\)',
    format = '$1',
    highlight = 1,
  },
  -- Matches: a URL in brackets: [URL]
  {
    regex = '\\[(\\w+://\\S+)\\]',
    format = '$1',
    highlight = 1,
  },
  -- Matches: a URL in curly braces: {URL}
  {
    regex = '\\{(\\w+://\\S+)\\}',
    format = '$1',
    highlight = 1,
  },
  -- Matches: a URL in angle brackets: <URL>
  {
    regex = '<(\\w+://\\S+)>',
    format = '$1',
    highlight = 1,
  },
  -- Then handle URLs not wrapped in brackets
  {
    -- Before
    --regex = '\\b\\w+://\\S+[)/a-zA-Z0-9-]+',
    --format = '$0',
    -- After
    regex = '[^(]\\b(\\w+://\\S+[)/a-zA-Z0-9-]+)',
    format = '$1',
    highlight = 1,
  },
  -- implicit mailto link
  {
    regex = '\\b\\w+@[\\w-]+(\\.[\\w-]+)+\\b',
    format = 'mailto:$0',
  },
}

return config

woojiq commented 1 year ago

@nobeans have you thought about creating a PR to fix default hyperlink_rules? Or is there a situation where this is not desirable behavior?

nobeans commented 1 year ago

I don't understand why a closing parenthesis is included in the original regular expression. Therefore, I'm not confident whether the fix is correct. For instance, an alternative solution like the following could also be considered:

  ...
  -- Then handle URLs not wrapped in brackets
  {
    -- Before
    --regex = '\\b\\w+://\\S+[)/a-zA-Z0-9-]+',
    --format = '$0',
    -- After(2)
    regex = '\\b\\w+://\\S+[/a-zA-Z0-9-]+', -- just removed ')'
    format = '$0',
  },
  ...

Because of this, I'm hesitating to send the pull request.

ninjalj commented 1 year ago

This may be just a little bit harder than it seems at first sight, consider the following:

[unix-history-repo](https://github.com/dspinellis/unix-history-repo)
[*Das verrückte Labyrinth*](https://en.wikipedia.org/wiki/Labyrinth_(board_game))
https://en.wikipedia.org/wiki/C_(programming_language)
http://en.wikipedia.org/w/index.php?title=Thresholding_(image_processing)&oldid=132306976
https://en.wikipedia.org/wiki/ANSI_escape_code#CSI_(Control_Sequence_Introducer)_sequences

nobeans commented 1 year ago

I see. There are patterns where parentheses are included within a URL. Anyway, both of the suggested fixes, commented as "After" or "After(2)," seem to work well for the URLs.

nobeans commented 1 year ago

Oops. The "After(2)" fix doesn't work for this:

https://en.wikipedia.org/wiki/C_(programming_language)

The fix "After" works well.

kenchou commented 1 month ago

The default rules seem to need to use non-greedy mode. Please refer to the example in this repository's README: an image with a link:

* [![Sponsor WezTerm](https://img.shields.io/github/sponsors/wez?label=Sponsor%20WezTerm&logo=github&style=for-the-badge)](https://github.com/sponsors/wez)

expect two links: https://img.shields.io/github/sponsors/wez?label=Sponsor%20WezTerm&logo=github&style=for-the-badge and https://github.com/sponsors/wez

actual one link: https://img.shields.io/github/sponsors/wez?label=Sponsor%20WezTerm&logo=github&style=for-the-badge)](https://github.com/sponsors/wez)

non-greedy mode: +?, *?:

config.hyperlink_rules = {
  -- Matches: a URL in parens: (URL)
  -- Markdown: [text](URL title)
  {
    regex = '\\((\\w+://\\S+?)(\\s+.+)?\\)',
    format = '$1',
    highlight = 1,
  },
  -- Matches: a URL in brackets: [URL]
  {
    regex = '\\[(\\w+://\\S+?)\\]',
    format = '$1',
    highlight = 1,
  },
  -- Matches: a URL in curly braces: {URL}
  {
    regex = '\\{(\\w+://\\S+?)\\}',
    format = '$1',
    highlight = 1,
  },
  -- Matches: a URL in angle brackets: <URL>
  {
    regex = '<(\\w+://\\S+?)>',
    format = '$1',
    highlight = 1,
  },
  -- Then handle URLs not wrapped in brackets
  -- regex = '\\b\\w+://\\S+[)/a-zA-Z0-9-]+',
  {
    regex = '(?<![\\(\\{\\[])\\b\\w+://\\S+\\b',
    format = '$0',
  },
  -- implicit mailto link
  {
    regex = '\\b\\w+@[\\w-]+(\\.[\\w-]+)+\\b',
    format = 'mailto:$0',
  },
}

The above rules are not perfect, and the following test cases are not perfectly recognized.

1. [![Sponsor WezTerm](https://img.shields.io/github/sponsors/wez?label=Sponsor%20WezTerm&logo=github&style=for-the-badge)](https://github.com/sponsors/wez)
2. [![C_(programming_language)](https://img.shields.io/github/sponsors/wez?label=Sponsor%20WezTerm&logo=github&style=for-the-badge)](https://en.wikipedia.org/wiki/C_(programming_language))
3. [C_(programming_language)](https://en.wikipedia.org/wiki/C_(programming_language))
4. (https://en.wikipedia.org/wiki/C_(programming_language))
5. [https://en.wikipedia.org/wiki/C_(programming_language)]
6. {https://en.wikipedia.org/wiki/C_(programming_language)}
7. <https://en.wikipedia.org/wiki/C_(programming_language)>
8. [Duck Duck Go](https://duckduckgo.com "search engine")
9. https://en.wikipedia.org/wiki/C_(programming_language)

item 1 needs non-greedy mode, but item 3 needs greedy mode.

When multiple rules match, WezTerm chooses the longest result (greedy). Under the current behavior, it seems irreconcilable.

[IMO] I think it should match according to the order of the rules, giving users the opportunity to adjust the matching rules. More precise rules should take precedence over more general ones.

@wez

update:

https://en.wikipedia.org/wiki/C_(programming_language) is also incorrectly recognized as https://en.wikipedia.org/wiki/C_(programming_language in iTem2.

kenchou commented 1 month ago

Unless the parentheses in the URL are URL-encoded (%hex), it is difficult to accurately recognize the links whether using greedy or non-greedy mode. Given that other terminals also cannot recognize parentheses in links, I believe we should use non-greedy mode, as it produces more accurate links.

This is my final configuration:

config.hyperlink_rules = {
  -- Matches: a URL in parens: (URL)
  -- Markdown: [text](URL title)
  {
    regex = '\\((\\w+://\\S+?)(?:\\s+.+)?\\)',
    format = '$1',
    highlight = 1,
  },
  -- Matches: a URL in brackets: [URL]
  {
    regex = '\\[(\\w+://\\S+?)\\]',
    format = '$1',
    highlight = 1,
  },
  -- Matches: a URL in curly braces: {URL}
  {
    regex = '\\{(\\w+://\\S+?)\\}',
    format = '$1',
    highlight = 1,
  },
  -- Matches: a URL in angle brackets: <URL>
  {
    regex = '<(\\w+://\\S+?)>',
    format = '$1',
    highlight = 1,
  },
  -- Then handle URLs not wrapped in brackets
  -- regex = '\\b\\w+://\\S+[)/a-zA-Z0-9-]+',
  {
    regex = '(?<![\\(\\{\\[<])\\b\\w+://\\S+',
    format = '$0',
  },
  -- implicit mailto link
  {
    regex = '\\b\\w+@[\\w-]+(\\.[\\w-]+)+\\b',
    format = 'mailto:$0',
  },
}

hankertrix commented 2 weeks ago

@kenchou your (?<![\\(\\{\\[<])\\b\\w+://\\S+ regex contains a lookbehind, which is not supported by Wezterm's regular expression engine, which is also Rust's regular expression engine.

kenchou commented 2 weeks ago

(?<![\\(\\{\\[<])\\b\\w+://\\S+ regex contains a lookbehind, which is not supported by Wezterm's regular expression engine, which is also Rust's regular expression engine.

See the configuration in my comment, these are the rules I am currently using. WezTerm uses fancy-regex, which supports lookaround. I am using the macOS version, and this rule works as I expected. Maybe the behavior is different on the Windows platform?

@hankertrix

hankertrix commented 2 weeks ago

WezTerm uses fancy-regex, which supports lookaround.

Ah I see, looks like I didn't read the docs closely enough. I looked at the top of the hyperlink_rules page and just clicked on the Regex syntax link, which didn't include lookarounds.

In that case, I think using your regex for hyperlinks in https://github.com/wez/wezterm/pull/4212 should work well enough.

wez / wezterm