Capture groups misbehave

HolyBlackCat commented 3 years ago

Sorry to bother you again, but I found more regressions. :P

Describe the issue

Folding ranges are determined incorrectly, when capture groups are present in beginRegex and are referenced in endRegex.

To reproduce

VSCode: 1.57.1
Explicit Folding: 0.14.8
Language: cpp

Code Example

```
"(

)" "(

)"
```
```
"foo(

)foo" "bar(

)bar"
```

Settings

    "explicitFolding.rules": {
        "cpp": [
            {
                "beginRegex": "\"([^\\(]{0,16})\\(",
                "endRegex": "\\)\\1\"",
                "foldLastLine": false,
            },
        ],
    },

Expected behavior

In both examples, I expected to get two folding ranges: lines 1..3 and 3..5.

Instead, in the first example there's one large range, with lines 1..5. And the second example doesn't fold at all.

daiyam commented 3 years ago

Temporary, it's working with the following rule by excluding " in the group name:

{
    "beginRegex": "\"([^\\(\"]{0,16})\\(",
    "endRegex": "\\)\\1\"",
    "foldLastLine": false,
}

I need to find a way so to fix the core of the issue. (without the need of ")

daiyam commented 3 years ago

Not sure how to fix the issue.

The regex is build by replacing the \1 reference in endRegex with its source in beginRegex. It's building the regex (?<_0_0>"([^\(]{0,16})\()|(?<_2_0>\)[^\(]{0,16}"). The issue is that [^\(] isn't excluding " so )" "( will be matched as )" ".

The best would be to update the documentation so others users are aware of this limitation. What do you think? Do you have another solution?

HolyBlackCat commented 3 years ago

@daiyam I could definitely live with this workaround, but I agree that if you don't fix it, it should be documented.

I think this has worked in previous versions though, so there was some solution to this?

C++ in its "raw string" syntax (that I tried to explain in #14) does allow using "s in this place. While it's unlikely to happen in real code, this means that we're one step back from being able to perfectly fold C++.

I do have an idea for a solution.

I didn't look at your source (because I don't know Typescript), but from what you said, it seems that you're using one large regex in every context. But what if you generated the regex on the fly, depending on context?

E.g. when you start parsing a document, you compose a regex from all 'begin' regexes from the config: (?<_0>BEGIN0)|(?<_1>BEGIN1)|(?<_2>BEGIN2).
Then, when you get a match, you check if the range has "nested":false or not.
- If it does, you only look for the single corresponding 'end' regex. After you find it and close the range, you go back to the original long regex. If capture groups were present in the 'begin' regex, you copypaste the matched text into the 'end' regex (instead of copying a part of the original regex).
- If it doesn't, you prepend the corresponding 'end' regex to the original one, getting (?<_e0>ENDi)|(?<_0>BEGIN0)|(?<_1>BEGIN1)|(?<_2>BEGIN2). Again, if 'begin' had capture groups, the added 'end' regex should have it verbatim. When you go deeper into folding ranges, you only need to keep the single last 'end' prepended. So the regex doesn't grow with the amount of folding ranges.

daiyam commented 3 years ago

I will go for the documentation.

I've checked, the way how the capture groups are working hasn't been changed since they have been introduced. Maybe it's because you just found some edge cases???

HolyBlackCat commented 3 years ago

Ok. Yeah, I might be wrong on this being a regression.

I guess I'll have to learn TS sooner or later if I want a perfect C++ folding. :P

Any insights on why it's not implemented like I described? (I mean copying the text matching capture group into the end regex.) That would look more straightforward to me, but since I didn't look into the code, I might be missing why doing it the other way is necessary.

daiyam commented 3 years ago

Performance wise, in javascript, a big regex is better than several small regexes. Also a regex is costly to build. So to dynamically build regexes when parsing big files, it can be bad for the performance.

The document is parsed line by line to keep track of the line number (used by the folding range). Then the line is parsed by a regex which is like BEGIN1|MIDDLE1|END1|SEPARATOR2|WHILE3|DOCSTRING4|BEGIN5 so it can match for nested folding regions. When "nested": false, only BEGIN is present in the regex. And when matched, it will look directly for the END since nothing can be nested.

daiyam commented 3 years ago

I've updated the doc https://github.com/zokugun/vscode-explicit-folding/blob/master/docs/rules/begin-end.md. What do you think?

HolyBlackCat commented 3 years ago

Got it, thanks. The updated doc looks ok to me.

Could be interesting to see how VSC handles it in TM grammars. They have this exact same feature, but without the limitation.

daiyam commented 3 years ago

From https://github.com/microsoft/vscode/blob/main/extensions/cpp/syntaxes/cpp.tmLanguage.json#L15418, here the syntax they are using for raw strings:

{
    "begin": "((?:u|u8|U|L)?R)\"(?:([^ ()\\\\\\t]{0,16})|([^ ()\\\\\\t]*))\\(",
    "beginCaptures": {
        "0": {
            "name": "punctuation.definition.string.begin"
        },
        "1": {
            "name": "meta.encoding"
        },
        "3": {
            "name": "invalid.illegal.delimiter-too-long"
        }
    },
    "end": "\\)\\2(\\3)\"",
    "endCaptures": {
        "0": {
            "name": "punctuation.definition.string.end"
        },
        "1": {
            "name": "invalid.illegal.delimiter-too-long"
        }
    },
    "name": "string.quoted.double.raw"
}

Excluding space would also fix your issue.

HolyBlackCat commented 3 years ago

@daiyam

Excluding space would also fix your issue.

Not really, because this code is valid:

R"(

)",1,R"(

)"

The TM grammar parses it correctly somehow, so I'm thinking they handle the capture groups the "honest" way, by substituting the matching text into the end regex.

daiyam commented 3 years ago

@HolyBlackCat You were right. It might have been a regression since the fix for ^. It's fixed. Also, since v0.18.0, the endRegex are generated on the run, like you wanted ;)

HolyBlackCat commented 3 years ago

@daiyam Thank you!

zokugun / vscode-explicit-folding