Add support for "ignore case", "multilines" and "lookbehind"

FALLAI-Denis commented 3 years ago

Hi,

From my (recent) experience, the extension does not support case-insensitive searches, does not support multi-lines searches, and does not support lookbehind.

I think this is linked to the use of the regexp2 library corresponding to the EcmaScript standard.

The use of a PCRE2 library should provide support for these three functionalities.

Thanks.

daiyam commented 3 years ago

I'm sorry but I didn't think that extension will ever support multi-lines searches. Case-insensitive searches would be easy to add. For look-behind/look-ahead, I will have to look into it.

FALLAI-Denis commented 3 years ago

Hi @daiyam

Thanks for your reply.

PCRE2 allows multiline expressions with the syntax (?m:expression). Likewise it allows case insensitive expressions with the syntax (?i:expression). Likewise it allows lookbehind expressions with the syntax (?<=expression), positive lookbehind, and (?<!expression), negative lookbehind.

daiyam commented 3 years ago

Internally, the file is split into lines and then each lines are matched against the regexes. If there is a match, immediately, it knows which line has been matched. Adding multi-lines support would require to match to the regexes and then to calculate the line position of the match, it can be slow. Maybe by testing the regex and if there is a multi-lines regex, then it would use the slower algorithm... ???

In Javascript, look-behind expressions are the same as PCRE2. It's just that I've not tested them.

FALLAI-Denis commented 3 years ago

I think that the multilines search may not be necessary, if the following conditions are met:

lookbehind pattern availability
possibility that an end marker may be common to several start markers
possibility that an end marker can also be a start marker for another folding range

FALLAI-Denis commented 3 years ago

@daiyam

Testing "lookbehind":

Initial configuration:

       {"beginRegex": "\\s{1,4}\\S+\\s+DIVISION"
       ,"endRegex": ".(?=(\\s{1,4}\\S+\\s+DIVISION))"
       ,"foldLastLine": false
       ,"foldEOF": true
       }

lang: cobol, regex: /(?<_0_0>\s{1,4}\S+\s+DIVISION)|(?<_2_0>.(?=(\s{1,4}\S+\s+DIVISION)))/
line: 1, offset: 0, type: END, match:  , regex: 0
line: 1, offset: 0, type: BEGIN, match:     IDENTIFICATION DIVISION, regex: 0
line: 5, offset: 0, type: END, match:  , regex: 0
line: 5, offset: 0, type: BEGIN, match:     ENVIRONMENT DIVISION, regex: 0
line: 12, offset: 0, type: END, match:  , regex: 0
line: 12, offset: 0, type: BEGIN, match:     DATA DIVISION, regex: 0
line: 23, offset: 0, type: END, match:  , regex: 0
line: 23, offset: 0, type: BEGIN, match:     PROCEDURE DIVISION, regex: 0

Adding non capturing prefix "first 6 columns":

       {"beginRegex": "(?:^.{6})\\s{1,4}\\S+\\s+DIVISION"
       ,"endRegex": ".(?=(\\s{1,4}\\S+\\s+DIVISION))"
       ,"foldLastLine": false
       ,"foldEOF": true
       }

lang: cobol, regex: /(?<_0_0>(?:^.{6})\s{1,4}\S+\s+DIVISION)|(?<_2_0>.(?=(\s{1,4}\S+\s+DIVISION)))/
line: 1, offset: 0, type: BEGIN, match:        IDENTIFICATION DIVISION, regex: 0
line: 5, offset: 0, type: BEGIN, match:        ENVIRONMENT DIVISION, regex: 0
line: 12, offset: 0, type: BEGIN, match:        DATA DIVISION, regex: 0
line: 23, offset: 0, type: BEGIN, match:        PROCEDURE DIVISION, regex: 0

Note: no end match because of overlapping "begin" and "end"

Replacing non capturing by lookbehind:

       {"beginRegex": "(?<=^.{6})\\s{1,4}\\S+\\s+DIVISION"
       ,"endRegex": ".(?=(\\s{1,4}\\S+\\s+DIVISION))"
       ,"foldLastLine": false
       ,"foldEOF": true
       }

lang: cobol, regex: /a^/

Something wrong... but what is wrong ?

daiyam commented 3 years ago

The parser that I'm using to detect groups, doesn't support lookbehinds... (I'm getting the error SyntaxError: Expected "?!", "?:", "?=" or "^" but "?" found., I will add it to the debug)

The /a^/ is a valid regex but it won't match anything.

FALLAI-Denis commented 3 years ago

@daiyam

"regexp2" does'nt, but "PCRE2" does... ;-)

FALLAI-Denis commented 3 years ago

Hi @daiyam,

The extension is already great with the latest evolutions, and it would be perfect with these features : case insensitive and lookbehind.

(?i)expression and (?i:expression) for case insensitive are supported by PCRE2, apparently not supported by regexp2
(?<=expression) and (?<!expression) for positive and negative lookbehind are supported by PCRE2, apparently not supported by regexp2

Is it possible to use PCRE2 instead of regexp2?

Does the ^ character match the start of the line for regular expressions?

Thanks.

daiyam commented 3 years ago

regexp2 is only used to parse an expression and identify the groups so I can build the master expression correctly.

V8, the JavaScript engine, is supporting lookbehind expressions but not case insensitive expressions.

VSCode is able to use PCRE2 expressions (ref) when doing a file search but

As a reminder, VS Code only supports regexes that are valid in JavaScript, because open editors are still searched using the editor's JavaScript-based search.

I've already tried to to fix the bug in regexp2 but I can't compile to project due to bad dependencies... The best solution will be to search for another library.

daiyam commented 3 years ago

In the last update, yes, the ^ character is matching the start of the line.

FALLAI-Denis commented 3 years ago

Hi @daiyam

If it is not possible to use the inline modifier (?i) or (?i:expression) for each regexp marker, would it be possible to have an option to activate the global modifier i at the level of the full expression, per language, like /full_regexp/gi ?

something like:

    "folding":
      {"cobol": [
        {// COBOL is case insensitive
         "sensitive": false
          // Page eject
         , "separatorRegex":"^.{6}\\/"
         ,"strict": false
         ,"descendants": [
           {// Division
           "separatorRegex": "^.{6}\\s{1,4}[A-Z0-9\\-_:]+(?=\\s+DIVISION)"
           ,"strict": false
           ,"descendants": [
              {// Section
               "separatorRegex": "^.{6}\\s{1,4}[A-Z0-9\\-_:]+(?=\\s+SECTION)"
              ,"strict": false
              ,"descendants": [
                 {// Paragraph
                 "separatorRegex": "^.{6}\\s{1,4}[A-Z0-9\\-_:]+(?!\\s+(SECTION|DIVISION))"
                 } ]
              } ]
           } ]
        } ]
      }

To produce:

lang: cobol, regex: /(?<_5_0>^.{6}\/)|(?<_5_1>^.{6}\s{1,4}[A-Z0-9-_:]+(?=\s+DIVISION))|(?<_5_2>^.{6}\s{1,4}[A-Z0-9-_:]+(?=\s+SECTION))|(?<_5_3>^.{6}\s{1,4}[A-Z0-9-_:]+(?!\s+(SECTION|DIVISION)))/gi

Notice the i as a global modifier on the constructed regular expression.

daiyam commented 3 years ago

(?i:expression) can be emulated by transforming a or A to [aA] and [A-Z] to [A-Za-z].

FALLAI-Denis commented 3 years ago

I may not have been very clear in my request.

A sequence like "[A-Za-z]" allows for a case independent search, but I was thinking about full words like "DIVISION" and "SECTION" which participate in searches, but which in COBOL can be encoded in any case : "DIVISION", "division", "Division"...

I just made a live edit in the "foldingProvider.js" file in line 84 (or line 152 in "foldingProvider.ts"):

         `this.masterRegex = new RegExp (source, 'gi');`

And this provides the expected result: no more sensitivity to case for words "DIVISION" and "SECTION".

FALLAI-Denis commented 3 years ago

I am no TypeScript expert, but by parsing the source code I understood that what was expected after the language id was:

either an object containing the expression of an unique folding rule
either an array of objects each containing an expression of a folding rule.

To implement case sensitivity management, globally, in relation to a language, it would have to be an object that is associated with the language id, this object being composed of different properties:

"insensitive": boolean, false by default, true if the language is case insensitive
"rules": array of folding rules, possibly composed of a single folding rule

Something like:

"folding":
  {"cobol": 
    {// COBOL is case insensitive
     "insensitive": true
     // COBOL is organized hierarchically :  page break, division, section, paragraph, sentence, statement
    ,"rules": [
       {// Page eject
        "separatorRegex":"^.{6}\\/"
        ,"strict": false
        ,"descendants": [
           {// Division
            "separatorRegex": "^.{6}\\s{1,4}[A-Z0-9\\-_:]+(?=\\s+DIVISION)"
            ,"strict": false
            ,"descendants": [
               {// Section
                "separatorRegex": "^.{6}\\s{1,4}[A-Z0-9\\-_:]+(?=\\s+SECTION)"
                ,"strict": false
                ,"descendants": [
                   {// Paragraph
                    "separatorRegex": "^.{6}\\s{1,4}[A-Z0-9\\-_:]+(?!\\s+(SECTION|DIVISION))"
                   } ]
               } ]
           } ]
       } ]
    }
  }

FALLAI-Denis commented 3 years ago

A simpler solution, which does not break with the existing one, is to consider that all languages are case insensitive, and to generalize the global modifier "i". If a language is case sensitive, then the language parser is the one that potentially signals a problem. Systematic use of the global modifier "i" can impact performance.

daiyam commented 3 years ago

I have a solution to support (?i), (?i:expression) and (?<=expression) but I will have time only next week 😉

FALLAI-Denis commented 3 years ago

Very nice. You have already invested a great deal in my requests and I thank you for that. I will wait until you have time to implement these latest developments. Have a good week.

daiyam commented 3 years ago

explicit-folding-0.11.0.vsix is supporting (?i), (?i:expression) and (?<=expression). Happy testing 😉

FALLAI-Denis commented 3 years ago

Hi @daiyam,

I did a quick test and it looks great to me.

My new version of settings:

    "folding":
      {
      "cobol": [
        {// Page eject
         "separatorRegex":"(?<=^.{6})\\/"
         ,"strict": false
         ,"descendants": [
           {// Division
           "separatorRegex": "(?<=^.{6}\\s{1,4})[A-Za-z0-9\\-_:]+(?=\\s+(?i:DIVISION))"
           ,"strict": false
           ,"descendants": [
              {// Section
               "separatorRegex": "(?<=^.{6}\\s{1,4})[A-Za-z0-9\\-_:]+(?=\\s+(?i:SECTION))"
              ,"strict": false
              ,"descendants": [
                 {// Paragraph
                 "separatorRegex": "(?<=^.{6}\\s{1,4})[A-Za-z0-9\\-_:]+(?!\\s+(?i:SECTION|DIVISION))"
                 } ]
              } ]
           } ]
        } ]
      }

What for the following source code: ExplicitFolding-20210506-01

gives the following analysis:

lang: cobol, regex: /(?<_5_0>(?<=^.{6})\/)|(?<_5_1>(?<=^.{6}\s{1,4})[A-Za-z0-9\-_:]+(?=\s+[Dd][Ii][Vv][Ii][Ss][Ii][Oo][Nn]))|(?<_5_2>(?<=^.{6}\s{1,4})[A-Za-z0-9\-_:]+(?=\s+[Ss][Ee][Cc][Tt][Ii][Oo][Nn]))|(?<_5_3>(?<=^.{6}\s{1,4})[A-Za-z0-9\-_:]+(?!\s+[Ss][Ee][Cc][Tt][Ii][Oo][Nn]|[Dd][Ii][Vv][Ii][Ss][Ii][Oo][Nn]))/g
line: 1, offset: 0, type: SEPARATOR, match: IDENTIFICATION, regex: 1
line: 2, offset: 0, type: SEPARATOR, match: PROGRAM-ID, regex: 3
line: 3, offset: 0, type: SEPARATOR, match: DATE-COMPILED, regex: 3
line: 4, offset: 0, type: SEPARATOR, match: ENVIRONMENT, regex: 1
line: 5, offset: 0, type: SEPARATOR, match: CONFIGURATION, regex: 2
line: 6, offset: 0, type: SEPARATOR, match: SOURCE-COMPUTER, regex: 3
line: 7, offset: 0, type: SEPARATOR, match: OBJECT-COMPUTER, regex: 3
line: 8, offset: 0, type: SEPARATOR, match: INPUT-OUTPUT, regex: 2
line: 9, offset: 0, type: SEPARATOR, match: FILE-CONTROL, regex: 3
line: 11, offset: 0, type: SEPARATOR, match: /, regex: 0
line: 12, offset: 0, type: SEPARATOR, match: DATA, regex: 1
line: 13, offset: 0, type: SEPARATOR, match: FILE, regex: 2
line: 14, offset: 0, type: SEPARATOR, match: FD, regex: 3
line: 17, offset: 0, type: SEPARATOR, match: 01, regex: 3
line: 20, offset: 0, type: SEPARATOR, match: WORKING-STORAGE, regex: 2
line: 21, offset: 0, type: SEPARATOR, match: 01, regex: 3
line: 24, offset: 0, type: SEPARATOR, match: 01, regex: 3
line: 27, offset: 0, type: SEPARATOR, match: LINKAGE, regex: 2
line: 28, offset: 0, type: SEPARATOR, match: 01, regex: 3
line: 30, offset: 0, type: SEPARATOR, match: PROCEDURE, regex: 1
line: 31, offset: 0, type: SEPARATOR, match: MAIN, regex: 2
line: 32, offset: 0, type: SEPARATOR, match: START-OF-RUN, regex: 3
line: 38, offset: 0, type: SEPARATOR, match: END-OF-RUN, regex: 3
line: 40, offset: 0, type: SEPARATOR, match: END, regex: 3
foldings: [{"start":0,"end":2,"kind":3},{"start":4,"end":6,"kind":3},{"start":8,"end":9,"kind":3},{"start":7,"end":9,"kind":3},{"start":3,"end":9,"kind":3},{"start":13,"end":15,"kind":3},{"start":16,"end":18,"kind":3},{"start":12,"end":18,"kind":3},{"start":20,"end":22,"kind":3},{"start":23,"end":25,"kind":3},{"start":19,"end":25,"kind":3},{"start":27,"end":28,"kind":3},{"start":26,"end":28,"kind":3},{"start":11,"end":28,"kind":3},{"start":31,"end":36,"kind":3},{"start":37,"end":38,"kind":3},{"start":30,"end":39,"kind":3},{"start":29,"end":39,"kind":3},{"start":10,"end":39,"kind":3}]

I also tested the sequence (?i), but although it is accepted in regular expressions, it does not work: no more paterns are found. I haven't had time to figure out where the problem is yet.

But I am fully satisfied with the result obtained with (?i:expression).

daiyam commented 3 years ago

Yep, I'm having issues with my translation of the PCRE2 syntax (?i)[A-Za-z0-9\-_:]. I get the JavaScript regexp [A-Za-zA-Za-z0-9\-[__][::]]. Aah, some stupid issues... The regexp parser/translator (https://github.com/daiyam/node-regexp) was done in 1 day. So it's expected. I will fix those.

daiyam commented 3 years ago

For (?!\s+(?i:SECTION|DIVISION)), I believe \s+ should be applied for both alternatives (SECTION and DIVISION). This is another fix to do...

FALLAI-Denis commented 3 years ago

@daiyam

yes there is a problem with \s+(?i:SECTION|DIVISION). perhaps i should code \s+(?i:(SECTION|DIVISION)) or \s+((?i:SECTION|DIVISION)) ?

daiyam commented 3 years ago

explicit-folding-0.11.0.vsix is fixing the issues mentioned above. (?i) and (?i:expression) should work as expected. So you shouldn't have to change your syntax.

FALLAI-Denis commented 3 years ago

New test with last version (I moved the processing of the page break separator at the end of the hierarchy):

    "folding":
      {"cobol": [
         {// Division
          "separatorRegex": "(?<=^.{6}\\s{1,4})[A-Za-z0-9\\-_:]+(?=\\s+(?i:DIVISION))"
         ,"strict": false
         ,"descendants": [
            {// Section
             "separatorRegex": "(?<=^.{6}\\s{1,4})[A-Za-z0-9\\-_:]+(?=\\s+(?i:SECTION))"
            ,"strict": false
            ,"descendants": [
               {// Paragraph
                "separatorRegex": "(?<=^.{6}\\s{1,4})[A-Za-z0-9\\-_:]+(?!\\s+(?i:SECTION|DIVISION))"
                ,"strict": false
                ,"descendants": [
                   {// Page eject
                    "separatorRegex":"(?<=^.{6})\\/"
                   }
                 ]
               }
             ]
            }
          ]
         }
       ]
      }

lang: cobol, regex: /(?<_5_0>(?<=^.{6}\s{1,4})[A-Za-z0-9\-_:]+(?=\s+(?:[Dd][Ii][Vv][Ii][Ss][Ii][Oo][Nn])))|(?<_5_1>(?<=^.{6}\s{1,4})[A-Za-z0-9\-_:]+(?=\s+(?:[Ss][Ee][Cc][Tt][Ii][Oo][Nn])))|(?<_5_2>(?<=^.{6}\s{1,4})[A-Za-z0-9\-_:]+(?!\s+(?:[Ss][Ee][Cc][Tt][Ii][Oo][Nn]|[Dd][Ii][Vv][Ii][Ss][Ii][Oo][Nn])))|(?<_5_3>(?<=^.{6})\/)/g
line: 1, offset: 0, type: SEPARATOR, match: IDENTIFICATION, regex: 0
line: 2, offset: 0, type: SEPARATOR, match: PROGRAM-ID, regex: 2
line: 3, offset: 0, type: SEPARATOR, match: DATE-COMPILED, regex: 2
line: 4, offset: 0, type: SEPARATOR, match: ENVIRONMENT, regex: 0
line: 5, offset: 0, type: SEPARATOR, match: CONFIGURATION, regex: 1
line: 6, offset: 0, type: SEPARATOR, match: SOURCE-COMPUTER, regex: 2
line: 7, offset: 0, type: SEPARATOR, match: OBJECT-COMPUTER, regex: 2
line: 8, offset: 0, type: SEPARATOR, match: INPUT-OUTPUT, regex: 1
line: 9, offset: 0, type: SEPARATOR, match: FILE-CONTROL, regex: 2
line: 11, offset: 0, type: SEPARATOR, match: /, regex: 3
line: 12, offset: 0, type: SEPARATOR, match: DATA, regex: 0
line: 13, offset: 0, type: SEPARATOR, match: FILE, regex: 1
line: 14, offset: 0, type: SEPARATOR, match: FD, regex: 2
line: 17, offset: 0, type: SEPARATOR, match: 01, regex: 2
line: 20, offset: 0, type: SEPARATOR, match: WORKING-STORAGE, regex: 1
line: 21, offset: 0, type: SEPARATOR, match: 01, regex: 2
line: 24, offset: 0, type: SEPARATOR, match: 01, regex: 2
line: 27, offset: 0, type: SEPARATOR, match: LINKAGE, regex: 1
line: 28, offset: 0, type: SEPARATOR, match: 01, regex: 2
line: 30, offset: 0, type: SEPARATOR, match: PROCEDURE, regex: 0
line: 31, offset: 0, type: SEPARATOR, match: MAIN, regex: 1
line: 32, offset: 0, type: SEPARATOR, match: START-OF-RUN, regex: 2
line: 38, offset: 0, type: SEPARATOR, match: END-OF-RUN, regex: 2
line: 40, offset: 0, type: SEPARATOR, match: END, regex: 2
foldings: [{"start":0,"end":2,"kind":3},{"start":4,"end":6,"kind":3},{"start":8,"end":10,"kind":3},{"start":7,"end":10,"kind":3},{"start":3,"end":10,"kind":3},{"start":13,"end":15,"kind":3},{"start":16,"end":18,"kind":3},{"start":12,"end":18,"kind":3},{"start":20,"end":22,"kind":3},{"start":23,"end":25,"kind":3},{"start":19,"end":25,"kind":3},{"start":27,"end":28,"kind":3},{"start":26,"end":28,"kind":3},{"start":11,"end":28,"kind":3},{"start":31,"end":36,"kind":3},{"start":37,"end":38,"kind":3},{"start":30,"end":39,"kind":3},{"start":29,"end":39,"kind":3}]

FALLAI-Denis commented 3 years ago

I think a new release of the extension could now be delivered.

FALLAI-Denis commented 3 years ago

For information, lookbehind works with variable length expressions, as in the case (?<=^.{6}\\s{1,4}), which PCRE2 does not allow. It's a useful feature.

FALLAI-Denis commented 3 years ago

Release 0.12.x released !

zokugun / vscode-explicit-folding

Add support for "ignore case", "multilines" and "lookbehind" #40