zufuliu / notepad4

Notepad4 (Notepad2⨯2, Notepad2++) is a light-weight Scintilla based text editor for Windows with syntax highlighting, code folding, auto-completion and API list for many programming languages and documents, bundled with file browser plugin matepath.
Other
2.56k stars 179 forks source link

Port boost::regex as regex search engine. #722

Closed atauzki closed 10 months ago

atauzki commented 10 months ago

code are mainly from other open souce projects. Tested by mingwg-cc and clang-msvc. but there is a problem compiling with MSVC, with the calling convention __vectorcall is not compatible with boost::regex.

zufuliu commented 10 months ago

Notepad++'s GPL code can't be used here. AnsiDocumentIterator and UTF8DocumentIterator is same as other two iterators inside Document.cxx for NO_CXX11_REGEX?

atauzki commented 10 months ago

They should be the same function as these 2 classes.

zufuliu commented 10 months ago

I not yet decided how to handle this PR (external regex engine), the builtin engine currently is used for auto-completion (finding words in document): image

https://github.com/zufuliu/notepad2/blob/7cfbd60490c0edd2d1cb7199a3004fbc5a669b7f/src/EditAutoC.c#L804-L818

Though don't known how, extent builtin engine to add missing syntax would have small binary than using C++, Boost or other engines.

zufuliu commented 10 months ago

The experimental regex syntax (3~6 times slow than plain find) is removed by 631662f39d7c55b33f545cf48bb15897dc71db00. we can use SCI_OWNREGEX to make custom regex engine (based on the BuiltinRegex at end of Document.cxx), which should result in clean changes (e.g. implement the custom engine in separator files).

It's maybe better to exam/compare other regex libraries for small dependents, small binary, etc.

atauzki commented 10 months ago

PCRE2 or google RE2? PCRE2's library seems to be slightly lighter and faster, RE2 is fastest among these libs but may lack in functionality.

zufuliu commented 10 months ago

PCRE2 or google RE2?

Not sure. there are others listed on https://handwiki.org/wiki/Software:Comparison_of_regular_expression_engines Or just make one with std::regex (JavaScript syntax) without code from RESearch.cxx.

atauzki commented 10 months ago

I've seen a post in zhihu and boost's benchmark data. They all shows that the std::regex implementation is very slow.

Oniguruma maybe slower compared to boost::regex. I used it before in EmEditor but it uses the old version.

zufuliu commented 10 months ago

with BOOST_REGEX_STANDALONE only about 40 header files inside https://github.com/boostorg/regex/tree/develop/include/boost are required. I think boost regex can be integrated with following rough steps:

atauzki commented 10 months ago

okay,in additional there's a vc project for PCRE2. But I have no enough time in weekdays to apply it to another branch and test them.

zufuliu commented 10 months ago

But I have no enough time in weekdays to apply it to another branch and test them.

integrate other engines are more complicated, most of them only support (UTF-8 encoded) string parameter instead of custom iterator.

zufuliu commented 10 months ago

Related change for speed: https://sourceforge.net/p/scintilla/feature-requests/1500/

atauzki commented 10 months ago

There's a bug for find previous match, as it always finds next match and keep finding it in that place. The std::regex related code may have the same issue.

zufuliu commented 10 months ago

@atauzki can you rebase the code on main and change maxTag loop to following (see https://github.com/boostorg/regex/issues/197) to pass CI builds?

        const int maxTag = std::min(static_cast<int>(match.size()), RESearch::MAXTAG);
        for (int co = 0; co < maxTag; co++) {
            search.bopat[co] = match[co].first.Pos();
            search.eopat[co] = match[co].second.PosRoundUp();
        }
atauzki commented 10 months ago

reverse search bugs that still exists:

  1. cannot include new lines.
  2. empty chars like \b returns the last character.(Fixed)

and forward search bug: \b don't go to next occurance, it just stops at original place.(Fixed)

the ^ and $ behavior is problematic too.

zufuliu commented 10 months ago

It seems more work is required to make it usable, the speed is slow than expected, ByteIterator is a bit slow than Scintilla's builtin regex, UTF8Iterator is even slow.

atauzki commented 10 months ago

the ^ and $ behavior is problematic too.

The boost::regex_constants::match_not_eol flag doesn't work as expected, match_not_bol seems wrong placed, causes it just finds the origin place in line start forever.

zufuliu commented 10 months ago

Basic version (synced from std::regex code) is committed into boost_regex branch (for further development).