highlighter: Change the parsing approach to significantly improve performance

JoeKar commented 8 months ago

The performance of the current parsing approach can't be improved without changing the whole highlighter code. Due to this the change isn't without any risk, but it's definitely worth the try. Please see the following list, which has been done with the same host and micro -profile. The test has been stopped after complete highlight within 80x24 or aborted due to known "endless" recursion (DNF). Afterwards the top1 has been printed with pprof.

file references	before	after
1.	1.93s 12.26% 12.26% 1.93s 12.26% unicode/utf8.DecodeRune	10ms 100% 100% 10ms 100% runtime.futex
2.	(DNF) 5.41s 14.64% 14.64% 5.41s 14.64% unicode/utf8.DecodeRune	10ms 100% 100% 10ms 100% runtime.writeHeapBits.flush
3.	10ms 20.00% 20.00% 10ms 20.00% github.com/zyedidia/tcell/v2.(*tScreen).SetContent	20ms 40.00% 40.00% 20ms 40.00% github.com/zyedidia/micro/v2/internal/util.CharacterCount
4.	10ms 20.00% 20.00% 10ms 20.00% crypto/md5.block	10ms 25.00% 25.00% 10ms 25.00% gopkg.in/yaml%2ev2.yaml_parser_update_buffer
5.	10ms 50.00% 50.00% 10ms 50.00% gopkg.in/yaml%2ev2.yaml_parser_scan_plain_scalar	10ms 33.33% 33.33% 10ms 33.33% runtime.(*consistentHeapStats).acquire
6.	(DNF) 8.79s 27.01% 27.01% 14.53s 44.65% regexp.(*Regexp).tryBacktrack	10ms 20.00% 20.00% 10ms 20.00% github.com/zyedidia/micro/v2/internal/util.DecodeCharacter

tileset_env_test from #3115 (reduced version)
tileset_env_test from #3115
sample.md from #2839
sample.md from #2839 (with inserted <script>)
Firefox's new tab page (reduced version)
Firefox's new tab page

My available test files created the same or even more complex highlighting (e.g. pattern highlight within regions in HTMLs) results. Most probably the logic isn't in a perfect shape yet, but definitely feasible as proof of concept thought.

Please help to test and improving it with a review. It took a lot of days to get this far and would be a shame when we didn't get this upstream in any form. :wink:

Fixes #2839 Fixes #3115 Closes #3242

JoeKar commented 8 months ago

By default Vim stops at a line length of 3k, but micro now doesn't (care): grafik

dustdfg commented 8 months ago

Can confirm it really solves https://github.com/zyedidia/micro/issues/3115 1.1.While I tried to to test if it solves the issue I noticed nothing not desirable
Replaced my current executable with executable of PR (+ some other PR which I believe won't influence this PR) so will say if find regressions

There is a problem maybe related maybe not. Create a file with 1MB of characters on one line and micro will freeze or slow but will work. It will start to work normally when there is no any characters from that big line is displayed on the screen

JoeKar commented 8 months ago

Create a file with 1MB of characters on one line and micro will freeze or slow but will work.

This is caused by https://github.com/zyedidia/micro/blob/master/internal/util/unicode.go#L69 resp. utf8.DecodeRune which takes the most of the time. The behavior was more or less the same before.

dustdfg commented 8 months ago

Create a file with 1MB of characters on one line and micro will freeze or slow but will work.

This is caused by https://github.com/zyedidia/micro/blob/master/internal/util/unicode.go#L69 resp. utf8.DecodeRune which takes the most of the time. The behavior was more or less the same before.

Yeah but I am almost sure not fully. Why? I tried one time assume that my file will be ascii only so no unicode and I deleted the length checking (https://github.com/dustdfg/micro/tree/unicode_length_unchecked) and performance became better but not much

JoeKar commented 8 months ago

Yeah but I am almost sure not fully. Why? I tried one time assume that my file will be ascii only so no unicode and I deleted the length checking (https://github.com/dustdfg/micro/tree/unicode_length_unchecked) and performance became better but not much

Yes, you're right. Due to the randomness it finds a lot of "regions" within this long line and starts slicing them one after an other. The problem is, that this slicing process is based on the decoding of the characters too, to identify the correct position of the code points within the byte slice. Currently this isn't that optimized so far, but I've no cool idea how this can be done better.

dustdfg commented 8 months ago

Yeah but I am almost sure not fully. Why? I tried one time assume that my file will be ascii only so no unicode and I deleted the length checking (https://github.com/dustdfg/micro/tree/unicode_length_unchecked) and performance became better but not much

Yes, you're right. Due to the randomness it finds a lot of "regions" within this long line and starts slicing them one after an other. The problem is, that this slicing process is based on the decoding of the characters too, to identify the correct position of the code points within the byte slice. Currently this isn't that optimized so far, but I've no cool idea how this can be done better.

Randomness?

JoeKar commented 8 months ago

Create a file with 1MB of characters on one line and micro will freeze or slow but will work.

Didn't you use something like the following to create the file?

cat /dev/urandom | tr -dc 'A-Za-z0-9!"#$%&'\''()*+,-./:;<=>?@[]^_`{|}~' | head -c 1M > test.sh

That was what I used and due to that I will receive some of "" & '' regions.

dustdfg commented 8 months ago

Create a file with 1MB of characters on one line and micro will freeze or slow but will work.

Didn't you use something like the following to create the file?

cat /dev/urandom | tr -dc 'A-Za-z0-9!"#$%&'''()*+,-./:;<=>?@[]^_`{|}~' | head -c 1M > test.sh

That was what I used and due to that I will receive some of "" & '' regions.

Yeah but I used base64 which doesn't include (as I know) any quotes characters + sed (to delete the \n because micro couldn't do it)

dmaluka commented 8 months ago

I believe the problem with very long lines is that micro stores line data in memory simply as an array of raw bytes, not an array of unicode runes. I.e. it doesn't decode the line into runes beforehand. So whenever micro needs to know anything about a visual representation of any part of a line, it decodes it on the fly. (So for instance, if this is a very long line and we are closer to the end of it, micro has to decode almost the entire line, and it has to do that every time when redrawing the screen and so on. So things become extremely slow.)

Heck, micro doesn't even know how many characters are there in a line. It has to call util.CharacterCount() every time to find it out.

If we change it to decode and store lines as slices of runes beforehand, it should solve these problems, and generally improve performance. I really don't get why it wasn't done this way from the beginning.

JoeKar commented 8 months ago

Yep, that's an good idea to solve this here as well. Maybe I'll take care of this in the next few days. If someone would like to improve this earlier then give it a go and we will merge it afterwards into this PR. :wink:

dustdfg commented 8 months ago

If we change it to decode and store lines as slices of runes beforehand, it should solve these problems, and generally improve performance. I really don't get why it wasn't done this way from the beginning.

I see one possible reason: Rune is int32. It would be like using UTF32 for all characters even if our document contains only ASCII

dmaluka commented 8 months ago

Will be great if you take care of it. But why do it in this PR? It is not really a highlighter issue. It is an issue of efficient representation of a line in the basic data structures (so it is more fundamental than the highlighter issues, and affects us even when syntax highlighting is not used at all).

Initial rough idea of where we might start:

--- a/internal/buffer/line_array.go
+++ b/internal/buffer/line_array.go
@@ -41,10 +41,16 @@ type searchState struct {
    done       bool
 }

+type Character struct {
+   r     rune
+   combc []rune
+}
+
 // A Line contains the data in bytes as well as a highlight state, match
 // and a flag for whether the highlighting needs to be updated
 type Line struct {
    data []byte
+   char []Character

    state       highlight.State
    match       highlight.LineMatch

dmaluka commented 8 months ago

I see one possible reason: Rune is int32. It would be like using UTF32 for all characters even if our document contains only ASCII

Yes, it costs memory. Isn't it worth it?

dustdfg commented 8 months ago

I see one possible reason: Rune is int32. It would be like using UTF32 for all characters even if our document contains only ASCII

Yes, it costs memory. Isn't it worth it?

Idk. I can only say if I would a developer that writes a new app I would prefer to make something on bytes because they are cheaper (in size) just because it is the most obvious solution.

If we have 1MB file we would store 4MB in memory it isn't so bad for most hardware but I am not sure I would want it. Just as a user I would prefer to have different modes which isn't so easy for developer...

dustdfg commented 8 months ago

Will be great if you take care of it. But why do it in this PR? It is not really a highlighter issue. It is an issue of efficient representation of a line in the basic data structures (so it is more fundamental than the highlighter issues, and affects us even when syntax highlighting is not used at all).

Yeah it is a separate issue

Initial rough idea of where we might start:

--- a/internal/buffer/line_array.go
+++ b/internal/buffer/line_array.go
@@ -41,10 +41,16 @@ type searchState struct {
  done       bool
 }

+type Character struct {
+ r     rune
+ combc []rune
+}
+
 // A Line contains the data in bytes as well as a highlight state, match
 // and a flag for whether the highlighting needs to be updated
 type Line struct {
  data []byte
+ char []Character

  state       highlight.State
  match       highlight.LineMatch

Storing data in bytes and storing the same data in the ints? I can understand replacement but addition? Why?

dmaluka commented 8 months ago

Storing data in bytes and storing the same data in the ints? I can understand replacement but addition? Why?

As I said, this is just an initial rough idea. I hope we won't need to keep both representations. OTOH, for example, if we need to keep LineBytes() for compatibility with plugins... (Some of my plugins do use it, I can change them if needed, but I'm not sure about other peoples plugins...)

JoeKar commented 8 months ago

But why do it in this PR? It is not really a highlighter issue.

Yeah it is a separate issue

I give both of you right, that the non-decoded bytes are and thus their repetitive decoding is a common performance degradation independent from the highlighting but the feature suffering the most from it is the highlighting...at least from end user perspective. The impact and result is easier to "feel"/measure with the syntax highlighting and due to the common usage of byte streams this interface this tightly coupled into the highlighting process. A rebase is then most likely. :wink:

Anway, I will find a solution for that lazy developer reason. :smile:

dmaluka commented 8 months ago

but the feature suffering the most from it is the highlighting...at least from end user perspective.

I don't think so. Try it: create a file long.txt with a long line containing 1 million characters, open it (the filetype will be unknown => no highlighting), move to the end of this long line and, for example, just try moving the cursor around. It's gonna be very slow. And if you also, for example, enable softwrap, it will be very slow even at the beginning of the long line.

Just as an example, how many times do we call RuneUnder() every time we refresh the display in displayBuffer()?

due to the common usage of byte streams this interface this tightly coupled into the highlighting process.

I don't think the highlighting is special here. Replacing byte streams with rune streams, although conceptually simple, is gonna be a huge change, reworking so many parts of code throughout micro (look at all the places where LineBytes() is used, for example). Changes in the highlighter would be just a small part of it, I think.

A rebase is then most likely.

Nothing's wrong with that.

JoeKar commented 8 months ago

Ok, you convinced me once again. I started yesterday in the evening with a small rework (independent from the highlighting): https://github.com/JoeKar/micro/tree/feature/perf-rune-lines

The branch is highly floating and the result (line handling) far away from a productive state. The start is done and now needs a lot of refactoring, fixing and adaptations at a lot of different locations.

dustdfg commented 8 months ago

Can confirm it really solves Micro does not respnoe when edit specific file #3115 1.1.While I tried to to test if it solves the issue I noticed nothing not desirable

Replaced my current executable with executable of PR (+ some other PR which I believe won't influence this PR) so will say if find regressions

There is a problem maybe related maybe not. Create a file with 1MB of characters on one line and micro will freeze or slow but will work. It will start to work normally when there is no any characters from that big line is displayed on the screen

I still use it and didn't encounter any problems so I think @dmaluka is right and highlight should be separate pull request (and issue?)

niten94 commented 7 months ago

I was looking at /usr/include/unistd.h in amd64 version of libc6-dev package in Debian 11 but there were multi-line comments with text not highlighted as text in a comment like this:

I was looking at this file when testing a bit: c-comment.c.txt

I think only the line where a multi-line comment is started is highlighted as text in a comment if there is text in it highlighted using regions when not in a comment. The other text in the line is also not highlighted as text in a comment if the region would end in it when not in a comment.

The lines between the first and last line in a character literal are not highlighted as text with an error. The other text in the line where a character literal ends is highlighted as text in a literal if the literal ends in another line.

There are similar bugs but I do not know if they are actually the same so I have not written about them.

dmaluka commented 7 months ago

@niten94 I can also see this bug (and don't see it without this PR).

JoeKar commented 7 months ago

[...] so I think @dmaluka is right and highlight should be separate pull request (and issue?)

Yes, this one wouldn't be mixed up with the other idea/improvement optimizing and storing the line data.

@niten94 I can also see this bug (and don't see it without this PR).

P.S. This is not the most important thing at the moment. I'd suggest to first address the regression reported by @niten94 in https://github.com/zyedidia/micro/pull/3127#issuecomment-1937714385 because maybe it means that the entire approach of this PR is wrong.

Yes, it's a regression which will be fixed in the next few days. Sorry for the inconveniences. AFAIK it sounds again like a logic issue within the nested region detection, which now uses all indices found in a line. But I thank you already for the testing, because this is really welcome here.

To make the API more intuitive, I'd suggest to remove statesOnly from Highlight() (but not from the internal highlight() function) and restore the ReHighlightStates() function which would call highlight() with statesOnly=true.

Side note: it seems we have redundant steps in this loop, which we can avoid by changing i++ to i = l + 1.

I will consider both.

JoeKar commented 7 months ago

I was looking at this file when testing a bit: c-comment.c.txt

Thanks a lot for providing the example file. This helps to save time and invest it into fixing the bugs. :wink:

dustdfg commented 7 months ago

The PR handles tab characters with mistakes:

When I start typing a name of a property for svg element (in the end of the tag) it doesn't highlight it fully but partially. It depends on amount of tabs in the line. The amount of not highlighted symbols is countOfTabs + 1

screen-1707982167

dustdfg commented 7 months ago

One more strange thing about svg highlight (exists in whole PR not only last patch):

File: marker.txt

New behavior is that it doesn't highlight marker and path as tag names screen-1707985542

Old behavior (which is also wrong...) is that it highlighted viewBox with wrong color screen-1707985719

dustdfg commented 7 months ago

Last patch broke multi line properties in xml which are yes usually not used by people...

File: tmp.txt

This patch behavior: screen-1707992065

Previous version of this PR: screen-1707992247

Before PR: screen-1707992259

The most funny is that all three version have some mistakes but everywhere different...

JoeKar commented 7 months ago

@dustdfg I will check all of your complains and thank you for the example files.

Doesn't look like it's going to leave me alone any time soon. :cry:

[x] The PR handles tab characters with mistakes:
[x] One more strange thing about svg highlight (exists in whole PR not only last patch):
[x] Last patch broke multi line properties in xml which are yes usually not used by people...

dustdfg commented 7 months ago

I just used the multi line one more time and found that problem doesn't occur with tabs but occurs with spaces in last patch (3rd problem)

JoeKar commented 7 months ago

Unfortunately the SVG highlighting is still not perfect: grafik Look at the / at the end of <path d="M 0 0 L 10 5 L 0 10 z"/>, which is highlighted as identifier instead of symbol.tag. This happens due to the fact that in the first round scanning for ` &=` it will find the subsequent matches within the upcoming string, but the string is parsed in the next round and the current region doesn't know about it. That is then indeed a point for the old approach scanning/slicing region per region per region ... , which was slow because of parsing the same data over and over again and again.

I can hear @dmaluka already thinking:

When does this guy stop riding his dead horse? The performance of the highlighting wouldn't be that bad any longer, when the line data approach has changed from bytes to (combined)runes!

:smile:

To be honest, I'm running out of arguments for this changed highlighting mechanism, because the maintenance becomes very complex right now and still not every corner case is 100% supported and thus not perfectly highlighted.

Maybe it's now a good point to decide if it's really worth the effort to further try improving this "proof"-of-concept after some more uncovered corner cases are found. What do you think? Better invest the time in the rework of the line_array and every dependent?

dmaluka commented 7 months ago

I don't know. Maybe. It depends on how much stuff in your PR addresses issues caused specifically by slow slicing, not something else (e.g. infinite recursion cannot be possibly fixed by optimizing slicing, it's an entirely different issue, right?). I still had no time to get familiar with your PR, or even with the existing highlighter code (which, it seems, got more complicated and non-obvious after your older PR #2840).

BTW it would be nice if you documented what exactly your change "highlighter: Rework of matching approach to speed up the processing" is doing.

JoeKar commented 7 months ago

It depends on how much stuff in your PR addresses issues caused specifically by slow slicing,

I switched over from the "sliding window" slicing approach, to an identification of regions of the same type at one line in one step. Nesting and the parts between the regions invoke further slicing.

[...] not something else (e.g. infinite recursion cannot be possibly fixed by optimizing slicing, it's an entirely different issue, right?).

Yes.

[...] (which, it seems, got more complicated and non-obvious after your older PR #2840).

:(

This wasn't intended. Yes it was and is far away from being perfect, but at the end at least it did the job better than before, because non of the previously introduced features really worked.

BTW it would be nice if you documented what exactly your change "highlighter: Rework of matching approach to speed up the processing" is doing.

Yep, this should be added to give more details for the intention.

dmaluka commented 7 months ago

Ok, thanks for adding the explanation in https://github.com/zyedidia/micro/pull/3127/commits/ebc80cfe102ba1c4f23e9c2cbba834855e87e301. It is now a bit more clear to me (not much).

...So if the only actual problem with the old approach is repetitive slicing, then indeed it might become a non-issue if we switch to pre-decoded runes, since the slicing per se will be very cheap, not involving any "processing".

Or perhaps it will be still too slow even with cheap slicing (e.g. because of repetitive regex search in sliced strings)? Seems we need a proof-of-concept with pre-decoded runes to find it out...

And what about the infinite recursion? (Your explanation still doesn't mention it.) What exactly causes it with the old approach (i.e. why does the old approach not guarantee that the recursion depth is limited)? Is it just because it doesn't ensure that a sliced string passed to the next level of recursion is not the same as the original string? So if we add the missing checks and logic for that, it will fix the infinite recursion without the need to rework the whole approach?

dustdfg commented 7 months ago

About the efficient line storing. Just found https://en.wikipedia.org/wiki/Rope_(data_structure). I know it is won't be implemented and that it would be inefficient for highlighting and etc but just wanted to share

X-Ryl669 commented 7 months ago

The use case of a 1MB single line is very artificial. Changing the whole algorithms for this use case doesn't look right to me.

In early Scite, they used a design that's smart in a way. There basic observation is that most files are mainly using a subset of ASCII and few "high range" Unicode char. So they store 3 arrays per line: the byte array for a line, a index array for mapping the non-byte runes to the last array index, and the last array contains the decoded rune. They called this "spans" IIRC.

A bit like this:

string = "Le sac de blé est à 35€ le kg"(utf8) 
          = [4c 65 20 73 61 63 20 64 65 20 62 6c c3 a9 20 65 73 74 20 c3 a0 20 33 35 e2 82 ac 20 6c 65 20 6b 67]
#             0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
index_array = [ 12, 19, 24 ]
rune_array = ['é', 'à', '€']
rune_size = [2, 2, 3]

string_len_in_char = string.len - sizes(0, string.len)
rune_pos_at_i = i - sizes(0, i)
...etc...

def sizes(start, end): 
{
   // Find index ends in index array (via binary search)
   int index_start = bin_search(index_array stop where index_array[i] > start)
   int index_end = bin_search(index_array stop where index_array[i] <= end)
   // Compute the sum of all runes size in the encoded string
   return sum(rune_size[index_start:index_end])
}

That way, it's possible to know the length of the line without having to decode the runes each time (you only need to look at the last position of the index array). It doesn't change the whole interface, just speed up the application where rune are needed from a O(N) with N being the line's length in bytes, to O(log(M)) where M is the number of extended runes in the line (those where 1 char != 1 byte).

The storage will be, at worst N*12 + encoded_string_len_in_bytes. On average, however, the overhead will be negligible (in the order of M*12 + encoded_string_len_in_bytes for each actual extended rune in the line).

dmaluka commented 7 months ago

@dustdfg:

About the efficient line storing. Just found https://en.wikipedia.org/wiki/Rope_(data_structure). I know it is won't be implemented and that it would be inefficient for highlighting and etc but just wanted to share

I've just found that micro actually used to use the rope data structure in its very early days, but soon switched to the currently used LineArray structure (a simple byte array): https://github.com/zyedidia/micro/commit/72f5808025af2d3d24bf1ccca899f7d291cb1dda

I'm wondering what was the exact motivation for that (considering that AFAICS that initial LineArray implementation already had this expensive O(n) random access, due to decoding characters every time). @zyedidia could you shed any light on this?

dmaluka commented 7 months ago

@X-Ryl669:

The use case of a 1MB single line is very artificial. Changing the whole algorithms for this use case doesn't look right to me.

Yeah. So the advantages of the rope structure (fast insertion/removal in large strings etc) seem to be dubious.

In early Scite, they used a design that's smart in a way. There basic observation is that most files are mainly using a subset of ASCII and few "high range" Unicode char. So they store 3 arrays per line: the byte array for a line, a index array for mapping the non-byte runes to the last array index, and the last array contains the decoded rune. ...

Very interesting, thanks. Perhaps something like that is what we want as the final solution for micro, to have both fast random access and low memory overhead.

I can already suggest a generalization of this algorithm: instead of storing indices of all non-ASCII runes, store indices of all runes whose width is different from the width of the previous rune (a sort of run-length encoding). For example, in Cyrillic UTF-8 texts the majority of characters are 2-byte characters (letters) and only the minority are ASCII characters (spaces, punctuation, digits etc).

dustdfg commented 7 months ago

I can already suggest a generalization of this algorithm: instead of storing indices of all non-ASCII runes, store indices of all runes whose width is different from the width of the previous rune (a sort of run-length encoding). For example, in Cyrillic UTF-8 texts the majority of characters are 2-byte characters (letters) and only the minority are ASCII characters (spaces, punctuation, digits etc).

Very vaguely resembles me LZ77

dmaluka commented 7 months ago

Yeah. So the advantages of the rope structure (fast insertion/removal in large strings etc) seem to be dubious.

Well... we may actually want to implement both a rune array based and a rope based implementations (encapsulating them under a common interface, to switch between them easily), to measure and compare performance and memory cost of both. If it turns out that a rope based implementation allows working efficiently even with very long lines, and at the same time is not noticeably slower (or more battery-hungry) than a rune array even with scenarios with very frequent random accesses (e.g. in syntax highlighting with heavy nesting), then why not?

X-Ryl669 commented 7 months ago

If you RLE the index array, you'll certainly improve memory footprint, but the index searching algorithm again becomes O(N) (you can't binary search in the index array because you need to decode the previous run length).

In real texts, it's very likely that some Unicode range comes together, so the RLE is going to be very efficient, and O(N) won't matter much since N here is the number of RLE items, not the number of runes. So probably it's as efficient as a O(log N) algorithm anyway.

Very vaguely resembles me LZ77

I don't think so, LZ77 is using a sliding window where it codes a new rune span with a new index in an ever growing dictionnary, if not found yet or the previous index to the same rune span if found before. I'm not sure it would be useful here to speed up processing the length of the runes, it might be useful to avoid the huge memory impact of storing UTF32 for each rune, but the runtime cost is huge in that case.

dmaluka commented 7 months ago

If you RLE the index array, you'll certainly improve memory footprint, but the index searching algorithm again becomes O(N) (you can't binary search in the index array because you need to decode the previous run length).

Good point.

In real texts, it's very likely that some Unicode range comes together, so the RLE is going to be very efficient, and O(N) won't matter much since N here is the number of RLE items, not the number of runes. So probably it's as efficient as a O(log N) algorithm anyway.

Yep, O(N) is gonna be compensated by much smaller N, except pathological cases like 1€, 2€, 3€, 4€, 5€, 6€, 7€, 8€, ....

JoeKar commented 7 months ago

It is now a bit more clear to me (not much).

Too bad it's not yet convincing enough. :(

...So if the only actual problem with the old approach is repetitive slicing, then indeed it might become a non-issue if we switch to pre-decoded runes, since the slicing per se will be very cheap, not involving any "processing". Or perhaps it will be still too slow even with cheap slicing (e.g. because of repetitive regex search in sliced strings)?

At least that it's one of the reasons why it's right now already a bit faster and would it hurt combining both optimizations together? There is some arguable benefit.

And what about the infinite recursion? (Your explanation still doesn't mention it.) What exactly causes it with the old approach (i.e. why does the old approach not guarantee that the recursion depth is limited)? Is it just because it doesn't ensure that a sliced string passed to the next level of recursion is not the same as the original string? So if we add the missing checks and logic for that, it will fix the infinite recursion without the need to rework the whole approach?

To be honest, I don't know. I saw with my logs that the old approach started parsing the same stuff over and over again and didn't waste more time to dive any deeper. I decided to start that new approach instead and it allowed highlighting the stuff which it didn't before even with the expensive parsing.

Regarding the overall discussion about the optimization of the line access: Does it still belong to this PR? This/A possible highlighter optimization is independent of that. I thought we already agreed that these things should be handled separately and thus we need to create an new issue to trace all the ideas, pros and cons.

X-Ryl669 commented 7 months ago

I thought we already agreed that these things should be handled separately and thus we need to create an new issue to trace all the ideas, pros and cons.

You're right. Sorry for hijacking your PR with unrelated ideas.

JoeKar commented 7 months ago

You're right. Sorry for hijacking your PR with unrelated ideas.

No problem. The whole thing started here a lot of comments ago. #3149 is now ready to proceed the discussion. Maybe a lot more nice ideas and hints will join.

dmaluka commented 7 months ago

Yeah... back to the discussion of highlighting and old vs new approach.

Some of my suggestions above were rather useless, since I didn't really understand the problem your PR is addressing. Now I realized that the problem is not with poor performance but with horrible performance.

Namely, for some reason the highlighting time grows exponentially with the number of regions in the same line which include nested regions which have start and stop patterns.

For example, highlighting of this trivial XML file:

<a x=""/><a x=""/><a x=""/><a x=""/><a x=""/><a x=""/><a x=""/><a x=""/><a x=""/><a x=""/><a x=""/><a x=""/><a x=""/>

takes 7 seconds on my machine. If I add one more <a x=""/> to it (in the same line), it takes 20 seconds. If I add another one, it takes 60 seconds, and so on.

This cannot be fixed by speeding up line access via a rune array. For example, my profiling of micro with tileset_env_test (endless parsing) shows that regardless of duration of the test, 40-45% time is spent inside Go regex engine, and another 40% is spent in micro's rune decoding. So by eliminating decoding during line accesses, we may make highlighting maybe twice faster, which is good in itself, but in this case it doesn't matter: it will still take minutes or hours to highlight a small XML file.

Now, I don't know if this exponential behavior can be fixed in the existing highlighter implementation as just another bug, or rather it is inherent to this implementation and cannot be fixed without completely reworking it (so you are doing the right thing).

P.S. All this has nothing to do with endless recursion. With an endless recursion it wouldn't parse infinitely, it would quickly die with a stack overflow. You mentioned endless recursion so I was thinking that we still have cases when we die with a stack overflow, but AFAICS the only known case was https://github.com/zyedidia/micro/issues/2841 and it was fixed by your https://github.com/zyedidia/micro/pull/2840 (right?).

X-Ryl669 commented 7 months ago

Just a friendly reminder: You can't parse (HT/X)ML with regex. There is no way you can fix up the regex result to get correct result for any (valid or invalid) XML document. In those markups languages, you can have "this is not XML but it looks like it is" content inside any tag or attribute (this CDATA or embedded CDATA in SVG in HTML) so you need a way to memorize the marker stacks to solve the ambiguity, and this is impossible with a regex matcher.

So maybe having a "stop matching" limit in the process will work better here and fallback to a simple symbol matcher or, better, use a dedicated language server to provide the highlighter boundaries.

JoeKar commented 7 months ago

I read that stackoverflow comment, enjoyed it even since it has a lot of drama. :wink:

At least it's right, that this regex-matcher will never cover every scenario in a simple and generic way. From my perspective the boundary isn't really reached by those markup languages alone, but by regions with more or less unlimited nesting. It is definitely till now a best guess instead of accuracy, which we will get with LSP only. Leaving the task of parsing and tree-ing to the dedicated SW component responsible for just one language instead of building our own approach to reach similarity for all languages we could imagine. So yes, the LSP is most probably the most accurate approach, but integrating this into the highlighting steps is one of these longer roads already discussed in the past, but never started nor finalized. I assume that the regex-approach will remain as some kind of fallback (there was already such a discussion) in case there is no LSP for a particular language available and then it should do the job as good as possible. This PR is here to break with the current limitation and at least give it a bit more shape as before...even in case it's still not perfect.

dmaluka commented 7 months ago

I believe that's a completely different issue anyway. Our issue is not about incorrect highlighting. Correct or not, highlighting shouldn't take O(2^n) time to complete.

And I don't know why are you guys talking about necessity of LSP for correct syntax highlighting (can't an editor do grammar-based parsing on its own?), but, whatever.

JoeKar commented 7 months ago

Unfortunately the SVG highlighting is still not perfect: Look at the / at the end of <path d="M 0 0 L 10 5 L 0 10 z"/>, which is highlighted as identifier instead of symbol.tag.

Ok, can be considered as a bug of xml.yaml#L20, which should be converted to end: "(\\?|\\/)?>":

grafik grafik

An empty element tag is allowed and preferred in XML too: https://www.w3.org/TR/REC-xml/#sec-starttags [44]

And I don't know why are you guys talking about necessity of LSP for correct syntax highlighting (can't an editor do grammar-based parsing on its own?), but, whatever.

Yes, it can. Sorry, was just an other sub-discussion not really fitting here. As you see above the "correctness" depends on the quality of our self-maintained grammar/syntax definition.

But as so often, you're right and we return to the discussion how much room we have for further improving the new approach. I used your example <a x=""/> as well it become slower at around 200 of that type at my machine. Profiling that proof that I run into the DecodeRune scenario: pprof001

dmaluka commented 7 months ago

Ok, can be considered as a bug of xml.yaml#L20, which should be converted to end: "(\\?|\\/)?>":

Wouldn't that just hide a problem with the highlighter ? As I understand, the problem is that / is considered to be inside an identifier region which starts with a ` (space) which is inside"M 0 0 L 10 5 L 0 10 z"(aconstant.stringregion). So the problem is that thisidentifier` region is detected as a region in the first place? These are sibling regions, not nested, so they are scanned in the same iteration, right? So it seems like what is missing in your PR is checking if the found matches for regions don't overlap, and if they do, taking just the leftmost match and discarding the overlapping matches to the right of it?

zyedidia / micro

highlighter: Change the parsing approach to significantly improve performance #3127