ReHighlightStates: sanity-check `startline` value

zyedidia / micro

A modern and intuitive terminal-based text editor

https://micro-editor.github.io

MIT License

24.4k stars 1.16k forks source link

ReHighlightStates: sanity-check `startline` value #3237

Closed dmaluka closed 2 months ago

dmaluka commented 2 months ago

Check if startline value is valid before passing it to input.State(), to prevent a theoretically possible race when the number of lines changes in the meantime, causing an out of bounds access.

Actually this race cannot happen: ReHighlightStates() is only called from the main goroutine, and the line array is modified, again, only by the main goroutine. So for now this change is rather cosmetic: it is just to make the highligher API implementation self-sufficiently safe without assumptions about which goroutines are using which API functions and how.

dmaluka commented 2 months ago

In my version we avoid unneeded locking if startline is 0 (i.e. we are at the first line of the buffer) and thus h.lastRegion is gonna be nil anyway.

JoeKar commented 2 months ago

Yes, it "affects" the performance and should weight more...and readability is a matter of perspective. So go for yours. :wink:

dmaluka commented 2 months ago

Hmm... BTW what about the access to h.lastRegion? It seemed to me that the coarser-grained locking implemented in #3224 prevents also races between the initial async highlighting and the rehighlighting MarkModified() in the main goroutine, since the way how it uses input.Lock() + input.Unlock() ensures that the initial highlighting of states or matches for the given line N will be done either fully before a modification of this line, or fully after it. So even such a tricky sequence:

async goroutine: HighlightStates() for line N
main goroutine: modification of line N
async goroutine: HighlightMatches() for line N - uses outdated highlight states obtained at step 1
main goroutine: ReHighlightStates() for line N
main goroutine: HighlightMatches() for line N

should be fine, since the incorrect highlighting done at step 3 should be "overwritten" by correct highlighting at steps 4 and 5.

But, since both these async highlighting and sync rehighlighting rely on the shared value of h.lastRegion and both modify it, this actually still seems racy (just for a different reason). And it's not just a matter of locking. Each iteration of the loop in (Re)HighlightStates() depends on the value of h.lastRegion set by the previous iteration, but this value could be actually set by the other goroutine. So the results of highlighting may be completely unpredictable, right?

I guess this could be easily fixed by changing lastRegion from being "global" (shared between both goroutines) to being local to a single run of (Re)HighlightStates() for the given range of the lines in the given goroutine.

...Actually it seems it would be even better if we made the very concept of "line state" local as well, i.e. if we combined (Re)HighlightStates() and HighlightMatches() into a single function Highlight() (with a simple clear semantics: just (re)highlight the given line range) and thereby made the concept of "line state" local to a single run of this function, not a part of the global highlighter state. (That would mean removing this tricky notion of "line state" from the highlighter's external API, making it purely an implementation detail. And we don't really need to store all line states, we only need to remember the state of the previous line in the loop (input.State(i-1)), right? With such a rework, lastRegion and State(i-1) would become the same thing.) Even if such a rework didn't prevent any extra races, it would at least make things much simpler and cleaner and easier to reason about.

dmaluka commented 2 months ago

if we combined (Re)HighlightStates() and HighlightMatches() into a single function Highlight() (with a simple clear semantics: just (re)highlight the given line range)

I've recalled you already did that in #3127 (for performance reasons). So, there seems to be more than one good reason to do that.

JoeKar commented 2 months ago

Each iteration of the loop in (Re)HighlightStates() depends on the value of h.lastRegion set by the previous iteration, but this value could be actually set by the other goroutine. So the results of highlighting may be completely unpredictable, right?

Yes, this should be improved...

I guess this could be easily fixed by changing lastRegion from being "global" (shared between both goroutines) to being local to a single run of (Re)HighlightStates() for the given range of the lines in the given goroutine.

...as suggested.

Even if such a rework didn't prevent any extra races, it would at least make things much simpler and cleaner and easier to reason about.

Definitely, it makes it less complicated and thus much easier to understand. So this is definitely worth spending the effort. Is it ok for you, when I append this to my rework (#3237), where I already merged HighlightStates() and HighlightMatches()? At least you linked it already (but couldn't find that "recall"/trigger there :wink:).

I don't think that we should spend that much more effort in the current base, since we already found out the limitations of the present approach.

Edit:

And we don't really need to store all line states, we only need to remember the state of the previous line in the loop (input.State(i-1)), right?

Hm, in case we need to re-highlight from line 3..5, we need this info from line 2. We've to start from 0 (even when we should do that beginning from 3) in the moment we don't store this globally, otherwise we've no chance to shorten that path by directly asking for the state before. Because we don't know, when this state/group has been started.

dmaluka commented 2 months ago

Hm, in case we need to re-highlight from line 3..5, we need this info from line 2. We've to start from 0 (even when we should do that beginning from 3) in the moment we don't store this globally, otherwise we've no chance to shorten that path by directly asking for the state before. Because we don't know, when this state/group has been started.

Hmm, right. Yeah, I was mistaken, we should keep storing those states globally.

But lastRegion we can (and should) make local, right? Otherwise it may just refer to completely different lines, if it is used by both goroutines simultaneously.

So for now we can just make lastRegion local, that will (hopefully) fix all remaining races, and then combining of HighlightStates() with HighlightMatches() can be done separately, for example as a part of your #3127, which it already is.

JoeKar commented 2 months ago

But lastRegion we can (and should) make local, right?

Yes, at least kind of. I already tried that in the local state of my rework, but it isn't that easy as it looks in the first place. Good news, It isn't touched outside the locked loop. I can prepare the PR for the current master.

Otherwise it may just refer to completely different lines, if it is used by both goroutines simultaneously.

Fully ACK.

[...] and then combining of HighlightStates() with HighlightMatches() can be done separately, for example as a part of your https://github.com/zyedidia/micro/pull/3127, which it already is.

Yep, this is already done...just ReHighlightStates() remains for now, till we've an better idea to do a proper merge/combination with Highlight().

dmaluka commented 2 months ago

just ReHighlightStates() remains for now, till we've an better idea to do a proper merge/combination with Highlight().

Hm, yeah, I've recalled that you added it back in https://github.com/zyedidia/micro/pull/3127/commits/22de1d3f30cd98b46ddf34f9a575979e075d0997 for some reason... I don't really understand for what reason, and what is "bottom up changes".

JoeKar commented 2 months ago

I don't really understand for what reason, and what is "bottom up changes".

Well, yes...hm...I can only assume, that my scenario was to remove the end of an region, which afterwards wasn't properly highlighted downwards. So the given reason isn't really correct, since ReHighlightStates() takes the start and scans down, till it detects the same state again or the end, which afterwards is used as boundary for the real highlighting. Due to this we can't simply use the start & end only (obtained by the modification), because it wouldn't consider everything necessary for re-highlighting.

From my perspective it's correct to have fixed boundaries for the real highlighting, but these slightly faster state checks need to scan er certain region probably behind the initial end. That's also the reason why I wouldn't merge the ReHighlightStates() into the highlighter function, since it would need additional function parameters to differentiate between the approaches. Unless you would still be in favor of it to move the downward state scanning into the main highlighting for that boundary depending re-highlighting.

PS: I will update the commit message of https://github.com/zyedidia/micro/commit/22de1d3f30cd98b46ddf34f9a575979e075d0997 to better reflect the reason.

dmaluka commented 2 months ago

Well yeah, that's what I was thinking of when I was thinking of a combined Highlight() function: it would highlight not just up to endline but beyond it, until it finds a line whose highlighting doesn't need to be updated.

From my perspective it's correct to have fixed boundaries for the real highlighting, but these slightly faster state checks need to scan er certain region probably behind the initial end. That's also the reason why I wouldn't merge the ReHighlightStates() into the highlighter function, since it would need additional function parameters to differentiate between the approaches.

Do we really need this differentiation? Can't we scan beyond endline in all cases?

We have just 2 cases anyway:

initial highlighting processes the entire buffer up to the end anyway
MarkModified() determines the exact range of lines that need to be rehighlighted (the exact required value of endline), based on the values returned by RehighlightStates(). With the new approach this exact range would be determined by Highlight() itself, not by MarkModified()

And even if there were more cases, highlighting up to the last lines that requires it, without leaving lines past endline with broken highlighting, seems to be the right thing to do always.

JoeKar commented 2 months ago

Well yeah, that's what I was thinking of when I was thinking of a combined Highlight() function: it would highlight not just up to endline but beyond it, until it finds a line whose highlighting doesn't need to be updated.

Reasonable perspective.

From my perspective it's correct to have fixed boundaries for the real highlighting, but these slightly faster state checks need to scan er certain region probably behind the initial end. That's also the reason why I wouldn't merge the ReHighlightStates() into the highlighter function, since it would need additional function parameters to differentiate between the approaches.

Do we really need this differentiation? [...]

Yes, I would say so, since the "scanning" still does the regex searches and especially while editing the highlighter should be as performant as possible. So we should stop in the same moment as we do when scanning down to identify the real boundary.

We have just 2 cases anyway:

* initial highlighting processes the entire buffer up to the end anyway
* `MarkModified()` determines the exact range of lines that need to be rehighlighted (the exact required value of `endline`), based on the values returned by `RehighlightStates()`. With the new approach this exact range would be determined by `Highlight()` itself, not by `MarkModified()`

I will consider it in the rework...so there is only one interface to be used for conventional highlighting any longer.

And even if there were more cases, highlighting up to the last lines that requires it, without leaving lines past endline with broken highlighting, seems to be the right thing to do always.

Yep, here you're right...we should prevent broken highlighting anyway, so there is no real usage for endline, except some implicit identification to to a full or limited highlighting. In the latter case a more explicit parameter is most probably preferred.

Edit:

In the latter case a more explicit parameter is most probably preferred.

I decided against it (see #3127 resp dfbfd0a6ac5c345b180f047a6556dea60e7f6957), since endline was already more than sufficient.