zeek / spicy

C++ parser generator for dissecting protocols & files.
https://docs.zeek.org/projects/spicy
Other
243 stars 37 forks source link

Add tooling or guidance on how a Spicy parser with error recovery should deal with huge packet loss #1768

Closed bbannier closed 2 months ago

bbannier commented 2 months ago

Spicy parser can provide error recovery so that they can resynchronize after parse errors (probably most common: after parse errors triggered by gaps due to packet loss). During error recovery the Spicy parser will enter a search mode where it searches the input stream for a pattern to synchronize on.

This process works, but is not free since the search might be (much) more expensive than parsing, especially if synchronization fails repeatedly due to more gaps before a match is found, and users reported that parsers spend most of their CPU time in recovery if there is high packet loss. We currently have the following ways to deal with that:

If packet loss is high enough these will ultimately still fail (similar argument why even a more optimized search impl with zeek/spicy#1133 would at a certain point also be insufficient).

We should provide or at least document an approach for parsers to opt-out of further parsing of a connection if packet loss is too high. An automated approach could look at some packet loss metric (which one?) and then deactivate the parser if it exceeds a threshold (likely: configurable since I'd expect the trade-off between visibility and throughput to be site-specific). Alternatively we could also document an approach on how to implement such a mitigation (e.g., inspect stats counters and manually perform the deactivation). It could be that either of this requires additional Spicy-side hooks or metrics.

rsmmr commented 2 months ago

Here's an idea: we could add statistics to the current input stream tracking number of data chunks vs gaps. Then, during synchronization, an analyzer could retrieve those stats and decide if there's just too many gaps for it to reasonably continue and disable itself by calling zeek::skip_input(). To retrieve the stats(), we'd introduce some built-in function. As the decision point where to check the stats, one could use one of the existing synchronization hooks. And/or, I think it would be reasonable to extend the existing on_gap hook for sinks to trigger for normal input stream gaps as well. @Mohan-Dhawan, what do you think?

Mohan-Dhawan commented 2 months ago

@rsmmr Use of stats to guide synchronization would be perfect. Just one query, once disabled the analyzer cannot be re-enabled for that input, right?

rsmmr commented 2 months ago

Just one query, once disabled the analyzer cannot be re-enabled for that input, right?

That's correct (which seems reasonable to me, how would you make that decision to reenable?)

rsmmr commented 2 months ago

@Mohan-Dhawan Take a look at https://github.com/zeek/spicy/pull/1767, how does that look?

Mohan-Dhawan commented 2 months ago

Thanks @rsmmr. The stats information is great. Is stats collection expensive? If not, can the stats collection continue even when the analyzer is disabled? If yes, would it be possible to re-enable the analyzer if stats. num_data_bytes/stats.num_gap_bytes goes above a specified threshold.

Mohan-Dhawan commented 2 months ago

Also, is it possible to have a window of time (like the last 10 minutes) where the stats information is collected or is it always over the stream's lifetime?

rsmmr commented 2 months ago

Thanks @rsmmr. The stats information is great. Is stats collection expensive?

No, I don't think so because it's just increasing a couple of counters each time a new data chunk comes in.

If not, can the stats collection continue even when the analyzer is disabled?

No, because Spicy won't see the data anymore at that point, so it also cannot count further.

If yes, would it be possible to re-enable the analyzer if stats. num_data_bytes/stats.num_gap_bytes goes above a specified threshold.

Per above, Spicy is out of the picture once the analyzer is disabled. Maybe something could be implemented Zeek-side to reenable the analyzer somehow (which would need to start out in sync mode), but honestly, not sure that is worth the trouble. Do you really have long-lived, high-volume connections that have large number of gaps only for small periods of time?

rsmmr commented 2 months ago

Also, is it possible to have a window of time (like the last 10 minutes) where the stats information is collected or is it always over the stream's lifetime?

Right now it's always the lifetime. I'd prefer to keep this simple, both for usability and performance. A sliding window would need some kind of configuration mechanism as well. But I think you could implement a sliding window yourself with the data that statistics() gives you if that's important.

Mohan-Dhawan commented 2 months ago

Is it possible to skip the bytes in the stream instead of disabling the analyzer, and start analyzing again when the stats normalize?

rsmmr commented 2 months ago

Is it possible to skip the bytes in the stream instead of disabling the analyzer, and start analyzing again when the stats normalize?

Once we skip the bytes, the parser won't be processing anything, so no hooks will be running. How/where would you want to make the decision to start analyzing again?

Mohan-Dhawan commented 2 months ago

Is it possible to run the check say after skipping every 1MB (or any user-specificed value) before checking the stats and making the decision?

rsmmr commented 2 months ago

I think we have different interpretations of "skipping the bytes". That said, if I understand you right, I think what we could do is give Spicy a feature to not further process the next X bytes (i.e., consume them but discard immediately), and then automatically go into synchronization mode afterwards. You could use that mechanism to start skipping from inside %on_sync_advance when you see too many gaps. You'd then skip over that 1MB, and you will be back in %on_sync_advance right after that amount of data has passed through. And then you could check again to see if you want to discard another 1MB. And so on.

Honestly, though, I don't think this would all buy you much in the end, apart from more complexity. I'd just skip the rest of the connection on the Zeek side, saving cycles, and move on. In the end, at >90% gaps, your problems are elsewhere.

Mohan-Dhawan commented 2 months ago

I see the point. Just that in situations with 90% loss one has to explicitly turn off the analyzer. With the hooks you suggested, it might be easy to have spicy allowing us that freedom at runtime.