Open bbannier opened 2 years ago
A user reported that they see the search during recovery take considerable amount of time (80-90%) if there are a lot of gaps (here: high packet loss in Zeek). They synchronize at a regex which is not complicated and matches an input of fixed width, and it seems like most of the time is spent starting off regex searches at each new byte in the input.
I am moving this issue into TODO state since it seems they could benefit from this (e.g., matching a regex for 10 characters with Boyer-Moore could slash time in recovery from 80-90% to <10%).
Dropping this from TODO again since the ultimate issue is high packet loss, and even a more optimized implementation could eventually become too slow.
A problem here is that while packet loss may be the initial trigger for this issue, this issue then causes high cpu usage, which causes more packet loss and I don't think the process ever fully recovers.
A problem here is that while packet loss may be the initial trigger for this issue, this issue then causes high cpu usage, which causes more packet loss and I don't think the process ever fully recovers.
What do you think about this idea? Basically, provide the tools for the analyzer to disable itself when things get crazy?
If I'm understanding things correctly, when spicy is looking for a regex like "SMB" to resynchronize, it doesn't search for the regex inside buffered data, but instead tries to match it at each possible offset, turning what could be a constant time operation into quadratic.
Is what it is doing equivalent to these two functions in python?
import random
import re
fn = "smb_fast_10000.pcap"
with open(fn, 'rb') as f:
data = f.read()
PKT_SIZE = 1500
def resync_match(pkt):
for offset in range(0, PKT_SIZE):
if re.match(b"SMB", pkt[offset:]):
return offset
def resync_search(pkt):
result = re.search(b"SMB", pkt)
if result:
return result.start()
def go(func):
max_len = len(data) - PKT_SIZE
offset = random.randint(0, max_len)
func(data[offset:offset+PKT_SIZE])
def test_match():
go(resync_match)
def test_search():
go(resync_search)
That program reads in the pcap(doesn't bother decoding it), picks a 1500 byte slice at random, then tries to find SMB in it using the two different search methods.
The performance difference between the two is huge:
❯ python -m timeit -u msec -n 1000000 -s 'import test_smb_sync' 'test_smb_sync.test_search()'
1000000 loops, best of 5: 0.00182 msec per loop
❯ python -m timeit -u msec -n 1000 -s 'import test_smb_sync' 'test_smb_sync.test_match()'
1000 loops, best of 5: 0.603 msec per loop
Does this match what spicy is doing?
Does this match what spicy is doing?
In general: not really.
Typically Spicy parsers deal with data which arrives in chunks, e.g., for your example searching for SMB
it might see a chunk 000SMB000
(what your dummy models), but also chunks 000S
, MB000
, ...
(i.e., data to match spread across multiple chunks). Spicy's regex engine is able to deal with data arriving piecemeal and on seeing 000S
would be able to communicate that it might have a match, but needs additional data to tell. This is important so that generated parsers can discard not matching chunks (otherwise we'd need to buffer unbounded amounts of data like your fast example), but also keep enough chunks around to be able to ultimately process all required data.
In the current implementation of literal search mode we simply
advance
the input by one byte until the given production can be parsed, e.g., here.While this works it very likely is not the most efficient search algorithm. When searching literals we could e.g., use a Boyer-Moore search to find matches; similar improvements might be possible for regexp productions, but should probably be implemented in https://github.com/rsmmr/justrx/.