zeek / spicy

C++ parser generator for dissecting protocols & files.
https://docs.zeek.org/projects/spicy
Other
247 stars 37 forks source link

Consider taking advantage of underlying data type during `LiteralMode::Search` #1133

Open bbannier opened 2 years ago

bbannier commented 2 years ago

In the current implementation of literal search mode we simplyadvance the input by one byte until the given production can be parsed, e.g., here.

While this works it very likely is not the most efficient search algorithm. When searching literals we could e.g., use a Boyer-Moore search to find matches; similar improvements might be possible for regexp productions, but should probably be implemented in https://github.com/rsmmr/justrx/.

bbannier commented 3 months ago

A user reported that they see the search during recovery take considerable amount of time (80-90%) if there are a lot of gaps (here: high packet loss in Zeek). They synchronize at a regex which is not complicated and matches an input of fixed width, and it seems like most of the time is spent starting off regex searches at each new byte in the input.

I am moving this issue into TODO state since it seems they could benefit from this (e.g., matching a regex for 10 characters with Boyer-Moore could slash time in recovery from 80-90% to <10%).

bbannier commented 3 months ago

Dropping this from TODO again since the ultimate issue is high packet loss, and even a more optimized implementation could eventually become too slow.

JustinAzoff commented 3 months ago

A problem here is that while packet loss may be the initial trigger for this issue, this issue then causes high cpu usage, which causes more packet loss and I don't think the process ever fully recovers.

rsmmr commented 3 months ago

A problem here is that while packet loss may be the initial trigger for this issue, this issue then causes high cpu usage, which causes more packet loss and I don't think the process ever fully recovers.

What do you think about this idea? Basically, provide the tools for the analyzer to disable itself when things get crazy?

JustinAzoff commented 3 months ago

If I'm understanding things correctly, when spicy is looking for a regex like "SMB" to resynchronize, it doesn't search for the regex inside buffered data, but instead tries to match it at each possible offset, turning what could be a constant time operation into quadratic.

Is what it is doing equivalent to these two functions in python?

import random
import re

fn = "smb_fast_10000.pcap"
with open(fn, 'rb') as f:
    data = f.read()

PKT_SIZE = 1500

def resync_match(pkt):
    for offset in range(0, PKT_SIZE):
        if re.match(b"SMB", pkt[offset:]):
            return offset

def resync_search(pkt):
    result = re.search(b"SMB", pkt)
    if result:
        return result.start()

def go(func):
    max_len = len(data) - PKT_SIZE
    offset = random.randint(0, max_len)
    func(data[offset:offset+PKT_SIZE])

def test_match():
    go(resync_match)

def test_search():
    go(resync_search)

That program reads in the pcap(doesn't bother decoding it), picks a 1500 byte slice at random, then tries to find SMB in it using the two different search methods.

The performance difference between the two is huge:

❯ python -m timeit -u msec -n 1000000  -s 'import test_smb_sync' 'test_smb_sync.test_search()'
1000000 loops, best of 5: 0.00182 msec per loop
❯ python -m timeit -u msec -n 1000  -s 'import test_smb_sync' 'test_smb_sync.test_match()'   
1000 loops, best of 5: 0.603 msec per loop

Does this match what spicy is doing?

bbannier commented 3 months ago

Does this match what spicy is doing?

In general: not really.

Typically Spicy parsers deal with data which arrives in chunks, e.g., for your example searching for SMB it might see a chunk 000SMB000 (what your dummy models), but also chunks 000S, MB000, ... (i.e., data to match spread across multiple chunks). Spicy's regex engine is able to deal with data arriving piecemeal and on seeing 000S would be able to communicate that it might have a match, but needs additional data to tell. This is important so that generated parsers can discard not matching chunks (otherwise we'd need to buffer unbounded amounts of data like your fast example), but also keep enough chunks around to be able to ultimately process all required data.