zeek / paraglob

A fairly quick data structure for matching a string against a large list of patterns.
Other
34 stars 7 forks source link

Add Support For Zeek-Style Patterns #5

Open 0xekez opened 5 years ago

0xekez commented 5 years ago

Currently paraglob supports glob style patterns. Zeek uses a different pattern type in its scripting layer which use the same syntax as flex regular expressions. This pattern matching is implemented in src/RE.h/cc inside the zeek/zeek repo. Adding support for these patterns in paraglob could potentially make it more useful for people using Zeek.

I think the current meta-word extraction approach should still work fine with some slightly more complicated parsing. Then its just a matter of determining what sort of patterns the paraglob contains in its constructor and matching using the appropriate method during get operations.

It would also be interesting to consider when combining regex style patterns with a | might increase performance.

rsmmr commented 5 years ago

Actually it was a deliberate decision to not use regular expressions: The underlying DFAs can get prohibitively large when many regexps are combined through '|'.

For individual patterns, I'm not sure paraglob could directly support regexps, as in general I don't think there's much of a way to tokenize them into fixed strings.

That all said, it was recently pointed out that hyperscan (https://www.hyperscan.io) can supposedlt match large numbers of regexps in parallel. It be interesting to understand if it can support our use case, and if so, how it's implemented so that it doesn't run into the DFA state explosion.

On Wed, May 29, 2019 at 12:09 -0700, Zeke Medley wrote:

Currently paraglob supports glob style patterns. Zeek uses a different pattern type in its scripting layer which use the same syntax as flex regular expressions. This pattern matching is implemented in src/RE.h/cc inside the zeek/zeek repo. Adding support these patterns in paraglob could potentially make it more useful for people using Zeek.

I think the current meta-word extraction approach should still work fine with some slightly more complicated parsing. Then its just a matter of determining what sort of patterns the paraglob contains in its constructor and matching using the appropriate method during get operations.

It would also be interesting to consider when combining regex style patterns with a | might increase performance.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/zeek/paraglob/issues/5

-- Robin Sommer Corelight, Inc. robin@corelight.com * www.corelight.com