For example, given a list of patterns
{*og, do*, ca*, plant}
and an input string dog
, paraglob will return
{*og, do*}
For any pattern, there exist a set of sub-strings that a string must contain in order for it to have any hope of matching against that pattern. We call these meta-words. Here are some examples:
*og -> |og|
dog*fish -> |dog| |fish|
When a pattern is added to a Paraglob the pattern is stored and is split into
its meta-words. Those meta words are then added to an Aho-Corasick data
structure that can be found in multifast-ac
.
When Paraglob is given a query, it first gets the meta-words contained in the
query using multifast-ac
. Then, it builds a set of all patterns associated with
those meta-words and runs fnmatch
on the query and those patterns. It finally
returns a vector of all the patterns that match.
# ./configure && make && make test && make install
paraglob-test
is a small
benchmarking script that takes three parameters: the number of patterns to
generate, the number of queries to perform, and the percentage generated of
patterns that will match.
As an example, running paraglob-test 10000 50 50
will add 10,000 patterns,
perform 50 queries on them (of which 50% should match), and then return the
results.
Paraglob is integrated with Zeek & provides a simple api inside of its
scripting language. In Zeek, paraglob is implemented as an
OpaqueType
and its syntax closely follows other similar constructs
inside Zeek. A paraglob can only be instantiated once from a vector of
patterns and then only supports get operations which return a vector
of all patterns matching an input string. These patterns are different than
the patttern
type in Zeek in that they are just strings. The syntax is as
follows:
local v = vector("*", "d?g", "*og", "d?", "d[!wl]g");
local p = paraglob_init(v);
print paraglob_match(p1, "dog");
out:
[*, *og, d?g, d[!wl]g]
Paraglob can make queries very quickly, but does not build instantly. It takes about 1.5 seconds to build for 10,000 items, 3 seconds for 20,000, and so on. This is because of the time required to build the Aho-Corasick structure.