zimmski / tavor

A generic fuzzing and delta-debugging framework
MIT License
245 stars 10 forks source link

tavor is slow and uses a lot memory with very recursive grammars #102

Open dkasak opened 7 years ago

dkasak commented 7 years ago

tavor's fuzzing functionality seems to decrease in speed and increase in memory usage very rapidly with recursive grammars. Informally speaking, the "more" recursive a grammar gets, the larger the memory usage. This is understandable to some degree but I would mostly expect the factors involved to be constant, since the generation of a sentence of a grammar is essentially a random walk on a graph and adding a recursive rule is simply enlarging the graph with some extra edges. I haven't tried to quantify the effect yet, but it feels like there's exponential complexity somewhere; enough that adding a few recursive rules prevents it from finishing by exhausting all memory. This happens even if I use --max-repeat 1 and even in cases where the resulting sentence is relatively "shallow" (i.e. without making many recursive steps when they are optional).

I'm not very familiar with Go so I can't easily tell whether this is to be expected from the way it is currently implemented so I thought I'd ask.

Just wanted to say, tavor is an awesome and extremely useful tool overall.

zimmski commented 7 years ago

Yes, you are right that Tavor is using too much memory(, IO and CPU!). All algorithms are at the moment recursive and not optimized. My intention was to write a "first version" and write enough tests so they can be optimized later. However, I never came around to refactor the first version. It would be great if you can help out with this task. The first step would be to identify (profile) which phase of Tavor needs to be refactored first (e.g. parsing phase, generation phase) to check if an optimization can be done with the current data structures. I think that the biggest problem is that the generation phase (the fuzzing part) clones a lot of data and never throws it away. One can improve this immensely by reworking the data structures to save only the "configuration information" of a token and reference the original token, instead of copying both (that's the cloning part) all the time. And I think the second biggest problem is the "unrolling of tokens" https://github.com/zimmski/tavor#unrolling which is not a problem (as you mentioned) if you use --max-repeat 1 but if that number is greater than 1 the internal data structures get huge.

(Sorry for the late answer, my open source backlog is crazy. And thank you for the kind words. I am happy that people are using the project. I would be interested to hear what you are doing with Tavor?)

dkasak commented 7 years ago

Thanks, these are very useful pointers on where to look when someone decides to tackle these problems. I'm currently a bit swamped myself and my Go knowledge is really lacking, but I'd potentially like to try working on this when I get some more time.

We've been experimenting with tavor as a generative fuzzer component for fuzzing various things as part of security work and bug bounties. For the cases where tavor fails to be useful because of slowness, I've been using QuickFuzz and its recently acquired BNFC backend which allows you to specify a grammar in LBNF and get a generative fuzzer for the grammar.

Still, tavor is much nicer in many regards, in particular the various operators, ranged repeat and optional group and ability to capture the generated content into variables, all of which make it easier to generate both syntactically and semantically valid sentences of the grammar.

Also, I'm really digging the Sindarin name. 😎