A greedy (whitespace-consuming) %string% type?

zentures / sequence

(Unmaintained) High performance sequential log analyzer and parser

http://sequencer.io

517 stars 72 forks source link

A greedy (whitespace-consuming) %string% type? #5

Closed alexzorin closed 9 years ago

alexzorin commented 9 years ago

Consider the exim log message:

2015-02-11 11:04:40 H=(amoricanexpress.com) [64.20.195.132]:10246 F=<fxC4480@amoricanexpress.com> rejected RCPT <SCRUBBED@SCRUBBED.com>: Sender verify failed

I might make the pattern:

%msgtime% H=( %srchost% ) [ %srcipv4% ] : %srcport% F=< %srcemail% > rejected RCPT < %dstemail% >: Sender verify failed

But, I cannot find a way to turn the Sender verify failed bit into a single field, because %string% appears to break on whitespace.

Any ideas?

It's great to see the analyzer finally released with all the other bits, by the way. This project is amazing.

zhenjl commented 9 years ago

Thanks @alexzorin! This message helped identify a couple scanner bugs I had to fix.

I do have a nugget for you tho. Assuming you just want to consume all the tokens to the end of the string, you can do something like

%msgtime% H = ( %srchost% ) [ %srcipv4% ] : %srcport% F = < %srcemail% > %action% RCPT < %dstemail% > : %reason-%

Notice the %reason-% field, basically the "-" tells the parser to consume the rest of the tokens and put them in the %reason% field. So "sender verify failed" will be put into %reason% in this case.

I haven't written much about it yet as I am still trying to test out a few meta characters to see how to make it easier for parsing.

Try it and let me know if it works for you.

Thanks for your help!

[edit: the - only works with Field types, and not Token types, at least not at this time. I should fix that]

alexzorin commented 9 years ago

Oh, very cool, thanks

alexzorin commented 9 years ago

Given your thoughts from the blog post about performance characteristics of regex parsers etc I think its doubtful, but do you think it would be within the realm of possibility in your scanner to do greedy consumption in the middle of a log string? Such as

# for messages ending in 'blah blah'
%time% %string-% blah blah

zhenjl commented 9 years ago

Certainly possible, though I don't think scanner is the place to do it. I try to keep the scanner purely for simple tokenization. However, we can do something in the parser.

In fact, if you have the following message and rule:

jan 14 10:15:56 testserver sudo: this is a weird log blah blah
%msgtime% %apphost% %appname% : %reason+% blah blah

Notice the "+" in the %reason% token, it tells the parser to continue adding %reason% fields until it hits the first "blah". The output is not exactly what I want yet:

%msgtime% "jan 14 10:15:56"
%apphost% "testserver"
%appname% "sudo"
%literal% ":"
%reason% "this"
%reason% "is"
%reason% "a"
%reason% "weird"
%reason% "log"
%literal% "blah"
%literal% "blah"

What I really want is to concatenate the 5 %reason% tokens into one...so stay tuned.

zhenjl commented 9 years ago

@alexzorin , I updated the parser to allow *, + and - as meta commands.

* means consume 0 or more tokens
+ means consume 1 or more tokens
- means consume the rest of the tokens

All three will merge the multiple tokens recognized.

So your rule can be written as as any of the following:

"%msgtime% h = ( %srchost% ) [ %srcip% ] : %srcport% f = < %srcemail% > %action% rcpt < %dstemail% > : %reason:-%",
"%msgtime% h = ( %srchost% ) [ %srcip% ] : %srcport% f = < %srcemail% > %action% rcpt < %dstemail% > : %reason:+%",
"%msgtime% h = ( %srchost% ) [ %srcip% ] : %srcport% f = < %srcemail% > %action% rcpt < %dstemail% > : %reason:*%",

Let me know if you can test it out.

thx

Jian

alexzorin commented 9 years ago

Thanks, I'll try it out on my exim rules