y-scope / clp

Compressed Log Processor (CLP) is a free log management tool capable of compressing logs and searching the compressed logs without decompression.
https://yscope.com
Apache License 2.0
875 stars 70 forks source link

implementation for dns log #606

Open wdweng opened 5 days ago

wdweng commented 5 days ago

Request

my graduation is about dns log compression and search i have read your paper and found that the json version very suitable for dns log so i want to develop one for dns log

Possible implementation

dns log is a txt file and is semi structured which has format time--CIP--RIP--QType--QName--Resource Records very similar to json files resource records have different length, and most values in each field are not repetitive i want to change the code in clp-s to fit dns log input

gibber9809 commented 5 days ago

Hi @wdweng,

The simplest thing you could do is convert your data to newline-delimited JSON then ingest that. That way everything should work for you out of the box without having to change code.

If you do want to directly ingest DNS logs there is a way to do that (talked about in my masters thesis) but it isn't very user friendly at the moment. You will have to write a parser and a serializer for your DNS logs following a certain programming model. Additionally you will have to change some parts of the code that currently assume every record is a JSON object, in particular here at ingestion, here during serialization, and here during search (and also here during search). Note that this is purely an issue with how the code is written right now -- the archive format itself can handle cases where records are not JSON.

When it comes to actually writing your parser and serializer you will first have to add a type to this enum -- this is the type that gets encoded into the Merged Parse Tree and indicates what type of structure is being represented.

For writing the parser hopefully this can act as reference -- in particular note the start_unordered_object and end_unordered_object calls that mark the start and end of parsing for this custom type. Between those calls you are free to call the "unordered" version of the functions to manipulate the schema and values in a record -- performing parsing in this way will guarantee that you see the same values in the same order at decompression time. You should hopefully just be able to follow what we do in the parse() function in that same file and just call your parser directly.

For serialization it might be a bit more difficult to replicate what we do since the code is very optimized for serializing JSON. Here in the code the variable m_global_id_to_unordered_object should have everything you need to initialize the serializer for your special type. You can see how we use this information to prepare to serialize structurized arrays here. After preparing to serialize objects from a given table the actual serialization code is here. I expect the details of what you'll want to do to initialize your serializer and actually serialize your data will be fairly different from what we do here.

Going forwards this should all become much simpler, but unfortunately support for custom parsing and serialization is not very mature right now.