Encoding the data from logparser

nagsubhadeep commented 4 years ago

Is Deeplog supposed to work on any log format such as firewall logs? Or is it supposed to work on system event logs only?

wuyifan18 commented 4 years ago

@nagsubhadeep DeepLog can work on any log format, you just use log parser to transform log to sequences of log id that corresponds to a log template (e.g. 2 13 23 4 6 2 7), and DeepLog can use these sequences to train and inference.

nagsubhadeep commented 3 years ago

Yifan,

When I executed the Drain log parser on my log data, I have got the logs parsed in the following format: log_structured.csv and log_templates.csv.

The log_structured.csv contains the following fields: LineId,Month,Date,Time,IP,Component,Content,EventId,EventTemplate,ParameterList

The log_templates.csv contains the following templates: EventId,EventTemplate,Occurrences

My question is that how can I model the training data according to the format mentioned in the data/hdfs_train file?

Any input will be appreciated.

Thanks, Deep

wuyifan18 commented 3 years ago

Deep,

You can use EventId in log_structured.csv as training data.

Yifan

nagsubhadeep commented 3 years ago

The eventid shows up as the following:

log_structured.csv a7e180bc a7e180bc 6dcd196e 6dcd196e 6dcd196e 6dcd196e 6dcd196e 6dcd196e 6dcd196e a7e180bc a7e180bc a7e180bc 046873eb 046873eb a7e180bc 046873eb a7e180bc 046873eb a7e180bc 5ab8dc9c c6e0f51b

log_template.csv a7e180bc 6dcd196e 046873eb 5ab8dc9c c6e0f51b

And that looks very different from the kind of training data that you have. Can I proceed with this training data?

nagsubhadeep commented 3 years ago

I am not getting a single-thread sequential sequence here such as 22 39 42 45 1. How can I achieve that? It seems that eventid does not generate that sequence.

wuyifan18 commented 3 years ago

encode EventId to number, such as a7e180bc -> 0, 6dcd196e -> 1...

nagsubhadeep commented 3 years ago

If I just encode the EventId from log_structured.csv, then my sequence of event id will look somewhat like this: Line 1: 2 Line 2: 6 Line 3: 3 Line 4: 5 ... and this will continue on throughout the length of the log_structured.csv dataset.

How do you get the following format such as? Line 1: 5 5 5 22 11 9 11 9 11 9 26 26 26 23 23 23 21 21 21 Line 2: 22 5 5 5 11 9 11 9 11 9 26 26 26 Line 3: 22 5 5 5 26 26 26 11 9 11 9 11 9 2 3 23 23 23 21 21 21 Line 4: 22 5 5 5 11 9 11 9 11 9 26 26 26

The only way I find it logical is if, along with the EventId, I encode the contents of the parameter list against each event id to unique numbers. Is that correct? E.g. EventId: 10 ParameterList: ['outbound', '828584', '125.101.22.216/17472 (125.101.22.216/17472)', '125.101.22.94/51103 (125.101.22.94/51102)']

So, here I encode it as: 10 22 44 55 66 [Here 10 is the EventID and 22 44 55 77 is the Parameter List encoded to unique numbers]

Is my understanding correct?

wuyifan18 commented 3 years ago

We do not use parameterList in log key anomaly detection model. I use these sequences in this way. E.g. sequences: 5 5 5 22 11 9 11 9 11 9 26 26 26 23 23 23 21 21 21 window_size: 5 input1: 5 5 5 22 11 output1: 9 input2: 5 5 22 11 9 output2: 11 Thinking of your line1-line4 as a sequence, that is to say, using the sequences whose length is set to window_size, and then to predict the next log key. If I remember correctly, the data provided in the paper is grouped by instance_id, so there are many lines.

nagsubhadeep commented 3 years ago

And here goes my understanding:

I get the window side. Input 1 consists of the encoded EventIds of length: window size. The output1 is the outcome of the model of the conditional probability distribution for each encoded EventId in input1 whose value you are getting as 9.

However, all I get from Drain is the EventId which I’m encoding in this phase and aggregating it in input1, input2 etc... How do you generate the conditional probability distribution for each input1, input2 etc?

Would you mind sharing the script to generate the training data that includes the window size, input and output?

wuyifan18 commented 3 years ago

I have written the code for generating the training data. https://github.com/wuyifan18/DeepLog/blob/502aaf05be4c1251b7dc96f6439025c4fc988c66/LogKeyModel_train.py#L14-L28 As for the conditional probability distribution, it is generated by LSTM. https://github.com/wuyifan18/DeepLog/blob/502aaf05be4c1251b7dc96f6439025c4fc988c66/LogKeyModel_train.py#L81-L82

nagsubhadeep commented 3 years ago

So basically, does the name variable in the generate() function denote the file name that contains all the EventIds from log_structured.csv. Right? Otherwise, I am still trying to understand what would be the format of the file that is assigned to the name variable.

wuyifan18 commented 3 years ago

Yes, but not log_structured.csv here. The data I use in the code is HDFS data, you should modify the code according to your data.

nagsubhadeep commented 3 years ago

I get that. I want to know what is the name variable in the generate() function here? Is it the file containing the entire outcome of the Drain parser or just the log keys of your HDFS data?

In short, on what did you run the generate function? And I am assuming that the current contents of the data folder: hdfs_test_abnormal, hdfs_test_normal and hdfs_train are generated after you ran the generate function on some data assigned to the variable: name. Right? What is that name variable? Is it the file containing the HDFS log keys from the Drain Output?

wuyifan18 commented 3 years ago

The name variable in the generate() function here is the files containing the log keys from the Drain Output, such as hdfs_train. The generate function is to get the train data or the test data.

nagsubhadeep commented 3 years ago

Yifan, thank you for all your help with explaining the problem

OutOfBoundCats commented 2 years ago

Hi @wuyifan18

Thanks for all the work on this repository. I went through all the issues and understood to generate the training dataset we need to do the following

get the Logs
Use Drain parser to generate files with log keys
Encode the event_id or similar filed depending on the logs
form sequence of encoded event_id

the one thing I am not sure about is how do we go about forming the sequences or groups like below on different lines given the encoded key 5 5 5 22 11 9 11 9 11 9 26 26 26 23 23 23 21 21 21 22 5 5 5 11 9 11 9 11 9 26 26 26

OutOfBoundCats commented 2 years ago

I think i found a way to generate sequence from even_id while digging through loglizer code

Read the log file parsed by Drain
Encode event_id to number and generate new file
Read the file print("Loading", input) struct_log = pd.read_csv(log_struc, engine='c',na_filter=False, memory_map=True)
generate sequence and save file code got from https://github.com/logpai/loglizer/blob/master/loglizer/dataloader.py line 82 to 89

' data_dict = OrderedDict() for idx, row in struct_log.iterrows(): if idx<1: print(row) blkIdlist = re.findall(r'(blk-?\d+)', row['Content']) if idx<1: print(blkId_list) blkId_set = set(blkId_list) if idx<1: print(blkId_set) for blk_Id in blkId_set: if not blk_Id in data_dict: data_dict[blk_Id] = [] if idx<1: print(data_dict) data_dict[blk_Id].append(row['EventId']) if idx<1: print(data_dict) data_df = pd.DataFrame(list(data_dict.items()), columns=['BlockId', 'EventSequence']) data_df.to_csv(output, index=False) '

you should now have a file which looks like this

mine looks like above as I haven't yet done the step 2

@wuyifan18 can you please comment if its right or I missed something

shoaib-intro commented 2 years ago

@OutOfBoundCats Summary: you're right in case of HDFS and BGL where I already grouped using loglizer, obvious in this case, but How to group encoded event id sequences in case of windows/system/application logs where we don't have any block_id as a grouping reference but only simple log statement. Answer: I observed Component can also be used as a reference for EventId's an alternative if block id mentioned in loglizer line 82-89

wuyifan18 / DeepLog

Encoding the data from logparser #41