wuyifan18 / DeepLog

Pytorch Implementation of DeepLog.
MIT License
361 stars 154 forks source link

Share how I transformed the logs into lines of IDs here #35

Open ying1016 opened 4 years ago

ying1016 commented 4 years ago

Hey guys, I used Drain3 to transform the HDFS logs into lines of IDs here:https://github.com/ying1016/Drain3.git. Hope it can help you if you don't know what to do. One thing that should be noticed: the rawdata is ordered by time of the log, not block ID. If you want to transform the logs, you need to have the data ordered by block ID, not my test data in the URL. But I think it might not be a problem.

DuoweiPan commented 3 years ago

@ying1016 Thank you for your implementation! I noticed in the IDblks.log there are a lot of single log messages like 06 01, 01 which is smaller than the window size and are quite different from hdfs_train. Those messages will be detected as abnormal if I use the model trained with hdfs_train. Correct if I'm wrong, I think the original log data you used is the same as log data that DeepLog used, then why is the log key so different between them? Any hint would be helpful! Thank you!

edocorallo commented 3 years ago

Hello, @DuoweiPan

I noticed in the IDblks.log there are a lot of single log messages like 06 01, 01 which is smaller than the window size and are quite different from hdfs_train. Those messages will be detected as abnormal if I use the model trained with hdfs_train.

For what i understood the minimal length of the session should never be less than the window size (eg. window_size=9, len(session)>=9) during the training stage (could be wrong thought)

then why is the log key so different between them?

Also For what i understood, the log keys are kinda arbitrary. I numerated them by appearing order using a simple dictionary and saved the dictionary for later parsing. But if I did the parsing starting from some random lines, I would still obtain a good training set containing the same sequences of logs, but named differently. (eg. the sequence [ 2 5 2 5 4 7 8 ] is equivalent to [6 8 6 8 1 9 13] and, as long the enumeration of the log keys is consistent through the entire dataset, DeepLog obtains similar results on both enumerations) Obviously, if you use one enumeration for the train that has to be the same for predicting.

I hope to be helpful. Bye

OneStepAndTwoSteps commented 1 year ago

This's very helpful to me, thank you