wuyifan18 / DeepLog

Pytorch Implementation of DeepLog.
MIT License
374 stars 155 forks source link

data conversion #1

Open Athiq opened 5 years ago

Athiq commented 5 years ago

Do you have script that converts the log files(HDFS files - text ) to numbers ??

https://github.com/wuyifan18/DeepLog/blob/master/data/hdfs_train

How did you get the above ?? -- using Spell ?? ... after running the parser i still have text data --- how did you convert to numbers (vectors) ?? --- do you have a script ?? can you please upload ??

https://github.com/logpai/logparser/tree/master/logs/HDFS

is this the above data converted to numbers ??

thanks in advance

wuyifan18 commented 5 years ago

I use the dataset provided by the author of this paper. More details please refer to the web page.

sotiristsak commented 5 years ago

hello. thanks @wuyifan18 for the great job. I think what @Athiq is talking about can be found in the 4.3 paragraph of the published paper. I'm also trying to find out how this could be implemented! any help would be much appreciated

Athiq commented 5 years ago

@sotiristsak exactly ... i want the raw text which has been converted to numbers in the data provided (i think, its TF-IDF) -- if its so, then shouldn't be a problem to implement. Please let me know, if that's the case @wuyifan18

wuyifan18 commented 5 years ago

@Athiq the raw text can be found in the web page.

Athiq commented 5 years ago

@wuyifan18 thanks for the response, i am looking for text data, so that i can use Spell and Deeplog .. but where i fail is after Spell i have text parsed data. This text parsed data i want to train with Deeplog, but i am not sure how to convert this parsed data from Spell to numbers (is it TF-IDF) ?.

wuyifan18 commented 5 years ago

@Athiq you mean convert the data to numbers according to log keys you have parsed from Spell? If so, I have no idea. Maybe @sotiristsak can give a hand.

sotiristsak commented 5 years ago

Sorry for the delayed replay. Unfortunately, I also don't have a clue. I'm thinking I have to implement the 4.3.1 paragraph of the paper on my own, because, I think, this is where the logs are being split into tasks in order to be grouped into workflows. @Athiq what do you mean by TF-IDF? Also, is anyone interested in collaborating to do the above work?

sotiristsak commented 5 years ago

Btw, @Athiq , the numbers are not TF-IDF. They are the ids of each different log type. So, a sequence of such numbers denotes the workflow of a specific task pattern. The hdfs_train file contains the workflows that were extracted from the raw log file of the normal execution.

wuyifan18 commented 5 years ago

@sotiristsak You're right.

Athiq commented 5 years ago

@sotiristsak @wuyifan18 what i am trying is to run DeepLog for this data as below

https://github.com/logpai/loghub/blob/master/Hadoop/Hadoop_2k.log.

I have successfully ran Spell(parser) on this data then i have two files as below



Sample structured_file.csv LineId Date Time Pid Level Component Content EventId EventTemplate              
1 81109 203615 148 INFO dfs.DataNode$PacketResponder PacketResponder 1 for block blk_38865049064139660 terminating ead21f08 PacketResponder for block terminating        
2 81109 203807 222 INFO dfs.DataNode$PacketResponder PacketResponder 0 for block blk_-6952295868487656571 terminating ead21f08 PacketResponder for block terminating        
3 81109 204005 35 INFO dfs.FSNamesystem BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.251.73.220:50010 is added to blk_7128370237687728475 size 67108864 54e007d2 BLOCK NameSystem.addStoredBlock blockMap updated 50010 is added to size
4 81109 204015 308 INFO dfs.DataNode$PacketResponder PacketResponder 2 for block blk_8229193803249955061 terminating ead21f08 PacketResponder for block terminating        
5 81109 204106 329 INFO dfs.DataNode$PacketResponder PacketResponder 2 for block blk_-6670958622368987959 terminating ead21f08 PacketResponder for block terminating        
6 81109 204132 26 INFO dfs.FSNamesystem BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.251.43.115:50010 is added to blk_3050920587428079149 size 67108864 54e007d2 BLOCK NameSystem.addStoredBlock blockMap updated 50010 is added to size


Sample template_file.csv EventId EventTemplate Occurrences
ead21f08 PacketResponder for block terminating 311
54e007d2 BLOCK NameSystem.addStoredBlock blockMap updated 50010 is added to size 314
74cae9fd Received block of size from * 292
dd632e5d Receiving block src dest 50010 292


Now the big question is --- to run Deeplog on this structured_file and template files. Is this possible ?? or i am missing something ??.

thanks in advance

wuyifan18 commented 5 years ago

@Athiq You should convert the structured_file to numbers according to the template files you have got using Spell.

Hammadtcs commented 5 years ago

@wuyifan18 : Thanks for your response, Sorry but I am also struggling on how to convert structured files into numbers, can you guide us by given any example if how to do it please. Any example would help.

williamceli commented 5 years ago

Hello! From my understanding, once raw text logs have been parsed(using Spell or any other parsing tool), I think they should be converted into sequences of log templates to be fed to LSTM model.

hzxGoForward commented 5 years ago

Hello! From my understanding, once raw text logs have been parsed(using Spell or any other parsing tool), I think they should be converted into sequences of log templates to be fed to LSTM model.

I agree with u opinion, that's why I am confused about their format of training data, I am also confused why the paper's author divide log to lines, and each line have different length, I think it is not the correct format of training data according to his paper, do you have any idea?

williamceli commented 5 years ago

@hzxGoForward I think there is a preprocessing step missing, which is, for each line(block/session), building sequences of same length. I guess that is not the actual final input for training. My problem is I don't get the same number of block lines. If I group by block in the first 100K log lines I get a different number of sessions. Maybe I am extracting the wrong block id from each line.

wuyifan18 commented 5 years ago

@williamceli exactly, the actual final input for training need to padding whose length is the hyperparameter window_size.

hzxGoForward commented 5 years ago

@hzxGoForward I think there is a preprocessing step missing, which is, for each line(block/session), building sequences of same length. I guess that is not the actual final input for training. My problem is I don't get the same number of block lines. If I group by block in the first 100K log lines I get a different number of sessions. Maybe I am extracting the wrong block id from each line.

may be you can use the number of each log key extract by the following dataset: http://iiis.tsinghua.edu.cn/~weixu/sospdata.html DeepLog's author cited this dataset, in this dataset, there are log key and their number.

Hammadtcs commented 5 years ago

@wuyifan18 @hzxGoForward : Can you add preprocessing, how you converted lines to numericals by using hyperparameter or window_size or timestamps for LSTM?

We are referring the openstack logs and for your reference, i have attached log. https://github.com/logpai/logparser/blob/master/logs/OpenStack/OpenStack_2k.log

And we are able to convert unstructured logs to structured logs using spell or log parser but after that we are unable to feed the data to training and I understood by using hyperparmeter window size you are trying to convert . Can you add that details or sample source code?

Huhu-ooo commented 4 years ago

Btw, @Athiq , the numbers are not TF-IDF. They are the ids of each different log type. So, a sequence of such numbers denotes the workflow of a specific task pattern. The hdfs_train file contains the workflows that were extracted from the raw log file of the normal execution.

@Athiq Hi,thanks for your response and it also helps me a lot! And I have something to verify, is you mean that I can verify code of realizing workflow by the hdfs_train file ?Thank you so much!

stuti-madaan commented 4 years ago

@Athiq hi! I am going through the same issue, I have parsed the logs and I am clueless on how to convert them into numbers for processing. Were you able to find a solution?

Nightmare2334 commented 1 year ago

@sotiristsak @wuyifan18 what i am trying is to run DeepLog for this data as below

https://github.com/logpai/loghub/blob/master/Hadoop/Hadoop_2k.log.

I have successfully ran Spell(parser) on this data then i have two files as below

Sample structured_file.csv

LineId Date Time Pid Level Component Content EventId EventTemplate               1 81109 203615 148 INFO dfs.DataNode$PacketResponder PacketResponder 1 for block blk38865049064139660 terminating ead21f08 PacketResponder for block terminating        
2 81109 203807 222 INFO dfs.DataNode$PacketResponder PacketResponder 0 for block blk
-6952295868487656571 terminating ead21f08 PacketResponder for block terminating        
3 81109 204005 35 INFO dfs.FSNamesystem BLOCK NameSystem.addStoredBlock: blockMap updated: 10.251.73.220:50010 is added to blk_7128370237687728475 size 67108864 54e007d2 BLOCK NameSystem.addStoredBlock blockMap updated 50010 is added to size
4 81109 204015 308 INFO dfs.DataNode$PacketResponder PacketResponder 2 for block blk_8229193803249955061 terminating ead21f08 PacketResponder
for block terminating        
5 81109 204106 329 INFO dfs.DataNode$PacketResponder PacketResponder 2 for block blk_-6670958622368987959 terminating ead21f08 PacketResponder
for block terminating        
6 81109 204132 26 INFO dfs.FSNamesystem BLOCK
NameSystem.addStoredBlock: blockMap updated: 10.251.43.115:50010 is added to blk_3050920587428079149 size 67108864 54e007d2 BLOCK NameSystem.addStoredBlock blockMap updated 50010 is added to size
Sample template_file.csv

EventId EventTemplate Occurrences ead21f08 PacketResponder for block terminating 311 54e007d2 BLOCK NameSystem.addStoredBlock blockMap updated 50010 is added to size 314 74cae9fd Received block of size from 292 dd632e5d Receiving block src dest * 50010 292 Now the big question is --- to run Deeplog on this structured_file and template files. Is this possible ?? or i am missing something ??.

thanks in advance

@Athiq Hello buddy, I have already obtained the template file and the templated log file, but how can I turn them into digital sequence files? Like the author's hdfs_ As with train data, do you have a way? I hope you can reply to me when you see it. This is very important to me. Thank you!