Open Athiq opened 5 years ago
I use the dataset provided by the author of this paper. More details please refer to the web page.
hello. thanks @wuyifan18 for the great job. I think what @Athiq is talking about can be found in the 4.3 paragraph of the published paper. I'm also trying to find out how this could be implemented! any help would be much appreciated
@sotiristsak exactly ... i want the raw text which has been converted to numbers in the data provided (i think, its TF-IDF) -- if its so, then shouldn't be a problem to implement. Please let me know, if that's the case @wuyifan18
@wuyifan18 thanks for the response, i am looking for text data, so that i can use Spell and Deeplog .. but where i fail is after Spell i have text parsed data. This text parsed data i want to train with Deeplog, but i am not sure how to convert this parsed data from Spell to numbers (is it TF-IDF) ?.
@Athiq you mean convert the data to numbers according to log keys you have parsed from Spell? If so, I have no idea. Maybe @sotiristsak can give a hand.
Sorry for the delayed replay. Unfortunately, I also don't have a clue. I'm thinking I have to implement the 4.3.1 paragraph of the paper on my own, because, I think, this is where the logs are being split into tasks in order to be grouped into workflows. @Athiq what do you mean by TF-IDF? Also, is anyone interested in collaborating to do the above work?
Btw, @Athiq , the numbers are not TF-IDF. They are the ids of each different log type. So, a sequence of such numbers denotes the workflow of a specific task pattern. The hdfs_train file contains the workflows that were extracted from the raw log file of the normal execution.
@sotiristsak You're right.
@sotiristsak @wuyifan18 what i am trying is to run DeepLog for this data as below
https://github.com/logpai/loghub/blob/master/Hadoop/Hadoop_2k.log.
I have successfully ran Spell(parser) on this data then i have two files as below
Sample structured_file.csv LineId | Date | Time | Pid | Level | Component | Content | EventId | EventTemplate | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 81109 | 203615 | 148 | INFO | dfs.DataNode$PacketResponder | PacketResponder 1 for block blk_38865049064139660 terminating | ead21f08 | PacketResponder for block terminating | |||||||
2 | 81109 | 203807 | 222 | INFO | dfs.DataNode$PacketResponder | PacketResponder 0 for block blk_-6952295868487656571 terminating | ead21f08 | PacketResponder for block terminating | |||||||
3 | 81109 | 204005 | 35 | INFO | dfs.FSNamesystem | BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.251.73.220:50010 is added to blk_7128370237687728475 size 67108864 | 54e007d2 | BLOCK NameSystem.addStoredBlock blockMap updated 50010 is added to size | |||||||
4 | 81109 | 204015 | 308 | INFO | dfs.DataNode$PacketResponder | PacketResponder 2 for block blk_8229193803249955061 terminating | ead21f08 | PacketResponder for block terminating | |||||||
5 | 81109 | 204106 | 329 | INFO | dfs.DataNode$PacketResponder | PacketResponder 2 for block blk_-6670958622368987959 terminating | ead21f08 | PacketResponder for block terminating | |||||||
6 | 81109 | 204132 | 26 | INFO | dfs.FSNamesystem | BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.251.43.115:50010 is added to blk_3050920587428079149 size 67108864 | 54e007d2 | BLOCK NameSystem.addStoredBlock blockMap updated 50010 is added to size |
Sample template_file.csv EventId | EventTemplate | Occurrences |
---|---|---|
ead21f08 | PacketResponder for block terminating | 311 |
54e007d2 | BLOCK NameSystem.addStoredBlock blockMap updated 50010 is added to size | 314 |
74cae9fd | Received block of size from * | 292 |
dd632e5d | Receiving block src dest 50010 | 292 |
Now the big question is --- to run Deeplog on this structured_file and template files. Is this possible ?? or i am missing something ??.
thanks in advance
@Athiq You should convert the structured_file to numbers according to the template files you have got using Spell.
@wuyifan18 : Thanks for your response, Sorry but I am also struggling on how to convert structured files into numbers, can you guide us by given any example if how to do it please. Any example would help.
Hello! From my understanding, once raw text logs have been parsed(using Spell or any other parsing tool), I think they should be converted into sequences of log templates to be fed to LSTM model.
Hello! From my understanding, once raw text logs have been parsed(using Spell or any other parsing tool), I think they should be converted into sequences of log templates to be fed to LSTM model.
I agree with u opinion, that's why I am confused about their format of training data, I am also confused why the paper's author divide log to lines, and each line have different length, I think it is not the correct format of training data according to his paper, do you have any idea?
@hzxGoForward I think there is a preprocessing step missing, which is, for each line(block/session), building sequences of same length. I guess that is not the actual final input for training. My problem is I don't get the same number of block lines. If I group by block in the first 100K log lines I get a different number of sessions. Maybe I am extracting the wrong block id from each line.
@williamceli exactly, the actual final input for training need to padding whose length is the hyperparameter window_size.
@hzxGoForward I think there is a preprocessing step missing, which is, for each line(block/session), building sequences of same length. I guess that is not the actual final input for training. My problem is I don't get the same number of block lines. If I group by block in the first 100K log lines I get a different number of sessions. Maybe I am extracting the wrong block id from each line.
may be you can use the number of each log key extract by the following dataset: http://iiis.tsinghua.edu.cn/~weixu/sospdata.html DeepLog's author cited this dataset, in this dataset, there are log key and their number.
@wuyifan18 @hzxGoForward : Can you add preprocessing, how you converted lines to numericals by using hyperparameter or window_size or timestamps for LSTM?
We are referring the openstack logs and for your reference, i have attached log. https://github.com/logpai/logparser/blob/master/logs/OpenStack/OpenStack_2k.log
And we are able to convert unstructured logs to structured logs using spell or log parser but after that we are unable to feed the data to training and I understood by using hyperparmeter window size you are trying to convert . Can you add that details or sample source code?
Btw, @Athiq , the numbers are not TF-IDF. They are the ids of each different log type. So, a sequence of such numbers denotes the workflow of a specific task pattern. The hdfs_train file contains the workflows that were extracted from the raw log file of the normal execution.
@Athiq Hi,thanks for your response and it also helps me a lot! And I have something to verify, is you mean that I can verify code of realizing workflow by the hdfs_train file ?Thank you so much!
@Athiq hi! I am going through the same issue, I have parsed the logs and I am clueless on how to convert them into numbers for processing. Were you able to find a solution?
@sotiristsak @wuyifan18 what i am trying is to run DeepLog for this data as below
https://github.com/logpai/loghub/blob/master/Hadoop/Hadoop_2k.log.
I have successfully ran Spell(parser) on this data then i have two files as below
Sample structured_file.csv
LineId Date Time Pid Level Component Content EventId EventTemplate 1 81109 203615 148 INFO dfs.DataNode$PacketResponder PacketResponder 1 for block blk38865049064139660 terminating ead21f08 PacketResponder for block terminating
2 81109 203807 222 INFO dfs.DataNode$PacketResponder PacketResponder 0 for block blk-6952295868487656571 terminating ead21f08 PacketResponder for block terminating
3 81109 204005 35 INFO dfs.FSNamesystem BLOCK NameSystem.addStoredBlock: blockMap updated: 10.251.73.220:50010 is added to blk_7128370237687728475 size 67108864 54e007d2 BLOCK NameSystem.addStoredBlock blockMap updated 50010 is added to size
4 81109 204015 308 INFO dfs.DataNode$PacketResponder PacketResponder 2 for block blk_8229193803249955061 terminating ead21f08 PacketResponder for block terminating
5 81109 204106 329 INFO dfs.DataNode$PacketResponder PacketResponder 2 for block blk_-6670958622368987959 terminating ead21f08 PacketResponder for block terminating
6 81109 204132 26 INFO dfs.FSNamesystem BLOCK NameSystem.addStoredBlock: blockMap updated: 10.251.43.115:50010 is added to blk_3050920587428079149 size 67108864 54e007d2 BLOCK NameSystem.addStoredBlock blockMap updated 50010 is added to size
Sample template_file.csvEventId EventTemplate Occurrences ead21f08 PacketResponder for block terminating 311 54e007d2 BLOCK NameSystem.addStoredBlock blockMap updated 50010 is added to size 314 74cae9fd Received block of size from 292 dd632e5d Receiving block src dest * 50010 292 Now the big question is --- to run Deeplog on this structured_file and template files. Is this possible ?? or i am missing something ??.
thanks in advance
@Athiq Hello buddy, I have already obtained the template file and the templated log file, but how can I turn them into digital sequence files? Like the author's hdfs_ As with train data, do you have a way? I hope you can reply to me when you see it. This is very important to me. Thank you!
Do you have script that converts the log files(HDFS files - text ) to numbers ??
https://github.com/wuyifan18/DeepLog/blob/master/data/hdfs_train
How did you get the above ?? -- using Spell ?? ... after running the parser i still have text data --- how did you convert to numbers (vectors) ?? --- do you have a script ?? can you please upload ??
https://github.com/logpai/logparser/tree/master/logs/HDFS
is this the above data converted to numbers ??
thanks in advance