Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdfs_train sequence file doesn't correspond to the sequence file generated for 100k structured file provided in the repository #21

Open
Zanis92 opened this issue Mar 25, 2021 · 21 comments

Comments

@Zanis92
Copy link

Zanis92 commented Mar 25, 2021

Hi,

Can you kindly let me know how you got 4855 sequences in hdfs_train? While I used your 'sample_hdfs.py' script to generate a sequence file from a 100k structured file provided by you and it generates 7940 sequences. Any help would be highly appreciated.

Thanks

@zeinabfarhoudi
Copy link

Hi @donglee-afar,
I have the same question: Can you let me know how to get sequences in hdfs_train? The sequences in HDFS_sequence.csv file is different from hdfs_train
Thanks for sharing your code

@ZanisAli
Copy link

ZanisAli commented May 7, 2021

Hi @donglee-afar,
I have the same question: Can you let me know how to get sequences in hdfs_train? The sequences in HDFS_sequence.csv file is different from hdfs_train
Thanks for sharing your code

There is sample file in the code to generate the sequences specifically for the BGL dataset, you can use that and use Block ID's instead as in anomaly_label.csv file for HDFS contain labels for Block_ID's

@JinYang88
Copy link

Same question.

@ZanisAli
Copy link

@JinYang88 , I understood that the file used for generating the sequence file is different from the one provided for HDFS. Most probably the file generated by some template identification/ log parsing technique. So, there is no issue anymore. :-)

@JinYang88
Copy link

@ZanisAli

I am sorry I do not really understand that, I found that training data can seriously affect the results, could you please explain how to get the hdfs_train?

@ZanisAli
Copy link

@JinYang88 True, the training data does affect results a lot but what I mean to say is that you don't need the exact training data because the author didn't mention that they got from the same structured file that is provided in the repository. Here is the script to get the data provided by the author : https://github.com/donglee-afar/logdeep/blob/master/data/sampling_example/sample_hdfs.py

Moreover, after getting this, you can divide it as you wish like 20% training or 80% training etc. all depends on you.

@JinYang88
Copy link

@ZanisAli I would like to use the semantic information for each tempalte instead of only IDs, so I want to know how hdfs_train is got from the raw log data in order to reconstruct the raw template for each ID. Do you have any clues?

@ZanisAli
Copy link

@JinYang88 What I understood from you is that you want to have semantic information then most probably you are talking about the event2semantic_vector.json file that is provided by the author and that is not used by DeepLog at all. While for the event2semantic vector the author provided the code in Issue#3 and that will provide the semantic information about the templates. The hdfs_train doesn't provide the semantic data or information as it doesn't care anything about the templates itself but care only about the ID's of the templates.

@JinYang88
Copy link

@ZanisAli Thanks for your helpful advice!

I checked Issue#3, which is really what I want, but in the code the author provided, the mapping eventid2template.json is missing, which should be used to find corresponding templates in hdfs_train.

@ZanisAli
Copy link

@JinYang88 In the code 1, eventid2template.json can be found as an output file

@JinYang88
Copy link

@ZanisAli Many thanks!!!!!!

@JinYang88
Copy link

@ZanisAli But the file templates.txt for HDFS is also missing.

@ZanisAli
Copy link

@JinYang88 templates.txt are the templates that are identified by the log parsing techniques like Drain. You can read templates from the output of log parsing technique and generate a txt file. Now there are many things that are not provided by the author like en_core_web_sm that you can get from the spacy library or cc.en.300.vec that you can get from https://fasttext.cc/docs/en/crawl-vectors.html. So, what I mean is you need to research things a bit as the author can't provide 4-5 GB of data in the Github repository :-). I hope that answers your question

@JinYang88
Copy link

@ZanisAli Really thanks for your help, I understand how to generate the templates by myself, but the question is I would like to get exactly the same templates-id mapping used by the author, because the order of tempaltes in tempaltes.txt is the id used in hdfs_train .

@ZanisAli
Copy link

@JinYang88 Based on my information, the templates ID doesn't matter when they are all referenced in a context. For example for template T1 id is ID1 and that what you want, but in general if you start giving ID5 of T1 and then ID6 of T2 and so on then the number of ID's doesn't matter at all as the anomaly detection technique doesn't care if you gave ID1 or ID2. One thing if you start from ID5 then you might need to change a lot of implementation because many things are hard-coded in the implementation.

While coming back to your question, the author started from ID1 so you can use mapping {item: i for i, item in enumerate(struct_file['EventId'].unique(), start=1)} so in this way it will get the ID1 for Template1 and so on.

@JinYang88
Copy link

@ZanisAli Great, thanks for your help.

BTW, do you happen to know where to download the full OpenStack dataset used in the original DeepLog paper? The link for homepage of Min Du does not work anymore.

@ZanisAli
Copy link

ZanisAli commented May 12, 2021 via email

@JinYang88
Copy link

@ZanisAli Yes it is, but the OpenStack data maintained in loghub is not the full version with more than 10M logs.

@ZanisAli
Copy link

@JinYang88 This is the one according to them the complete log https://zenodo.org/record/3227177#.YJuhypMza3c

@JinYang88
Copy link

@ZanisAli The OpenStack in this link (only < 10MB) is not the complete version.

@tongxiao-cs
Copy link

The DeepLog paper says,
"DeepLog needs a small fraction of normal log entries to train its
model. In the case of HDFS log, only less than 1% of normal sessions
(4,855 sessions parsed from the first 100,000 log entries compared
to a total of 11,197,954) are used for training."

So I'm still wondering how to get the 4855 sequences in hdfs_train. Any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants