-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hdfs_train sequence file doesn't correspond to the sequence file generated for 100k structured file provided in the repository #21
Comments
Hi @donglee-afar, |
There is sample file in the code to generate the sequences specifically for the BGL dataset, you can use that and use Block ID's instead as in anomaly_label.csv file for HDFS contain labels for Block_ID's |
Same question. |
@JinYang88 , I understood that the file used for generating the sequence file is different from the one provided for HDFS. Most probably the file generated by some template identification/ log parsing technique. So, there is no issue anymore. :-) |
I am sorry I do not really understand that, I found that training data can seriously affect the results, could you please explain how to get the hdfs_train? |
@JinYang88 True, the training data does affect results a lot but what I mean to say is that you don't need the exact training data because the author didn't mention that they got from the same structured file that is provided in the repository. Here is the script to get the data provided by the author : https://github.com/donglee-afar/logdeep/blob/master/data/sampling_example/sample_hdfs.py Moreover, after getting this, you can divide it as you wish like 20% training or 80% training etc. all depends on you. |
@ZanisAli I would like to use the semantic information for each tempalte instead of only IDs, so I want to know how hdfs_train is got from the raw log data in order to reconstruct the raw template for each ID. Do you have any clues? |
@JinYang88 What I understood from you is that you want to have semantic information then most probably you are talking about the event2semantic_vector.json file that is provided by the author and that is not used by DeepLog at all. While for the event2semantic vector the author provided the code in Issue#3 and that will provide the semantic information about the templates. The hdfs_train doesn't provide the semantic data or information as it doesn't care anything about the templates itself but care only about the ID's of the templates. |
@ZanisAli Thanks for your helpful advice! I checked Issue#3, which is really what I want, but in the code the author provided, the mapping eventid2template.json is missing, which should be used to find corresponding templates in hdfs_train. |
@JinYang88 In the code 1, |
@ZanisAli Many thanks!!!!!! |
@ZanisAli But the file templates.txt for HDFS is also missing. |
@JinYang88 |
@ZanisAli Really thanks for your help, I understand how to generate the templates by myself, but the question is I would like to get exactly the same templates-id mapping used by the author, because the order of tempaltes in tempaltes.txt is the id used in |
@JinYang88 Based on my information, the templates ID doesn't matter when they are all referenced in a context. For example for template T1 id is ID1 and that what you want, but in general if you start giving ID5 of T1 and then ID6 of T2 and so on then the number of ID's doesn't matter at all as the anomaly detection technique doesn't care if you gave ID1 or ID2. One thing if you start from ID5 then you might need to change a lot of implementation because many things are hard-coded in the implementation. While coming back to your question, the author started from ID1 so you can use mapping |
@ZanisAli Great, thanks for your help. BTW, do you happen to know where to download the full OpenStack dataset used in the original DeepLog paper? The link for homepage of Min Du does not work anymore. |
Hi,
They are all available at LogHub git repository. You can just search with
the name and will find the first or second link.
…On Wed, 12 May 2021, 03:09 LIU, Jinyang, ***@***.***> wrote:
@ZanisAli <https://github.com/ZanisAli> Great, thanks for your help.
BTW, do you happen to know where to download the full OpenStack dataset
used in the original DeepLog paper? The link for homepage of Min Du does
not work anymore.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH2RPNO54PULCVTMC6BYG2DTNHIN3ANCNFSM4ZZHNBWQ>
.
|
@ZanisAli Yes it is, but the OpenStack data maintained in loghub is not the full version with more than 10M logs. |
@JinYang88 This is the one according to them the complete log https://zenodo.org/record/3227177#.YJuhypMza3c |
@ZanisAli The OpenStack in this link (only < 10MB) is not the complete version. |
The DeepLog paper says, So I'm still wondering how to get the 4855 sequences in hdfs_train. Any ideas? |
Hi,
Can you kindly let me know how you got 4855 sequences in hdfs_train? While I used your 'sample_hdfs.py' script to generate a sequence file from a 100k structured file provided by you and it generates 7940 sequences. Any help would be highly appreciated.
Thanks
The text was updated successfully, but these errors were encountered: