Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ransomware ds requirements #196

Merged
4 commits merged into from
Jun 28, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,3 +148,20 @@ An anomalous score of transactions indicates a probability score of being a frau
- https://stellargraph.readthedocs.io/en/stable/hinsage.html?highlight=hinsage
- https://github.com/rapidsai/clx/blob/branch-0.20/examples/forest_inference/xgboost_training.ipynb
- Rafaël Van Belle, Charles Van Damme, Hendrik Tytgat, Jochen De Weerdt,Inductive Graph Representation Learning for fraud detection (https://www.sciencedirect.com/science/article/abs/pii/S0957417421017449)

## Ransomware Detection via AppShield
### Model Overview
This model shows an application of DOCA AppShield to use data from volatile memory to classify processes as ransomware or bengin. This model uses a sliding window over time and feeds derived data into a random forest classifiers of various lengths depending on the amount of data collected.
### Model Architecture
The model uses input from Volatility plugins in DOCA AppShield to aggregate and derive features over snapshots in time. The features are used as input into three random forest binary classifiers.
### Training
Training data consists of 87968 labeled AppShield processes from 32 snapshots collected from 256 unique benign and ransomware activities.
### How To Use This Model
Combined with host data from DOCA AppShield, this model can be used to detect ransomware. A training notebook is also included so that users can update the model as more labeled data is collected.
#### Input
Snapshots collected from DOCA AppShield
#### Output
For each process_id and snapshot there is a probablity score between 1 and 0, where 1 is ransomware and 0 is benign.
### References
- Cohen, A,. & Nissim, N. (2018). Trusted detection of ransomware in a private cloud using machine learning methods leveraging meta-features from volatile memory. In Expert Systems With Applications. (https://www.sciencedirect.com/science/article/abs/pii/S0957417418301283)
- https://developer.nvidia.com/networking/doca
3 changes: 3 additions & 0 deletions models/datasets/training-data/ransomware-training-data.csv
Git LFS file not shown
3 changes: 2 additions & 1 deletion models/model-information.csv
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ phishing-bert-20211006.onnx,phishing-detection,Gorkem Batmaz,0.1.0,Phishing emai
sid-minibert-20211021.onnx,sensitive-information-detection,Rachel Allen,0.2.0,"SID is a classifier, designed to detect sensitive information (e.g., AWS credentials, GitHub credentials) in unencrypted data. This example model classifies text containing these 10 categories of sensitive information- address, bank account, credit card number, email address, government id number, full name, password, phone number, secret keys, and usernames.",Compact BERT-mini transformer model,Training consisted of fine-tuning the original pretrained [model from google](https://huggingface.co/google/bert_uncased_L-4_H-256_A-4). The labeled training dataset is 2 million synthetic pcap payloads generated using the [faker package](https://github.com/joke2k/faker) to mimic sensitive and benign data found in nested jsons from web APIs and environmental variables.,This model is an example of customized transformer-based sensitive information detection. It can be further fine-tuned for specific detection needs or retrained for alternative categorizations using the fine-tuning scripts in the repo.,English text from PCAP payloads,Multi-label sequence classification for 10 sensitive information categories,This model version is designed for english language text data. It may not perform well on other languages.,N/A,"Well-Read Students Learn Better: On the Importance of Pre-training Compact Models, 2019,� https://arxiv.org/abs/1908.08962",1,32,V100,0.96,0.96,0.9875,43MB,N/A,bert-base-uncased,256,64,TRUE,FALSE,11,1.8,3.8.10,18.04.5 LTS,4.5
hammah-user123-20211017.pkl and hammah-role-g-20211017.pkl,digital-fingerprinting/ humans-as-machines,Gorkem Batmaz,0.1.0,This model is one example of an Autoencoder trained from a baseline for benign activity from synthetic `user-123` and `role-g`. This model combined with validation data from Morpheus examples can be used to test the HAMMAH Morpheus pipeline. It has little utility outside of testing.,"The model is an ensemble of an Autoencoder and a fast Fourier transform reconstruction. The reconstruction loss of new log data through the trained Autoencoder is used as an anomaly score. Concurrently, the timestamps of user/entity activity are used for a time series analysis to flag activity with poor reconstruction after a fast Fourier transform.",,This model is one example of an Autoencoder trained from a baseline for benign activity from synthetic `user-123` and `role-g`. This model combined with validation data from Morpheus examples can be used to test the HAMMAH Morpheus pipeline. It has little utility outside of testing.,aws-cloudtrail logs,"Anomalous score of Autoencoder, Binary classification of time series anomaly detection",This particular model is an example based on a synthetic users baseline behavior. Use on other datasets will require retraining.,N/A,https://github.com/AlliedToasters/dfencoder/blob/master/dfencoder/autoencoder.py https://github.com/rapidsai/clx/blob/branch-22.04/notebooks/anomaly_detection/FFT_Outlier_Detection.ipynb Rasheed Peng Alhajj Rokne Jon: Fourier Transform Based Spatial Outlier Mining 2009 - https://link.springer.com/chapter/10.1007/978-3-642-04394-9_39,25,,V100,1,1,1,3MB and 9MB,"ae=4, ts=4",N/A,N/A,N/A,N/A,N/A,11,1.7.1,3.8.10,18.04.5 LTS,N/A
hinsage-model.pt and xgb.pth,fraud-detection,Tad Zemicheal,0.1.0,"This model shows an application of a graph neural network for fraud detection in a credit card transaction graph. A transaction dataset that includes three types of nodes, transaction, client, and merchant nodes is used for modeling. A combination of `GraphSAGE` along `XGBoost` is used to identify frauds in the transaction networks.","It uses a bipartite heterogeneous graph representation as input for `GraphSAGE` for feature learning and `XGBoost` as a classifier. Since the input graph is heterogenous, a heterogeneous implementation of `GraphSAGE` (HinSAGE) is used for feature embedding.",This model is an example of a fraud detection pipeline using a graph neural network and gradient boosting trees. This can be further retrained or fine-tuned to be used for similar types of transaction networks with similar graph structures.,This model is an example of a fraud detection pipeline using a graph neural network and gradient boosting trees. This can be further retrained or fine-tuned to be used for similar types of transaction networks with similar graph structures.,"Transaction data with nodes including transaction, client, and merchant.",An anomalous score of transactions indicates a probability score of being a fraud.,These particular model files are based on a synthetic transaction graph. Use with other datasets will require retraining.,N/A," https://stellargraph.readthedocs.io/en/stable/hinsage.html?highlight=hinsage https://github.com/rapidsai/clx/blob/branch-0.20/examples/forest_inference/xgboost_training.ipynb [Rafa�l Van Belle, Charles Van Damme, Hendrik Tytgat, Jochen De Weerdt,Inductive Graph Representation Learning for fraud detection] (https:/www.sciencedirect.com/science/article/abs/pii/S0957417421017449)",30,5,V100, NA,0.96, NA, 756KB,N/A and 0.5,N/A,N/A,N/A,N/A,N/A,11.0/11.4,1.9.1,3.8.10,18.04.5 LTS,N/A
log-parsing-20220418.onnx,log-parsing,Rachel Allen,0.1.0,"This model is an example of using Named Entity Recognition (NER) for log parsing, specifically apache web logs.",bert-base-cased transformer model,Training consisted of fine-tuning the original pretrained [model from google](https://huggingface.co/bert-base-cased). The labeled training dataset is 1000 parsed apache web logs from a public dataset [logpai](https://github.com/logpai/loghub),This model is one example of a BERT-model trained to parse raw logs. It can be used to parse apache web logs or retrained to parse other types of logs as well. The model file has a corresponding config.json file with the names of the fields it parses.,raw apache web logs,parsed apache web log as jsonlines,This model version is designed for english language text data. It may not perform well on other languages.,N/A,[1](https://arxiv.org/abs/1810.04805) [2](https://medium.com/rapids-ai/cybert-28b35a4c81c4) [3](https://www.splunk.com/en_us/blog/it/how-splunk-is-parsing-machine-logs-with-machine-learning-on-nvidia-s-triton-and-morpheus.html),2,32,V100,0.99,0.99,0.999,431MB,N/A,bert-base-cased,256,64,FALSE,FALSE,11,1.9.1,3.8.10,18.04.5 LTS,4.18
log-parsing-20220418.onnx,log-parsing,Rachel Allen,0.1.0,"This model is an example of using Named Entity Recognition (NER) for log parsing, specifically apache web logs.",bert-base-cased transformer model,Training consisted of fine-tuning the original pretrained [model from google](https://huggingface.co/bert-base-cased). The labeled training dataset is 1000 parsed apache web logs from a public dataset [logpai](https://github.com/logpai/loghub),This model is one example of a BERT-model trained to parse raw logs. It can be used to parse apache web logs or retrained to parse other types of logs as well. The model file has a corresponding config.json file with the names of the fields it parses.,raw apache web logs,parsed apache web log as jsonlines,This model version is designed for english language text data. It may not perform well on other languages.,N/A,[1](https://arxiv.org/abs/1810.04805) [2](https://medium.com/rapids-ai/cybert-28b35a4c81c4) [3](https://www.splunk.com/en_us/blog/it/how-splunk-is-parsing-machine-logs-with-machine-learning-on-nvidia-s-triton-and-morpheus.html),2,32,V100,0.99,0.99,0.999,431MB,N/A,bert-base-cased,256,64,FALSE,FALSE,11,1.9.1,3.8.10,18.04.5 LTS,4.18
ransomw-model-short-rf-20220126.sav,ransomware-detection,Haim Elisha,0.1.0,This model detects ransomware from host volitile memory features collected from DOCA AppShield,Binary random forest classifier ,Training data consists of 87968 labeled AppShield processes from 32 snapshots collected from 256 unique benign and ransomware activities.,"Combined with host data from DOCA AppShield, this model can be used to detect ransomware. A training notebook is also included so that users can update the model as more labeled data is collected.",Snapshots collected from DOCA AppShield,"For each process_id and snapshot there is a probablity score between 1 and 0, where 1 is ransomware and 0 is benign.",This model was trained in the lab on windows machines,N/A,"ohen, A,. & Nissim, N. (2018). Trusted detection of ransomware in a private cloud using machine learning methods leveraging meta-features from volatile memory. In Expert Systems With Applications. (https://www.sciencedirect.com/science/article/abs/pii/S0957417418301283)",,,V100,recall= 0.9,,,946KB,N/A,,,,,,,,,,
Binary file not shown.
Loading