|
Prototype-001 is a proof-of-concept system combining cybersecurity / SIEM / ML insights into an end-to-end pipeline.
It’s designed for SOC analysts, threat researchers, or ML engineers who want to explore data-driven security event detection and anomaly analysis.
Prototype-002 introduces a HybridSecurityModel designed to perform three key security analyses in one: anomaly detection, classification, and time-series analysis. This prototype can handle both numeric and log-based data, making it a versatile tool for security monitoring.
Motivation:
- Many SIEM-related ML prototypes lack modularity, visualization, or reproducibility.
- Prototype-001 aims to provide a clean foundation: data loading, feature extraction, model training, evaluation & visualization — all in one Jupyter notebook.
- Multi-Faceted Analysis: It's a base model that can perform anomaly detection, classification, and time-series analysis simultaneously in Prototype_002.
- Hybrid Data Support: The model includes separate pipelines for analyzing both numeric data and log data.
- Custom Classifier: It features a SimpleLogisticRegression classifier built from scratch.
- Rule-Based Logic: The prototype uses rule-based logic for tasks like flagging rare events in logs as anomalies and classifying log content based on keywords.
Key Topics / Tags: training
· logs
· ml
· cybersecurity
· siem
· socs
- 🚧 Modular pipeline: Data ingestion → Feature extraction → Model training → Evaluation
- 📊 Rich visualizations (ROC curves, confusion matrices, feature importances)
- 📂 Flexible log format support (plug your own datasets)
- 🔧 Hyperparameter tuning (grid search or alternatives)
- ✅ Reproducible experiment tracking (seeds, logging)
- 🧪 Notebook-first demonstration (interactive, exploratory)
[Raw Logs / Events] ↓ [Preprocessing / Parsing] ↓ [Feature Extraction / Aggregation] ↓ [Train / Validate / Test Split] ↓ [Model(s)] ↓ [Evaluation & Visualization]
- Data Ingestion → Parsers for CSV/JSON/syslog
- Feature Extraction → Time windows, counts, embeddings
- Models → Random Forest, XGBoost, Neural Nets, anomaly detectors
- Evaluation → Precision, Recall, F1, ROC-AUC
- Visualization → Matplotlib / Seaborn / Plotly plots
- Python 3.8+
pip
orconda
- (Optional) GPU if training deep models
git clone https://github.com/CyberMetrics/Prototype-001.git
cd Prototype-001
🧪 Usage / Demo
Notebook Mode
bash
Copy code
jupyter notebook Prototype_001.ipynb
Walk through the notebook:
Data Loading & Preprocessing
Feature Engineering
Train / Validation / Test Split
Model Training & Evaluation
Visualizations & Analysis
Script Mode (if modularized)
bash
Copy code
python run.py --input path/to/logs.json --model random_forest --output results/
📊 Experiments & Results
Model Precision Recall F1-Score ROC-AUC
Random Forest 0.85 0.78 0.81 0.92
XGBoost 0.87 0.80 0.83 0.94
Neural Net 0.82 0.75 0.78 0.91
Sample Visualizations
ROC curves for multiple models
Confusion matrices
Feature importance plots (e.g., SHAP)
Time series anomaly detection charts
Insights:
Random Forest provided stable results across datasets.
Feature X showed highest importance.
Future work: reduce false positives, test deep anomaly detection.
###📂 Project Structure
Prototype-001/
├── LICENSE
├── README.md
└── Prototype_001.ipynb ← main demo notebook
🤝 Contributing
Contributions are welcome!
Fork this repository
Create a feature branch (git checkout -b feature/your-feature)
Commit changes (git commit -m "Add new feature")
Push and open a Pull Request
Ideas:
Add support for more log formats (Zeek, Suricata, syslog)
Add deep models (autoencoders, LSTMs)
Integrate MLflow or W&B for experiment tracking
Improve visualization dashboards (Plotly, Streamlit)
##📜 License This project is licensed under the MIT License — see the LICENSE file.
##🙏 Acknowledgements
Open-source libraries: NumPy, Pandas, scikit-learn, Matplotlib, Seaborn
Inspired by research in ML for SIEM / anomaly detection
Thanks to the open-source community for tools & feedback