A Graph Neural Network (GNN) implementation for detecting illicit Bitcoin transactions using the Elliptic dataset from Kaggle. This project uses PyTorch Geometric and PyTorch Lightning to build a sophisticated GNN model for anomaly detection and identifying potentially fraudulent cryptocurrency transactions through graph-based pattern recognition.
- Graph Neural Network Architecture: Multi-layer GCN with batch normalization and dropout
- Automated Data Loading: Downloads Elliptic dataset from Kaggle automatically
- Model Training: PyTorch Lightning integration with callbacks and logging
- Anomaly Detection: Advanced pattern recognition for identifying suspicious transaction behaviors
- Evaluation Metrics: ROC-AUC, classification reports, and confusion matrices
- Suspicious Node Detection: Identifies unknown transactions likely to be illicit through anomaly scoring
- Visualization: t-SNE embeddings and subgraph visualization around suspicious nodes
- Model Checkpointing: Automatic saving of best performing models
The Elliptic Dataset contains Bitcoin transaction data with:
- 203,769 transactions (nodes)
- 234,355 directed edges representing Bitcoin flows
- 166 node features including transaction amounts, timestamps, and aggregated features
- Labels: Illicit (1), Licit (2), or Unknown
See requirements.txt
for full dependencies. Key packages:
- PyTorch & PyTorch Geometric
- PyTorch Lightning
- Pandas, NumPy, Scikit-learn
- NetworkX, Matplotlib
- Kaggle API (for automatic dataset download)
- Clone the repository:
git clone https://github.com/yourusername/Kaggle_Elliptic_Dataset.git
cd Kaggle_Elliptic_Dataset
- Install dependencies:
pip install -r requirements.txt
- Set up Kaggle API credentials (for automatic dataset download):
- Install Kaggle API:
pip install kaggle
- Follow Kaggle API setup
- Accept dataset terms at Elliptic Dataset page
- Install Kaggle API:
from kaggle_elliptic_dataset import BitcoinIllicitGNNDetector
# Initialize detector
detector = BitcoinIllicitGNNDetector(hidden_dim=128, learning_rate=0.001)
# Load data (downloads automatically if needed)
detector.load_elliptic_data(data_dir='./elliptic_data')
# Prepare graph data
detector.prepare_graph_data()
# Train model
detector.train_model(max_epochs=200, patience=20)
# Evaluate on test set
y_true, y_pred, y_prob = detector.evaluate_model()
# Find suspicious unknown nodes through anomaly detection
suspicious_nodes = detector.identify_suspicious_unknown_nodes(threshold=0.7, top_k=50)
# Analyze learned embeddings
embeddings, embeddings_2d = detector.analyze_node_embeddings()
# Visualize suspicious node neighborhoods
detector.visualize_suspicious_subgraph(suspicious_nodes[0]['txId'], num_hops=2)
The GNN model consists of:
- 3 GCN layers with batch normalization and ReLU activation
- Dropout layers for regularization
- Fully connected layers for final classification
- Adam optimizer with learning rate scheduling
The model achieves:
- ROC-AUC: ~0.81 on test set
- Precision/Recall: Balanced performance on both classes
- Anomaly Detection: Identifies high-risk unknown transactions through pattern analysis
├── kaggle_elliptic_dataset.py # Main implementation
├── elliptic_data/ # Dataset directory
│ ├── elliptic_txs_features.csv
│ ├── elliptic_txs_classes.csv
│ └── elliptic_txs_edgelist.csv
├── lightning_logs/ # Training logs and checkpoints
├── requirements.txt # Python dependencies
└── README.md # This file
BitcoinIllicitGNNDetector
: Main wrapper class for the entire pipelineBitcoinGNNDetector
: PyTorch Lightning module for trainingGNNModel
: Core GNN architecture with GCN layers
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this code in your research, please cite:
@dataset{elliptic_dataset,
title={The Elliptic Data Set},
author={Weber, Mark and Chen, Giacomo and Mendez, Manuel and Altintas, Alpay and Coscia, Michele and McNeeley, Bridgette},
url={https://www.kaggle.com/ellipticco/elliptic-data-set},
year={2019}
}
- Elliptic for providing the dataset
- PyTorch Geometric team for the excellent GNN library
- PyTorch Lightning for the training framework