Streamline-DAQ is a distributed system designed for real-time data acquisition, processing, and monitoring. It enables efficient ingestion and analysis of high-frequency data streams from multiple sources, making it ideal for applications in scientific experiments, IoT systems, and high-performance environments. This was created to expand my project demonstrates expertise in distributed systems, real-time data processing, and monitoring for scalable, fault-tolerant operations.
The project integrates ATLAS Top Tagging Open Data from OpenData CERN to demonstrate its capabilities in handling high-energy physics datasets. Features include:
- Real-time ingestion using Apache Kafka.
- Preprocessing and feature engineering for scientific analysis.
- Scalable storage with PostgreSQL and MongoDB.
- Monitoring of data ingestion rates and tagging accuracy in Grafana.
- Real-Time Data Acquisition: Simulates high-frequency data sources using
Apache Kafka
for distributed message brokering. - Processing Pipelines: Processes time-series data with Python libraries (
Pandas
,NumPy
), ensuring high throughput and minimal latency. - Monitoring Dashboard: Provides actionable insights through real-time visualization of performance metrics (e.g., data rates, errors) using
Grafana
. - Scalable Storage: Implements structured and unstructured data storage using
PostgreSQL
andMongoDB
. - Fault-Tolerant Architecture: Includes error recovery and retry mechanisms for uninterrupted data flow.
- Data Sources: Simulate IoT devices, sensors, or scientific instruments generating high-frequency data streams.
- Data Ingestion: Apache Kafka ensures reliable, distributed ingestion of time-series data.
- Processing Layer: Python scripts process, clean, and transform data for downstream tasks.
- Storage Layer: Stores processed data in
PostgreSQL
orMongoDB
for querying and analysis. - Monitoring Layer: Visualizes system health and performance metrics in
Grafana
.
- Scientific Data Systems: Real-time acquisition and analysis of experimental data.
- IoT Monitoring: Managing and visualizing high-frequency sensor data from IoT devices.
- Operational Monitoring: Tracking system performance metrics in distributed environments.
- Operating System: Linux/macOS/Windows
- Languages: Python 3.9+
- Tools: Apache Kafka, PostgreSQL, Grafana, Docker
-
Clone the repository:
git clone https://github.com/danigallegdup/streamline-daq.git cd streamline-daq
-
Install Python dependencies:
pip install -r requirements.txt
-
Start services with Docker:
docker-compose up
-
Access the Grafana dashboard at
http://localhost:3000
(default credentials: admin/admin).
-
Simulate Data Sources:
-
Run the
data_generator.py
script to simulate multiple high-frequency data streams:python scripts/data_generator.py
-
-
Monitor System Performance:
- Access Grafana to view throughput, latency, and error rates.
-
Query Processed Data:
-
Use SQL to query the PostgreSQL database for analysis:
SELECT * FROM processed_data WHERE error_rate > 0.05;
-
- Python: For data processing and pipeline development (
Pandas
,NumPy
). - Apache Kafka: Distributed message brokering for reliable data ingestion.
- PostgreSQL & MongoDB: For structured and unstructured data storage.
- Grafana: For real-time monitoring and visualization.
- Docker: For containerized service management.
- High Throughput: Handles over 10,000 events per second with low latency.
- Scalability: Supports dynamic scaling for increased data loads.
- Reliability: Fault-tolerant architecture ensures uninterrupted data flow.
- Implement machine learning models for real-time anomaly detection.
- Extend monitoring dashboards with predictive analytics.
- Add integration with cloud storage solutions for long-term data archiving.
Contributions are welcome! Please submit a pull request or open an issue for any suggestions or improvements.
This project is licensed under the MIT License - see the LICENSE file for details.