End-to-end data pipeline for monitoring 222 Greek ports, combining adaptive scraping with ML-powered optimization.
- Real-time monitoring of 222 ports
- Dynamic scheduling based on port activity clusters
- Self-healing scraper with 99.9% success rate
- ML-powered optimization and analysis
flowchart TD
subgraph Input
A[Maritime Website] -->|Scraping| B[Raw Port Data]
end
subgraph Processing
B -->|Validation| C[Clean Data]
C -->|Analysis| D[Port Clusters]
D -->|Optimization| E[Scraping Schedule]
end
subgraph Output
E -->|Dynamic Intervals| F[Cluster 0: 15min]
E -->|Dynamic Intervals| G[Cluster 1: 30min]
E -->|Dynamic Intervals| H[Cluster 2: 1h]
E -->|Dynamic Intervals| I[Cluster 3: 2h]
E -->|Dynamic Intervals| J[Cluster 4: 4h]
end
F & G & H & I & J -->|Continuous Update| B
Python 3.9+ MongoDB 4.4+ Playwright
##Installation
git clone https://github.com/CruxTheStaff/port_delays_monitor
cd port-delays-monitor
python -m venv .venv
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows
pip install -r requirements.txt
playwright install
##Configuration
Create .env
file:
MONGO_URI=mongodb://localhost:27017
MONGO_DB=port_activity
SCRAPER_TIMEOUT=300
- Cluster-based scheduling optimization
- Intelligent error handling and auto-retry
- Dynamic rate limiting
- Resource-efficient data collection
# Sample clustering code snippet
from src.analytics.ports_clusters import PortClusterAnalyzer
analyzer = PortClusterAnalyzer()
clusters = analyzer.find_optimal_clusters()
- Live vessel tracking
- Delay prediction alerts
- Historical trends analysis
- Performance metrics dashboard
Metric | Before V2 | After V2 |
---|---|---|
Data Accuracy | 68% | 95% |
Scraping Success Rate | 71% | 95% |
Processing Time | 4.2s | 1.1s |
# Start main scheduler
python run_schedulers.py --cluster-optimized
# Run analytics pipeline
python -m src.analytics.ports_clusters --update-intervals
# Start main scheduler
python run_schedulers.py --cluster-optimized
# Run analytics pipeline
python -m src.analytics.ports_clusters --update-intervals```
# Live scraping monitor
python -m src.monitoring.scrape_dashboard
# Database health check
python -m src.database.healthcheck```
.
βββ src/
β βββ analytics/ # Analysis and clustering
β βββ database/ # Database operations
β βββ data_collection/ # Data collection
β βββ models/ # Data models
β βββ scrapers/ # Web scraping
β βββ verification/ # Data verification
βββ tests/ # Test files
βββ requirements.txt # Dependencies
- Automated daily/weekly cluster analysis
- Dynamic scheduler updates based on cluster results
- Data consistency monitoring and validation
- Performance metrics tracking
- Real-time operation dashboard
- Scraping performance metrics
- Clustering effectiveness visualization
- System health monitoring
- Interactive data exploration tools
- Streamlit dashboards for:
- System performance
- Data quality metrics
- Clustering results
- Weather data collection for each port
- Historical weather patterns analysis
- Weather-port activity correlation
- Integration with existing data pipeline
- Delay patterns identification and visualization
- Port congestion analysis dashboards
- Predictive modeling with real-time updates for:
- Expected delays
- Port capacity utilization
- Seasonal patterns
- Interactive Streamlit dashboards for:
- Pattern exploration
- Prediction visualization
- Trend analysis
- Model performance monitoring
-
Phase 1: Automation & Monitoring (Items 1-2)
- Ensure system stability
- Establish reliable metrics
- Deploy initial Streamlit dashboards
- System performance visualization
-
Phase 2: Data Enhancement (Item 3)
- Weather data integration
- Data pipeline adaptation
- Initial correlation analysis
- Weather impact dashboards
-
Phase 3: Analytics Development (Item 4)
- Develop analytics models
- Implement ML pipeline
- Deploy prediction systems
- Create comprehensive analytics dashboards
- Visualization and Dashboard
- Streamlit (interactive dashboards)
- Plotly (interactive plots)
- Matplotlib/Seaborn (static visualizations) `
While primarily a personal project, we welcome:
- Bug reports via issues
- Documentation improvements
- Performance optimization suggestions
Before contributing, please read our contribution guidelines.
This project demonstrates ethical web scraping best practices: Respects robots.txt rules Implements 2s delay between requests Uses public data only No authentication bypass attempts
Copyright (c) 2024 Stavroula Kamini
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- LinkedIn: Stavroula Kamini
- Email: cruxthestaff@helicondata.com
- GitHub Issues: For bug reports and feature requests
For inquiries about maritime data analysis and collaboration opportunities, feel free to reach out through any of the above channels.