This is the final project as a part of the Data Engineering Zoomcamp course. It includes an end-to-end data pipeline that streams train movement data along with a very basic analysis of their performance.
The data is a public feed provided by the UK’s Network Rail company through an ActiveMQ interface.There are 8 message type, this project is only focus on type "0003", train movements. The message includes the arrival, passing or departure of a train. Additionally, it records the time of the event.
Using this simple analysis, you can see how each train operator performs based on whether trains are early, late, on time, or off route.
- train movements
- toc: Train Operator name
- ukrail_loc: UK railway station location
- Prefect: To orchestrating workflow
- Docker Compose: This manages 15 dockers including Prefect, Kafka, SchemaRegistry, dbt, Postgres DB, Ofelia scheduler etc
- Docker: To containerize the code and infrastructure
- Google BigQuery: As Data Warehouse
- Google Cloud Storage: As Data Lake
- Ofelia: A docker job scheduler to run dbt Core
- dbt: Run data transformation
- metabase: Visualization of the findings
- ingest: datafeeder_avro.py is the entrypoint. It streams the TRAIN_MVT_ALL_TOC topic data from an ActiveMQ interface then send the message to the Kafka producer. The module includes a MVListener and Avro Kafka producer.
```
[global]
save-folder = /etc/logs
[job-exec "job-dbt-scheduler"]
# schedule = @every 90s #use prebuild intervals
# schedule = 0 10 */1 * * *
schedule = 0 */10 * * * *
container = dbt-bq-train
command = dbt run
```
-
resources: This includes the Avro schemas which will be shared in the Prefect-Agent and datafeeder docker containers
-
utilities: This includes the Kafka producer settings and TrainRecord Class
-
streamlit_app: This give the ability to view the simple dashboard straight away after runing
make start
. -
terraform_gcp: This includes all the scripts to setup the Google Cloud Storage and BigQuery.
As I couldn't figure out how to allow others to replicate the charts in Metadata, below is a streamlit version of the dashboard.You view this at http://localhost:8501/ after you run `make start'. It is due to the time difference when I extracted the data that there is a discrepancy between this and above.
There are roughly two full days of data in the data sets. A total of 13,479 records were captured, of which 34% are late, 36% are early, 28% are on time, and the rest are off route. West Coast Railways and DC Rail have the highest percentage of delays with more than 50% of trains running late.
Please see the detail instruction in setup.md
- Create a Prefect deployment to run dbt core
- Create a CI/CD with Github actions
- Improve the setup.sh to fully automate the process
- Use Streamlit for the Analytics
- Stomp - The Simple Text Oriented Messaging Protocol
- Python Rest Client Schema Registry
- Train Movement wiki
This project would not have been possible without the DataTalksClub team behind these courses and all the members who helped me along the way.