Skip to content

yjw868/data-engineering-zoomcamp-project

Repository files navigation

DataTalksClub's Data Engineering Zoomcamp Project

UK NetworkRail Operator performance Analysis

This is the final project as a part of the Data Engineering Zoomcamp course. It includes an end-to-end data pipeline that streams train movement data along with a very basic analysis of their performance.

Architecture diagram

Problem description

The data is a public feed provided by the UK’s Network Rail company through an ActiveMQ interface.There are 8 message type, this project is only focus on type "0003", train movements. The message includes the arrival, passing or departure of a train. Additionally, it records the time of the event.

Using this simple analysis, you can see how each train operator performs based on whether trains are early, late, on time, or off route.

Datasets

  • train movements
  • toc: Train Operator name
  • ukrail_loc: UK railway station location

Tech Stack

  • Prefect: To orchestrating workflow
  • Docker Compose: This manages 15 dockers including Prefect, Kafka, SchemaRegistry, dbt, Postgres DB, Ofelia scheduler etc
  • Docker: To containerize the code and infrastructure
  • Google BigQuery: As Data Warehouse
  • Google Cloud Storage: As Data Lake
  • Ofelia: A docker job scheduler to run dbt Core
  • dbt: Run data transformation
  • metabase: Visualization of the findings

Structure of the project

  • ingest: datafeeder_avro.py is the entrypoint. It streams the TRAIN_MVT_ALL_TOC topic data from an ActiveMQ interface then send the message to the Kafka producer. The module includes a MVListener and Avro Kafka producer.

- flows: This includes all the Prefect flows * etl_kafka_to_gcs_avro.py using an Avro consumer to subscrble the train_mov topic the write the message to gcs data lake. The data is partioned by toc code.

* el_gcs_to_bq_batch.py runs an incremental batch load.

- dbt-bq-train: Transformation using the dbt Core with BigQuery

- ofelia scheduler: A [scheduler](https://github.com/mcuadros/ofelia) to run cronjob from a running docker. It is used to run dbt

```
[global]
save-folder = /etc/logs

[job-exec "job-dbt-scheduler"]
# schedule = @every 90s #use prebuild intervals
# schedule = 0 10 */1 * * *
schedule = 0 */10 * * * *
container = dbt-bq-train
command = dbt run
```

  • resources: This includes the Avro schemas which will be shared in the Prefect-Agent and datafeeder docker containers

  • utilities: This includes the Kafka producer settings and TrainRecord Class

  • streamlit_app: This give the ability to view the simple dashboard straight away after runing make start.

  • terraform_gcp: This includes all the scripts to setup the Google Cloud Storage and BigQuery.

Results

Stermlit simple dashboard

As I couldn't figure out how to allow others to replicate the charts in Metadata, below is a streamlit version of the dashboard.You view this at http://localhost:8501/ after you run `make start'. It is due to the time difference when I extracted the data that there is a discrepancy between this and above.

Conclusion

There are roughly two full days of data in the data sets. A total of 13,479 records were captured, of which 34% are late, 36% are early, 28% are on time, and the rest are off route. West Coast Railways and DC Rail have the highest percentage of delays with more than 50% of trains running late.

To replicate

Please see the detail instruction in setup.md

Improvements

  • Create a Prefect deployment to run dbt core
  • Create a CI/CD with Github actions
  • Improve the setup.sh to fully automate the process
  • Use Streamlit for the Analytics

References

Credits

This project would not have been possible without the DataTalksClub team behind these courses and all the members who helped me along the way.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages