DataTalksClub's Data Engineering Zoomcamp Project

UK NetworkRail Operator performance Analysis

This is the final project as a part of the Data Engineering Zoomcamp course. It includes an end-to-end data pipeline that streams train movement data along with a very basic analysis of their performance.

Architecture diagram

Problem description

The data is a public feed provided by the UK’s Network Rail company through an ActiveMQ interface.There are 8 message type, this project is only focus on type "0003", train movements. The message includes the arrival, passing or departure of a train. Additionally, it records the time of the event.

Using this simple analysis, you can see how each train operator performs based on whether trains are early, late, on time, or off route.

Datasets

train movements
toc: Train Operator name
ukrail_loc: UK railway station location

Tech Stack

Prefect: To orchestrating workflow
Docker Compose: This manages 15 dockers including Prefect, Kafka, SchemaRegistry, dbt, Postgres DB, Ofelia scheduler etc
Docker: To containerize the code and infrastructure
Google BigQuery: As Data Warehouse
Google Cloud Storage: As Data Lake
Ofelia: A docker job scheduler to run dbt Core
dbt: Run data transformation
metabase: Visualization of the findings

Structure of the project

ingest: datafeeder_avro.py is the entrypoint. It streams the TRAIN_MVT_ALL_TOC topic data from an ActiveMQ interface then send the message to the Kafka producer. The module includes a MVListener and Avro Kafka producer.

- flows: This includes all the Prefect flows * etl_kafka_to_gcs_avro.py using an Avro consumer to subscrble the train_mov topic the write the message to gcs data lake. The data is partioned by toc code.

* el_gcs_to_bq_batch.py runs an incremental batch load.

- dbt-bq-train: Transformation using the dbt Core with BigQuery

- ofelia scheduler: A [scheduler](https://github.com/mcuadros/ofelia) to run cronjob from a running docker. It is used to run dbt

```
[global]
save-folder = /etc/logs

[job-exec "job-dbt-scheduler"]
# schedule = @every 90s #use prebuild intervals
# schedule = 0 10 */1 * * *
schedule = 0 */10 * * * *
container = dbt-bq-train
command = dbt run
```

resources: This includes the Avro schemas which will be shared in the Prefect-Agent and datafeeder docker containers
utilities: This includes the Kafka producer settings and TrainRecord Class
streamlit_app: This give the ability to view the simple dashboard straight away after runing make start.
terraform_gcp: This includes all the scripts to setup the Google Cloud Storage and BigQuery.

Results

Stermlit simple dashboard

As I couldn't figure out how to allow others to replicate the charts in Metadata, below is a streamlit version of the dashboard.You view this at http://localhost:8501/ after you run `make start'. It is due to the time difference when I extracted the data that there is a discrepancy between this and above.

Conclusion

There are roughly two full days of data in the data sets. A total of 13,479 records were captured, of which 34% are late, 36% are early, 28% are on time, and the rest are off route. West Coast Railways and DC Rail have the highest percentage of delays with more than 50% of trains running late.

To replicate

Please see the detail instruction in setup.md

Improvements

Create a Prefect deployment to run dbt core
Create a CI/CD with Github actions
Improve the setup.sh to fully automate the process
Use Streamlit for the Analytics

References

Credits

This project would not have been possible without the DataTalksClub team behind these courses and all the members who helped me along the way.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.vscode		.vscode
assets		assets
blocks		blocks
data		data
dbt-bq-train/dtc_de_train		dbt-bq-train/dtc_de_train
docker		docker
flows		flows
ingest		ingest
profile-yml		profile-yml
resources		resources
streamlit_app		streamlit_app
terraform_gcp		terraform_gcp
utilities		utilities
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
scheduler.conf		scheduler.conf
secrets.example.toml		secrets.example.toml
setup.md		setup.md
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataTalksClub's Data Engineering Zoomcamp Project

UK NetworkRail Operator performance Analysis

Architecture diagram

Problem description

Datasets

Tech Stack

Structure of the project

Results

Stermlit simple dashboard

Conclusion

To replicate

Improvements

References

Credits

About

Releases

Packages

Languages

yjw868/data-engineering-zoomcamp-project

Folders and files

Latest commit

History

Repository files navigation

DataTalksClub's Data Engineering Zoomcamp Project

UK NetworkRail Operator performance Analysis

Architecture diagram

Problem description

Datasets

Tech Stack

Structure of the project

Results

Stermlit simple dashboard

Conclusion

To replicate

Improvements

References

Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages