This repository automates the collection and deployment of KBO data using Apache Airflow. It manages the flow of data from collection to visualization in the Data Portal.
This pipeline runs on Apache Airflow and is deployed using Docker Compose. To set up and run the pipeline, ensure Docker is installed and configured properly.
To run the pipeline locally using Docker Compose:
- Clone this repository and initialize submodules:
git clone --recurse-submodules https://github.com/leewr9/kbo-data-pipeline.git cd kbo-data-pipeline
- Ensure that your GCP service account key is placed in the
config
folder and renamed tokey.json
:mv your-service-account-key.json config/key.json
- Start the Airflow services using Docker Compose:
docker-compose up -d
- Access the Airflow web UI at http://localhost:8080
- Login with Username:
admin
, Password:admin
- Login with Username:
The following DAGs are currently implemented:
- fetch_kbo_games_daily - Runs daily at 00:00, parsing the latest KBO game results.
- fetch_kbo_players_weekly - Runs every Sunday at 00:00, parsing player records up to the current week.
- fetch_kbo_schedules_weekly - Runs every Sunday at 00:00, parsing the schedule for the upcoming week.
- fetch_kbo_historical_data - Runs every year on January 1st at 00:00, parsing the schedule for the upcoming year.
The collected data is stored in Google Cloud Storage (GCS) under the kbo-data
bucket with the following structure:
- schedules/
weekly/
(Upcoming game schedules, weekly basis)historical/
(Past game schedules by year)
- games/
daily/
(Game details collected daily)historical/
(Historical game details by year)
- players/
daily/
(Player statistics per game)weekly/
(Aggregated player statistics per week)historical/
(Past player statistics by year)
The parsing modules are managed through the kbo-data-collector repository, which is included as a Git submodule in this project.
This project is licensed under the MIT License. See the LICENSE file for details.