Technologies • About the project • Conceptual architecture • Phase 1 • Phase 2 • Phase 3 - Final Stage • Data source • Looker report • Setup
An educational project to build an end-to-end pipline for near real-time and batch processing of data further used for visualisation 👀 and a machine learning model 🧠.
The project is designed to enable the preparation of an analytical summary of the variability of METAR weather reports over the years for airports of European countries.
Read more about METAR here ➡️ METAR
In addition, the aim is to prepare a web application using the Streamlit library and machine learning algorithms to predict the trend of change of upcoming METAR reports.
The project is divided into 3 phases according to the attached diagrams:
Retrieval of archive data from source. Initial transformation. Transfer of data to Data Lake - Google Cloud Storage. Transfer of data to Data Warehouse. Transformations using PySpark on Dataproc cluster. Visualisation of aggregated data on an interactive dashboard in Looker.
Preparing the environment for near-real-time data retrieval. Transformations of archived and live data using PySpark, preparation of data for machine learning model. Training and tuning stage of the model.
Collection of analytical reports for historical data, preparation of web dashboard with the ability to display the prediction of the nearest METAR report for a given airport and the likely trend of change.
💿 IOWA STATE UNIVERSITY ASOS-AWOS-METAR Data
The report generated in Looker provides averages of METAR data, broken down by temperature, winds, directions, and weather phenomena, with accompanying charts. The data was scraped via URL and stored in raw form in Cloud Storage. PySpark and Dataproc were then used to prepare SQL tables with aggregation functions, which were saved in BigQuery. The Looker report directly utilizes these tables from BigQuery.
Additionally, it's possible to prepare a similar report for other networks. Below is an example for PL__ASOS.
Check: PL__ASOS
For more information, please refer to the "Setup" section.
-
Make sure you have Spark, PySpark, Google Cloud Platform SDK, Prefect and Terraform installed and configured.
-
Clone the repo
$ git clone https://github.com/MarieeCzy/METAR-Data-Engineering-and-Machine-Learning-Project.git
-
Create a new python virtual environment.
$ python -m venv venv
-
Activate the new virtual environment using source (Unix systems) or .\venv\Scripts\activate (Windows systems).
$ source venv/bin/activate
-
Install packages from requirements.txt using pip. Make sure the requirements.txt file is in your current working directory.
$ pip install -r requirements.txt
-
Create new project on the GCP platform, assign it as default and authorize:
$ gcloud config set project <your_project_name> $ gcloud auth login
-
Configure variables for Terraform:
6.1. In:
terraform.tfvars
replace project name to the name of your project created within the Google Cloud Platform:
project = <your_project_name>
go to terraform directory:
$ cd terraform/
initialize, plan and apply cloud resource creation:
$terraform init $terraform plan $terraform apply
-
Configure the upload data, go to:
~/prefect_orchestration/deployments/flows/config.json
8.1. Complete the variables:
-
network
select one network e.g. FR__ASOS, -
start_year
,start_month
,start_day
- complete the start date, make sure that the digits are not preceded by "0" -
batch_bucket_name
- enter the name of the created Google Cloud Storgage bucket
-
-
Set up Perfect, the task orchestration tool:
9.1. Generate new KEY for storage service account:
On Google Platform go to
IAM & Admin > Service Accounts, click on
"storage-service-acc"
go toKEYS and click on ADD KEY > Create new key in JSON format.
Save it in a safe place, do not share it on GitHub or any other public place.
In order not to change the code in the
gcp_credentials_blocks.py
block, create a .secrets directory: ~/METAR-Data-Engineering-and-Machine-Learning-Project/.secrets and put the downloaded key in it under the name:gcp_credentials_key.json
9.2. Run Prefect server
$ prefect orion start
Go to: http://127.0.0.1:4200
9.3. In
~/prefect_orchestration/prefect_blocks
run below commands in console to create Credentials and GCS Bucket blocks:$ python gcp_credentials_blocks.py $ python gcs_buckets_blocks.py
9.4. Configure Perfect Deployment:
$ python prefect_orchestration/deployments/deployments_config.py
9.5. Run Prefect Agent to enable deployment in "default" queue
$ prefect agent start -q "default"
-
Start deployment stage 1 - S1: Downloading data and uploading to the Google Cloud Storage bucket
Go to:
~/prefect_orchestration/deployments
and run in command line:$ python deployments_run.py --stage="S1"
☝️ You can observe the running deployment flow in Prefect UI :
After the deployment is complete, you will find the data in the GCS bucket.
-
Configuration and commissioning stage 2 - S2: data transformation using PySpark and moving to BigQuery using Dataproc
11.1. Go to
~/prefect_orchestration/deployments
ingcloud_submit_job.sh
and check if given paths and names are correct:As long as you haven't changed other names/settings other than those listed in this manual, everything should be fine.
$ gcloud dataproc jobs submit pyspark \ --cluster=metar-cluster \ --region=europe-west2 \ --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \ --files=gs://code-metar-bucket-2/code/sql_queries_config.yaml\ gs://code-metar-bucket-2/code/pyspark_sql.py \ -- \ --input=gs://batch-metar-bucket-2/data/ES__ASOS/*/* \ --bq_output=reports.ES__ASOS \ --temp_bucket=dataproc-staging-bucket-metar-bucket-2
11.2. Upload the
pyspark_sql.py
and config filesql_queries_config.yaml
to the bucket code.In
~/prefect_orchestration/deployments/flows
:$ gsutil cp pyspark_sql.py gs://code-metar-bucket-2/code/pyspark_sql.py $ gsutil cp sql_queries_config.yaml gs://code-metar-bucket-2/code/sql_queries_config.yaml
11.3. Run deployment stage S2 GCS -> BigQuery on Dataproc cluster:
$ python deployments_run.py --stage="S2"
If the Job was successful, you can go to BigQuery, where the generated data is located. Now you can copy my Looker report and replace the data sources, or prepare your own. 😎