Goal - Detecting anomalous cell towers using network and customer data through serverless.
Following are the lab modules:
1. Understanding Data
2. Solution Diagram
3. Uploading DAG files to DAGs folder
4. Execution of Airflow DAG
5. BQ Output Tables
6. Logging
The datasets used for this project are
1.telecom_customer_churn_data.csv
2.service_threshold_data.csv
3.customer_data
- Telecom Customer Churn Data - This dataset contains information of services provided to the customers by the celltowers.
- Service Threshold Data - This dataset contains the performance metrics thresold information of the celltowers.
- Cust Raw Data - This is a folder which contains the files which are in parquet format and holds the information of the customer data.
Model Pipeline
The model pipeline involves the following steps:
- Create buckets in GCS
- Create Dataproc and Persistent History Server Cluster
- Copy the raw data files, PySpark and notebook files into GCS
- Create a Cloud Composer environment and Airflow jobs to run the serverless spark job
- Creating Google BigQuery tables with summary of anomalous cell towers
- From the code repository, download the file located at: cell-tower-anomaly-detection>00-scripts>cell-tower-airflow.py
- Rename file to <your_name_here>-cell-tower-airflow.py
- Open the file and replace your name on row 21
- Navigate to Composer><composer_environment>
- Next, navigate to Environment Configuration>DAGs folder URI
- Next, upload the DAG file to the GCS bucket corresponding to the DAGs folder URI
- Navigate to Composer><your_environment>>Open Airflow UI
- Once the Airflow UI opens, navigate to DAGs and open your respective DAG
- Next, trigger your DAG by clicking on the Trigger DAG button
- Once the DAG is triggered, the DAG can be monitored directly through the Airflow UI as well as the Dataproc>Serverless>Batches window
Navigate to BigQuery Console, and check the cell_tower_anomaly_detection dataset.
Once the Airflow DAG execution is completed, two new tables '<your_name_here>_cell_tower_performance_data' and '<your_name_here>_customer_service_metrics_data' will be created:
To view the data in these tables -
- Select the table from BigQuery Explorer by navigating 'project_id' > 'dataset' > 'table_name'
- Click on the Preview button to see the data in the table
Note: If the Preview button is not visible, run the below queries to view the data. However, these queries will be charged for the full table scan.
SELECT * FROM `<project_name>.<dataset_name>.<your_name_here>_cell_tower_performance_data` LIMIT 1000;
SELECT * FROM `<project_name>.<dataset_name>.<your_name_here>_customer_service_metrics_data` LIMIT 1000;
Note: Edit all occurrences of <project_name> and <dataset_name> to match the values of the variables PROJECT_ID, and BQ_DATASET_NAME respectively
- To view the logs of any step of the DAG execution, click on the >Log button
Logs associated with the application can be found in the logging console under
Dataproc > Serverless > Batches > <batch_name>.
You can also click on “View Logs” button on the Dataproc batches monitoring page to get to the logging page for the specific Spark job.
To view the Persistent History server logs, click the 'View History Server' button on the Dataproc batches monitoring page and the logs will be shown as below: