Skip to content

Latest commit

 

History

History
185 lines (122 loc) · 5.24 KB

06c-cell-tower-anomaly-airflow-execution.md

File metadata and controls

185 lines (122 loc) · 5.24 KB

Cell Tower Anomaly Detection using Serverless Spark through Google Cloud Composer

Goal - Detecting anomalous cell towers using network and customer data through serverless.

Following are the lab modules:

1. Understanding Data
2. Solution Diagram
3. Uploading DAG files to DAGs folder
4. Execution of Airflow DAG
5. BQ Output Tables
6. Logging

1. Understanding the data

The datasets used for this project are

1.telecom_customer_churn_data.csv
2.service_threshold_data.csv
3.customer_data

  • Telecom Customer Churn Data - This dataset contains information of services provided to the customers by the celltowers.
  • Service Threshold Data - This dataset contains the performance metrics thresold information of the celltowers.
  • Cust Raw Data - This is a folder which contains the files which are in parquet format and holds the information of the customer data.

2. Solution Diagram



Model Pipeline

The model pipeline involves the following steps:
- Create buckets in GCS
- Create Dataproc and Persistent History Server Cluster
- Copy the raw data files, PySpark and notebook files into GCS
- Create a Cloud Composer environment and Airflow jobs to run the serverless spark job
- Creating Google BigQuery tables with summary of anomalous cell towers


3. Uploading DAG files to DAGs folder

  • From the code repository, download the file located at: cell-tower-anomaly-detection>00-scripts>cell-tower-airflow.py
  • Rename file to <your_name_here>-cell-tower-airflow.py
  • Open the file and replace your name on row 21
  • Navigate to Composer><composer_environment>
  • Next, navigate to Environment Configuration>DAGs folder URI
  • Next, upload the DAG file to the GCS bucket corresponding to the DAGs folder URI






4. Execution of Airflow DAG

  • Navigate to Composer><your_environment>>Open Airflow UI

  • Once the Airflow UI opens, navigate to DAGs and open your respective DAG
  • Next, trigger your DAG by clicking on the Trigger DAG button

  • Once the DAG is triggered, the DAG can be monitored directly through the Airflow UI as well as the Dataproc>Serverless>Batches window

5. BQ output tables

Navigate to BigQuery Console, and check the cell_tower_anomaly_detection dataset.
Once the Airflow DAG execution is completed, two new tables '<your_name_here>_cell_tower_performance_data' and '<your_name_here>_customer_service_metrics_data' will be created:



To view the data in these tables -

  • Select the table from BigQuery Explorer by navigating 'project_id' > 'dataset' > 'table_name'
  • Click on the Preview button to see the data in the table


Note: If the Preview button is not visible, run the below queries to view the data. However, these queries will be charged for the full table scan.

  SELECT * FROM `<project_name>.<dataset_name>.<your_name_here>_cell_tower_performance_data` LIMIT 1000;
  SELECT * FROM `<project_name>.<dataset_name>.<your_name_here>_customer_service_metrics_data` LIMIT 1000;

Note: Edit all occurrences of <project_name> and <dataset_name> to match the values of the variables PROJECT_ID, and BQ_DATASET_NAME respectively



6. Logging

6.1 Airflow logging

  • To view the logs of any step of the DAG execution, click on the >Log button

6.2 Serverless Batch logs

Logs associated with the application can be found in the logging console under Dataproc > Serverless > Batches > <batch_name>.
You can also click on “View Logs” button on the Dataproc batches monitoring page to get to the logging page for the specific Spark job.


6.3 Persistent History Server logs

To view the Persistent History server logs, click the 'View History Server' button on the Dataproc batches monitoring page and the logs will be shown as below: