Skip to content

Swapnay/capstone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COVID - 19 and Affected Sectors Data Warehouse

1. Problem Definition

The COVID-19 virus pandemic has led to unexpected interruptions in economic activity around the world.In efforts to slow down the spread of COVID-19, many states and countries shut down several economic areas such as tourism and entertainment venues, restaurants, personal services, and some manufacturing facilities. This has led many areas into a recession. According to UNIDO, the economic crisis unleashed by the outbreak of COVID-19 is hurting economies, regardless of income level.

In this capstone we will Extract, Transform and Load the data. Data will then be used to visualize the progression of COVID-19 in terms of number of cases and deaths along with its effect on areas such as the stock market, unemployment rate, housing market. Time series data will be collected from various sources that represent these areas. This capstone will try to provide correlation between below questions.

Sector COVID - 19 S&P 500 Stock Housing Market Unemployment Rate
World/ USA cases The most raised / dropped stocks How housing prices/Inventory changed in each city/state Unemployment rate by Industry/ Race/ state

2. Expected Results

Time Series Data warehouse.Data that can be used for analysis using line chart, column chart and combination of both.

2. Architecture

Architecture

3 Datasets

In this section, we describe in detail how to get data from data source. All data selected is time series data

Sector COVID - 19 S&P 500 Stock prices Housing Market Unemployment Rate
Source World Data USA Data Yahoo Finance Zillow.com bls.gov
Frequency Daily Daily Monthly Monthly
Data Type CSV CSV CSV Table/JSON

**Detailed information about data sources is captured in Data-Extraction **

4. Data Extraction (Landing Zone):

  • Extracted Data files are saved to the file system.
  • Raw tables with that data are created in Azure mysql Example landing zone in cloud (Amazon S3 / Azure Blob storage) More information on different zonesData-Zones

5 Data Transformation

Reasons for using Apache Spark for data transformation:

  • Spark is capable of handling several petabytes of data at a time, distributed across a cluster of thousands of cooperating physical or virtual servers.
  • Supports languages such as Java, Python, R, and Scala and integration of leading storage solutions.
  • Simple to use with rich set of APIs
  • Speed More about the configuration is here

6. Processed Zone (Data Load):

  • Data from the landing zone is transformed using a processing engine and loaded into structured tables.
  • In this case it is mysql /Azure mysql. More information on different zones Data-Zones

7. Data model

Star Schema data model is Here Reasons for using Star Schemas

  • Queries are simpler
  • Easier business insights reporting
  • Better-performing queries:
  • Provides data to OLAP systems:

8. Initial Data Exploration

Initial Data exploratory analysis suggests An interesting correlation between COVID-19 and affected sectors. EDA is Here

9 ETL/ELT workflow Orchestration

Now that we have ETL pipeline in place, how do we schedule daily or monthly while integrating all dependencies Apache Airflow is used for below reasons:

  • Dynamic Pipeline Generation ( DAG configuration as code )
  • Extensibility and Functionality
  • Easily integrates with all cloud platforms
  • Large user base

More details on project wiki and DAGS folder

ETL DAG run for Stocks data Example-Dag-Run

10 Spark and Azure Mysql Optimisation

In the development lifecycle, More time is spent on optimising and maintaining the code than on writing it. Following were main areas of optimisation

  • Code optimization
  • Increase Compute size and storage Azure Mysql
  • Dynamically selecting batch size based on source.
  • Spark optimization

More on optimisation is Here

11 REST API

  • Java Springboot REST API.
  • API interfaces to Stocks,COVID, Unemployment Rate, and Housing data monthly average tables.
  • Spring Boot, Swagger, JPA
  • For More Info Checkout Here

12 Jupyter Notebook Integration - Correlation plots

  • Docker jupyter notebook integration using image jupyter/pyspark-notebook enables end user analyse data.
  • Users can either use spark in driver mode or submit to spark clusters running in standalone mode.
  • End users may also call above REST API to get aggregated data. Observations from data loaded:

For more plots Look at python note book And also exploratory data analysis

  • The Leisure and Hospitality sector had most job losses. Construction, Transportation and utilities is second in the list. Sector

  • The States with the most unemployment rate - Nevada, Hawaii have the most unemployment rate as these 2 states depend on tourism. In addition Nevada , like Michigan, is also dependent on Manufacturing and Mining. State

12 Resources

References

Reference1 Reference2 Reference3 Reference4