The COVID-19 virus pandemic has led to unexpected interruptions in economic activity around the world.In efforts to slow down the spread of COVID-19, many states and countries shut down several economic areas such as tourism and entertainment venues, restaurants, personal services, and some manufacturing facilities. This has led many areas into a recession. According to UNIDO, the economic crisis unleashed by the outbreak of COVID-19 is hurting economies, regardless of income level.
In this capstone we will Extract, Transform and Load the data. Data will then be used to visualize the progression of COVID-19 in terms of number of cases and deaths along with its effect on areas such as the stock market, unemployment rate, housing market. Time series data will be collected from various sources that represent these areas. This capstone will try to provide correlation between below questions.
Sector | COVID - 19 | S&P 500 Stock | Housing Market | Unemployment Rate |
---|---|---|---|---|
World/ USA cases | The most raised / dropped stocks | How housing prices/Inventory changed in each city/state | Unemployment rate by Industry/ Race/ state |
Time Series Data warehouse.Data that can be used for analysis using line chart, column chart and combination of both.
In this section, we describe in detail how to get data from data source. All data selected is time series data
Sector | COVID - 19 | S&P 500 Stock prices | Housing Market | Unemployment Rate |
---|---|---|---|---|
Source | World Data USA Data | Yahoo Finance | Zillow.com | bls.gov |
Frequency | Daily | Daily | Monthly | Monthly |
Data Type | CSV | CSV | CSV | Table/JSON |
**Detailed information about data sources is captured in Data-Extraction **
- Extracted Data files are saved to the file system.
- Raw tables with that data are created in Azure mysql Example landing zone in cloud (Amazon S3 / Azure Blob storage) More information on different zonesData-Zones
Reasons for using Apache Spark for data transformation:
- Spark is capable of handling several petabytes of data at a time, distributed across a cluster of thousands of cooperating physical or virtual servers.
- Supports languages such as Java, Python, R, and Scala and integration of leading storage solutions.
- Simple to use with rich set of APIs
- Speed More about the configuration is here
- Data from the landing zone is transformed using a processing engine and loaded into structured tables.
- In this case it is mysql /Azure mysql. More information on different zones Data-Zones
Star Schema data model is Here Reasons for using Star Schemas
- Queries are simpler
- Easier business insights reporting
- Better-performing queries:
- Provides data to OLAP systems:
Initial Data exploratory analysis suggests An interesting correlation between COVID-19 and affected sectors. EDA is Here
Now that we have ETL pipeline in place, how do we schedule daily or monthly while integrating all dependencies Apache Airflow is used for below reasons:
- Dynamic Pipeline Generation ( DAG configuration as code )
- Extensibility and Functionality
- Easily integrates with all cloud platforms
- Large user base
More details on project wiki and DAGS folder
In the development lifecycle, More time is spent on optimising and maintaining the code than on writing it. Following were main areas of optimisation
- Code optimization
- Increase Compute size and storage Azure Mysql
- Dynamically selecting batch size based on source.
- Spark optimization
- Java Springboot REST API.
- API interfaces to Stocks,COVID, Unemployment Rate, and Housing data monthly average tables.
- Spring Boot, Swagger, JPA
- For More Info Checkout Here
- Docker jupyter notebook integration using image jupyter/pyspark-notebook enables end user analyse data.
- Users can either use spark in driver mode or submit to spark clusters running in standalone mode.
- End users may also call above REST API to get aggregated data. Observations from data loaded:
For more plots Look at python note book And also exploratory data analysis
-
The Leisure and Hospitality sector had most job losses. Construction, Transportation and utilities is second in the list.
-
The States with the most unemployment rate - Nevada, Hawaii have the most unemployment rate as these 2 states depend on tourism. In addition Nevada , like Michigan, is also dependent on Manufacturing and Mining.