Airflow Pipeline From Kaggle to AWS Redshift

This project is an outcome of Udacity data engineering program capstone project.

DESCRIPTION

The goal of this capstone project is to enrich US Zipcode data with hospitals locations, population size, income information and zip code classification. The source data comes from Kaggle with minimal data transformation performed throughout the pipeline. Below is the list of data tables that are included in this project.

Data Dictionary of this data model - https://github.com/ZawWin/kaggle-n-redshift/blob/main/data_dictionary.xlsx

Answering the Why?

Why this data model?
- The main reason I chose the zip code data for this data model is because of its well-defined nature with minimal clean up required. Even though this dataset mainly composed of dimensional data at the current stage, one can easily include the fact table of his choice and easily build a star schema on top of this using zip code column as a key to tie off among entities.
What technologies are being used and why?
- For this project, I am using python 3.7 with apache_airflow as my workflow management tool. Also, s3 and redshift as a way to stage and store the data for the end users to access. The entire technology stack is built within a virtual environment (using conda command) so that we are in control of our environment and only using the most stable version of each technology, not necessarily the latest development version.
- It is obvious why I would choose python for this project (user-friendly, intuitive, many available libraries, most-in-demand skills among data professionals ..I could go on and on). For my workflow management tools, I choose to go with Apache-airflow because it provides such a user friendly web interface to manage your workflow. Not only that, but it also has a strong open source community out there, who are continuously improving and contributing a ton to make the tools easier to use for everyone.

Requirements

You will need to install:

Python 3.6 and above. You can install it via anaconda distribution here at: https://www.anaconda.com/products/individual
Additional python libraries - kaggle
Apache-airflow 2.0
Create s3 bucket and redshift data warehouse in AWS. Make sure both of them are in the same region. (This step is out of scope for this project.)

Installation and Setup

First download anaconda distribution from URL above and install it to your computer. No fancy command line required for this step. The great thing about using this installation method is, the installation includes most commonly used python libraries pandas, numpy, scipy and so forth. Minimal additional setup required.
Install kaggle package by following the steps given on Kaggle website: https://www.kaggle.com/docs/api
- This step is required to use Kaggle API to download their dataset via python.
Install apache-airflow-2.0 by following the steps below.
- First, open the terminal and type in the command.

  > conda create -n "airflow" python=3.7
  > conda activate airflow

Create a home folder where you would like your project files to be based at.

  > mkdir ~/Desktop/airflow

Set your airflow home to be that folder.

  > export AIRFLOW_HOME='~/Desktop/airflow'

Now you are ready to install apache_airflow.

  > pip install apache_airflow

Since you will be connecting to AWS environment, including s3 and redshift, you will also need to install these. (This step only applies to apache airflow 2.0)

  > pip install apache-airflow-providers-postgres
  > pip install apache-airflow-providers-amazon

You are now ready to start firing up the airflow.

  > airflow db init
  > airflow scheduler

While the scheduler is running (let it run), open a new terminal window and follow the steps.

  > conda activate airflow
  > export AIRFLOW_HOME = ~/Desktop/airflow
  > airflow webserver -p 8080

You can now go to localhost:8080 to access airflow UI.
For the first time starting airflow, it will require you to create a username and password. This is the new functionality with Airflow 2.0.
Open another new terminal window. Follow the same steps.

  > conda activate airflow
  > export AIRFLOW_HOME = ~/Desktop/airflow
  > airflow users create [-h] -e EMAIL -f FIRSTNAME -l LASTNAME [-p PASSWORD] -r
                   ROLE [--use-random-password] -u USERNAME

After the user is created, you can then go back to Airflow UI and log in with the newly created user.

Running the job with Airflow

Download the current repository into ~/Desktop/airflow folder you created. That is your airflow home directory. When you go back to Airflow UI, you will see an error message. That's okay. Because you haven't created your redshift connection and s3_bucket in airflow yet.
Create your 's3_bucket' under variables in Airflow UI. Name exactly 's3_bucket' for variable name. The code will look for that name.
Under connections in Airflow UI, create the two following connections:
- Amazon Web Services connection. Make sure the name is 'aws_credentials'.
- Postgres connection. Make sure the name is 'redshift'.
If everything runs as expected, you should see error message clear up and see your dag in Dag list.

Workflow

Conceptual Diagram

Actual Data Flow Diagram

Use Cases

What questions can be answered using this dataset?

Number of hospitals in low income zip area?
Total Number of population and Average Income in low income zip code?
Hospital locations and contact information along with their ratings?
Correlation between number of hospitals and total number of population in each zip code and determine underserved areas.

Answering Capstone Specific Questions

If the data is increased by 100x, we could use big data technology such as Apache Spark to process our data in a more cooridnated, paralleized cluster environment. In such environment, we will also have the option of choosing our desired compute capacity and processing power to match our workload.
Scheduling the job to run at 7 am daily would be as simple as changing schedule interval in dag parameters to: 0 7 * * *
The more people accessing the database the more CPU resources you need to get a fast experience. By using a distributed database you can improve your replications and partitioning to get faster query results for each user. Additionally, the other challenge with the increase in number of people getting access to database would be managing the security around those people, properly ensuring that each of them is applied to the principle of least privilege. This is espcially important if your dataset contains PHI information.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
dags		dags
images		images
logs		logs
main		main
plugins		plugins
.DS_Store		.DS_Store
HospInfo.csv		HospInfo.csv
README.md		README.md
airflow-webserver.pid		airflow-webserver.pid
airflow.cfg		airflow.cfg
airflow.db		airflow.db
avg_income_per_zipcode.csv		avg_income_per_zipcode.csv
create_tables.sql		create_tables.sql
data_dictionary.xlsx		data_dictionary.xlsx
database-of-hpsa-and-low-income-zip-codes-for-issuers-subject-to-the-alternate-ecp-standard-for-the-purposes-of-qhp-certification.csv		database-of-hpsa-and-low-income-zip-codes-for-issuers-subject-to-the-alternate-ecp-standard-for-the-purposes-of-qhp-certification.csv
hospital_general_info.csv		hospital_general_info.csv
hospital_ratings.csv		hospital_ratings.csv
low_income_zip.csv		low_income_zip.csv
postcode_level_averages.csv		postcode_level_averages.csv
socrata_metadata.json		socrata_metadata.json
unittests.cfg		unittests.cfg
webserver_config.py		webserver_config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Airflow Pipeline From Kaggle to AWS Redshift

DESCRIPTION

Answering the Why?

Requirements

Installation and Setup

Running the job with Airflow

Workflow

Conceptual Diagram

Actual Data Flow Diagram

Use Cases

Answering Capstone Specific Questions

About

Releases

Packages

Languages

ZawWin/kaggle-n-redshift

Folders and files

Latest commit

History

Repository files navigation

Airflow Pipeline From Kaggle to AWS Redshift

DESCRIPTION

Answering the Why?

Requirements

Installation and Setup

Running the job with Airflow

Workflow

Conceptual Diagram

Actual Data Flow Diagram

Use Cases

Answering Capstone Specific Questions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages