This project demonstrates an end-to-end data pipeline using Apache Airflow, where data is extracted from an S3 bucket, processed with Pandas, and then loaded into Amazon Redshift.
This repository provides an example of a data pipeline that:
- Extracts a CSV file from an S3 bucket.
- Loads the CSV content into a Pandas DataFrame.
- Inserts the data into a table in Amazon Redshift.
- Python 3.8+
- Apache Airflow 2.0+
- LocalStack (for local development)
It's recommended to use a virtual environment for dependency management.
python3 -m venv .venv
.venv/bin/activate
The Airflow DAG is responsible for orchestrating the data pipeline. Here’s a simplified example:
@task
def run_data_pipeline():
try:
data = extract_data_from_s3()
load_data_into_redshift(data)
except Exception as e:
raise AirflowException(f"Airflow raised the exception: {e}")
(init >> run_data_pipeline() >> done)
- Activate the Virtual Environment.
- Set Environment Variables.
- Run Airflow (Start Airflow webserver).
- Trigger the DAG on the Airflow UI.
This project is licensed under the MIT License.