Skip to content

manuelandersen/reddit-pipeline

Repository files navigation

Reddit ELT Pipeline

This is a repo to implement an ETL pipeline for Reddit data using Airflow and AWS cloud services.

Overview

What the pipelines does:

  • Extract Reddit data trough their API.
  • Load the data to an S3 bucket.
  • Perform some transformations to the data using AWS Glue.

Installation

  1. Clone the repository
git clone https://github.com/manuelandersen/reddit-pipeline.git
  1. Create a virtual environment (optional but recommended):
python3 -m venv venv
source venv/bin/activate
  1. Install the dependencies:
pip install -r requirements.txt
  1. Rename the configuration file:
 mv config/config.conf.example config/config.conf

Warning

Make sure to put the credentials you need in the new config.conf file.

  1. Build and run the Docker container:
docker compose up -d --build
  1. Open Airflow web UI:

In your browser go to http://localhost:8080, you will see the DAG's and then you can run them.

About

Reddit data extraction to S3 bucket

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published