This repository contains a real-time pipeline to ingest/receive data from a real-time stream of events and deploys a streaming application to aggregate some events in real time.
The web application feeds event data to Kinesis Streams, the raw data is stored in an S3 bucket using Amazon Firehose, additionally, the aggregated data is stored in a sub folder on Amazon S3 (see diagram below)
- Clone this repository.
- Set up an AWS account and download your access and secret key from AWS IAM
- Configure AWS CLI on your terminal using
aws configure
- Type in your credentials
- Navigate to
IAC
in your terminal and type these commandsterraform init
,terraform apply
to set up AWS infrastructure.
Step 1: Navigate to AWS Kinesis Data Streams, select raw_data_stream
and click on Process data in real-time

Step 2: Create an Apache Flink - Studio Notebook

Step 3: Select a database (This was already been created on the Terraform)

Step 4: Run the notebook and click Open Apache Zeppelin

Step 5: On the notebook, run the CREATE TABLE raw_data_table
on sql_queries file
in the repo
Step 6: Run the producer.py
file to produce data

**Step 7: Run the Query to perform aggregations
on sql_queries file
in the repo
Step 8: Run the CREATE TABLE aggregated_data_table
and INSERT INTO aggregated_data_table
on sql_queries file
Step 9: Create a firehose to connect to S3, specify the source and directory structure


