Skip to content

Production-ready AWS fully managed scalable serverless solution for bulk data batch processing

Notifications You must be signed in to change notification settings

muditha-silva/data-ingestion-service

Repository files navigation

Solution Overview

Architecture

The following diagram shows the data ingestion architecture.

image info

The proposed solution is a production-ready AWS fully managed scalable serverless solution for bulk data batch processing.

AWS Services

  • S3 for Object store

  • Serverless Aurora PostgreSQL as Back End

  • SQS for buffering S3 Object Create events

  • Lambda Function (sqs-event-listener-lambda) Batch process events using Lambda event source configurations (Batch Size and Batch Window). Trigger State Machine execution and pass event batches as input for Step Function Lambda. code here

  • Step Function process S3 Object Create events in batches, retrieve the S3 Objects, and ingest data into PostgreSQL.

    • Batch Process S3 Lambda code here
    • Batch Data Ingestion Lambda code here
  • CDK for Infrastructure provisioning and deployment code here

Programming Language

  • TypeScript

Getting started

prerequisite

  • An AWS account
  • AWS CLI configured
  • Node.js 14+

To deploy this project, follow these steps.

Clone the project

git clone https://github.com/muditha-silva/data-ingestion-service.git

Install dependencies

npm install

Run the build

npm run build

Deployment Stack Configurations.

Default stack configurations can find here here.

Important

  • Change the RawDataBucketName property to a unique s3 bucket name. Please note that bucket name is suffixed with {aws-region}. For the default configuration, bucket name is raw-data-lake-dev-eu-west-1
  • For the default configuration, table name property RawDataTableName is set to raw_data
  • Lambda SQS event source mapping configurations for Batch Size and Batch Window (in minutes)
    • "SQSBatchSize":"100"
    • "SQSBatchWindow":"1"

Deploy the stack

Install the CDK globally

npm install -g cdk

This stack uses assets, therefore the toolkit stack (CDKToolkit) must deploy to the environment if it does not exist.

cdk bootstrap aws://{aws-account}/{aws-region}
cdk deploy

Create the raw_data table

Connect to Database

  • Select Query Editor from RDS dashboard .
  • Select the database name begins with dataingestionservice-auroradatacluster from the dropdown menu.
  • For the Database username, select Connect with a Secrets Manager ARN (use the ARN of the secret name begins with AuroraDataClusterSecret from Secrets manager).
  • For the default configuration use RawDataDB as the database name.

Create Table Script

    CREATE TABLE raw_data (
    id TEXT NOT NULL PRIMARY KEY,
    data JSONB,
    createDate TEXT NOT NULL
    );

Table Design

  • data column, JSONB data type is used for storing JSON data which supports querying, filtering, and indexing JSON data.
  • createDate column, if date-specific partitioning is required, create an index.

Testing

Upload batch of files into s3 raw data bucket.

image info

State Machine Execution

image info

Query Data

image info

image info

About

Production-ready AWS fully managed scalable serverless solution for bulk data batch processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published