Solution Overview

Architecture

The following diagram shows the data ingestion architecture.

The proposed solution is a production-ready AWS fully managed scalable serverless solution for bulk data batch processing.

AWS Services

S3 for Object store
Serverless Aurora PostgreSQL as Back End
SQS for buffering S3 Object Create events
Lambda Function (sqs-event-listener-lambda) Batch process events using Lambda event source configurations (Batch Size and Batch Window). Trigger State Machine execution and pass event batches as input for Step Function Lambda. code here
Step Function process S3 Object Create events in batches, retrieve the S3 Objects, and ingest data into PostgreSQL.
- Batch Process S3 Lambda code here
- Batch Data Ingestion Lambda code here
CDK for Infrastructure provisioning and deployment code here

Programming Language

TypeScript

Getting started

prerequisite

An AWS account
AWS CLI configured
Node.js 14+

To deploy this project, follow these steps.

Clone the project

git clone https://github.com/muditha-silva/data-ingestion-service.git

Install dependencies

npm install

Run the build

npm run build

Deployment Stack Configurations.

Default stack configurations can find here here.

Important

Change the RawDataBucketName property to a unique s3 bucket name. Please note that bucket name is suffixed with {aws-region}. For the default configuration, bucket name is raw-data-lake-dev-eu-west-1
For the default configuration, table name property RawDataTableName is set to raw_data
Lambda SQS event source mapping configurations for Batch Size and Batch Window (in minutes)
- "SQSBatchSize":"100"
- "SQSBatchWindow":"1"

Deploy the stack

Install the CDK globally

npm install -g cdk

This stack uses assets, therefore the toolkit stack (CDKToolkit) must deploy to the environment if it does not exist.

cdk bootstrap aws://{aws-account}/{aws-region}

cdk deploy

Create the raw_data table

Connect to Database

Select Query Editor from RDS dashboard .
Select the database name begins with dataingestionservice-auroradatacluster from the dropdown menu.
For the Database username, select Connect with a Secrets Manager ARN (use the ARN of the secret name begins with AuroraDataClusterSecret from Secrets manager).
For the default configuration use RawDataDB as the database name.

Create Table Script

    CREATE TABLE raw_data (
    id TEXT NOT NULL PRIMARY KEY,
    data JSONB,
    createDate TEXT NOT NULL
    );

Table Design

data column, JSONB data type is used for storing JSON data which supports querying, filtering, and indexing JSON data.
createDate column, if date-specific partitioning is required, create an index.

Testing

Upload batch of files into s3 raw data bucket.

State Machine Execution

Query Data

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
bin		bin
etc		etc
images		images
lib		lib
src		src
.gitignore		.gitignore
.npmignore		.npmignore
.prettierrc		.prettierrc
README.md		README.md
cdk.json		cdk.json
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
webpack.config.ts		webpack.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Solution Overview

Architecture

AWS Services

Programming Language

Getting started

prerequisite

Clone the project

Install dependencies

Run the build

Deployment Stack Configurations.

Deploy the stack

Create the raw_data table

Testing

About

Releases

Packages

Languages

muditha-silva/data-ingestion-service

Folders and files

Latest commit

History

Repository files navigation

Solution Overview

Architecture

AWS Services

Programming Language

Getting started

prerequisite

Clone the project

Install dependencies

Run the build

Deployment Stack Configurations.

Deploy the stack

Create the raw_data table

Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages