The following diagram shows the data ingestion architecture.
The proposed solution is a production-ready AWS fully managed scalable serverless solution for bulk data batch processing.
-
S3
for Object store -
Serverless Aurora PostgreSQL
as Back End -
SQS
for buffering S3Object Create
events -
Lambda Function (sqs-event-listener-lambda)
Batch process events using Lambda event source configurations (Batch Size and Batch Window). Trigger State Machine execution and pass event batches as input for Step Function Lambda. code here -
Step Function
process S3 Object Create events in batches, retrieve the S3 Objects, and ingest data into PostgreSQL. -
CDK
for Infrastructure provisioning and deployment code here
- TypeScript
- An AWS account
- AWS CLI configured
- Node.js 14+
To deploy this project, follow these steps.
git clone https://github.com/muditha-silva/data-ingestion-service.git
npm install
npm run build
Default stack configurations can find here here.
Important
- Change the
RawDataBucketName
property to a unique s3 bucket name. Please note that bucket name is suffixed with {aws-region}. For the default configuration, bucket name is raw-data-lake-dev-eu-west-1 - For the default configuration, table name property
RawDataTableName
is set toraw_data
- Lambda SQS event source mapping configurations for Batch Size and Batch Window (in minutes)
"SQSBatchSize":"100"
"SQSBatchWindow":"1"
Install the CDK globally
npm install -g cdk
This stack uses assets, therefore the toolkit stack (CDKToolkit) must deploy to the environment if it does not exist.
cdk bootstrap aws://{aws-account}/{aws-region}
cdk deploy
Connect to Database
- Select
Query Editor
fromRDS dashboard
. - Select the database name begins with
dataingestionservice-auroradatacluster
from the dropdown menu. - For the Database
username
, selectConnect with a Secrets Manager ARN
(use the ARN of the secret name begins withAuroraDataClusterSecret
fromSecrets manager
). - For the default configuration use
RawDataDB
as the database name.
Create Table Script
CREATE TABLE raw_data (
id TEXT NOT NULL PRIMARY KEY,
data JSONB,
createDate TEXT NOT NULL
);
Table Design
data
column, JSONB data type is used for storing JSON data which supports querying, filtering, and indexing JSON data.createDate
column, if date-specific partitioning is required, create an index.
Upload batch of files into s3 raw data bucket.
State Machine Execution
Query Data