Data lake demo using change data capture (CDC) on AWS.
- Part 1 Database and Local Development
- Part 2 CDC with Amazon MSK
- Part 3 Hudi Table and Dashboard Creation
- Employing the transactional outbox pattern, the source database publishes change event records to the CDC event table. The event records are generated by triggers that listen to insert and update events on source tables.
- CDC is implemented in a streaming environment and Amazon MSK is used to build the streaming infrastructure. In order to process the real-time CDC event records, a source and sink connectors are set up in Amazon MSK Connect. The Debezium connector for PostgreSQL is used as the source connector and the Lenses S3 connector is used as the sink connector. The sink connector pushes messages to a S3 bucket.
- Hudi DeltaStreamer is run on Amazon EMR. As a spark application, it reads files from the S3 bucket and upserts Hudi records to another S3 bucket. The Hudi table is created in the AWS Glue Data Catalog.
- The Hudi table is queried in Amazon Athena while the table is registered in the AWS Glue Data Catalog.
- Dashboards are created in Amazon Quicksight where the dataset is created using Amazon Athena.