-
Notifications
You must be signed in to change notification settings - Fork 108
Analytics
The analytics component forwards the data from the backend DynamoDB stream to S3 so that it can be analyzed with Athena. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
Data is stored in S3. Files will look like the following:
Applications/year=2019/month=10/day=11/hour=16/realworld-serverless-application-analytic-Firehose-12J7YC29T8FAY-1-2019-10-11-16-58-58-c0068baf-ab5b-4a61-a9a7-1e100983c696.parquet
There are 2 important optimizations in the way the data is stored:
-
Data is partitioned by time. The
year=2019
style of prefixes are used to partition the data by time. Partitioning reduces the amount of data that has to be scanned to execute Athena queries, thus reducing the cost. -
Data is stored in .parquet files. Parquet is a columnar data storage format of the Apache Hadoop ecosystem. It provides efficient data compression and enhanced performance to handle complex data in bulk. Parquet files are drastically smaller than JSON text files. Using parquet reduces storage and query costs.
The Analytics component sets up a Firehose delivery stream configured to output the files in the format described above.
To get started, navigate to the Athena console, and select the realworld_serverless_application_analytics_*
database from the list.
Run the following query first to load new data partitions:
MSCK REPAIR TABLE applications;
Run a sample query:
SELECT detail.eventname,
detail.dynamodb.keys.applicationid.s AS applicationid,
detail.dynamodb.keys.userid.s AS userid,
detail.dynamodb.newimage.author.s AS author,
detail.dynamodb.newimage.description.s AS description
FROM applications;