-
Notifications
You must be signed in to change notification settings - Fork 1
Scaling Athena Redshift and APIs
While there are multiple technology options for the Data warehouse Athena uses, Amazon Redshift was chosen for this programming challenge as a starting point. We discuss below, if we had to scale Redshift for Athena to handle exabyte scale data, what approaches we could take to meet such needs.
The GC business data is volume heavy, with data arriving into a DW from a variety of sources. The data volume could run into exabytes keeping in view historical data spanning more than a decade. Three kinds of interactions are anticipated with the DW.
- Real time queries: These are primarily from some APIs that would serve real time requests, maybe for visualization purposes.
- Batch processing queries: These would probably look at massive amounts of historical data and do number crunching, producing materialized views of the data. These materialized views would probably be the query targets of APIs.
- EDA: Exploratory data analysis by data scientists.
In essence, heavy number crunching during certain hours of the day by data scientists, complex query execution in nightly batch cycles, and simplified queries serving the APIs, are all in the mix. We ought to calculate traffic volume and data volume correctly to do a capacity planning.
Depending on how this platform is envisaged to serve customers, and number of such customers who want real time access to insights through APIs, one would need to scale the API layer, potentially thousands of requests per hour, or more as per needs. Elastic scalability of the microservices exposing the API endpoints is necessary.
-
The Redshift cluster can be elastically resized. During the resize, additional nodes are added or nodes are removed during a scale down exercise. During the window the resize operation happens, there's a brief stoppage of query capabilities; typically minutes.
-
It may not be cost effective to do massive number crunching on the Redshift cluster. A cheaper approach is to leverage the Redshift Spectrum service which is serverless in nature. Introducing Spectrum to the mix will allow querying at exabyte scale directly on S3 data. Redshift smartly pushes specific type of operations down to Spectrum where thousands of independent AWS managed nodes perform the heavy lifting.
The number of users of the APIs in this case, is not expected to run into millions. But to elastically scale the APIs, a Kubernetes cluster may be used. The APIs which are deployed as microservices(Docker containers) would be elastically scaled. The number of microservice pods would vary automatically(by virtue of configuration; Horizontal Pod Autoscaler) in the Kubernetes cluster.
-
Amazon Redshift is probably not the only solution for this business case. In certain cases more cost effective solutions(with their respective pros and cons) may exist. For example, data could be dumped to Cassandra, and queries can be performed on Cassandra data. Well designed Cassandra keyspace/tables would allow one to perform efficient queries at scale while looking at massive amounts of distributed data. Furthermore Spark SQL may be used either on Cassandra data or S3 data. This is expected to be cheaper from an OPEX standpoint, but comes at the cost of complexity of managing and scaling the Cassandra cluster. A fully managed Cassandra cluster might alleviate some of these issues.
-
APIs: Elastically scaling the Kubernetes cluster is recommended to keep costs low and services available.
Adopting fully managed services like Amazon Redshift and Spectrum provide a number of conveniences that come at a price. For example, elastically scaling capacity is a non-trivial job and demands a significant amount of expertise if done by hand. It is best to carefully weigh pros and cons and decide on a technology stack so that a positive ROI can be achieved in agreed break-even horizons.
Athena (c) 2020. Arun Patra