This repository serves as a playground and exercise in how to build out an ML focused application platform from scratch on top of AWS EKS that adheres to Infrastructure as Code (IaC) and GitOps practices using tools like Terraform, Terragrunt, Kubernetes, GitHub Actions and Kubeflow.
Should you find any of this project useful, please consider donating through,
At a minimum it helps with the AWS bill.
If you're following along, at a minimum you'll need the following,
- AWS Account with the following service quotas,
- Amazon EC2 Instances - Running On-Demand G and VT instances = 32
- Amazon EC2 Instances - All Demand G and VT Spot Instance Requests = 32
- Docker - for containerization.
I manage tool installations using asdf, but to each their own. If you do use,
asdf, there is a .tool-versions
file in the root of the project, which can be used to
install all the listed below by running asdf install
. After doing this refresh your shell by running exec $SHELL
.
- AWS CLI - setup with administrative access to make demonstration easy.
- Github CLI - for managing GitHub resources.
- Terraform - for infrastructure as code.
- Terragrunt - for managing multiple Terraform environments.
- Kubectl - for managing Kubernetes clusters.
Verify you have the pre-requisites installed by running the following,
make
You should see version output for each of the tools listed above.
Let's first fork and clone the repo,
gh repo fork archegos-labs/platform --clone ~/projects/platform; cd ~/projects/platform
Next validate that the IaC setup will run with,
make plan-all org_name="ExampleOrg"
This runs Terragrunt / Terraform plan on all the modules in the repository.
Our first step is to set up a Virtual Private Cloud (VPC) and its subnets where our EKS cluster will live. The VPC will look like the following,
The notable features of the VPC setup are,
- There are sufficient IP addresses for the cluster and apps on it. The IP CIDR block is
10.0.0.0 / 16
. - Subnets in multiple availability zones (AZ) for high availability.
- There are private and public subnets in each availability zone for granular control of inbound and outbound traffic. Private subnets are for the EKS nodes with no direct internet access and public subnets for receiving and managing internet traffic.
- NAT Gateways in each public subnet of each availability zone to ensure zone-independent architecture and reduce cross AZ expenditures.
- The default NACL is associated with each subnet in the VPC.
If your AWS account and CLI are setup, you can run the following to create the VPC and subnets,
make deploy-vpc
After the VPC is created, you can view many of its components as illustrated in the diagram above by running,
aws resourcegroupstaggingapi get-resources \
--resource-type-filters \
ec2:vpc \
ec2:subnet \
ec2:natgateway \
ec2:internet-gateway \
ec2:route-table \
ec2:elastic-ip \
ec2:network-interface \
ec2:security-group \
ec2:network-acl \
--query 'ResourceTagMappingList[].{ARN:ResourceARN,Name:Tags[?Key==`Name`].Value | [0]}' \
--output table
Next we'll setup an EKS cluster within the VPC laid out above. The basics of the setup will look like,
The most notable features of the EKS cluster setup are,
- Two node groups, one for general purpose workloads and one for GPU workloads.
- API server endpoint access is public and private.
- Appropriate security groups attached to network interfaces.
- A default set of addons fundamental to the operation of the cluster.
- EKS Pod Identity - Allows you to assign IAM roles to Kubernetes service accounts. And is the preferred way to manage security.
- VPC CNI - Provides native integration with AWS VPC and works in underlay mode. In underlay mode, Pods and hosts are located at the same network layer and share the network namespace. The IP address of the Pod is consistent from the cluster and VPC perspective.
- Kube Proxy - Provides network proxy and load balancing between a service and its pods on a single worker node.
- CoreDNS - Provides service discovery and DNS resolution for Kubernetes.
- To deploy the EKS cluster run the following,
make deploy-eks
- Configure
kubectl
for access to the cluster by running,
make add-cluster
- Lastly, let's verify that we can access the cluster by running
kubectl cluster-info
. You should see output similiar to,
Kubernetes control plane is running at https://7D3A825AA8E29A730955A485709E89D2.gr7.us-east-1.eks.amazonaws.com
CoreDNS is running at https://7D3A825AA8E29A730955A485709E89D2.gr7.us-east-1.eks.amazonaws.com/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
Congratulations! You've deployed a barebones EKS cluster.
In addition to the barebones EKS cluster, we'll need additional functionality in the form of addons on our road to getting Kubeflow up and running.
- Cert Manager - Cert-manager creates TLS certificates for workloads in your Kubernetes or OpenShift cluster and renews the certificates before they expire.
- AWS Load Balancer Controller - Help manage Elastic Load Balancers for a Kubernetes cluster.
- External DNS - Makes Kubernetes resources discoverable via public DNS servers. Unlike KubeDNS, however, it’s not a DNS server itself, but merely configures other DNS providers accordingly—e.g. AWS Route 53 or Google Cloud DNS.
make deploy-eks-addons addons='vpc-cni cert-manager awslb-controller external-dns'
After the addons are deployed ExternalDNS requires a restart. I'm not sure entirely way. Run
kubectl rollout restart deployment/external-dns -n kube-system
Our next step on the journey of getting Kubeflow up on AWS is setting up Istio. Istio’s powerful features provide a uniform and efficient way to secure, connect, and monitor services. Kubeflow is a collection of tools, frameworks and services that are stiched together to provide a seamless ML platform under Istio. Below are some of the features Kubeflow leverages,
- Secure service-to-service communication in a cluster with mutual TLS encryption, strong identity-based authentication and authorization
- Automatic load balancing for HTTP, gRPC, WebSocket, and TCP traffic
- Fine-grained control of traffic behavior with rich routing rules, retries, failovers, and fault injection
- A pluggable policy layer and configuration API supporting access controls, rate limits and quotas
- Automatic metrics, logs, and traces for all traffic within a cluster, including cluster ingress and egress
This installation of Istio has been setup in ambient mode.
- Istio and a number of components we will be installing depend on Prometheus for monitoring. We won't be covering that in depth here only installing it. Run,
make deploy-prometheus
Prometheus was installed in the monitoring namespace. Verify the pods are running,
kubectl -n monitoring get pods
- To deploy Istio to our EKS cluster run,
make deploy-istio
After the installation is complete, you can verify that the Istio control plane is up with,
kubectl get pods -n istio-system
In addition to the Istio control plane, the following tools are installed to support the service mesh,
- Kiali - Configure, visualize, validate and troubleshoot your mesh! Kiali is a console for Istio service mesh.
- Prometheus - Prometheus is an open-source systems monitoring and alerting toolkit.
To access the Kiali dashboard, run the following,
kubectl port-forward svc/kiali 20001:20001 -n istio-system
Then navigate to http://localhost:20001/
in your preferred web browser. Learn more accessing the Kiali dashboard here.
If you want the operator to re-process the Kiali CR (called “reconciliation”) without having to change the Kiali CR’s spec fields, you can modify any annotation on the Kiali CR itself. This will trigger the operator to reconcile the current state of the cluster with the desired state defined in the Kiali CR, modifying cluster resources if necessary to get them into their desired state. Here is an example illustrating how you can modify an annotation on a Kiali CR:
kubectl annotate kiali my-kiali -n istio-system --overwrite kiali.io/reconcile="$(date)"
For more details on the CR see Kiali CR
The installation of kubeflow is done leveraging the Terraform provided by AWS Labs on the Kubeflow for AWS project. In addition to the addons installed for the baseline EKS cluster above, we're also setting up the following addons to support Kubeflow,
- EBS-CSI Driver - Provides a CSI interface used by Container Orchestrators to manage the lifecycle of Amazon EBS volumes.
- EFS-CSI Driver - Provides a CSI interface used by Container Orchestrators to manage the lifecycle of Amazon EFS volumes.
- FsX-CSI Driver - Provides a CSI specification for container orchestrators (CO) to manage lifecycle of Amazon FSx for Lustre filesystems.
- NVida GPU Operator - The NVIDIA GPU Operator simplifies the deployment and management of GPU-accelerated applications on Kubernetes.
The Kubeflow Training Operator allows you to use Kubernetes workloads to train large models with Kubernetes Custom Resources APIs or using the Training Operator Python SDK. The operators primary use case is the ability to run distributed training and fine-tuning. After installing the operator we'll demonstrate how to run a training job.
Before running the example you'll need to,
- Have Kubectl installed.
- Run
make add-cluster
to add the dev cluster created above to your kubeconfig. - Have Docker installed.
With the above in place let's go through a few examples.
We're going to run a Jupyter notebook through Docker locally to submit and monitor a distributed Pytorch training job. To run the Jupyter notebook,
- Run the following at the root of the project,
docker run --rm -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -e GRANT_SUDO=yes \
--user root \
-v ~/.kube:/home/jovyan/.kube \
-v ~/.aws:/home/jovyan/.aws \
-v ./examples:/home/jovyan/work \
quay.io/jupyter/pytorch-notebook
- Navigate to the URL provided in the output of the command above. For example,
...
Or copy and paste one of these URLs:
http://e16d3018c90e:8888/lab?token=161c548d8a560266b0e76323276322a1f3ecaf8da32d1de2
http://127.0.0.1:8888/lab?token=161c548d8a560266b0e76323276322a1f3ecaf8da32d1de2
- Open the notebook at
work/training-operator/pytorchjobs/python-sdk-distributed-training.ipynb
and follow the instructions.
That's it! You've run your first distributed training job using the Kubeflow Training Operator.
Coming Soon
We make use of FSx for Lustre to provide a high-performance file system for Kubeflow. This supports the performance of loading large datasets for model training.
TODO: How does this get leveraged in jobs?
We're using the familiar good ole PR based workflow. This means IaC changes are validated and planned in the PR and once approved the infrastructure is deployed/applied on merge to main. The workflow is as follows:
- Infrastructure changes are made on a branch and a PR is created against main
- Terragrunt validate and plan are run on any changes.
- Validation and planning are run on every push to a branch
- Reviews and approvals are applied. Once the PR is approved, the PR is merged into main
- IaC changes from the PR merge are then applied.
AWS is accessed from GitHub Actions using OpenID Connect. GitHub acts as an Identity Provider (IDP) and AWS as a Service Provider (SP). Authentication happens on GitHub, and then GitHub “passes” our user to an AWS account, saying that “this is really John Smith”, and AWS performs the “authorization“, that is, AWS checks whether this John Smith can create new resources.