Platform - Under Development

Overview

This repository serves as a playground and exercise in how to build out an ML focused application platform from scratch on top of AWS EKS that adheres to Infrastructure as Code (IaC) and GitOps practices using tools like Terraform, Terragrunt, Kubernetes, GitHub Actions and Kubeflow.

Pre-requisites
Setup
Networking
EKS
Service Mesh
Kubeflow
Github Actions

Donations

Should you find any of this project useful, please consider donating through,

At a minimum it helps with the AWS bill.

Pre-Requisites

If you're following along, at a minimum you'll need the following,

AWS Account with the following service quotas,
- Amazon EC2 Instances - Running On-Demand G and VT instances = 32
- Amazon EC2 Instances - All Demand G and VT Spot Instance Requests = 32
Docker - for containerization.

I manage tool installations using asdf, but to each their own. If you do use, asdf, there is a .tool-versions file in the root of the project, which can be used to install all the listed below by running asdf install. After doing this refresh your shell by running exec $SHELL.

AWS CLI - setup with administrative access to make demonstration easy.
Github CLI - for managing GitHub resources.
Terraform - for infrastructure as code.
Terragrunt - for managing multiple Terraform environments.
Kubectl - for managing Kubernetes clusters.

Verify you have the pre-requisites installed by running the following,

make

You should see version output for each of the tools listed above.

Setup

Let's first fork and clone the repo,

gh repo fork archegos-labs/platform --clone ~/projects/platform; cd ~/projects/platform

Next validate that the IaC setup will run with,

make plan-all org_name="ExampleOrg"

This runs Terragrunt / Terraform plan on all the modules in the repository.

Networking

Our first step is to set up a Virtual Private Cloud (VPC) and its subnets where our EKS cluster will live. The VPC will look like the following,

The notable features of the VPC setup are,

There are sufficient IP addresses for the cluster and apps on it. The IP CIDR block is 10.0.0.0 / 16.
Subnets in multiple availability zones (AZ) for high availability.
There are private and public subnets in each availability zone for granular control of inbound and outbound traffic. Private subnets are for the EKS nodes with no direct internet access and public subnets for receiving and managing internet traffic.
NAT Gateways in each public subnet of each availability zone to ensure zone-independent architecture and reduce cross AZ expenditures.
The default NACL is associated with each subnet in the VPC.

If your AWS account and CLI are setup, you can run the following to create the VPC and subnets,

make deploy-vpc

After the VPC is created, you can view many of its components as illustrated in the diagram above by running,

aws resourcegroupstaggingapi get-resources \
  --resource-type-filters \
      ec2:vpc \
      ec2:subnet \
      ec2:natgateway \
      ec2:internet-gateway \
      ec2:route-table \
      ec2:elastic-ip \
      ec2:network-interface \
      ec2:security-group \
      ec2:network-acl \
  --query 'ResourceTagMappingList[].{ARN:ResourceARN,Name:Tags[?Key==`Name`].Value | [0]}' \
  --output table

References

EKS

Next we'll setup an EKS cluster within the VPC laid out above. The basics of the setup will look like,

The most notable features of the EKS cluster setup are,

Two node groups, one for general purpose workloads and one for GPU workloads.
API server endpoint access is public and private.
Appropriate security groups attached to network interfaces.
A default set of addons fundamental to the operation of the cluster.
- EKS Pod Identity - Allows you to assign IAM roles to Kubernetes service accounts. And is the preferred way to manage security.
- VPC CNI - Provides native integration with AWS VPC and works in underlay mode. In underlay mode, Pods and hosts are located at the same network layer and share the network namespace. The IP address of the Pod is consistent from the cluster and VPC perspective.
- Kube Proxy - Provides network proxy and load balancing between a service and its pods on a single worker node.
- CoreDNS - Provides service discovery and DNS resolution for Kubernetes.

Deployment

To deploy the EKS cluster run the following,

make deploy-eks

Configure kubectl for access to the cluster by running,

make add-cluster

Lastly, let's verify that we can access the cluster by running kubectl cluster-info. You should see output similiar to,

Kubernetes control plane is running at https://7D3A825AA8E29A730955A485709E89D2.gr7.us-east-1.eks.amazonaws.com
CoreDNS is running at https://7D3A825AA8E29A730955A485709E89D2.gr7.us-east-1.eks.amazonaws.com/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

Congratulations! You've deployed a barebones EKS cluster.

Addons

In addition to the barebones EKS cluster, we'll need additional functionality in the form of addons on our road to getting Kubeflow up and running.

Cert Manager - Cert-manager creates TLS certificates for workloads in your Kubernetes or OpenShift cluster and renews the certificates before they expire.
AWS Load Balancer Controller - Help manage Elastic Load Balancers for a Kubernetes cluster.
External DNS - Makes Kubernetes resources discoverable via public DNS servers. Unlike KubeDNS, however, it’s not a DNS server itself, but merely configures other DNS providers accordingly—e.g. AWS Route 53 or Google Cloud DNS.

make deploy-eks-addons addons='vpc-cni cert-manager awslb-controller external-dns'

After the addons are deployed ExternalDNS requires a restart. I'm not sure entirely way. Run

kubectl rollout restart deployment/external-dns -n kube-system

References

Service Mesh

Our next step on the journey of getting Kubeflow up on AWS is setting up Istio. Istio’s powerful features provide a uniform and efficient way to secure, connect, and monitor services. Kubeflow is a collection of tools, frameworks and services that are stiched together to provide a seamless ML platform under Istio. Below are some of the features Kubeflow leverages,

Secure service-to-service communication in a cluster with mutual TLS encryption, strong identity-based authentication and authorization
Automatic load balancing for HTTP, gRPC, WebSocket, and TCP traffic
Fine-grained control of traffic behavior with rich routing rules, retries, failovers, and fault injection
A pluggable policy layer and configuration API supporting access controls, rate limits and quotas
Automatic metrics, logs, and traces for all traffic within a cluster, including cluster ingress and egress

This installation of Istio has been setup in ambient mode.

Installation

Istio and a number of components we will be installing depend on Prometheus for monitoring. We won't be covering that in depth here only installing it. Run,

make deploy-prometheus

Prometheus was installed in the monitoring namespace. Verify the pods are running,

kubectl -n monitoring get pods

To deploy Istio to our EKS cluster run,

make deploy-istio

After the installation is complete, you can verify that the Istio control plane is up with,

kubectl get pods -n istio-system

References

Tools

In addition to the Istio control plane, the following tools are installed to support the service mesh,

Kiali - Configure, visualize, validate and troubleshoot your mesh! Kiali is a console for Istio service mesh.
Prometheus - Prometheus is an open-source systems monitoring and alerting toolkit.

Kiali

To access the Kiali dashboard, run the following,

kubectl port-forward svc/kiali 20001:20001 -n istio-system

Then navigate to http://localhost:20001/ in your preferred web browser. Learn more accessing the Kiali dashboard here.

If you want the operator to re-process the Kiali CR (called “reconciliation”) without having to change the Kiali CR’s spec fields, you can modify any annotation on the Kiali CR itself. This will trigger the operator to reconcile the current state of the cluster with the desired state defined in the Kiali CR, modifying cluster resources if necessary to get them into their desired state. Here is an example illustrating how you can modify an annotation on a Kiali CR:

kubectl annotate kiali my-kiali -n istio-system --overwrite kiali.io/reconcile="$(date)"

For more details on the CR see Kiali CR

Kubeflow

The installation of kubeflow is done leveraging the Terraform provided by AWS Labs on the Kubeflow for AWS project. In addition to the addons installed for the baseline EKS cluster above, we're also setting up the following addons to support Kubeflow,

EBS-CSI Driver - Provides a CSI interface used by Container Orchestrators to manage the lifecycle of Amazon EBS volumes.
EFS-CSI Driver - Provides a CSI interface used by Container Orchestrators to manage the lifecycle of Amazon EFS volumes.
FsX-CSI Driver - Provides a CSI specification for container orchestrators (CO) to manage lifecycle of Amazon FSx for Lustre filesystems.
NVida GPU Operator - The NVIDIA GPU Operator simplifies the deployment and management of GPU-accelerated applications on Kubernetes.

Training Operator

The Kubeflow Training Operator allows you to use Kubernetes workloads to train large models with Kubernetes Custom Resources APIs or using the Training Operator Python SDK. The operators primary use case is the ability to run distributed training and fine-tuning. After installing the operator we'll demonstrate how to run a training job.

Before running the example you'll need to,

Have Kubectl installed.
Run make add-cluster to add the dev cluster created above to your kubeconfig.
Have Docker installed.

With the above in place let's go through a few examples.

Example 1: Distributed Training Using Python SDK

We're going to run a Jupyter notebook through Docker locally to submit and monitor a distributed Pytorch training job. To run the Jupyter notebook,

Run the following at the root of the project,

docker run --rm -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -e GRANT_SUDO=yes \
  --user root \
  -v ~/.kube:/home/jovyan/.kube \
  -v ~/.aws:/home/jovyan/.aws \
  -v ./examples:/home/jovyan/work \
  quay.io/jupyter/pytorch-notebook

Navigate to the URL provided in the output of the command above. For example,

...
Or copy and paste one of these URLs:
        http://e16d3018c90e:8888/lab?token=161c548d8a560266b0e76323276322a1f3ecaf8da32d1de2
        http://127.0.0.1:8888/lab?token=161c548d8a560266b0e76323276322a1f3ecaf8da32d1de2

Open the notebook at work/training-operator/pytorchjobs/python-sdk-distributed-training.ipynb and follow the instructions.

That's it! You've run your first distributed training job using the Kubeflow Training Operator.

Example 2: Fine-Tune an LLM

Pipelines

Coming Soon

FSx for Lustre (Coming Soon)

We make use of FSx for Lustre to provide a high-performance file system for Kubeflow. This supports the performance of loading large datasets for model training.

TODO: How does this get leveraged in jobs?

References

Kubeflow Training Operator

Github Actions

We're using the familiar good ole PR based workflow. This means IaC changes are validated and planned in the PR and once approved the infrastructure is deployed/applied on merge to main. The workflow is as follows:

Infrastructure changes are made on a branch and a PR is created against main
Terragrunt validate and plan are run on any changes.
Validation and planning are run on every push to a branch
Reviews and approvals are applied. Once the PR is approved, the PR is merged into main
IaC changes from the PR merge are then applied.

Authentication & Authorization

AWS is accessed from GitHub Actions using OpenID Connect. GitHub acts as an Identity Provider (IDP) and AWS as a Service Provider (SP). Authentication happens on GitHub, and then GitHub “passes” our user to an AWS account, saying that “this is really John Smith”, and AWS performs the “authorization“, that is, AWS checks whether this John Smith can create new resources.

Name		Name	Last commit message	Last commit date
Latest commit History 327 Commits
.github		.github
account		account
common		common
deployments		deployments
eks		eks
examples/training-operator/pytorchjobs		examples/training-operator/pytorchjobs
istio		istio
kubeflow		kubeflow
prometheus		prometheus
vpc		vpc
.gitignore		.gitignore
.tool-versions		.tool-versions
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
terragrunt.hcl		terragrunt.hcl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Platform - Under Development

Overview

Donations

Pre-Requisites

Setup

Networking

References

EKS

Deployment

Addons

References

Service Mesh

Installation

References

Tools

Kiali

Kubeflow

Training Operator

Example 1: Distributed Training Using Python SDK

Example 2: Fine-Tune an LLM

Pipelines

FSx for Lustre (Coming Soon)

References

Github Actions

Authentication & Authorization

About

Releases

Sponsor this project

Packages

Languages

License

archegos-labs/platform

Folders and files

Latest commit

History

Repository files navigation

Platform - Under Development

Overview

Donations

Pre-Requisites

Setup

Networking

References

EKS

Deployment

Addons

References

Service Mesh

Installation

References

Tools

Kiali

Kubeflow

Training Operator

Example 1: Distributed Training Using Python SDK

Example 2: Fine-Tune an LLM

Pipelines

FSx for Lustre (Coming Soon)

References

Github Actions

Authentication & Authorization

About

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages