-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amazon-eks-gpu AMI nvidia-container-toolkit dependency questions #1560
Comments
Hi @jrleslie - thank you for the issue! In terms of your first question, are you encountering errors running containers on the EKS GPU AMI? Any details you might be able to share about your setup would be quite helpful On the EKS GPU AMI, we do provide the necessary host components (NVIDIA driver, the NVIDIA container runtime and OCI hook, EFA kernel module, etc.), but the CUDA runtimes, libraries, and tools should be provided via the application containers. You can see this depicted in the NVIDIA docs here https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver This also aligns with the documentation on the container toolkit project https://github.com/NVIDIA/nvidia-container-toolkit#getting-started
The drivers are provided on the AMI, the CUDA runtime/libraries/tools can be supplied via the application container since they will have compatibility constraints with the frameworks that leverage them (i.e. - one application may require pytorch version |
The Deep Learning AMI's are not maintained by EKS. We're doing a major rework of our NVIDIA dependencies to address #1494, and those changes will land in an AMI release soon. The |
@cartermckinnon what is the timeline for release of an AMI with the nvidia-container-toolkit installed? And can you share the target version? @bryantbiggs yes - we're seeing issues when executing pytorch jobs against nodes running the amazon-eks-gpu-node-1.26-v20231230. I believe our app container has everything it needs, but there are issues with CUDA not being exposed properly without the nvidia-container-toolkit installed directly on the instance (we dont leverage the gpu-operator). Our issue is similar to what's mentioned here https://stackoverflow.com/questions/63751883/using-gpu-inside-docker-container-cuda-version-n-a-and-torch-cuda-is-availabl When we run nvidia-smi from inside our app container, the CUDA version comes back as: CUDA Version: N/A. After we install the nvidia-container-toolkit on the host everything works fine inside the running container. |
@cartermckinnon can you share the target version of nvidia-container-toolkit that will be installed to be more clear. |
@jrleslie are you using the NVIDIA container images as your base? If not, you might need to expose some environment variables - (i.e. - https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.2.0/centos7/base/Dockerfile?ref_type=heads#L36-37 - there may be others depending on which image layer, but just for reference) For example, this should show the apiVersion: v1
kind: Pod
metadata:
name: gpu-test-nvidia-smi
spec:
restartPolicy: OnFailure
containers:
- name: gpu-demo
image: nvidia/cuda:12.2.0-runtime-ubi8
command: ['/bin/sh', '-c']
args: ['nvidia-smi && tail -f /dev/null']
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: 'nvidia.com/gpu'
operator: 'Equal'
value: 'true'
effect: 'NoSchedule' |
It'll be the latest one 👍 |
@bryantbiggs we aren't using the nvidia/cuda image. We have a custom image with cuda libraries and pytorch being installed on top. The libcuda.so driver version isn't exposed in our app container. If my understanding is correct, how is that supposed to be surfaced from the host without the toolkit or gpu-operator? |
@jrleslie I can't say for certain, you can look through the NVIDIA images to see the different values they are adding into their containers. At minimum, I would suggest adding the following to your image to see if this resolves your issue - note, the path to where CUDA is installed in the container (below) may vary depending on how you have installed it, so you might need to adjust the path if this is different from whats shown below:
|
tl;dr - everything does work today on the EKS GPU AMI, but the container image needs to ensure the appropriate NVIDIA configuration values are exposed (which the NVIDIA containers currently provide) |
Hey @bryantbiggs, I'm on @jrleslie's team and wanted to add some additional context to our query. The issue we're currently running into is that the nvidia driver files, which are available on the host through the AMI, are not available to the container, e.g. running a In non-EKS clusters we rely on the nvidia container toolkit to mount the drivers for use in the container, so we are wondering what mechanism is exposing the drivers on the host to the container here if you're not using the container toolkit? Or do we have to set up mounts manually? It's possible we have something misconfigured, so if there's any containerd config, log paths, etc you could point us to that to help us debug that would be really helpful. Thanks! |
This issue is resolved now - setting the |
Hi @alenawang - happy to hear you were able to resolve it! FYI - you won't find
Any images that use the NVIDIA container images as their base, or if the images themselves export the necessary NVIDIA environment configuration values will allow this to work. Since you are creating a custom image that is not building on top of these other images, this is why you need to add the environment variables yourself. For example, even this simple container image will work for the FROM public.ecr.aws/amazonlinux/amazonlinux:2023-minimal
# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility With a simple deployment like: apiVersion: v1
kind: Pod
metadata:
name: gpu-test-nvidia-smi
spec:
restartPolicy: OnFailure
containers:
- name: gpu-demo
image: <CUSTOM-IMAGE-URI>
command: ['/bin/sh', '-c']
args: ['nvidia-smi && tail -f /dev/null']
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
tolerations:
- key: 'nvidia.com/gpu'
operator: 'Equal'
value: 'true'
effect: 'NoSchedule' Returns: kubectl logs gpu-test-nvidia-smi
|
and for transparency, this is my cluster configuration (below) - the only component I have added to the cluster is the nvidia-device-plugin to expose and allocate the GPU(s) module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.20"
cluster_name = local.name
cluster_version = "1.28"
cluster_endpoint_public_access = true
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = {
# This nodegroup is for core addons such as CoreDNS
default = {
instance_types = ["m5.large"]
min_size = 1
max_size = 2
desired_size = 2
}
gpu = {
ami_type = "AL2_x86_64_GPU"
instance_types = ["g4dn.8xlarge"]
min_size = 1
max_size = 1
desired_size = 1
taints = {
gpu = {
key = "nvidia.com/gpu"
value = "exists"
effect = "NO_SCHEDULE"
}
}
labels = {
gpu = "true"
}
}
}
} |
yes, we do have nvidia-device-plugin installed on the EKS clusters |
Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
):aws eks describe-cluster --name <name> --query cluster.version
):uname -a
): 2023 x86_64 x86_64 x86_64 GNU/Linuxcat /etc/eks/release
on a node):While testing with the latest amazon-eks-gpu-node-1.26-v20231230 ami, we noticed there are some package discrepancies and drift between it and the Deep Learning AMI (Amazon Linux 2) ami. We're wondering why the amazon-eks-gpu-* image does not include the nvidia-container-toolkit and nvidia-container-toolkit-base packages, but is included in the DLAMI? Is it possible to have this be included in the amazon-eks-gpu-* image? If not, what is the thinking around not including it in the eks gpu image? Nvidia's docs call out that the toolkit is the recommended approach for the nvidia container stack to function properly.
The amazon-eks-gpu-node-* is using libnvidia-container supporting dependencies that were released in 2020 (all pinned to 1.4.0-1). Are there plans to update that? The Deep Learning AMI has those same dependencies on 1.13.5-1 which were released in July 2023.
Is the Deep Learning AMI (Amazon Linux 2) ami compatible with eks? Or does the amazon-eks-gpu-node need to be utilized? The docs are bit unclear on this.
The text was updated successfully, but these errors were encountered: