Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amazon-eks-gpu AMI nvidia-container-toolkit dependency questions #1560

Closed
jrleslie opened this issue Jan 8, 2024 · 14 comments
Closed

Amazon-eks-gpu AMI nvidia-container-toolkit dependency questions #1560

jrleslie opened this issue Jan 8, 2024 · 14 comments

Comments

@jrleslie
Copy link

jrleslie commented Jan 8, 2024

Environment:

  • AWS Region: us-east-1
  • Instance Type(s): p4d, p4de
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion):
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version):
  • AMI Version: amazon-eks-gpu-node-1.26-v20231230
  • Kernel (e.g. uname -a): 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-0014dd7b6a19f6ad5"
BUILD_TIME="Sat Dec 30 05:25:52 UTC 2023"
BUILD_KERNEL="5.10.192-183.736.amzn2.x86_64"
ARCH="x86_64"
  1. While testing with the latest amazon-eks-gpu-node-1.26-v20231230 ami, we noticed there are some package discrepancies and drift between it and the Deep Learning AMI (Amazon Linux 2) ami. We're wondering why the amazon-eks-gpu-* image does not include the nvidia-container-toolkit and nvidia-container-toolkit-base packages, but is included in the DLAMI? Is it possible to have this be included in the amazon-eks-gpu-* image? If not, what is the thinking around not including it in the eks gpu image? Nvidia's docs call out that the toolkit is the recommended approach for the nvidia container stack to function properly.

  2. The amazon-eks-gpu-node-* is using libnvidia-container supporting dependencies that were released in 2020 (all pinned to 1.4.0-1). Are there plans to update that? The Deep Learning AMI has those same dependencies on 1.13.5-1 which were released in July 2023.

  3. Is the Deep Learning AMI (Amazon Linux 2) ami compatible with eks? Or does the amazon-eks-gpu-node need to be utilized? The docs are bit unclear on this.

@bryantbiggs
Copy link
Contributor

bryantbiggs commented Jan 8, 2024

Hi @jrleslie - thank you for the issue! In terms of your first question, are you encountering errors running containers on the EKS GPU AMI? Any details you might be able to share about your setup would be quite helpful

On the EKS GPU AMI, we do provide the necessary host components (NVIDIA driver, the NVIDIA container runtime and OCI hook, EFA kernel module, etc.), but the CUDA runtimes, libraries, and tools should be provided via the application containers. You can see this depicted in the NVIDIA docs here https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver

This also aligns with the documentation on the container toolkit project https://github.com/NVIDIA/nvidia-container-toolkit#getting-started

Make sure you have installed the NVIDIA driver for your Linux Distribution Note that you do not need to install the CUDA Toolkit on the host system, but the NVIDIA driver needs to be installed

The drivers are provided on the AMI, the CUDA runtime/libraries/tools can be supplied via the application container since they will have compatibility constraints with the frameworks that leverage them (i.e. - one application may require pytorch version x which uses CUDA 11.7, and another application running pytorch version y may require CUDA 12.2 - providing these framework and CUDA runtime/library components in the container isolates the dependencies and allows both applications to run on the same set of GPU enabled EKS nodes)

@cartermckinnon
Copy link
Member

The Deep Learning AMI's are not maintained by EKS.

We're doing a major rework of our NVIDIA dependencies to address #1494, and those changes will land in an AMI release soon. The nvidia-container-toolkit will be installed.

@jrleslie
Copy link
Author

jrleslie commented Jan 8, 2024

@cartermckinnon what is the timeline for release of an AMI with the nvidia-container-toolkit installed? And can you share the target version?

@bryantbiggs yes - we're seeing issues when executing pytorch jobs against nodes running the amazon-eks-gpu-node-1.26-v20231230. I believe our app container has everything it needs, but there are issues with CUDA not being exposed properly without the nvidia-container-toolkit installed directly on the instance (we dont leverage the gpu-operator). Our issue is similar to what's mentioned here https://stackoverflow.com/questions/63751883/using-gpu-inside-docker-container-cuda-version-n-a-and-torch-cuda-is-availabl

When we run nvidia-smi from inside our app container, the CUDA version comes back as: CUDA Version: N/A. After we install the nvidia-container-toolkit on the host everything works fine inside the running container.

@jrleslie
Copy link
Author

jrleslie commented Jan 8, 2024

@cartermckinnon can you share the target version of nvidia-container-toolkit that will be installed to be more clear.

@bryantbiggs
Copy link
Contributor

bryantbiggs commented Jan 8, 2024

@jrleslie are you using the NVIDIA container images as your base? If not, you might need to expose some environment variables - (i.e. - https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/12.2.0/centos7/base/Dockerfile?ref_type=heads#L36-37 - there may be others depending on which image layer, but just for reference)

For example, this should show the libcuda.so version (driver CUDA version) in the output:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test-nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
    - name: gpu-demo
      image: nvidia/cuda:12.2.0-runtime-ubi8
      command: ['/bin/sh', '-c']
      args: ['nvidia-smi && tail -f /dev/null']
      resources:
        limits:
          nvidia.com/gpu: 1
  tolerations:
    - key: 'nvidia.com/gpu'
      operator: 'Equal'
      value: 'true'
      effect: 'NoSchedule'

@cartermckinnon
Copy link
Member

cartermckinnon commented Jan 8, 2024

can you share the target version of nvidia-container-toolkit that will be installed to be more clear.

It'll be the latest one 👍

@jrleslie
Copy link
Author

jrleslie commented Jan 8, 2024

@bryantbiggs we aren't using the nvidia/cuda image. We have a custom image with cuda libraries and pytorch being installed on top. The libcuda.so driver version isn't exposed in our app container. If my understanding is correct, how is that supposed to be surfaced from the host without the toolkit or gpu-operator?

@bryantbiggs
Copy link
Contributor

@jrleslie I can't say for certain, you can look through the NVIDIA images to see the different values they are adding into their containers. At minimum, I would suggest adding the following to your image to see if this resolves your issue - note, the path to where CUDA is installed in the container (below) may vary depending on how you have installed it, so you might need to adjust the path if this is different from whats shown below:

RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
    echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf

ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

@bryantbiggs
Copy link
Contributor

tl;dr - everything does work today on the EKS GPU AMI, but the container image needs to ensure the appropriate NVIDIA configuration values are exposed (which the NVIDIA containers currently provide)

@alenawang
Copy link

alenawang commented Jan 9, 2024

Hey @bryantbiggs, I'm on @jrleslie's team and wanted to add some additional context to our query. The issue we're currently running into is that the nvidia driver files, which are available on the host through the AMI, are not available to the container, e.g. running a find finds libcuda.so when searching on the host but not in the container. On the container image we install the cuda runtime/libraries/etc but not the driver.

In non-EKS clusters we rely on the nvidia container toolkit to mount the drivers for use in the container, so we are wondering what mechanism is exposing the drivers on the host to the container here if you're not using the container toolkit? Or do we have to set up mounts manually? It's possible we have something misconfigured, so if there's any containerd config, log paths, etc you could point us to that to help us debug that would be really helpful. Thanks!

@alenawang
Copy link

This issue is resolved now - setting the NVIDIA_DRIVER_CAPABILITIES env per your suggestion controlled whether the driver was mounted. This isn't necessary for some reason with the DLAMI, not sure if it's due to the nvidia container toolkit version. Thank you @bryantbiggs for your help!

@jrleslie jrleslie closed this as completed Jan 9, 2024
@bryantbiggs
Copy link
Contributor

bryantbiggs commented Jan 9, 2024

Hi @alenawang - happy to hear you were able to resolve it! FYI - you won't find libcuda.so on the container since this is installed as part of the driver installation, so you will only see that on the host itself.

This isn't necessary for some reason with the DLAMI,

Any images that use the NVIDIA container images as their base, or if the images themselves export the necessary NVIDIA environment configuration values will allow this to work. Since you are creating a custom image that is not building on top of these other images, this is why you need to add the environment variables yourself.

For example, even this simple container image will work for the nvidia-smi command:

FROM public.ecr.aws/amazonlinux/amazonlinux:2023-minimal

# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

With a simple deployment like:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test-nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
    - name: gpu-demo
      image: <CUSTOM-IMAGE-URI>
      command: ['/bin/sh', '-c']
      args: ['nvidia-smi && tail -f /dev/null']
      resources:
        requests:
          nvidia.com/gpu: 1
        limits:
          nvidia.com/gpu: 1
  tolerations:
    - key: 'nvidia.com/gpu'
      operator: 'Equal'
      value: 'true'
      effect: 'NoSchedule'

Returns:

kubectl logs gpu-test-nvidia-smi 
Tue Jan  9 19:25:30 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   26C    P8              11W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

@bryantbiggs
Copy link
Contributor

bryantbiggs commented Jan 9, 2024

and for transparency, this is my cluster configuration (below) - the only component I have added to the cluster is the nvidia-device-plugin to expose and allocate the GPU(s)

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.20"

  cluster_name    = local.name
  cluster_version = "1.28"

  cluster_endpoint_public_access = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_groups = {
    # This nodegroup is for core addons such as CoreDNS
    default = {
      instance_types = ["m5.large"]

      min_size     = 1
      max_size     = 2
      desired_size = 2
    }

    gpu = {
      ami_type       = "AL2_x86_64_GPU"
      instance_types = ["g4dn.8xlarge"]

      min_size     = 1
      max_size     = 1
      desired_size = 1

      taints = {
        gpu = {
          key    = "nvidia.com/gpu"
          value  = "exists"
          effect = "NO_SCHEDULE"
        }
      }

      labels = {
        gpu = "true"
      }
    }
  }
}

@jsonmp-k8
Copy link

yes, we do have nvidia-device-plugin installed on the EKS clusters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants