Update use of CDI API #12825

elezar · 2022-01-12T14:45:31Z

This change updates implementation from #10081 to use the CDI API changes by @klihub merged in cncf-tags/container-device-interface@46367ec. This also serves to indicate how CDI could be used to address #11088

From a user perspective the major change is that the device name specified through the podman command line (when executing podman run) must be a fully-qualified CDI device name.

This can be tested on an NVIDIA GPU-based system assuming the following:

The nvidia-container-toolkit is installed
The following file has been created at /etc/cdi/nvidia.json

{
  "cdiVersion": "0.2.0",
  "kind": "nvidia.com/gpu",
  "devices": [
    {
      "name": "gpu0",
      "containerEdits": {
        "env": [
          "NVIDIA_VISIBLE_DEVICES=0"
        ]
      }
    }
  ],
  "containerEdits": {
    "hooks": [
      {
        "hookName": "prestart",
        "path": "/usr/bin/nvidia-container-toolkit",
        "args": [
          "nvidia-container-toolkit",
          "prestart"
        ]
      }
    ]
  }
}

Note: this could also be a YAML file: /etc/cdi/nvidia.yaml:

---
cdiVersion: 0.2.0
kind: nvidia.com/gpu
devices:
- name: gpu0
  containerEdits:
    env:
    - NVIDIA_VISIBLE_DEVICES=0
containerEdits:
  hooks:
  - hookName: prestart
    path: "/usr/bin/nvidia-container-toolkit"
    args:
    - nvidia-container-toolkit
    - prestart

In which case selecting the nvidia.com/gpu=gpu0 device yields:

sudo ./bin/podman run -ti --rm  --net=host --device=nvidia.com/gpu=gpu0 ubuntu:18.04 nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB-N (UUID: GPU-edfee158-11c1-52b8-0517-92f30e7fac88)

mheon · 2022-01-12T14:46:57Z

libpod/container_internal_linux.go

+		registry := cdi.GetRegistry()
+		unresolved, err := registry.InjectDevices(g.Config, c.config.CDIDevices...)
+		if len(unresolved) > 0 {
+			logrus.Debugf("Could not resolve CDI devices: %v", unresolved)


Seems like this ought to be more serious than a Debugf? logrus.Warnf maybe?

Yes, this should probably be a warning. This would indicate that the state of the registry has changed between determining the list of CDI devices (where GetDevice is called) and here where the actual injection takes place.

One question I did have was whether there would be a recommended way to propagate a reference to the registry from MakeContainer in create_container.go to the use here in generateSpec?

The container's image name and image ID are stored in its configuration at this point, but there's no guarantee those are actually populated, Podman supports creating images from user-specified root filesystems.

By "registry" in this case I mean a cache of the CDI devices defined by spec files available on the system (see the interface here. The idea is that spec files in, for example, /etc/cdi/*.json are loaded once and then these specifications are used.

Ahh. Sorry, registry usually means a very different thing for containers.

I checked, and there's no location for the spec files currently listed that I can find. If one were to be added, containers.conf would be the most logical location - that's a centralized config file that we could put the path in, that all containers can access

I think I'm still miscommunicating what I need to know.

When processing the original list of devices from the config in create_container.go, we instantiate a cache that stores the state of the CDI specs for the system (defining where these are read from is currently out of scope). The question is whether there is a way to pass this instance of the cache along so that this is also used in generateSpec in container_internal_linux.go? The motivation for this is that the files defining the CDI definition may have changed on disk so that the list of CDI devices being processed here does not have the desired effect -- hence the warning.

To be clear, I don't think this is a blocker for this particular PR, but it may be good to consider how to plumb this through in future if there is something that makes sense and does not add too much additional overhead. On the other hand, maybe @klihub has insights into the lifetime of the singleton instantiated in the cdi package when used at these two locations -- if we have some guarantees that these are in fact the same object then there would be no need to pass these through explicitly (except making the link clearer).

There are temporary files directories that you could use, but they're all internal to Libpod, so accessing them from pkg/specgen could prove to be a problem. Is this a global directory, or a per-container one? We could definitely do a global CDI state directory inside Libpod's temporary files directory, and expose that to the world.

This is a global directory. For what it's worth, I have changed the implementation to only look at whether the device specified is a valid CDI device name. This means that we are no longer instantiating this cache twice. I think this is sufficient for the time being. If this becomes a requirement we can open a discussion as to how this should be done correctly.

mheon · 2022-01-12T14:47:41Z

/approve
One nit otherwise LGTM

rhatdan · 2022-01-12T18:21:43Z

@elezar is this still a draft?

elezar · 2022-01-12T21:04:00Z

@elezar is this still a draft?

@rhatdan I think I would like @klihub and @bart0sh as well to have a look at this PR as well since they worked on the new API and a similar implementation for containerd. Should be good to go otherwise.

rhatdan · 2022-01-12T21:53:57Z

Ok lets get their approval.

klihub · 2022-01-13T07:13:43Z

@elezar LGTM, with a nitpick related to CDI device reference detection.

klihub · 2022-01-13T07:39:49Z

pkg/specgen/generate/container_create.go

-			logrus.Debugf("CDI HasDevice Error: %v", err)
-		}
-		if err == nil && isCDIDevice {
+		if d := cdiDeviceDB.GetDevice(device.Path); d != nil {


nitpick: you could also do syntactic filtering/check for CDI device references here (using IsQualifiedName).

One benefit of that would be that unresolvable CDI devices would go through the CDI resolution/injection attempt and come back as unresolved. This would potentially allow you to produce better/more contextually accurate error messages. AFAICT, with the current implementation such devices/errors are only caught later in the container creation pipeline, probably only in the runtime once it tries to inject the device into the container.

I wanted to avoid relying on IsQualifiedName here. One motivator is that I want to have some flexibility (in our library) for determining whether or not a device is a CDI device without requiring that clients such as podman be updated to support these changes. I understand that they would require an update, but this would be limited to updating a dependency.

Another motivation is that a fully-qualified device name does not mean that the device has an associated CDI spec.

example fully qualified has CDI spec action at filtering result at injection

1 nvidia.com/gpu=invalid yes no assumed to be standard device errors with Error: stat nvidia.com/gpu=invalid: no such file or directory

2 nvidia.com/gpu=gpu0 yes yes identified as CDI device resolves and injects modifications for nvidia.com/gpu=gpu0

3 gpu0 no yes assumed to be standard device (CDI currently requires a fully-qualified name) errors with Error: stat gpu0: no such file or directory

4 /dev/nvidia0 no no assumed to be standard device injects standard device node /dev/nvidia0

Using IsQualifiedName would change the filtering behaviour for 1 and would result in a warning and the container being launched without injecting the device. Is this better behaviour than exiting with the error shown? Is it more desirable to have the contains start with no modifications to the spec than to error out for an invalid device?

~~Thinking about this a bit more, I think raising an error for unresolved CDI devices instead of just issuing a warning would be more in keeping with the current behaviour.~~

Update: The injection will fail as an unresolved device is considered an error by InjectDevices used to make the spec modifications.

That's not how it would behave IMO. If device resolution/injection fails for any of the devices, inject will also return a non-nil error.

Ah. Good point. I just confirmed that InjectDevices also returns an error if unresolved is non-empty meaning that injection will fail for any unresolved devices.

@klihub I have reworked the implementation to use IsQualifiedName as discussed. I abstracted this behind a function isCDIDevice which could later be moved into our API if this makes sense.

This also removes the requirement that the registry be instantiated here meaning that the error presented to the user should be clearer. When selecting a fully-qualified but unresolvable CDI device we now see:

sudo ./bin/podman run -ti --rm --net=host --device=nvidia.com/gpu=gpu3 ubuntu:18.04 nvi dia-smi -L WARN[0000] Ignoring global metacopy option, not supported with booted kernel Error: error setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=gpu3

UPDATE: (Note I have removed the warning that was printed as it repeats what is in the error message)

If there were errors creating the registry these are also ~~shown as a warning~~ available in the debug logs (in the event of resolution errors).

bart0sh · 2022-01-13T10:39:51Z

pkg/specgen/generate/container_create.go

-		opts = ExtractCDIDevices(s)
+		registry := cdi.GetRegistry()
+		if errs := registry.GetErrors(); len(errs) > 0 {
+			logrus.Debugf("Error creating CDI Registry: %v", errs)


Should we fail container creation here?

No. These errors may not be related to the devices that we're interested in. For example, the spec from vendorA may be malformed, but we are using devices from vendorB.

ok, makes sense

elezar · 2022-01-13T20:39:37Z

@rhatdan after the discussions with @klihub and @bart0sh and taking their feedback into account, I think this is ready now. If there is anything from your side (or from @mheon) then let me know.

rhatdan · 2022-01-14T12:08:54Z

libpod/container_internal_linux.go

+				logrus.Warnf("The following errors were triggered when creating the CDI Registry: %v", errs)
+			}
+		}
+		if err != nil {


I think the err check should happen before the data check?

If we return on err != nil we don't provide the warning / error output. With that said, I could include the output in the if err != nil block as there would (or at least should) not be a case where unresolved is non-empty and err == nil.

OK. I have simplified things a bit. The error message will already show the unresolved devices, so there is no need to print them again. I am now also relying on the Debugf output to ensure that registry creation errors are available to the user instead of repeating them.

openshift-ci · 2022-01-14T12:09:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elezar, mheon, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mheon,rhatdan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rhatdan · 2022-01-14T12:10:27Z

@klihub @bart0sh @mheon PTAL

This change updates the CDI API to commit 46367ec063fda9da931d050b308ccd768e824364 which addresses some inconistencies in the previous implementation. Signed-off-by: Evan Lezar <elezar@nvidia.com>

Signed-off-by: Evan Lezar <elezar@nvidia.com>

bart0sh · 2022-01-14T13:54:16Z

/lgtm

openshift-ci · 2022-01-14T13:54:17Z

@bart0sh: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mheon · 2022-01-14T14:14:49Z

/lgtm

vans163 · 2022-02-02T22:39:08Z

Catching up with this now, how does this address being able to dynamically change which GPUs are passed to which podman instances?
Say I have 4 GPUs and want to start 3 podman instance, first instance has access to gpu 0 and 1, second to gpu 2, third to gpu 0 1 2 3?

elezar · 2022-02-03T10:34:30Z

@vans163 your question seems out of scope of this PR which was "just" updating the CDI dependency.

Then, what do you mean by "instances"? If you mean containers you should be able to start three of them:

podman run --device nvidia.com/gpu0 --device nvidia.com/gpu2 ...
podman run --device nvidia.com/gpu1 ...
podman run --device nvidia.com/gpu0 --device nvidia.com/gpu1 --device nvidia.com/gpu2 --device nvidia.com/gpu3 ...

If this is not what you mean, please open an issue that describes your use case and how you would expect this to work.

vans163 · 2022-02-03T15:09:32Z

@vans163 your question seems out of scope of this PR which was "just" updating the CDI dependency.

Then, what do you mean by "instances"? If you mean containers you should be able to start three of them:
podman run --device nvidia.com/gpu0 --device nvidia.com/gpu2 ...
podman run --device nvidia.com/gpu1 ...
podman run --device nvidia.com/gpu0 --device nvidia.com/gpu1 --device nvidia.com/gpu2 --device nvidia.com/gpu3 ...
If this is not what you mean, please open an issue that describes your use case and how you would expect this to work.

Yea this is exactly what I mean, so I would need to register devices 0-3 over here?

  "devices": [
    {
      "name": "gpu0",
      "containerEdits": {
        "env": [
          "NVIDIA_VISIBLE_DEVICES=0"
        ]
      },
      {"name": "gpu1", ..},
      {"name": "gpu2", ..},
      {"name": "gpu3", ..},
    }

Because then wont the envvar "NVIDIA_VISIBLE_DEVICES={gpuX}" conflict?

Also does this mean we can keep cgroups enabled in nvidia-container-runtime?

#set no-cgroups for nvidia-container-runtime
#TODO: remove this stage once cgroupsV2 support is stable (likely the next major release)
sed -i 's/^#no-cgroups = false/no-cgroups = true/;' /etc/nvidia-container-runtime/config.toml

TheFloatingBrain · 2022-03-05T23:49:28Z

Having a bit of difficulty with this, running:
sudo ./bin/podman run -ti --rm --net=host --device=nvidia.com/gpu=gpu0 ubuntu:18.04 nvidia-smi -L

I get:
Error: error setting up CDI devices: CDI: device "nvidia.com/gpu=gpu0" not found for spec "nvidia.com/gpu"

Though I can confirm I have the exact file in /etc/cdi/nvidia.json some of my config files are in locations I don't always see some of my files where others have them, for example /usr/share/containers/containers.conf

klihub · 2022-03-06T14:27:00Z

Having a bit of difficulty with this, running: sudo ./bin/podman run -ti --rm --net=host --device=nvidia.com/gpu=gpu0 ubuntu:18.04 nvidia-smi -L

I get: Error: error setting up CDI devices: CDI: device "nvidia.com/gpu=gpu0" not found for spec "nvidia.com/gpu"

Though I can confirm I have the exact file in /etc/cdi/nvidia.json some of my config files are in locations I don't always see some of my files where others have them, for example /usr/share/containers/containers.conf

@TheFloatingBrain We have a CDI test utility in the repo that you could use for diagnosis. You can use it to get the list of vendors or devices your CDI Specs declare, check whether your Specs have any errors, test how an OCI Spec gets modified when you inject a particular CDI device into it, and for a few other things.

You can give it a try by cloning the CDI repo (git clone https://github.com/container-orchestrated-devices/container-device-interface) and building it (cd container-device-interface; make all). It should be available as ./bin/cdi from the root of your repo after compilation. Once you have that you can, for instance, list your devices with cdi devices, check for any errors with cdi validate and also test the result of injecting a given CDI device to an otherwise empty OCI Spec with something like echo "" | ./bin/cdi inject - vendor.com/device=name.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 12, 2022

mheon reviewed Jan 12, 2022

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 12, 2022

elezar force-pushed the update-cdi branch 5 times, most recently from cc1e59f to ee090cc Compare January 12, 2022 15:32

klihub reviewed Jan 13, 2022

View reviewed changes

bart0sh reviewed Jan 13, 2022

View reviewed changes

elezar force-pushed the update-cdi branch from 6818ec6 to c5df728 Compare January 13, 2022 20:11

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 13, 2022

elezar force-pushed the update-cdi branch 3 times, most recently from cb00a64 to 92b85db Compare January 13, 2022 20:33

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 13, 2022

elezar marked this pull request as ready for review January 13, 2022 20:37

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 13, 2022

rhatdan approved these changes Jan 14, 2022

View reviewed changes

elezar added 2 commits January 14, 2022 13:35

Use new CDI API

968deb7

This change updates the CDI API to commit 46367ec063fda9da931d050b308ccd768e824364 which addresses some inconistencies in the previous implementation. Signed-off-by: Evan Lezar <elezar@nvidia.com>

Use fully-qualified device name in CDI test

72ab66d

Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar force-pushed the update-cdi branch from 92b85db to 72ab66d Compare January 14, 2022 12:35

openshift-ci bot assigned mheon Jan 14, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 14, 2022

openshift-merge-robot merged commit 7ab9922 into containers:main Jan 14, 2022

elezar deleted the update-cdi branch January 17, 2022 12:35

elezar mentioned this pull request Jan 17, 2022

GPU Support #11088

Closed

elezar mentioned this pull request Feb 22, 2022

Update CDI go dependency to v0.3.0 #13317

Merged

Clockwork-Muse mentioned this pull request Sep 9, 2022

Improve hostRequirements, Features support for dev containers that use GPUs devcontainers/spec#82

Closed

Clockwork-Muse mentioned this pull request Oct 10, 2022

Cannot use userns=keep-id together with nvidia-container-toolkit on rootless podman NVIDIA/nvidia-container-toolkit#35

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023

	example	fully qualified	has CDI spec	action at filtering	result at injection
1	`nvidia.com/gpu=invalid`	yes	no	assumed to be standard device	errors with `Error: stat nvidia.com/gpu=invalid: no such file or directory`
2	`nvidia.com/gpu=gpu0`	yes	yes	identified as CDI device	resolves and injects modifications for `nvidia.com/gpu=gpu0`
3	`gpu0`	no	yes	assumed to be standard device (CDI currently requires a fully-qualified name)	errors with `Error: stat gpu0: no such file or directory`
4	`/dev/nvidia0`	no	no	assumed to be standard device	injects standard device node `/dev/nvidia0`

Update use of CDI API #12825

Update use of CDI API #12825

Conversation

elezar commented Jan 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mheon commented Jan 12, 2022

rhatdan commented Jan 12, 2022

elezar commented Jan 12, 2022 • edited Loading

rhatdan commented Jan 12, 2022

klihub commented Jan 13, 2022 • edited Loading

Choose a reason for hiding this comment

elezar Jan 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elezar Jan 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elezar commented Jan 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Jan 14, 2022

rhatdan commented Jan 14, 2022

bart0sh commented Jan 14, 2022

openshift-ci bot commented Jan 14, 2022

mheon commented Jan 14, 2022

vans163 commented Feb 2, 2022

elezar commented Feb 3, 2022 • edited Loading

vans163 commented Feb 3, 2022

TheFloatingBrain commented Mar 5, 2022 • edited Loading

klihub commented Mar 6, 2022

elezar commented Jan 12, 2022 •

edited

Loading

elezar commented Jan 12, 2022 •

edited

Loading

klihub commented Jan 13, 2022 •

edited

Loading

elezar Jan 13, 2022 •

edited

Loading

elezar Jan 13, 2022 •

edited

Loading

elezar commented Feb 3, 2022 •

edited

Loading

TheFloatingBrain commented Mar 5, 2022 •

edited

Loading