Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LXD GPU passthrough tests (New) #1577

Merged
merged 57 commits into from
Nov 14, 2024
Merged

Add LXD GPU passthrough tests (New) #1577

merged 57 commits into from
Nov 14, 2024

Conversation

pedro-avalos
Copy link
Collaborator

@pedro-avalos pedro-avalos commented Nov 4, 2024

Description

  • Created gpu_passthrough.py within GPGPU provider
  • Added LXD container tests for GPU passthrough setups.
  • Created LXD and LXDVM classes that can be used to wrap LXD container or LXD virtual machine
    • These could probably go into checkbox-support ?

Resolved issues

Documentation

n/a

Tests

Tested locally on a laptop with NVIDIA GPU. Tested on torchtusk as well

Submission from torchtusk: https://certification.canonical.com/submissions/status/293943

@pedro-avalos pedro-avalos added the enhancement New feature or request label Nov 4, 2024
Copy link

codecov bot commented Nov 4, 2024

Codecov Report

Attention: Patch coverage is 94.90741% with 11 lines in your changes missing coverage. Please review.

Project coverage is 91.13%. Comparing base (bdf6739) to head (d08cf53).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
providers/gpgpu/bin/gpu_passthrough.py 94.90% 10 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1577       +/-   ##
===========================================
+ Coverage   48.03%   91.13%   +43.10%     
===========================================
  Files         371        3      -368     
  Lines       39850      327    -39523     
  Branches     6734       38     -6696     
===========================================
- Hits        19140      298    -18842     
+ Misses      19993       28    -19965     
+ Partials      717        1      -716     
Flag Coverage Δ
checkbox-ng ?
checkbox-support ?
contrib-provider-ce-oem ?
provider-base ?
provider-certification-client ?
provider-certification-server ?
provider-genio ?
provider-gpgpu 91.13% <94.90%> (+7.34%) ⬆️
provider-iiotg ?
provider-resource ?
provider-sru ?
release-tools ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pedro-avalos pedro-avalos marked this pull request as ready for review November 5, 2024 01:35
Copy link
Collaborator

@Hook25 Hook25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is a big step in the right direction imo to clean up a bit the virtualization-oriented tests, I have a few suggestions here and there as to how I would change it further. See if they make sense.

providers/gpgpu/bin/gpu_passthrough.py Outdated Show resolved Hide resolved
providers/gpgpu/bin/gpu_passthrough.py Outdated Show resolved Hide resolved
providers/gpgpu/bin/gpu_passthrough.py Outdated Show resolved Hide resolved
providers/gpgpu/bin/gpu_passthrough.py Outdated Show resolved Hide resolved
providers/gpgpu/bin/gpu_passthrough.py Outdated Show resolved Hide resolved
providers/gpgpu/bin/gpu_passthrough.py Outdated Show resolved Hide resolved
providers/gpgpu/bin/gpu_passthrough.py Outdated Show resolved Hide resolved
providers/gpgpu/bin/gpu_passthrough.py Show resolved Hide resolved
providers/gpgpu/bin/gpu_passthrough.py Outdated Show resolved Hide resolved
providers/gpgpu/units/test-plan.pxu Outdated Show resolved Hide resolved
@pedro-avalos
Copy link
Collaborator Author

Hm, the nvidia-persistenced.service is not starting up for the VM test.

@pedro-avalos pedro-avalos requested a review from Hook25 November 13, 2024 14:23
Hook25
Hook25 previously approved these changes Nov 14, 2024
Copy link
Collaborator

@Hook25 Hook25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding these in the boostrap include section, they wont be run only if the template doesn't expand

providers/gpgpu/units/test-plan.pxu Show resolved Hide resolved
@Hook25 Hook25 merged commit 7c6b994 into main Nov 14, 2024
43 checks passed
@Hook25 Hook25 deleted the add-lxd-gpu-tests branch November 14, 2024 15:02
eugene-yujinwu pushed a commit to eugene-yujinwu/checkbox that referenced this pull request Dec 31, 2024
* Add initial gpu_passthrough program

The LXD and LXDVM classes may be useful in checkbox-support, as other tests may be able to benefit from these classes.

* Add initial coverage tests

* Fix parse_args function

* Add Checkbox units

* Fix typo in unit

* Fix typo in LXD.launch

* Force delete in cleanup

Otherwise the test is not necessarily idempotent

* Ensure insert_images is called

* Fix nvidia repo url

* Fix nvidia pinfile url

* Fix symlink name

* Pass NVIDIA runtime at launch

* Format gpu_passthrough.py

* Don't use dataclasses

I guess these are newer than I thought...

* Remove unsupported typehints

* Document that parameters can be overwritten

* Use cached property

* Rewrite run function

* Add retry decorator to download_image

* Add type hint to launch

* Rename jobs to gpgpu-passthrough

* Update tests

* Install mixbench snap

* Make LXD and LXDVM context managers

* Move setup to GPU_VENDORS fields

* Update tests

* Fix tests

* Fix shlex join bug

* No sudo needed

Instance should be running as root

* Make script a little more verbose

* init_lxd is part of __enter__

* Don't just sleep for system to be up

* Ensure nvidia capabilities are passed through

This ensures mixbench is able to find the right CUDA libraries from the host.

* Make type hints more accurate

* image_alias not image

* Update tests

* Update tox file requirements

* Add libsystemd-dev to tox workflow

* shlex.join available in 3.8+

* Fix launch tests

* Add more log messages

* Increase wait for VM retries

* Wait for VM to be up after adding GPU

* Add wait until running function

* Add test

* Remove unused properties

* Install gpgpu drivers on LXD vm

This should at the very least help speed up the setup process since X11 is not needed in the vm

* Add debug messages to run

* Install linux-generic

* remove todo message

* Add options= to make it line clearer

* Use default storage size for VM

This is not needed anymore since CUDA Toolkit is not being installed on the
VM

* Auto-connect request granted

* Compatibility with jammy and prior

* Remove LXDVM passthrough test

This is not working as intended, will add in a separate PR

* Ensure nvidia driver is present

* Add units to bootstrap_include
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants