Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for the NVIDIA GPU tests #433

Merged
merged 2 commits into from
Jun 16, 2023

Conversation

sayanchowdhury
Copy link
Member

@sayanchowdhury sayanchowdhury commented May 3, 2023

Add tests for the NVIDIA GPU tests

Testing done

  • Changelog entries added in the respective changelog/ directory (user-facing change, bug fix, security fix, update)
  • Inspected CI output for image differences: /boot and /usr size, packages, list files for any missing binaries, kernel modules, config files, kernel modules, etc.

}

func verifyNvidiaInstallation(c cluster.TestCluster) {
m := c.Machines()[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With

Suggested change
m := c.Machines()[0]
if kola.AzureOptions.Size != "Standard_NC6s_v3" {
c.Skip("skipping due to wrong instance size")
}
m := c.Machines()[0]

I think we could do this in scripts:

diff --git a/ci-automation/vendor-testing/azure.sh b/ci-automation/vendor-testing/azure.sh
index 17081b3598..c57210d651 100755
--- a/ci-automation/vendor-testing/azure.sh
+++ b/ci-automation/vendor-testing/azure.sh
@@ -74,7 +74,7 @@ query_kola_tests() {
     kola list --platform=azure --filter "${@}"
 }
 
-other_instance_types=()
+other_instance_types=('Standard_NC6s_v3')
 if [[ "${CIA_ARCH}" = 'amd64' ]]; then
     other_instance_types+=('V1')
 fi
@@ -85,6 +85,6 @@ run_kola_tests_on_instances \
     "${CIA_FIRST_RUN}" \
     "${other_instance_types[@]}" \
     '--' \
-    'cl.internet' \
+    'cl.internet' 'cl.misc.nvidia' \
     '--' \
     "${@}"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to keep this check, we could even move it to the SkipFunc (as kola.AzureOptions.Size should be accessible)

@sayanchowdhury sayanchowdhury force-pushed the sayan/add-nvidia-test branch 2 times, most recently from 811dc81 to 3def6fb Compare June 13, 2023 12:46
@sayanchowdhury sayanchowdhury changed the title wip: add tests for the nvidia gpu tests Add tests for the nvidia gpu tests Jun 16, 2023
@sayanchowdhury sayanchowdhury changed the title Add tests for the nvidia gpu tests Add tests for the NVIDIA GPU tests Jun 16, 2023
@sayanchowdhury sayanchowdhury requested review from a team, pothos, jepio and tormath1 June 16, 2023 02:52
// This test is to test the NVIDIA installation, limited to AZURE for now
Platforms: []string{"azure"},
Architectures: []string{"amd64"},
Flags: []register.Flag{register.NoEnableSelinux},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance to track the avc message? (for issue tracking and/or documentation purpose?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I will need to search the logs, if it's not there then I can trigger a local/Jenkins build to track the avc messages for tracking on a issue.

Copy link
Contributor

@tormath1 tormath1 Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should see it in the journal logs (before disabling the SELinux flags)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant searching the journal logs if they are not deleted on bincache.

Comment on lines +41 to +44
out, err := c.SSH(m, "systemctl is-active nvidia.service")
if !bytes.Contains(out, []byte("inactive")) {
return fmt.Errorf("nvidia.service: %q: %v", out, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, you could do this way:

Suggested change
out, err := c.SSH(m, "systemctl is-active nvidia.service")
if !bytes.Contains(out, []byte("inactive")) {
return fmt.Errorf("nvidia.service: %q: %v", out, err)
}
_, err := c.SSH(m, "systemctl --quiet is-active nvidia.service")
return err

we just want to assert if the unit is active or not, if the unit is active with --quiet it will simply return 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also _ = c.MustSSH(…) which fails the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct but in this case we don't want to fail the test as we are in a retry loop no?

Signed-off-by: Sayan Chowdhury <schowdhury@microsoft.com>
Signed-off-by: Sayan Chowdhury <schowdhury@microsoft.com>
@sayanchowdhury sayanchowdhury force-pushed the sayan/add-nvidia-test branch from d04c5d0 to 4cd37d7 Compare June 16, 2023 13:16
@sayanchowdhury sayanchowdhury merged commit 4771cb7 into flatcar-master Jun 16, 2023
@sayanchowdhury sayanchowdhury deleted the sayan/add-nvidia-test branch June 16, 2023 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants