Skip to content

Commit

Permalink
Add GPU specific Tags (red-hat-data-services#1938)
Browse files Browse the repository at this point in the history
* add nvidia tag for all the tests with Resources-GPU tag

* add amd gpu tag in vllm tests

* remove wrong comment
  • Loading branch information
bdattoma committed Oct 17, 2024
1 parent 4950bcc commit 16c0b68
Show file tree
Hide file tree
Showing 17 changed files with 61 additions and 61 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Verify Notebook Controller Deployment
Verify GPU Operator Deployment # robocop: disable
[Documentation] Verifies Nvidia GPU Operator is correctly installed
[Tags] Sanity Tier1
... Resources-GPU # Not actually needed, but we first need to enable operator install by default
... Resources-GPU NVIDIA-GPUs # Not actually needed, but we first need to enable operator install by default
... ODS-1157
# Before GPU Node is added to the cluster
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ Verify Notebook Tolerations Are Applied To Workbenches
Verify User Can Add GPUs To Workbench
[Documentation] Verifies user can add GPUs to an already started workbench
[Tags] Tier1 Sanity
... ODS-2013 Resources-GPU
... ODS-2013 Resources-GPU NVIDIA-GPUs
Launch Data Science Project Main Page
Create Workbench workbench_title=${WORKBENCH_TITLE_GPU} workbench_description=${EMPTY}
... prj_title=${PRJ_TITLE} image_name=${NB_IMAGE_GPU} deployment_size=Small
Expand All @@ -108,7 +108,7 @@ Verify User Can Add GPUs To Workbench
Verify User Can Remove GPUs From Workbench
[Documentation] Verifies user can remove GPUs from an already started workbench
[Tags] Tier1 Sanity
... ODS-2014 Resources-GPU
... ODS-2014 Resources-GPU NVIDIA-GPUs
Launch Data Science Project Main Page
Create Workbench workbench_title=${WORKBENCH_TITLE_GPU} workbench_description=${EMPTY}
... prj_title=${PRJ_TITLE} image_name=${NB_IMAGE_GPU} deployment_size=Small
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Resource ../../../Resources/Page/OCPDashboard/Pods/Pods.robot
Library JupyterLibrary
Suite Setup Spawner Suite Setup
Suite Teardown End Web Test
Test Tags Resources-GPU
Test Tags Resources-GPU NVIDIA-GPUs


*** Variables ***
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,43 +22,43 @@ Verify CUDA Image Can Be Spawned With GPU
[Documentation] Spawns CUDA image with 1 GPU and verifies that the GPU is
... not available for other users.
[Tags] Sanity
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-1141 ODS-346 ODS-1359
Pass Execution Passing tests, as suite setup ensures that image can be spawned

Verify CUDA Image Includes Expected CUDA Version
[Documentation] Checks CUDA version
[Tags] Sanity
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-1142
Verify Installed CUDA Version ${EXPECTED_CUDA_VERSION}

Verify PyTorch Library Can See GPUs In Minimal CUDA
[Documentation] Installs PyTorch and verifies it can see the GPU
[Tags] Sanity
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-1144
Verify Pytorch Can See GPU install=True

Verify Tensorflow Library Can See GPUs In Minimal CUDA
[Documentation] Installs Tensorflow and verifies it can see the GPU
[Tags] Sanity
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-1143
Verify Tensorflow Can See GPU install=True

Verify Cuda Image Has NVCC Installed
[Documentation] Verifies NVCC Version in Minimal CUDA Image
[Tags] Sanity
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-483
${nvcc_version} = Run Cell And Get Output input=!nvcc --version
Should Not Contain ${nvcc_version} /usr/bin/sh: nvcc: command not found

Verify Previous CUDA Notebook Image With GPU
[Documentation] Runs a workload after spawning the N-1 CUDA Notebook
[Tags] Tier2 LiveTesting
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-2128
[Setup] N-1 CUDA Setup
Spawn Notebook With Arguments image=${NOTEBOOK_IMAGE} size=Small gpus=1 version=previous
Expand Down Expand Up @@ -90,7 +90,7 @@ Verify CUDA Image Suite Setup
# This will fail in case there are two nodes with the same number of GPUs
# Since the overall available number won't change even after 1 GPU is assigned
# However I can't think of a better way to execute this check, under the assumption that
# the Resources-GPU tag will always ensure there is 1 node with 1 GPU on the cluster.
# the Resources-GPU will always ensure there is 1 node with 1 GPU on the cluster.
${maxNo} = Find Max Number Of GPUs In One Node
${maxSpawner} = Fetch Max Number Of GPUs In Spawner Page
# Need to continue execution even on failure or the whole suite will be failed
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ Verify Tensorboard Is Accessible
Verify PyTorch Image Can Be Spawned With GPU
[Documentation] Spawns PyTorch image with 1 GPU
[Tags] Tier1
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-1145
Clean Up Server
Stop JupyterLab Notebook Server
Expand All @@ -60,28 +60,28 @@ Verify PyTorch Image Can Be Spawned With GPU
Verify PyTorch Image Includes Expected CUDA Version
[Documentation] Checks CUDA version
[Tags] Tier1
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-1146
Verify Installed CUDA Version ${EXPECTED_CUDA_VERSION}

Verify PyTorch Library Can See GPUs In PyTorch Image
[Documentation] Verifies PyTorch can see the GPU
[Tags] Tier1
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-1147
Verify Pytorch Can See GPU

Verify PyTorch Image GPU Workload
[Documentation] Runs a workload on GPUs in PyTorch image
[Tags] Tier1
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-1148
Run Repo And Clean https://github.com/lugi0/notebook-benchmarks notebook-benchmarks/pytorch/fgsm_tutorial.ipynb

Verify Previous PyTorch Notebook Image With GPU
[Documentation] Runs a workload after spawning the N-1 PyTorch Notebook
[Tags] Tier2 LiveTesting
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-2129
[Setup] N-1 PyTorch Setup
Spawn Notebook With Arguments image=${NOTEBOOK_IMAGE} size=Small gpus=1 version=previous
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,36 +50,36 @@ Verify Tensorboard Is Accessible
Verify Tensorflow Image Can Be Spawned With GPU
[Documentation] Spawns PyTorch image with 1 GPU
[Tags] Tier1
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-1151
Close Previous Server
Spawn Notebook With Arguments image=${NOTEBOOK_IMAGE} size=Small gpus=1

Verify Tensorflow Image Includes Expected CUDA Version
[Documentation] Checks CUDA version
[Tags] Tier1
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-1152
Verify Installed CUDA Version ${EXPECTED_CUDA_VERSION}

Verify Tensorflow Library Can See GPUs In Tensorflow Image
[Documentation] Verifies Tensorlow can see the GPU
[Tags] Tier1
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-1153
Verify Tensorflow Can See GPU

Verify Tensorflow Image GPU Workload
[Documentation] Runs a workload on GPUs in Tensorflow image
[Tags] Tier1
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-1154
Run Repo And Clean https://github.com/lugi0/notebook-benchmarks notebook-benchmarks/tensorflow/GPU-no-warnings.ipynb

Verify Previous Tensorflow Notebook Image With GPU
[Documentation] Runs a workload after spawning the N-1 Tensorflow Notebook
[Tags] Tier2 LiveTesting
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... ODS-2130
[Setup] N-1 Tensorflow Setup
Spawn Notebook With Arguments image=${NOTEBOOK_IMAGE} size=Small gpus=1 version=previous
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Verify Number Of Available GPUs Is Correct
[Documentation] Verifies that the number of available GPUs in the
... Spawner dropdown is correct; i.e., it should show the maximum
... Number of GPUs available in a single node.
[Tags] Sanity Resources-2GPUS
[Tags] Sanity Resources-2GPUS NVIDIA-GPUs
... ODS-1256
${maxNo} = Find Max Number Of GPUs In One Node
${maxSpawner} = Fetch Max Number Of GPUs In Spawner Page
Expand All @@ -31,7 +31,7 @@ Verify Number Of Available GPUs Is Correct
Verify Two Servers Can Be Spawned
[Documentation] Spawns two servers requesting 1 gpu each, and checks
... that both can schedule and are scheduled on different nodes.
[Tags] Sanity Resources-2GPUS
[Tags] Sanity Resources-2GPUS NVIDIA-GPUs
... ODS-1257
Spawn Notebook With Arguments image=${NOTEBOOK_IMAGE} size=Small gpus=1
${serial_first} = Get GPU Serial Number
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Run Training operator ODH test base LoRA use case
# Run Training operator ODH test base QLoRA use case
# [Documentation] Run Go ODH tests for Training operator base QLoRA use case
# [Tags] RHOAIENG-13142
# ... Resources-GPU
# ... Resources-GPU NVIDIA-GPUs
# ... Tier1
# ... DistributedWorkloads
# ... Training
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Run TestKueueRayCpu ODH test

Run TestKueueRayGpu ODH test
[Documentation] Run Go ODH test: TestKueueRayGpu
[Tags] Resources-GPU
[Tags] Resources-GPU NVIDIA-GPUs
... Tier1
... DistributedWorkloads
... Training
Expand All @@ -43,7 +43,7 @@ Run TestRayTuneHPOCpu ODH test

Run TestRayTuneHPOGpu ODH test
[Documentation] Run Go ODH test: TestMnistRayTuneHpoGpu
[Tags] Resources-GPU
[Tags] Resources-GPU NVIDIA-GPUs
... Tier1
... DistributedWorkloads
... Training
Expand All @@ -62,7 +62,7 @@ Run TestKueueCustomRayCpu ODH test
Run TestKueueCustomRayGpu ODH test
[Documentation] Run Go ODH test: TestKueueCustomRayGpu
[Tags] RHOAIENG-10013
... Resources-GPU
... Resources-GPU NVIDIA-GPUs
... Tier1
... DistributedWorkloads
... Training
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ ${RUNTIME_NAME}= Model Serving GPU Test
*** Test Cases ***
Verify GPU Model Deployment Via UI # robocop: off=too-long-test-case,too-many-calls-in-test-case
[Documentation] Test the deployment of an openvino_ir model on a model server with GPUs attached
[Tags] Sanity Resources-GPU
[Tags] Sanity Resources-GPU NVIDIA-GPUs
... ODS-2214
Clean All Models Of Current User
Open Data Science Projects Home Page
Expand Down Expand Up @@ -57,7 +57,7 @@ Verify GPU Model Deployment Via UI # robocop: off=too-long-test-case,too-many

Test Inference Load On GPU
[Documentation] Test the inference load on the GPU after sending random requests to the endpoint
[Tags] Sanity Resources-GPU
[Tags] Sanity Resources-GPU NVIDIA-GPUs
... ODS-2213
${url}= Get Model Route Via UI ${MODEL_NAME}
Send Random Inference Request endpoint=${url} no_requests=100
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ Verify Multiple Projects With Same Model (OVMS on Kserve)

Verify GPU Model Deployment Via UI (OVMS on Kserve) # robocop: off=too-long-test-case,too-many-calls-in-test-case
[Documentation] Test the deployment of an openvino_ir model on a model server with GPUs attached
[Tags] Tier1 Resources-GPU
[Tags] Tier1 Resources-GPU NVIDIA-GPUs
... ODS-2630 ODS-2631 ProductBug RHOAIENG-3355
${requests}= Create Dictionary nvidia.com/gpu=1
${limits}= Create Dictionary nvidia.com/gpu=1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -339,7 +339,7 @@ Verify User Can Set Requests And Limits For A Model # robocop: off=too-long-t
Verify Model Can Be Served And Query On A GPU Node # robocop: off=too-long-test-case,too-many-calls-in-test-case
[Documentation] Basic tests for preparing, deploying and querying a LLM model on GPU node
... using Kserve and Caikit+TGIS runtime
[Tags] Sanity ODS-2381 Resources-GPU
[Tags] Sanity ODS-2381 Resources-GPU NVIDIA-GPUs
[Setup] Set Project And Runtime namespace=singlemodel-gpu
${test_namespace}= Set Variable singlemodel-gpu
${model_name}= Set Variable flan-t5-small-caikit
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ Verify User Can Set Requests And Limits For A Model Using The UI # robocop: o
Verify Model Can Be Served And Query On A GPU Node Using The UI # robocop: off=too-long-test-case
[Documentation] Basic tests for preparing, deploying and querying a LLM model on GPU node
... using Kserve and Caikit+TGIS runtime
[Tags] Sanity ODS-2523 Resources-GPU
[Tags] Sanity ODS-2523 Resources-GPU NVIDIA-GPUs
[Setup] Set Up Project namespace=singlemodel-gpu
${test_namespace}= Set Variable singlemodel-gpu
${model_name}= Set Variable flan-t5-small-caikit
Expand Down
Loading

0 comments on commit 16c0b68

Please sign in to comment.