Update regression mi300 test to use new runner cluster #19738

yamiyysu · 2025-01-19T23:00:58Z

This PR updated pkgci_regression_test to use a new runner by updating the label to linux-mi300-gpu-1.

ScottTodd

nit: I don't think "Conductor" is necessarily a public-facing name that would make sense for public developers, so the PR title and description could instead say "new runner cluster"

ScottTodd · 2025-01-20T17:27:34Z

.github/workflows/pkgci_regression_test.yml

@@ -46,7 +46,7 @@ jobs:
          - name: amdgpu_rocm_mi300_gfx942
            rocm-chip: gfx942
            backend: rocm
-            runs-on: nodai-amdgpu-mi300-x86-64
+            runs-on: linux-mi300-gpu-1


Seeing test failures missing some GPU setup?
https://github.com/iree-org/iree/actions/runs/12857925207/job/35846521901?pr=19738#step:6:281

experimental/regression_suite/shark-test-suite-models/sdxl/test_clip.py:157: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ experimental/regression_suite/ireers_tools/fixtures.py:74: in iree_run_module subprocess.run(exec_args, check=True, capture_output=True, cwd=vmfb.parent) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ input = None, capture_output = True, timeout = None, check = True popenargs = (['iree-run-module', '--device=hip', '--module=/home/runner/_work/iree/iree/model_output_artifacts/sdxl_clip_vmfbs/mod...ters=model=/home/runner/_work/iree/iree/artifacts/sdxl_clip/real_weights.irpa', '--expected_f16_threshold=1.0f', ...],) kwargs = {'cwd': PosixPath('/home/runner/_work/iree/iree/model_output_artifacts/sdxl_clip_vmfbs'), 'stderr': -1, 'stdout': -1} process = <Popen: returncode: 1 args: ['iree-run-module', '--device=hip', '--module=/h...> stdout = b'' stderr = b"iree/runtime/src/iree/hal/drivers/hip/dynamic_symbols.c:160: UNAVAILABLE; HIP runtime library 'amdhip64.dll'/'libamd...ating driver for device 'hip'; resolving dependencies for 'compiled_clip'; creating VM context; creating run context\n" retcode = 1

I don't understand this one. Will solicit Sai's help.

ScottTodd · 2025-01-20T17:33:22Z

.github/workflows/pkgci_regression_test.yml

@@ -46,7 +46,7 @@ jobs:
          - name: amdgpu_rocm_mi300_gfx942
            rocm-chip: gfx942
            backend: rocm
-            runs-on: nodai-amdgpu-mi300-x86-64


The current self-hosted runners have a persistent file cache located by the IREE_TEST_FILES environment variable, used in the source code here:

iree/experimental/regression_suite/ireers_tools/artifacts.py

Lines 40 to 43 in a64d713

@functools.cache

def get_artifact_root_dir() -> Path:

root_path = os.getenv("IREE_TEST_FILES", default=str(Path.cwd())) + "/artifacts"

return Path(os.path.expanduser(root_path)).resolve()

In the test run on this PR, I see that files are not yet cached, resulting in significant time spent downloading 4GB+ files (28 minutes total job time):

https://github.com/iree-org/iree/actions/runs/12857925207/job/35846521901?pr=19738#step:6:212

Sun, 19 Jan 2025 23:14:51 GMT INFO ireers_tools.artifacts:artifacts.py:198 Downloading 'inference_output.0.bin' (128.00 KiB) to '/home/runner/_work/iree/iree/artifacts/sdxl_unet_fp16/inference_output.0.bin' Sun, 19 Jan 2025 23:14:51 GMT INFO ireers_tools.artifacts:artifacts.py:198 Downloading 'real_weights.irpa' (4.78 GiB) to '/home/runner/_work/iree/iree/artifacts/sdxl_unet_fp16/real_weights.irpa' Sun, 19 Jan 2025 23:19:01 GMT **************************************************************

Compare that with a run on the current runners (8 minutes total job time):

https://github.com/iree-org/iree/actions/runs/12839500128/job/35807030987#step:6:212

Sat, 18 Jan 2025 01:32:11 GMT INFO ireers_tools.artifacts:artifacts.py:191 Skipping 'inference_output.0.bin' download (128.00 KiB) - local MD5 hash matches Sat, 18 Jan 2025 01:32:18 GMT INFO ireers_tools.artifacts:artifacts.py:191 Skipping 'real_weights.irpa' download (4.78 GiB) - local MD5 hash matches Sat, 18 Jan 2025 01:32:19 GMT **************************************************************

Are these new runners in the cluster persistent? Can we use a cache on the runners? Can we prepopulate the cache prior to running real test jobs?

Thanks. I was under the impression that this workflow doesn't have local cache so doing a quick test to see if this migration would just work. These runners are persistent, so I'll probably copy the local files and have a way to mount the local cache so that it's accessible from the workflow container.

.github/workflows/pkgci_regression_test.yml

Update regression mi300 test to use Conductor

5e9b1ce

yamiyysu requested a review from ScottTodd as a code owner January 19, 2025 23:00

ScottTodd requested changes Jan 20, 2025

View reviewed changes

yamiyysu changed the title ~~Update regression mi300 test to use Conductor~~ Update regression mi300 test to use new runner cluster Jan 27, 2025

yamiyysu mentioned this pull request Jan 27, 2025

Migrate workflows to OSSCI nod-ai/shark-ai#793

Open

Set up IREE_TEST_FILES env var

8b61c34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update regression mi300 test to use new runner cluster #19738

Update regression mi300 test to use new runner cluster #19738

yamiyysu commented Jan 19, 2025 •

edited

Loading

ScottTodd left a comment

ScottTodd Jan 20, 2025

yamiyysu Jan 27, 2025

ScottTodd Jan 20, 2025

yamiyysu Jan 27, 2025 •

edited

Loading

	@functools.cache
	def get_artifact_root_dir() -> Path:
	root_path = os.getenv("IREE_TEST_FILES", default=str(Path.cwd())) + "/artifacts"
	return Path(os.path.expanduser(root_path)).resolve()

Update regression mi300 test to use new runner cluster #19738

Are you sure you want to change the base?

Update regression mi300 test to use new runner cluster #19738

Conversation

yamiyysu commented Jan 19, 2025 • edited Loading

ScottTodd left a comment

Choose a reason for hiding this comment

ScottTodd Jan 20, 2025

Choose a reason for hiding this comment

yamiyysu Jan 27, 2025

Choose a reason for hiding this comment

ScottTodd Jan 20, 2025

Choose a reason for hiding this comment

yamiyysu Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

yamiyysu commented Jan 19, 2025 •

edited

Loading

yamiyysu Jan 27, 2025 •

edited

Loading