Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update regression mi300 test to use new runner cluster #19738

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yamiyysu
Copy link
Contributor

@yamiyysu yamiyysu commented Jan 19, 2025

This PR updated pkgci_regression_test to use a new runner by updating the label to linux-mi300-gpu-1.

@yamiyysu yamiyysu requested a review from ScottTodd as a code owner January 19, 2025 23:00
Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think "Conductor" is necessarily a public-facing name that would make sense for public developers, so the PR title and description could instead say "new runner cluster"

@@ -46,7 +46,7 @@ jobs:
- name: amdgpu_rocm_mi300_gfx942
rocm-chip: gfx942
backend: rocm
runs-on: nodai-amdgpu-mi300-x86-64
runs-on: linux-mi300-gpu-1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing test failures missing some GPU setup?
https://github.com/iree-org/iree/actions/runs/12857925207/job/35846521901?pr=19738#step:6:281

experimental/regression_suite/shark-test-suite-models/sdxl/test_clip.py:157: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
experimental/regression_suite/ireers_tools/fixtures.py:74: in iree_run_module
    subprocess.run(exec_args, check=True, capture_output=True, cwd=vmfb.parent)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

input = None, capture_output = True, timeout = None, check = True
popenargs = (['iree-run-module', '--device=hip', '--module=/home/runner/_work/iree/iree/model_output_artifacts/sdxl_clip_vmfbs/mod...ters=model=/home/runner/_work/iree/iree/artifacts/sdxl_clip/real_weights.irpa', '--expected_f16_threshold=1.0f', ...],)
kwargs = {'cwd': PosixPath('/home/runner/_work/iree/iree/model_output_artifacts/sdxl_clip_vmfbs'), 'stderr': -1, 'stdout': -1}
process = <Popen: returncode: 1 args: ['iree-run-module', '--device=hip', '--module=/h...>
stdout = b''
stderr = b"iree/runtime/src/iree/hal/drivers/hip/dynamic_symbols.c:160: UNAVAILABLE; HIP runtime library 'amdhip64.dll'/'libamd...ating driver for device 'hip'; resolving dependencies for 'compiled_clip'; creating VM context; creating run context\n"
retcode = 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this one. Will solicit Sai's help.

@@ -46,7 +46,7 @@ jobs:
- name: amdgpu_rocm_mi300_gfx942
rocm-chip: gfx942
backend: rocm
runs-on: nodai-amdgpu-mi300-x86-64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current self-hosted runners have a persistent file cache located by the IREE_TEST_FILES environment variable, used in the source code here:

@functools.cache
def get_artifact_root_dir() -> Path:
root_path = os.getenv("IREE_TEST_FILES", default=str(Path.cwd())) + "/artifacts"
return Path(os.path.expanduser(root_path)).resolve()

In the test run on this PR, I see that files are not yet cached, resulting in significant time spent downloading 4GB+ files (28 minutes total job time):

https://github.com/iree-org/iree/actions/runs/12857925207/job/35846521901?pr=19738#step:6:212

Sun, 19 Jan 2025 23:14:51 GMT INFO     ireers_tools.artifacts:artifacts.py:198   Downloading 'inference_output.0.bin' (128.00 KiB) to '/home/runner/_work/iree/iree/artifacts/sdxl_unet_fp16/inference_output.0.bin'
Sun, 19 Jan 2025 23:14:51 GMT INFO     ireers_tools.artifacts:artifacts.py:198   Downloading 'real_weights.irpa' (4.78 GiB) to '/home/runner/_work/iree/iree/artifacts/sdxl_unet_fp16/real_weights.irpa'
Sun, 19 Jan 2025 23:19:01 GMT **************************************************************

Compare that with a run on the current runners (8 minutes total job time):

https://github.com/iree-org/iree/actions/runs/12839500128/job/35807030987#step:6:212

Sat, 18 Jan 2025 01:32:11 GMT INFO     ireers_tools.artifacts:artifacts.py:191   Skipping 'inference_output.0.bin' download (128.00 KiB) - local MD5 hash matches
Sat, 18 Jan 2025 01:32:18 GMT INFO     ireers_tools.artifacts:artifacts.py:191   Skipping 'real_weights.irpa' download (4.78 GiB) - local MD5 hash matches
Sat, 18 Jan 2025 01:32:19 GMT **************************************************************

Are these new runners in the cluster persistent? Can we use a cache on the runners? Can we prepopulate the cache prior to running real test jobs?

Copy link
Contributor Author

@yamiyysu yamiyysu Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I was under the impression that this workflow doesn't have local cache so doing a quick test to see if this migration would just work. These runners are persistent, so I'll probably copy the local files and have a way to mount the local cache so that it's accessible from the workflow container.

.github/workflows/pkgci_regression_test.yml Show resolved Hide resolved
@yamiyysu yamiyysu changed the title Update regression mi300 test to use Conductor Update regression mi300 test to use new runner cluster Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants