Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update regression mi300 test to use new runner cluster #19738

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/pkgci_regression_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ jobs:
- name: amdgpu_rocm_mi300_gfx942
rocm-chip: gfx942
backend: rocm
runs-on: nodai-amdgpu-mi300-x86-64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current self-hosted runners have a persistent file cache located by the IREE_TEST_FILES environment variable, used in the source code here:

@functools.cache
def get_artifact_root_dir() -> Path:
root_path = os.getenv("IREE_TEST_FILES", default=str(Path.cwd())) + "/artifacts"
return Path(os.path.expanduser(root_path)).resolve()

In the test run on this PR, I see that files are not yet cached, resulting in significant time spent downloading 4GB+ files (28 minutes total job time):

https://github.com/iree-org/iree/actions/runs/12857925207/job/35846521901?pr=19738#step:6:212

Sun, 19 Jan 2025 23:14:51 GMT INFO     ireers_tools.artifacts:artifacts.py:198   Downloading 'inference_output.0.bin' (128.00 KiB) to '/home/runner/_work/iree/iree/artifacts/sdxl_unet_fp16/inference_output.0.bin'
Sun, 19 Jan 2025 23:14:51 GMT INFO     ireers_tools.artifacts:artifacts.py:198   Downloading 'real_weights.irpa' (4.78 GiB) to '/home/runner/_work/iree/iree/artifacts/sdxl_unet_fp16/real_weights.irpa'
Sun, 19 Jan 2025 23:19:01 GMT **************************************************************

Compare that with a run on the current runners (8 minutes total job time):

https://github.com/iree-org/iree/actions/runs/12839500128/job/35807030987#step:6:212

Sat, 18 Jan 2025 01:32:11 GMT INFO     ireers_tools.artifacts:artifacts.py:191   Skipping 'inference_output.0.bin' download (128.00 KiB) - local MD5 hash matches
Sat, 18 Jan 2025 01:32:18 GMT INFO     ireers_tools.artifacts:artifacts.py:191   Skipping 'real_weights.irpa' download (4.78 GiB) - local MD5 hash matches
Sat, 18 Jan 2025 01:32:19 GMT **************************************************************

Are these new runners in the cluster persistent? Can we use a cache on the runners? Can we prepopulate the cache prior to running real test jobs?

Copy link
Contributor Author

@yamiyysu yamiyysu Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I was under the impression that this workflow doesn't have local cache so doing a quick test to see if this migration would just work. These runners are persistent, so I'll probably copy the local files and have a way to mount the local cache so that it's accessible from the workflow container.

ScottTodd marked this conversation as resolved.
Show resolved Hide resolved
runs-on: linux-mi300-gpu-1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing test failures missing some GPU setup?
https://github.com/iree-org/iree/actions/runs/12857925207/job/35846521901?pr=19738#step:6:281

experimental/regression_suite/shark-test-suite-models/sdxl/test_clip.py:157: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
experimental/regression_suite/ireers_tools/fixtures.py:74: in iree_run_module
    subprocess.run(exec_args, check=True, capture_output=True, cwd=vmfb.parent)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

input = None, capture_output = True, timeout = None, check = True
popenargs = (['iree-run-module', '--device=hip', '--module=/home/runner/_work/iree/iree/model_output_artifacts/sdxl_clip_vmfbs/mod...ters=model=/home/runner/_work/iree/iree/artifacts/sdxl_clip/real_weights.irpa', '--expected_f16_threshold=1.0f', ...],)
kwargs = {'cwd': PosixPath('/home/runner/_work/iree/iree/model_output_artifacts/sdxl_clip_vmfbs'), 'stderr': -1, 'stdout': -1}
process = <Popen: returncode: 1 args: ['iree-run-module', '--device=hip', '--module=/h...>
stdout = b''
stderr = b"iree/runtime/src/iree/hal/drivers/hip/dynamic_symbols.c:160: UNAVAILABLE; HIP runtime library 'amdhip64.dll'/'libamd...ating driver for device 'hip'; resolving dependencies for 'compiled_clip'; creating VM context; creating run context\n"
retcode = 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this one. Will solicit Sai's help.

env:
PACKAGE_DOWNLOAD_DIR: ${{ github.workspace }}/.packages
IREE_TEST_PATH_EXTENSION: ${{ github.workspace }}/build_tools/pkgci/external_test_suite
Expand Down
Loading