Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky newly migrated MI300 workflow #19955

Open
yamiyysu opened this issue Feb 11, 2025 · 2 comments
Open

Fix flaky newly migrated MI300 workflow #19955

yamiyysu opened this issue Feb 11, 2025 · 2 comments
Assignees
Labels
infrastructure Relating to build systems, CI, or testing

Comments

@yamiyysu
Copy link
Contributor

Newly migrated mi_300 workflow fails 20% of time](https://github.com/iree-org/iree/actions/runs/13144343574/job/36679240037) due to apt-installing not able to reach remote host. Propose to move the dependency in the runner scale set so that

  • remove apt-install in every run to increase robustness
  • remove docker-in-docker, which requires the runner scale set to run in privileged mode and can access all GPUs

This depends on first creating a new runner scale set, then migrate this workflow

@yamiyysu yamiyysu self-assigned this Feb 11, 2025
@ScottTodd
Copy link
Member

Do we know why the network is unstable? Can we fix that too? I doubt this will be the last time someone tries to install a package on these machines.

@ScottTodd ScottTodd added the infrastructure Relating to build systems, CI, or testing label Feb 11, 2025
@yamiyysu
Copy link
Contributor Author

We don't know: I haven't dug deeper to see if it client-side/Conductor, or server side, or in-between. We are in general having Conductor networking issue though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Relating to build systems, CI, or testing
Projects
None yet
Development

No branches or pull requests

2 participants