-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restrict CK solvers to only run on MI100, MI200 and MI300 #2533
Conversation
…th is_ck_whitelist
@bghimireamd can this PR replace #2528 ? |
@bghimireamd @junliume Why CK is restricted to MI ASICs only? I do not see explanation in #2458, see https://github.com/ROCmSoftwarePlatform/MIOpen/pull/2458/files#r1396497729 |
CI testing should not be a random process. If some test is not intended for some GPU, the test should be disabled for that GPU in tests/CMakeLists.txt or in the source of relevant gtest.
Can you please provide a list of that tests and/or an example of failure? |
@junliume Yes |
I think its because currently CK team is writing kernels focusing only for MI ASICs. May be in future CK team will expand to other GPU series. |
@atamazov Yes, we can do it in gtest or cmake or in IsApplicable of the solver. Benefit of placing in Potentially all the solver that are using CK solvers but are not restricted to MI ASICs are example of failure. eg: PR #2481 was failing because the CK solver was not restricted to MI ASIC. |
@bghimireamd Currently CK is supporting non-xDLOPs GPUs, e.g. MI50 and MI60, Navi21. Disabling CK for these will likely lead to performance regressions for these targets.
If some CK solver added in #2481 is not working, then we need to restrict the applicability of CK solvers introduced in that PR. And tests need to be restricted too, see #2481 (review). Otherwise the tests will continue to fail. |
@bghimireamd @junliume Let me repeat some important question. It is important because I suspect that our Jenkinsfile has some flaws.
|
FYI here is the review thread that is likely related to the uncertainty mentioned above: #1693 (comment) |
@atamazov the PR CI now solely perform on @bghimireamd please connect with @zjing14 , I believe CK also supports |
@bghimireamd @junliume The CK that we currently use supports gfx906 and gfx1030 for sure. |
@atamazov The owner of the PR is not able to test unsupported GPUs because Jenkinsfile randomly selects supported GPUs. Is that one of the flaw? |
You can simply inspect IsApplicable of the existing convolution solvers. WRT layernorm, yes, CK team can help. |
It is related. But random selection happens only during smoke testing. After smoke testing, our CI should perform full testing on all supported GPUs, and the "full test" stages must detect all the remaining problems. I suspect there is some issue in the full testing stages.
Maybe it is better to manually test CK-based layernorm solvers on different nodes and set their applicability according to the results (and adjust the applicability of the tests). 💡 Please note that the tests must ensure 100% coverage of functionality. In other words, the applicability of solvers should not be wider than what we test in our CI. Right away (after #2458) the above requirement means that the applicability of CK-based layernorm solvers should be limited to MI100, 200 and 300. I recommending doing this ASAP. |
@junliume @bghimireamd @JehandadKhan #2458 disables some CK-based CONV and BN solvers for MI50/60/gfx906, Navi2x, Navi3x. As as mentioned above, this may lead to regressions. Are we going to resolve this problem? Or do we no longer need CK solvers for MI50/60, Navi2x, Navi3x? I just need to understand what is going on. |
A sidenote - it's not enough just to limit solvers applicability. There is a bunch of CK tests with explicitly enabled Navi2x nodes, they may start failing randomly. |
Yes, but what I am saying at #2533 (comment) is that it seems highly questionable that CK should be disabled for Navi2x/3x and gfx906. |
@junliume I take my words back. Sorry for confusion. |
Currently tests are passing when Jenkins randomly choses MI series GPU. If Jenkins chooses GPU other then MI series, the tests starts failing. The reason is the CK solvers are currently only verified(tests pass) in MI series. This PR is to restrict all the CK solvers to only run on MI100, MI200 and MI300.
For this I replaced
is_ck_supported_hardware
withis_ck_whitelist
.