Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict CK solvers to only run on MI100, MI200 and MI300 #2533

Closed
wants to merge 2 commits into from

Conversation

bghimireamd
Copy link
Contributor

@bghimireamd bghimireamd commented Nov 16, 2023

Currently tests are passing when Jenkins randomly choses MI series GPU. If Jenkins chooses GPU other then MI series, the tests starts failing. The reason is the CK solvers are currently only verified(tests pass) in MI series. This PR is to restrict all the CK solvers to only run on MI100, MI200 and MI300.

For this I replaced is_ck_supported_hardware with is_ck_whitelist.

@junliume
Copy link
Collaborator

@bghimireamd can this PR replace #2528 ?

@atamazov
Copy link
Contributor

@bghimireamd @junliume Why CK is restricted to MI ASICs only? I do not see explanation in #2458, see https://github.com/ROCmSoftwarePlatform/MIOpen/pull/2458/files#r1396497729

@atamazov
Copy link
Contributor

@bghimireamd

Currently tests are passing when Jenkins randomly choses MI series GPU.

CI testing should not be a random process. If some test is not intended for some GPU, the test should be disabled for that GPU in tests/CMakeLists.txt or in the source of relevant gtest.

If Jenkins chooses GPU other then MI series, the tests starts failing.

Can you please provide a list of that tests and/or an example of failure?

@bghimireamd
Copy link
Contributor Author

@bghimireamd can this PR replace #2528 ?

@junliume Yes

@bghimireamd
Copy link
Contributor Author

@bghimireamd @junliume Why CK is restricted to MI ASICs only? I do not see explanation in #2458, see https://github.com/ROCmSoftwarePlatform/MIOpen/pull/2458/files#r1396497729

I think its because currently CK team is writing kernels focusing only for MI ASICs. May be in future CK team will expand to other GPU series.

@bghimireamd
Copy link
Contributor Author

bghimireamd commented Nov 17, 2023

@bghimireamd

Currently tests are passing when Jenkins randomly choses MI series GPU.

CI testing should not be a random process. If some test is not intended for some GPU, the test should be disabled for that GPU in tests/CMakeLists.txt or in the source of relevant gtest.

If Jenkins chooses GPU other then MI series, the tests starts failing.

Can you please provide a list of that tests and/or an example of failure?

@atamazov Yes, we can do it in gtest or cmake or in IsApplicable of the solver. Benefit of placing in IsApplicable is maintainability, we just have to edit is_ck_whitelist function if CK expands there GPU support in future. Disadvantage is now we have these checks in each solver which might increase CPU cycle but not sure how significant it is.

Potentially all the solver that are using CK solvers but are not restricted to MI ASICs are example of failure. eg: PR #2481 was failing because the CK solver was not restricted to MI ASIC.

@atamazov
Copy link
Contributor

atamazov commented Nov 17, 2023

@bghimireamd Currently CK is supporting non-xDLOPs GPUs, e.g. MI50 and MI60, Navi21. Disabling CK for these will likely lead to performance regressions for these targets.

PR #2481 was failing because the CK solver was not restricted to MI ASIC.

If some CK solver added in #2481 is not working, then we need to restrict the applicability of CK solvers introduced in that PR. And tests need to be restricted too, see #2481 (review). Otherwise the tests will continue to fail.

@atamazov
Copy link
Contributor

@junliume How is it possible that #2481 passed CI?

@atamazov
Copy link
Contributor

@bghimireamd @junliume Let me repeat some important question. It is important because I suspect that our Jenkinsfile has some flaws.

CI testing should not be a random process...

If Jenkins chooses GPU other then MI series, the tests starts failing.

Can you please provide a list of the failing tests and/or an example log of the failure (a copy of log from CI)?

@atamazov
Copy link
Contributor

@bghimireamd @junliume

Currently tests are passing when Jenkins randomly choses MI series GPU.

FYI here is the review thread that is likely related to the uncertainty mentioned above: #1693 (comment)

@junliume
Copy link
Collaborator

junliume commented Nov 17, 2023

@junliume How is it possible that #2481 passed CI?

@atamazov the PR CI now solely perform on gfx908/gf90a with limited smoke test/static code check on anyGPU. I see the risk while since we are getting quite a few gfx90a in our CI pool, I need to figure out a way not making legacy ASIC the bottleneck :(

@bghimireamd please connect with @zjing14 , I believe CK also supports gfx110x so not completely only on MI series.

@atamazov
Copy link
Contributor

@bghimireamd @junliume The CK that we currently use supports gfx906 and gfx1030 for sure.

@bghimireamd
Copy link
Contributor Author

bghimireamd commented Nov 19, 2023

@bghimireamd @junliume Let me repeat some important question. It is important because I suspect that our Jenkinsfile has some flaws.

CI testing should not be a random process...

If Jenkins chooses GPU other then MI series, the tests starts failing.

Can you please provide a list of the failing tests and/or an example log of the failure (a copy of log from CI)?

@atamazov The owner of the PR is not able to test unsupported GPUs because Jenkinsfile randomly selects supported GPUs. Is that one of the flaw?

@bghimireamd
Copy link
Contributor Author

@atamazov @junliume Correcting my earlier statement. All solvers in CK are supported for MI series. But some CK solvers support GPUs other then MI series. Let me find out if there is a list of Solvers and its respective supported GPUs from CK teams.

@atamazov
Copy link
Contributor

@bghimireamd

Let me find out if there is a list of Solvers and its respective supported GPUs from CK teams.

You can simply inspect IsApplicable of the existing convolution solvers. WRT layernorm, yes, CK team can help.

@atamazov
Copy link
Contributor

atamazov commented Nov 20, 2023

@bghimireamd

The owner of the PR is not able to test unsupported GPUs because Jenkinsfile randomly selects supported GPUs. Is that one of the flaw?

It is related. But random selection happens only during smoke testing. After smoke testing, our CI should perform full testing on all supported GPUs, and the "full test" stages must detect all the remaining problems. I suspect there is some issue in the full testing stages.

You can simply inspect IsApplicable of the existing convolution solvers. WRT layernorm, yes, CK team can help.

Maybe it is better to manually test CK-based layernorm solvers on different nodes and set their applicability according to the results (and adjust the applicability of the tests).

💡 Please note that the tests must ensure 100% coverage of functionality. In other words, the applicability of solvers should not be wider than what we test in our CI.

Right away (after #2458) the above requirement means that the applicability of CK-based layernorm solvers should be limited to MI100, 200 and 300. I recommending doing this ASAP.

/cc @junliume @JehandadKhan

@atamazov
Copy link
Contributor

@junliume @bghimireamd @JehandadKhan #2458 disables some CK-based CONV and BN solvers for MI50/60/gfx906, Navi2x, Navi3x. As as mentioned above, this may lead to regressions.

Are we going to resolve this problem? Or do we no longer need CK solvers for MI50/60, Navi2x, Navi3x? I just need to understand what is going on.

@CAHEK7
Copy link
Contributor

CAHEK7 commented Dec 15, 2023

A sidenote - it's not enough just to limit solvers applicability. There is a bunch of CK tests with explicitly enabled Navi2x nodes, they may start failing randomly.
If we decide to go with this PR, the tests should be fixed accordingly.

@atamazov
Copy link
Contributor

@CAHEK7

A sidenote - it's not enough just to limit solvers applicability. There is a bunch of CK tests with explicitly enabled Navi2x nodes, they may start failing...

Yes, but what I am saying at #2533 (comment) is that it seems highly questionable that CK should be disabled for Navi2x/3x and gfx906.

@atamazov
Copy link
Contributor

atamazov commented Dec 22, 2023

@junliume We can remove ON_HOLD I guess?

@junliume
Copy link
Collaborator

junliume commented Dec 22, 2023

@junliume We can remove ON_HOLD I guess?

@atamazov By closing out this PR?

@atamazov
Copy link
Contributor

@junliume I take my words back. Sorry for confusion.

@junliume junliume closed this May 14, 2024
@junliume junliume deleted the restrict_ck_solver_to_MI_series branch June 25, 2024 04:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants