Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabled TEGroupedMLP test. #22

Merged
merged 3 commits into from
Nov 21, 2024
Merged

Conversation

wenchenvincent
Copy link
Collaborator

No description provided.

@gurpreet-dhami
Copy link
Collaborator

@wenchenvincent , @lcskrishna
Do we want to remove the original markers e.g. "internal" ?
At some point I wanted to run all internal tests as well and mark the failing ones for further debugging ?
I suggest to keep the original markers as it is. And add additional markers ( e.g. test_on_rocm ) for the tests like this ?
It is easier for subsequent IFU updates as well.

@wenchenvincent
Copy link
Collaborator Author

@wenchenvincent , @lcskrishna Do we want to remove the original markers e.g. "internal" ? At some point I wanted to run all internal tests as well and mark the failing ones for further debugging ? I suggest to keep the original markers as it is. And add additional markers ( e.g. test_on_rocm ) for the tests like this ? It is easier for subsequent IFU updates as well.

@gurpreet-dhami We should keep most of the internal markers. For this one, I wanted to enable the specific tests for TEGroupedMLP to make sure that we test it. I like your idea of adding additional markers instead of removing the internal markers. I will make the change accordingly.

@wenchenvincent wenchenvincent force-pushed the enable_te_grouped_mlp_test branch from 9d03142 to b1829e5 Compare November 20, 2024 18:42
@wenchenvincent
Copy link
Collaborator Author

@gurpreet-dhami @lcskrishna I refactored the test markers. Could you please review the change?

@gurpreet-dhami
Copy link
Collaborator

@wenchenvincent : Change looks good to me. I hope the tests are passing for you.
I saw that CI failed on another unrelated test http://ml-ci-internal.amd.com:8080/job/Megatron-LM/job/Megatron-LM/view/change-requests/job/PR-22/3/console
tests/unit_tests/transformer/test_transformer_block.py tests/unit_tests/transformer/test_transformer_block.py ........................FFFFFFFF........Fatal Python error: Aborted

I ran again. lets see how it goes.

@lcskrishna
Copy link
Collaborator

@wenchenvincent : Change looks good to me. I hope the tests are passing for you. I saw that CI failed on another unrelated test http://ml-ci-internal.amd.com:8080/job/Megatron-LM/job/Megatron-LM/view/change-requests/job/PR-22/3/console tests/unit_tests/transformer/test_transformer_block.py tests/unit_tests/transformer/test_transformer_block.py ........................FFFFFFFF........Fatal Python error: Aborted

I ran again. lets see how it goes.

Seems like UTs are failing with this one in another PR as well. @gurpreet-dhami Can you take a look at this one.

@gurpreet-dhami
Copy link
Collaborator

For some reason, CI is failing on your PR on transformer_block tests.

It is passing on rocm_dev branch

passing on rocm_dev - http://ml-ci-internal.amd.com:8080/job/Megatron-LM/job/Megatron-LM/job/rocm_dev/19/

failing on TEgrouped gemm PR: http://ml-ci-internal.amd.com:8080/job/Megatron-LM/job/Megatron-LM/view/change-requests/job/PR-22/4/

@gurpreet-dhami gurpreet-dhami merged commit b0d08df into rocm_dev Nov 21, 2024
0 of 3 checks passed
@gurpreet-dhami gurpreet-dhami deleted the enable_te_grouped_mlp_test branch November 21, 2024 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants