Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MetaSchedule] Upstream the leftover changes #10689

Merged

Conversation

junrushao
Copy link
Member

This PR upstreams the leftovers in MetaSchedule upstream process for performance and stability.

@junrushao
Copy link
Member Author

CC: @spectrometerHBH @zxybazh

@junrushao junrushao force-pushed the feature/2022-03-19/meta-schedule-perf-fix branch 5 times, most recently from a1191cd to 053a2e7 Compare March 20, 2022 17:17
@junrushao junrushao force-pushed the feature/2022-03-19/meta-schedule-perf-fix branch from 053a2e7 to edce105 Compare March 20, 2022 17:22
Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a small nit.

@@ -93,6 +91,7 @@ def main():
cache_line_bytes=64,
max_shared_memory_per_block=int(ARGS.target.attrs["max_shared_memory_per_block"]),
max_threads_per_block=int(ARGS.target.attrs["max_threads_per_block"]),
max_local_memory_per_block=12345678,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Have a comment saying this is useless but just a placeholder.

@comaniac comaniac merged commit cced5bc into apache:main Mar 22, 2022
@comaniac
Copy link
Contributor

Thanks @junrushao1994

while (true) {
std::vector<int64_t> result = SamplePerfectTile(rand_state, extent, n_splits);
if (result.back() <= max_innermost_factor) {
return result;
Copy link
Member

@masahi masahi Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found an interesting issue with this change. After this PR, my VNNI tuning test hangs at https://gist.github.com/masahi/3579be9b2b9506106270eb2217746e74#file-vnni_dense_tir-py-L164, i.e. the second call in

def schedule_rule_batch_matmul_vnni(sch: tir.Schedule, bmm_block):
    sch_copy = sch.copy()
    schedule_batch_matmul(bmm_block, None, True, sch, layout_trans_compute_root=False)
    schedule_batch_matmul(bmm_block, None, True, sch_copy, layout_trans_compute_root=True)  # <- Hangs at sample_perfect_tile
    return [sch, sch_copy]

If I remove the second schedule_batch_matmul, everything is ok. Apparently, during the second call to sample_perfect_tile, SamplePerfectTile above keeps returning 1024 when max_inner_most_factor is 128 and extent is 1024. Does this bring any bell?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ooops! so this PR basically loops until the innermost factor satisfies some certain constraint, with the assumption that there is always a case where the factor could be 1 where 1 <= max_innermost_factor always holds.

To make sure I understand this correctly, in your particular case, extent = 1024, max_innermost_factor = 128, and what is n_splits? And it causes forever hang on this method? If so, probably @zxybazh or I could reproduce this bug and get it fixed :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, extent = 1024, max_innermost_factor = 128 and n_splits=2.

It hangs on the second call to schedule_batch_matmul, where SamplePerfectTile keeps returning 1024. The following alternative definition works fine:

def schedule_rule_batch_matmul_vnni(sch: tir.Schedule, bmm_block):
    schedule_batch_matmul(bmm_block, None, True, sch, layout_trans_compute_root=False)
    return [sch]

I think the issue has to be related to the use of rand_state, since this is the only stateful input to SamplePerfectTile(rand_state, extent, n_splits). Note that I'm applying the same schedule on two copies of sch. Maybe when I make a copy of schedule by sch_copy = sch.copy(), the RNG engine is not copied?

Copy link
Member Author

@junrushao junrushao Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱 seems not!!!!!

https://github.com/junrushao1994/tvm/blob/673355b2a423dfbb8f5e311cfd74503ae91fbba9/src/tir/schedule/concrete_schedule.cc#L189

We didn't set the new random state by:

n->rand_state_ = self->ForkSeed(); // (also we need to change the Copy method to a non-const one)

and thus likely the state is always some default value......I'm very surprised that we didn't find out in previous performance alignment.

Would you mind sending a quick PR fixing this dangerous behavior? Thanks a lot!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that in the commit f9c655a, but it didn't fix the hang problem. I'll look into more this week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #10806

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Masa, thanks for the quick fix, just want to confirm with you if the hang problem is fixed after #10806 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it was fixed in #10806

pfk-beta pushed a commit to pfk-beta/tvm that referenced this pull request Apr 11, 2022
* [MetaSchedule] Upstream the leftover changes

* Update driver_api.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants