Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pointwise scheduler fails to validate reference tv #3513

Merged
merged 77 commits into from
Dec 18, 2024

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Dec 2, 2024

Fixes: #3512

When picking reference tv, pointwise scheduler fails to validate that the transformation on reference tv can be safely propagated to all outputs in the fusion. The issue occurs when an IterDomain that's not in the reference tv is merged with another dimension in the output tv, preventing the merge on reference tv to be propagated to the target.

This PR adds an optional check areAllOutputIdsMappedTo in nvfuser::pointwise_utils::DomainMap::isValidReference

The added check in this PR checks that all source producer IterDomain producing the IterDomain on outputs are covered by reference tv. This is safe for pointwise scheduler, since the scheduler checks that there's no reversible view present in the fusion.

The check is optional and is disabled by transpose scheduler, where the reference_tv is not supposed to cover the entire fusion, but rather a subset of fusion IO tensors. We should extent that in future PRs.

@jjsjann123
Copy link
Collaborator Author

!test

@jjsjann123 jjsjann123 changed the title Pw scheduler reference find patch [SMOKE TEST] Pw scheduler reference find patch Dec 2, 2024
@jjsjann123
Copy link
Collaborator Author

!test

@jjsjann123
Copy link
Collaborator Author

🤞

@jjsjann123 jjsjann123 changed the title [SMOKE TEST] Pw scheduler reference find patch pointwise scheduler fails to validate reference tv Dec 5, 2024
@jjsjann123
Copy link
Collaborator Author

!test

@jjsjann123 jjsjann123 requested a review from naoyam December 5, 2024 10:37
@jjsjann123
Copy link
Collaborator Author

!test --diff-bench

@naoyam
Copy link
Collaborator

naoyam commented Dec 12, 2024

But now I'm worried that this check might not work for pad, where we go from a broadcast ID to non-broadcast ID... I'm checking that with a test.

Ah, this actually can cause an issue. For example, suppose we pick a tensor as a reference that has a broadcast ID, and that broadcast ID comes from a fusion input tensor. Suppose that broadcast ID is also used by a pad op, generating a non-broadcast ID, and that non-broadcast ID is NOT included in the reference.

More specifically:

t0 = [i0, b1] // fusion input
t1 = [i2, i3, i4, b5] // fusion input

// t1: [b6, b7, i0, b1]
t2 = broadcast(t0, [true, true, false, false]) 

t3 = add(t1, t2) // fusion output


// t4: [i0, i8]
t4 = pad(t0) // fusion output

Here, suppose we choose t3 as a reference. Clearly, we can't schedule the i8 domain of t4. However, since the propagation back from i8 reaches at b1, which is also reachable from t3, the reference validation would miss flagging the invalid reference.

@jjsjann123
Copy link
Collaborator Author

jjsjann123 commented Dec 12, 2024

Clearly, we can't schedule the i8 domain of t4.

I was worried about the same thing and I was thinking about changing the get_source_iter_domains here to add output of resize to also be treated as sources.

But turns out we don't need that.
Transform propagation can actually schedule fusion like that from t3.

You can look at the other example I added for pad (there's a typo in the comment on tv0, I'll fix that). It's very similar to yours
I also just tried you example and the scheduling seems to work. I'll add it to the test as well.

I think the difference between this example and the original issue we had is due to resize. Which links the input ID to the output ID, so propagation of transformation works (I think?!) I was planning to ask you that question tomorrow, since it's getting late when I ran into this.

@jjsjann123
Copy link
Collaborator Author

!test --diff-bench

@jjsjann123
Copy link
Collaborator Author

!test --diff-bench

@jjsjann123
Copy link
Collaborator Author

errr... what's with CI 😭

@jjsjann123
Copy link
Collaborator Author

!test --diff-bench

@jjsjann123 jjsjann123 requested a review from naoyam December 12, 2024 17:46
@jjsjann123
Copy link
Collaborator Author

jjsjann123 commented Dec 12, 2024

My gut feeling, based on my naive understanding about transform propagation, if we actually have an expandOp, or even just use a resize to link the IDs that are expanded, I feel nvfuser might be able to codegen the original issue #3512 without a problem.

The other scenario is also pretty interesting to me. #3576

NOTE for myself. I think I should go have a look at how transform propagation actually works to verify this for a piece of mind.

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix.

@jjsjann123
Copy link
Collaborator Author

!test --diff-bench

@jjsjann123
Copy link
Collaborator Author

!test --diff-bench

@jjsjann123 jjsjann123 merged commit a61139c into main Dec 18, 2024
58 checks passed
@jjsjann123 jjsjann123 deleted the pw_scheduler_reference_find_patch branch December 18, 2024 00:06
jacobhinkle added a commit that referenced this pull request Dec 18, 2024
Fixes: #3512

When picking reference tv, pointwise scheduler fails to validate that
the transformation on reference tv can be safely propagated to all
outputs in the fusion. The issue occurs when an IterDomain that's not in
the reference tv is merged with another dimension in the output tv,
preventing the merge on reference tv to be propagated to the target.

This PR adds an optional check `areAllOutputIdsMappedTo` in
`nvfuser::pointwise_utils::DomainMap::isValidReference`

The added check in this PR checks that all source producer IterDomain
producing the IterDomain on outputs are covered by reference tv. This is
safe for pointwise scheduler, since the scheduler checks that there's no
reversible view present in the fusion.

The check is optional and is disabled by transpose scheduler, where the
reference_tv is not supposed to cover the entire fusion, but rather a
subset of fusion IO tensors. We should extent that in future PRs.

---------

Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com>
Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pointwise scheduler picks invalid reference TensorView
4 participants