Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add two new tail strategies for update definitions #7949

Merged
merged 10 commits into from
Dec 5, 2023

Conversation

abadams
Copy link
Member

@abadams abadams commented Nov 14, 2023

I've been working on code where I want to vectorize in-place updates on very small images, and our existing loop tail strategies aren't sufficient. ShiftInwards is no good for in-place updates, RoundUp writes beyond the end, and GuardWithIf spends all its time in a scalar loop tail (on archs without predicated loads/stores).

Another effect on small images I see is that you want to partition the loop in x, even in the tail in the loop in y, because you potentially spend a lot of time in the tail of the y loop. I changed Partition::Always to do this.

This PR adds two new tail strategies, along the lines proposed in #7947

Benchmarking results for the tail strategies in the new performance test are as follows (for avx2):

RoundUp 0.00902173
GuardWithIf 0.323874
ShiftInwardsAndBlend 0.0120871
RoundUpAndBlend 0.0157695

This is a test where the output size is not-quite two vectors, so it has one iteration of steady-state and one iteration of tail.

ShiftInwardsAndBlend (the tail strategy I need) is only 1.3x slower than RoundUp, but it's 27x faster than GuardWithIf.

One caveat is that vectorizing with ShiftInwardsAndBlend will do a vector load that overlaps a previous vector store, which stalls for >10 cycles, so you need to take care in the schedule to do enough work before doing the load for the store to make it out of the store buffer. In the test I address this by unrolling the y dimension.

@mcourteaux
Copy link
Contributor

Regarding the potential race conditions. You mentioned in the inline comments that:

Because this is a read - modify - write, this tail strategy cannot be
used on any dimension the stage is parallelized over as it would cause a
race condition.

However, there will be a race condition for a parallel outer loop, and a RoundUpAndBlend inner loop, I believe.

Func f;
f(x, y) = 0;
f.parallel(y).vectorize(x, 4, TailStrategy::RoundUpAndBlend);

For an image of 10x10, the vectorized inner loop will do 3 iterations, touching a total of 12 elements, of which 2 elements are beyond the extent of dim 0, which makes it touch the second scanline of the image, which is written by another thread. This looks like a race condition to me.

I don't know if you considered this. At least I don't see a test for it, nor do the inline comments suggest what will happen. After writing this, isn't this the same for simply RoundUp as well? Sorry if I'm being dumb and just wasting your time, but I'm trying to help catching a potential bug early.

@abadams
Copy link
Member Author

abadams commented Nov 15, 2023

That's actually safe. Bounds inference for allocation sizes is done separately to bounds inference for computation, and both are done separately per dimension. Bounds inference for allocation sizes is done by sniffing what places the code actually loads from and stores to, so in that case, if it's an internal pipeline stage, it will allocate a 12 x 10 region. If it's a pipeline input or output, it'll throw an out-of-bounds access error.

ShiftInwardsAndBlend is more useful for inputs and outputs - RoundUpAndBlend doesn't really apply there. I think I would use RoundUpAndBlend very rarely - probably only when writing to a subregion of a larger buffer using an RDom.

@abadams
Copy link
Member Author

abadams commented Nov 15, 2023

And no need to apologize, I frequently make mistakes, so it's great to have people proactively looking for bugs.

@steven-johnson
Copy link
Contributor

That's actually safe. Bounds inference for allocation sizes is done separately to bounds inference for computation, and both are done separately per dimension. Bounds inference for allocation sizes is done by sniffing what places the code actually loads from and stores to, so in that case, if it's an internal pipeline stage, it will allocate a 12 x 10 region. If it's a pipeline input or output, it'll throw an out-of-bounds access error.

ShiftInwardsAndBlend is more useful for inputs and outputs - RoundUpAndBlend doesn't really apply there. I think I would use RoundUpAndBlend very rarely - probably only when writing to a subregion of a larger buffer using an RDom.

Please capture this exchange in a code comment somewhere

@abadams
Copy link
Member Author

abadams commented Nov 15, 2023

Done.

@abadams
Copy link
Member Author

abadams commented Nov 16, 2023

Failures unrelated (new llvm 18 failures on arm-32, and weird performance results on the new arm-64 buildbot that also occur on main)

@TH3CHARLie TH3CHARLie added the release_notes For changes that may warrant a note in README for official releases. label Nov 17, 2023
@TH3CHARLie
Copy link
Contributor

worth mentioning new strategies in the notes

@@ -100,6 +100,32 @@ enum class TailStrategy {
* instead of a multiple of the split factor as with RoundUp. */
ShiftInwards,

/** Equivalent to ShiftInwards, but protects values that would be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we need to have some kind of tutorial for all of the different variants of TailStrategy: there are a lot of them and some of them are pretty subtle, so it's not really obvious how exactly they work.

No need to block PR on this though, but maybe would be good to create an issue.

}
}

for (auto it = sched.splits().begin(); it != sched.splits().end(); it++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add a similar comment to above that this loop propagates back (thus all of the if conditions and bodies are reversed) from the root to the child vars which of them are connected to parallel var.

@abadams abadams merged commit 17578a1 into main Dec 5, 2023
3 checks passed
@steven-johnson
Copy link
Contributor

It would be good to announce this inside Google with some info on how/when to use it -- do we think the existing comments are enough?

@steven-johnson steven-johnson deleted the abadams/blend_tail_strategies branch December 5, 2023 20:20
vksnk pushed a commit that referenced this pull request Dec 7, 2023
* Add two new tail strategies for update definitions

* Stop printing asm

* Update expected number of partitions for Partition::Always

* Add a comment explaining why the blend safety check is per dimension

* Add serialization support for the new tail strategies

* trigger buildbots

* Add comment

---------

Co-authored-by: Steven Johnson <srj@google.com>
vksnk added a commit that referenced this pull request Dec 19, 2023
* Half-plumbed

* Revert "Half-plumbed"

This reverts commit eb9dd02.

* Interface for double buffer

* Update Provides, Calls and Realizes for double buffering

* Proper sync for double buffering

* Use proper name for the semaphor and use correct initial value

* Rename the class

* Pass expression for index

* Adds storage for double buffering index

* Use a separate index to go through the double buffer

* Failing test

* Better handling of hoisted storage in all of the async-related passes

* New test and clean-up the generated IR

* More tests

* Allow double buffering without async and add corresponding test

* Filter out incorrect double_buffer schedules

* Add tests to the cmake files

* Clean up

* Update the comment

* Clean up

* Clean up

* Update serialization

* complete_x86_target() should enable F16C and FMA when AVX2 is present (#7971)

All known AVX2-enabled architectures definitely have these features.

* Add two new tail strategies for update definitions (#7949)

* Add two new tail strategies for update definitions

* Stop printing asm

* Update expected number of partitions for Partition::Always

* Add a comment explaining why the blend safety check is per dimension

* Add serialization support for the new tail strategies

* trigger buildbots

* Add comment

---------

Co-authored-by: Steven Johnson <srj@google.com>

* Add appropriate mattrs for arm-32 extensions (#7978)

* Add appropriate mattrs for arm-32 extensions

Fixes #7976

* Pull clauses out of if

* Move canonical version numbers into source, not build system (#7980) (#7981)

* Move canonical version numbers into source, not build system (#7980)

* Fixes

* Silence useless "Insufficient parallelism" autoscheduler warning (#7990)

* Add a notebook with a visualization of the aprrox_* functions and their errors (#7974)

* Add a notebook with a visualization of the aprrox_* functions and their errors

* Fix spelling error

* Make narrowing float->int casts on wasm go via wider ints (#7973)

Fixes #7972

* Fix handling of assert statements whose conditions get vectorized (#7989)

* Fix handling of assert statements whose conditions get vectorized

* Fix test name

* Fix all "unscheduled update()" warnings in our code (#7991)

* Fix all "unscheduled update()" warnings in our code

And also fix the Mullapudi scheduler to explicitly touch all update stages. This allows us to mark this warning as an error if we so choose.

* fixes

* fixes

* Update recursive_box_filters.cpp

* Silence useless 'Outer dim vectorization of var' warning in Mullapudi… (#7992)

Silence useless 'Outer dim vectorization of var' warning in Mullapudi scheduler

* Add a tutorial for async and double_buffer

* Renamed double_buffer to ring_buffer

* ring_buffer() now expects an extent Expr

* Actually use extent for ring_buffer()

* Address some of the comments

* Provide an example of the code structure for producer-consumer async example

* Comments updates

* Fix clang-format and clang-tidy

* Add Python binding for Func::ring_buffer()

* Don't use a separate index for ring buffer + add a new test

* Rename the tests

* Clean up the old name

* Add &

* Move test to the right folder

* Move expr

* Add comments for InjectRingBuffering

* Improve ring_buffer doc

* Fix comments

* Comments

* A better error message

* Mention that extent is expected to be a positive integer

* Add another code structure and explain how the indices for ring buffer are computed

* Expand test comments

* Fix spelling

---------

Co-authored-by: Steven Johnson <srj@google.com>
Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>
ardier pushed a commit to ardier/Halide-mutation that referenced this pull request Mar 3, 2024
* Add two new tail strategies for update definitions

* Stop printing asm

* Update expected number of partitions for Partition::Always

* Add a comment explaining why the blend safety check is per dimension

* Add serialization support for the new tail strategies

* trigger buildbots

* Add comment

---------

Co-authored-by: Steven Johnson <srj@google.com>
ardier pushed a commit to ardier/Halide-mutation that referenced this pull request Mar 3, 2024
* Half-plumbed

* Revert "Half-plumbed"

This reverts commit eb9dd02.

* Interface for double buffer

* Update Provides, Calls and Realizes for double buffering

* Proper sync for double buffering

* Use proper name for the semaphor and use correct initial value

* Rename the class

* Pass expression for index

* Adds storage for double buffering index

* Use a separate index to go through the double buffer

* Failing test

* Better handling of hoisted storage in all of the async-related passes

* New test and clean-up the generated IR

* More tests

* Allow double buffering without async and add corresponding test

* Filter out incorrect double_buffer schedules

* Add tests to the cmake files

* Clean up

* Update the comment

* Clean up

* Clean up

* Update serialization

* complete_x86_target() should enable F16C and FMA when AVX2 is present (halide#7971)

All known AVX2-enabled architectures definitely have these features.

* Add two new tail strategies for update definitions (halide#7949)

* Add two new tail strategies for update definitions

* Stop printing asm

* Update expected number of partitions for Partition::Always

* Add a comment explaining why the blend safety check is per dimension

* Add serialization support for the new tail strategies

* trigger buildbots

* Add comment

---------

Co-authored-by: Steven Johnson <srj@google.com>

* Add appropriate mattrs for arm-32 extensions (halide#7978)

* Add appropriate mattrs for arm-32 extensions

Fixes halide#7976

* Pull clauses out of if

* Move canonical version numbers into source, not build system (halide#7980) (halide#7981)

* Move canonical version numbers into source, not build system (halide#7980)

* Fixes

* Silence useless "Insufficient parallelism" autoscheduler warning (halide#7990)

* Add a notebook with a visualization of the aprrox_* functions and their errors (halide#7974)

* Add a notebook with a visualization of the aprrox_* functions and their errors

* Fix spelling error

* Make narrowing float->int casts on wasm go via wider ints (halide#7973)

Fixes halide#7972

* Fix handling of assert statements whose conditions get vectorized (halide#7989)

* Fix handling of assert statements whose conditions get vectorized

* Fix test name

* Fix all "unscheduled update()" warnings in our code (halide#7991)

* Fix all "unscheduled update()" warnings in our code

And also fix the Mullapudi scheduler to explicitly touch all update stages. This allows us to mark this warning as an error if we so choose.

* fixes

* fixes

* Update recursive_box_filters.cpp

* Silence useless 'Outer dim vectorization of var' warning in Mullapudi… (halide#7992)

Silence useless 'Outer dim vectorization of var' warning in Mullapudi scheduler

* Add a tutorial for async and double_buffer

* Renamed double_buffer to ring_buffer

* ring_buffer() now expects an extent Expr

* Actually use extent for ring_buffer()

* Address some of the comments

* Provide an example of the code structure for producer-consumer async example

* Comments updates

* Fix clang-format and clang-tidy

* Add Python binding for Func::ring_buffer()

* Don't use a separate index for ring buffer + add a new test

* Rename the tests

* Clean up the old name

* Add &

* Move test to the right folder

* Move expr

* Add comments for InjectRingBuffering

* Improve ring_buffer doc

* Fix comments

* Comments

* A better error message

* Mention that extent is expected to be a positive integer

* Add another code structure and explain how the indices for ring buffer are computed

* Expand test comments

* Fix spelling

---------

Co-authored-by: Steven Johnson <srj@google.com>
Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_notes For changes that may warrant a note in README for official releases.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants