Add two new tail strategies for update definitions #7949

abadams · 2023-11-14T23:11:19Z

I've been working on code where I want to vectorize in-place updates on very small images, and our existing loop tail strategies aren't sufficient. ShiftInwards is no good for in-place updates, RoundUp writes beyond the end, and GuardWithIf spends all its time in a scalar loop tail (on archs without predicated loads/stores).

Another effect on small images I see is that you want to partition the loop in x, even in the tail in the loop in y, because you potentially spend a lot of time in the tail of the y loop. I changed Partition::Always to do this.

This PR adds two new tail strategies, along the lines proposed in #7947

Benchmarking results for the tail strategies in the new performance test are as follows (for avx2):

RoundUp 0.00902173
GuardWithIf 0.323874
ShiftInwardsAndBlend 0.0120871
RoundUpAndBlend 0.0157695

This is a test where the output size is not-quite two vectors, so it has one iteration of steady-state and one iteration of tail.

ShiftInwardsAndBlend (the tail strategy I need) is only 1.3x slower than RoundUp, but it's 27x faster than GuardWithIf.

One caveat is that vectorizing with ShiftInwardsAndBlend will do a vector load that overlaps a previous vector store, which stalls for >10 cycles, so you need to take care in the schedule to do enough work before doing the load for the store to make it out of the store buffer. In the test I address this by unrolling the y dimension.

mcourteaux · 2023-11-15T13:39:32Z

Regarding the potential race conditions. You mentioned in the inline comments that:

Because this is a read - modify - write, this tail strategy cannot be
used on any dimension the stage is parallelized over as it would cause a
race condition.

However, there will be a race condition for a parallel outer loop, and a RoundUpAndBlend inner loop, I believe.

Func f;
f(x, y) = 0;
f.parallel(y).vectorize(x, 4, TailStrategy::RoundUpAndBlend);

For an image of 10x10, the vectorized inner loop will do 3 iterations, touching a total of 12 elements, of which 2 elements are beyond the extent of dim 0, which makes it touch the second scanline of the image, which is written by another thread. This looks like a race condition to me.

I don't know if you considered this. At least I don't see a test for it, nor do the inline comments suggest what will happen. After writing this, isn't this the same for simply RoundUp as well? Sorry if I'm being dumb and just wasting your time, but I'm trying to help catching a potential bug early.

abadams · 2023-11-15T16:55:06Z

That's actually safe. Bounds inference for allocation sizes is done separately to bounds inference for computation, and both are done separately per dimension. Bounds inference for allocation sizes is done by sniffing what places the code actually loads from and stores to, so in that case, if it's an internal pipeline stage, it will allocate a 12 x 10 region. If it's a pipeline input or output, it'll throw an out-of-bounds access error.

ShiftInwardsAndBlend is more useful for inputs and outputs - RoundUpAndBlend doesn't really apply there. I think I would use RoundUpAndBlend very rarely - probably only when writing to a subregion of a larger buffer using an RDom.

abadams · 2023-11-15T16:55:57Z

And no need to apologize, I frequently make mistakes, so it's great to have people proactively looking for bugs.

steven-johnson · 2023-11-15T17:13:49Z

That's actually safe. Bounds inference for allocation sizes is done separately to bounds inference for computation, and both are done separately per dimension. Bounds inference for allocation sizes is done by sniffing what places the code actually loads from and stores to, so in that case, if it's an internal pipeline stage, it will allocate a 12 x 10 region. If it's a pipeline input or output, it'll throw an out-of-bounds access error.

ShiftInwardsAndBlend is more useful for inputs and outputs - RoundUpAndBlend doesn't really apply there. I think I would use RoundUpAndBlend very rarely - probably only when writing to a subregion of a larger buffer using an RDom.

Please capture this exchange in a code comment somewhere

abadams · 2023-11-15T17:22:23Z

Done.

abadams · 2023-11-16T17:35:27Z

Failures unrelated (new llvm 18 failures on arm-32, and weird performance results on the new arm-64 buildbot that also occur on main)

TH3CHARLie · 2023-11-17T00:01:56Z

worth mentioning new strategies in the notes

vksnk · 2023-11-27T17:36:57Z

src/Schedule.h

@@ -100,6 +100,32 @@ enum class TailStrategy {
     * instead of a multiple of the split factor as with RoundUp. */
    ShiftInwards,

+    /** Equivalent to ShiftInwards, but protects values that would be


I really think we need to have some kind of tutorial for all of the different variants of TailStrategy: there are a lot of them and some of them are pretty subtle, so it's not really obvious how exactly they work.

No need to block PR on this though, but maybe would be good to create an issue.

vksnk · 2023-11-27T17:55:15Z

src/Func.cpp

+        }
+    }
+
+    for (auto it = sched.splits().begin(); it != sched.splits().end(); it++) {


Might be good to add a similar comment to above that this loop propagates back (thus all of the if conditions and bodies are reversed) from the root to the child vars which of them are connected to parallel var.

steven-johnson · 2023-12-05T18:13:34Z

It would be good to announce this inside Google with some info on how/when to use it -- do we think the existing comments are enough?

* Add two new tail strategies for update definitions * Stop printing asm * Update expected number of partitions for Partition::Always * Add a comment explaining why the blend safety check is per dimension * Add serialization support for the new tail strategies * trigger buildbots * Add comment --------- Co-authored-by: Steven Johnson <srj@google.com>

* Half-plumbed * Revert "Half-plumbed" This reverts commit eb9dd02. * Interface for double buffer * Update Provides, Calls and Realizes for double buffering * Proper sync for double buffering * Use proper name for the semaphor and use correct initial value * Rename the class * Pass expression for index * Adds storage for double buffering index * Use a separate index to go through the double buffer * Failing test * Better handling of hoisted storage in all of the async-related passes * New test and clean-up the generated IR * More tests * Allow double buffering without async and add corresponding test * Filter out incorrect double_buffer schedules * Add tests to the cmake files * Clean up * Update the comment * Clean up * Clean up * Update serialization * complete_x86_target() should enable F16C and FMA when AVX2 is present (#7971) All known AVX2-enabled architectures definitely have these features. * Add two new tail strategies for update definitions (#7949) * Add two new tail strategies for update definitions * Stop printing asm * Update expected number of partitions for Partition::Always * Add a comment explaining why the blend safety check is per dimension * Add serialization support for the new tail strategies * trigger buildbots * Add comment --------- Co-authored-by: Steven Johnson <srj@google.com> * Add appropriate mattrs for arm-32 extensions (#7978) * Add appropriate mattrs for arm-32 extensions Fixes #7976 * Pull clauses out of if * Move canonical version numbers into source, not build system (#7980) (#7981) * Move canonical version numbers into source, not build system (#7980) * Fixes * Silence useless "Insufficient parallelism" autoscheduler warning (#7990) * Add a notebook with a visualization of the aprrox_* functions and their errors (#7974) * Add a notebook with a visualization of the aprrox_* functions and their errors * Fix spelling error * Make narrowing float->int casts on wasm go via wider ints (#7973) Fixes #7972 * Fix handling of assert statements whose conditions get vectorized (#7989) * Fix handling of assert statements whose conditions get vectorized * Fix test name * Fix all "unscheduled update()" warnings in our code (#7991) * Fix all "unscheduled update()" warnings in our code And also fix the Mullapudi scheduler to explicitly touch all update stages. This allows us to mark this warning as an error if we so choose. * fixes * fixes * Update recursive_box_filters.cpp * Silence useless 'Outer dim vectorization of var' warning in Mullapudi… (#7992) Silence useless 'Outer dim vectorization of var' warning in Mullapudi scheduler * Add a tutorial for async and double_buffer * Renamed double_buffer to ring_buffer * ring_buffer() now expects an extent Expr * Actually use extent for ring_buffer() * Address some of the comments * Provide an example of the code structure for producer-consumer async example * Comments updates * Fix clang-format and clang-tidy * Add Python binding for Func::ring_buffer() * Don't use a separate index for ring buffer + add a new test * Rename the tests * Clean up the old name * Add & * Move test to the right folder * Move expr * Add comments for InjectRingBuffering * Improve ring_buffer doc * Fix comments * Comments * A better error message * Mention that extent is expected to be a positive integer * Add another code structure and explain how the indices for ring buffer are computed * Expand test comments * Fix spelling --------- Co-authored-by: Steven Johnson <srj@google.com> Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>

* Add two new tail strategies for update definitions * Stop printing asm * Update expected number of partitions for Partition::Always * Add a comment explaining why the blend safety check is per dimension * Add serialization support for the new tail strategies * trigger buildbots * Add comment --------- Co-authored-by: Steven Johnson <srj@google.com>

* Half-plumbed * Revert "Half-plumbed" This reverts commit eb9dd02. * Interface for double buffer * Update Provides, Calls and Realizes for double buffering * Proper sync for double buffering * Use proper name for the semaphor and use correct initial value * Rename the class * Pass expression for index * Adds storage for double buffering index * Use a separate index to go through the double buffer * Failing test * Better handling of hoisted storage in all of the async-related passes * New test and clean-up the generated IR * More tests * Allow double buffering without async and add corresponding test * Filter out incorrect double_buffer schedules * Add tests to the cmake files * Clean up * Update the comment * Clean up * Clean up * Update serialization * complete_x86_target() should enable F16C and FMA when AVX2 is present (halide#7971) All known AVX2-enabled architectures definitely have these features. * Add two new tail strategies for update definitions (halide#7949) * Add two new tail strategies for update definitions * Stop printing asm * Update expected number of partitions for Partition::Always * Add a comment explaining why the blend safety check is per dimension * Add serialization support for the new tail strategies * trigger buildbots * Add comment --------- Co-authored-by: Steven Johnson <srj@google.com> * Add appropriate mattrs for arm-32 extensions (halide#7978) * Add appropriate mattrs for arm-32 extensions Fixes halide#7976 * Pull clauses out of if * Move canonical version numbers into source, not build system (halide#7980) (halide#7981) * Move canonical version numbers into source, not build system (halide#7980) * Fixes * Silence useless "Insufficient parallelism" autoscheduler warning (halide#7990) * Add a notebook with a visualization of the aprrox_* functions and their errors (halide#7974) * Add a notebook with a visualization of the aprrox_* functions and their errors * Fix spelling error * Make narrowing float->int casts on wasm go via wider ints (halide#7973) Fixes halide#7972 * Fix handling of assert statements whose conditions get vectorized (halide#7989) * Fix handling of assert statements whose conditions get vectorized * Fix test name * Fix all "unscheduled update()" warnings in our code (halide#7991) * Fix all "unscheduled update()" warnings in our code And also fix the Mullapudi scheduler to explicitly touch all update stages. This allows us to mark this warning as an error if we so choose. * fixes * fixes * Update recursive_box_filters.cpp * Silence useless 'Outer dim vectorization of var' warning in Mullapudi… (halide#7992) Silence useless 'Outer dim vectorization of var' warning in Mullapudi scheduler * Add a tutorial for async and double_buffer * Renamed double_buffer to ring_buffer * ring_buffer() now expects an extent Expr * Actually use extent for ring_buffer() * Address some of the comments * Provide an example of the code structure for producer-consumer async example * Comments updates * Fix clang-format and clang-tidy * Add Python binding for Func::ring_buffer() * Don't use a separate index for ring buffer + add a new test * Rename the tests * Clean up the old name * Add & * Move test to the right folder * Move expr * Add comments for InjectRingBuffering * Improve ring_buffer doc * Fix comments * Comments * A better error message * Mention that extent is expected to be a positive integer * Add another code structure and explain how the indices for ring buffer are computed * Expand test comments * Fix spelling --------- Co-authored-by: Steven Johnson <srj@google.com> Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>

abadams added 3 commits November 14, 2023 15:03

Add two new tail strategies for update definitions

0504add

Stop printing asm

a6e5b9b

Update expected number of partitions for Partition::Always

07f152a

Add a comment explaining why the blend safety check is per dimension

69fa46c

Add serialization support for the new tail strategies

ccfcb4c

TH3CHARLie added the release_notes For changes that may warrant a note in README for official releases. label Nov 17, 2023

vksnk approved these changes Nov 27, 2023

View reviewed changes

steven-johnson and others added 5 commits November 28, 2023 06:55

Merge branch 'main' into abadams/blend_tail_strategies

1d1eebf

Merge branch 'main' into abadams/blend_tail_strategies

ad128dd

Merge branch 'main' into abadams/blend_tail_strategies

0311797

trigger buildbots

179704e

Add comment

9d064ff

abadams merged commit 17578a1 into main Dec 5, 2023
3 checks passed

steven-johnson deleted the abadams/blend_tail_strategies branch December 5, 2023 20:20

BrewTestBot mentioned this pull request Feb 2, 2024

halide 17.0.0 Homebrew/homebrew-core#161602

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add two new tail strategies for update definitions #7949

Add two new tail strategies for update definitions #7949

abadams commented Nov 14, 2023

mcourteaux commented Nov 15, 2023

abadams commented Nov 15, 2023

abadams commented Nov 15, 2023

steven-johnson commented Nov 15, 2023

abadams commented Nov 15, 2023

abadams commented Nov 16, 2023

TH3CHARLie commented Nov 17, 2023

vksnk Nov 27, 2023

vksnk Nov 27, 2023

steven-johnson commented Dec 5, 2023

Add two new tail strategies for update definitions #7949

Add two new tail strategies for update definitions #7949

Conversation

abadams commented Nov 14, 2023

mcourteaux commented Nov 15, 2023

abadams commented Nov 15, 2023

abadams commented Nov 15, 2023

steven-johnson commented Nov 15, 2023

abadams commented Nov 15, 2023

abadams commented Nov 16, 2023

TH3CHARLie commented Nov 17, 2023

vksnk Nov 27, 2023

Choose a reason for hiding this comment

vksnk Nov 27, 2023

Choose a reason for hiding this comment

steven-johnson commented Dec 5, 2023