-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add two new tail strategies for update definitions #7949
Conversation
Regarding the potential race conditions. You mentioned in the inline comments that:
However, there will be a race condition for a parallel outer loop, and a Func f;
f(x, y) = 0;
f.parallel(y).vectorize(x, 4, TailStrategy::RoundUpAndBlend); For an image of 10x10, the vectorized inner loop will do 3 iterations, touching a total of 12 elements, of which 2 elements are beyond the extent of dim 0, which makes it touch the second scanline of the image, which is written by another thread. This looks like a race condition to me. I don't know if you considered this. At least I don't see a test for it, nor do the inline comments suggest what will happen. After writing this, isn't this the same for simply |
That's actually safe. Bounds inference for allocation sizes is done separately to bounds inference for computation, and both are done separately per dimension. Bounds inference for allocation sizes is done by sniffing what places the code actually loads from and stores to, so in that case, if it's an internal pipeline stage, it will allocate a 12 x 10 region. If it's a pipeline input or output, it'll throw an out-of-bounds access error. ShiftInwardsAndBlend is more useful for inputs and outputs - RoundUpAndBlend doesn't really apply there. I think I would use RoundUpAndBlend very rarely - probably only when writing to a subregion of a larger buffer using an RDom. |
And no need to apologize, I frequently make mistakes, so it's great to have people proactively looking for bugs. |
Please capture this exchange in a code comment somewhere |
Done. |
Failures unrelated (new llvm 18 failures on arm-32, and weird performance results on the new arm-64 buildbot that also occur on main) |
worth mentioning new strategies in the notes |
@@ -100,6 +100,32 @@ enum class TailStrategy { | |||
* instead of a multiple of the split factor as with RoundUp. */ | |||
ShiftInwards, | |||
|
|||
/** Equivalent to ShiftInwards, but protects values that would be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think we need to have some kind of tutorial for all of the different variants of TailStrategy: there are a lot of them and some of them are pretty subtle, so it's not really obvious how exactly they work.
No need to block PR on this though, but maybe would be good to create an issue.
} | ||
} | ||
|
||
for (auto it = sched.splits().begin(); it != sched.splits().end(); it++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to add a similar comment to above that this loop propagates back (thus all of the if conditions and bodies are reversed) from the root to the child vars which of them are connected to parallel var.
It would be good to announce this inside Google with some info on how/when to use it -- do we think the existing comments are enough? |
* Add two new tail strategies for update definitions * Stop printing asm * Update expected number of partitions for Partition::Always * Add a comment explaining why the blend safety check is per dimension * Add serialization support for the new tail strategies * trigger buildbots * Add comment --------- Co-authored-by: Steven Johnson <srj@google.com>
* Half-plumbed * Revert "Half-plumbed" This reverts commit eb9dd02. * Interface for double buffer * Update Provides, Calls and Realizes for double buffering * Proper sync for double buffering * Use proper name for the semaphor and use correct initial value * Rename the class * Pass expression for index * Adds storage for double buffering index * Use a separate index to go through the double buffer * Failing test * Better handling of hoisted storage in all of the async-related passes * New test and clean-up the generated IR * More tests * Allow double buffering without async and add corresponding test * Filter out incorrect double_buffer schedules * Add tests to the cmake files * Clean up * Update the comment * Clean up * Clean up * Update serialization * complete_x86_target() should enable F16C and FMA when AVX2 is present (#7971) All known AVX2-enabled architectures definitely have these features. * Add two new tail strategies for update definitions (#7949) * Add two new tail strategies for update definitions * Stop printing asm * Update expected number of partitions for Partition::Always * Add a comment explaining why the blend safety check is per dimension * Add serialization support for the new tail strategies * trigger buildbots * Add comment --------- Co-authored-by: Steven Johnson <srj@google.com> * Add appropriate mattrs for arm-32 extensions (#7978) * Add appropriate mattrs for arm-32 extensions Fixes #7976 * Pull clauses out of if * Move canonical version numbers into source, not build system (#7980) (#7981) * Move canonical version numbers into source, not build system (#7980) * Fixes * Silence useless "Insufficient parallelism" autoscheduler warning (#7990) * Add a notebook with a visualization of the aprrox_* functions and their errors (#7974) * Add a notebook with a visualization of the aprrox_* functions and their errors * Fix spelling error * Make narrowing float->int casts on wasm go via wider ints (#7973) Fixes #7972 * Fix handling of assert statements whose conditions get vectorized (#7989) * Fix handling of assert statements whose conditions get vectorized * Fix test name * Fix all "unscheduled update()" warnings in our code (#7991) * Fix all "unscheduled update()" warnings in our code And also fix the Mullapudi scheduler to explicitly touch all update stages. This allows us to mark this warning as an error if we so choose. * fixes * fixes * Update recursive_box_filters.cpp * Silence useless 'Outer dim vectorization of var' warning in Mullapudi… (#7992) Silence useless 'Outer dim vectorization of var' warning in Mullapudi scheduler * Add a tutorial for async and double_buffer * Renamed double_buffer to ring_buffer * ring_buffer() now expects an extent Expr * Actually use extent for ring_buffer() * Address some of the comments * Provide an example of the code structure for producer-consumer async example * Comments updates * Fix clang-format and clang-tidy * Add Python binding for Func::ring_buffer() * Don't use a separate index for ring buffer + add a new test * Rename the tests * Clean up the old name * Add & * Move test to the right folder * Move expr * Add comments for InjectRingBuffering * Improve ring_buffer doc * Fix comments * Comments * A better error message * Mention that extent is expected to be a positive integer * Add another code structure and explain how the indices for ring buffer are computed * Expand test comments * Fix spelling --------- Co-authored-by: Steven Johnson <srj@google.com> Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>
* Add two new tail strategies for update definitions * Stop printing asm * Update expected number of partitions for Partition::Always * Add a comment explaining why the blend safety check is per dimension * Add serialization support for the new tail strategies * trigger buildbots * Add comment --------- Co-authored-by: Steven Johnson <srj@google.com>
* Half-plumbed * Revert "Half-plumbed" This reverts commit eb9dd02. * Interface for double buffer * Update Provides, Calls and Realizes for double buffering * Proper sync for double buffering * Use proper name for the semaphor and use correct initial value * Rename the class * Pass expression for index * Adds storage for double buffering index * Use a separate index to go through the double buffer * Failing test * Better handling of hoisted storage in all of the async-related passes * New test and clean-up the generated IR * More tests * Allow double buffering without async and add corresponding test * Filter out incorrect double_buffer schedules * Add tests to the cmake files * Clean up * Update the comment * Clean up * Clean up * Update serialization * complete_x86_target() should enable F16C and FMA when AVX2 is present (halide#7971) All known AVX2-enabled architectures definitely have these features. * Add two new tail strategies for update definitions (halide#7949) * Add two new tail strategies for update definitions * Stop printing asm * Update expected number of partitions for Partition::Always * Add a comment explaining why the blend safety check is per dimension * Add serialization support for the new tail strategies * trigger buildbots * Add comment --------- Co-authored-by: Steven Johnson <srj@google.com> * Add appropriate mattrs for arm-32 extensions (halide#7978) * Add appropriate mattrs for arm-32 extensions Fixes halide#7976 * Pull clauses out of if * Move canonical version numbers into source, not build system (halide#7980) (halide#7981) * Move canonical version numbers into source, not build system (halide#7980) * Fixes * Silence useless "Insufficient parallelism" autoscheduler warning (halide#7990) * Add a notebook with a visualization of the aprrox_* functions and their errors (halide#7974) * Add a notebook with a visualization of the aprrox_* functions and their errors * Fix spelling error * Make narrowing float->int casts on wasm go via wider ints (halide#7973) Fixes halide#7972 * Fix handling of assert statements whose conditions get vectorized (halide#7989) * Fix handling of assert statements whose conditions get vectorized * Fix test name * Fix all "unscheduled update()" warnings in our code (halide#7991) * Fix all "unscheduled update()" warnings in our code And also fix the Mullapudi scheduler to explicitly touch all update stages. This allows us to mark this warning as an error if we so choose. * fixes * fixes * Update recursive_box_filters.cpp * Silence useless 'Outer dim vectorization of var' warning in Mullapudi… (halide#7992) Silence useless 'Outer dim vectorization of var' warning in Mullapudi scheduler * Add a tutorial for async and double_buffer * Renamed double_buffer to ring_buffer * ring_buffer() now expects an extent Expr * Actually use extent for ring_buffer() * Address some of the comments * Provide an example of the code structure for producer-consumer async example * Comments updates * Fix clang-format and clang-tidy * Add Python binding for Func::ring_buffer() * Don't use a separate index for ring buffer + add a new test * Rename the tests * Clean up the old name * Add & * Move test to the right folder * Move expr * Add comments for InjectRingBuffering * Improve ring_buffer doc * Fix comments * Comments * A better error message * Mention that extent is expected to be a positive integer * Add another code structure and explain how the indices for ring buffer are computed * Expand test comments * Fix spelling --------- Co-authored-by: Steven Johnson <srj@google.com> Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>
I've been working on code where I want to vectorize in-place updates on very small images, and our existing loop tail strategies aren't sufficient. ShiftInwards is no good for in-place updates, RoundUp writes beyond the end, and GuardWithIf spends all its time in a scalar loop tail (on archs without predicated loads/stores).
Another effect on small images I see is that you want to partition the loop in x, even in the tail in the loop in y, because you potentially spend a lot of time in the tail of the y loop. I changed Partition::Always to do this.
This PR adds two new tail strategies, along the lines proposed in #7947
Benchmarking results for the tail strategies in the new performance test are as follows (for avx2):
This is a test where the output size is not-quite two vectors, so it has one iteration of steady-state and one iteration of tail.
ShiftInwardsAndBlend (the tail strategy I need) is only 1.3x slower than RoundUp, but it's 27x faster than GuardWithIf.
One caveat is that vectorizing with ShiftInwardsAndBlend will do a vector load that overlaps a previous vector store, which stalls for >10 cycles, so you need to take care in the schedule to do enough work before doing the load for the store to make it out of the store buffer. In the test I address this by unrolling the y dimension.