-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Long lowering time after #13217 #13508
Comments
hmm strange, the new flag |
Thank you, and I'm seeing the same behavior with this example. Using f5a102c, the lowering step runs in 0.18s, while using 101e3a4 (just after PR#13217) the lowering step runs in 11.78s, the same 50x difference in performance that you're seeing. This definitely shouldn't be the case, as @masahi pointed out, since the additional analysis is disabled by default. I'm investigating into it. |
Looks like the performance degredation is from It looks like a quick fix may be to disable the I'm continuing to investigate, to see if this should be disabled, or if something else is wrong with simplifications. The lowered TIR has a lot of expressions that I would expect to be simplified. For example, that first |
@Lunderberg your analysis makes a lot of sense. After removing In addition, just my two cents about the simplification, correct me if I'm wrong, low-level compilers (e.g., nvcc, llvm) should be capable of simplifying this expression by themselves, so you might not see any performance improvement even you apply this simplification at TIR level. |
Ideally we should keep simplifier light weight. In this case, disabling |
During a `tir.Simplify` pass, these extensions were conditionally enabled based on the `PassContext`. Prior to this commit, they were enabled by default in the `tir.RemoveNoOp` pass, as the simplified expressions were only used to prove/disprove a no-op, and did not appear in the output TIR. However, this caused performance issues for some nested boolean expressions. This PR disables the analyzer extensions for the analyzer used by `tir.RemoveNoOp`. The extensions are still used internally by `ControlFlowGraph`, including during the data-flow analysis used if `tir.transform.RemoveNoOpConfig.use_dataflow_analysis` is enabled, so the opt-in data-dependent no-op removals are unaffected. Related to issue apache#13508.
@comaniac @tqchen I've submitted #13524, which disables the use of simplifier extensions by
I was curious on this, and did some benchmarks after modifying Since there was such a benefit, I'm going to clean up and PR those changes to |
Thanks for the fix and investigation. Apparently the LLVM backend doesn't aware of this transform. If possible, could you also benchmark on GPU to test nvcc? Based on the benchmark results, I agree that we should include this transform once the we make it reasonably light weight. |
During a `tir.Simplify` pass, these extensions were conditionally enabled based on the `PassContext`. Prior to this commit, they were enabled by default in the `tir.RemoveNoOp` pass, as the simplified expressions were only used to prove/disprove a no-op, and did not appear in the output TIR. However, this caused performance issues for some nested boolean expressions. This PR disables the analyzer extensions for the analyzer used by `tir.RemoveNoOp`. The extensions are still used internally by `ControlFlowGraph`, including during the data-flow analysis used if `tir.transform.RemoveNoOpConfig.use_dataflow_analysis` is enabled, so the opt-in data-dependent no-op removals are unaffected. Related to issue #13508.
Testing on the GPU, with both cuda and vulkan backends (nvidia-driver-470 on ubuntu 21.04), it shows a pretty similar effect. It isn't quite as dramatic, only 50x slower instead of 1000x, but it's still quite a large effect. Both GPU tests were done with the same compute definition, but with The specific fix here (the int vs float indexing) wasn't on the transformation side, but a change to the topi operator. The nice thing is that it can be a lot more general, and can convert floating point numbers to integer ratios (e.g. the |
Prior to this commit, floating point expressions were used to map between different-sized pixel arrays. These floating point expressions are less aggressively optimized by `RewriteSimplifier`, which can prevent some optimizations This was first noticed during investigation into issue apache#13508. Benchmarks of `topi.image.resize` showed 1000x and 50x performance improvements using the LLVM and CUDA backends, respectively, by using integer expressions instead of floating point. This performance improvement is partly driven by enabling `tir.transform.VectorizeLoops` to recognize vectorizable indices, where the round-trip through floating point previously prevented that optimization.
Prior to this commit, floating point expressions were used to map between different-sized pixel arrays. These floating point expressions are less aggressively optimized by `RewriteSimplifier`, which can prevent some optimizations This was first noticed during investigation into issue apache#13508. Benchmarks of `topi.image.resize` showed 1000x and 50x performance improvements using the LLVM and CUDA backends, respectively, by using integer expressions instead of floating point. This performance improvement is partly driven by enabling `tir.transform.VectorizeLoops` to recognize vectorizable indices, where the round-trip through floating point previously prevented that optimization.
Prior to this commit, floating point expressions were used to map between different-sized pixel arrays. These floating point expressions are less aggressively optimized by `RewriteSimplifier`, which can prevent some optimizations This was first noticed during investigation into issue apache#13508. Benchmarks of `topi.image.resize` showed 1000x and 50x performance improvements using the LLVM and CUDA backends, respectively, by using integer expressions instead of floating point. This performance improvement is partly driven by enabling `tir.transform.VectorizeLoops` to recognize vectorizable indices, where the round-trip through floating point previously prevented that optimization.
Prior to this commit, floating point expressions were used to map between different-sized pixel arrays. These floating point expressions are less aggressively optimized by `RewriteSimplifier`, which can prevent some optimizations This was first noticed during investigation into issue apache#13508. Benchmarks of `topi.image.resize` showed 1000x and 50x performance improvements using the LLVM and CUDA backends, respectively, by using integer expressions instead of floating point. This performance improvement is partly driven by enabling `tir.transform.VectorizeLoops` to recognize vectorizable indices, where the round-trip through floating point previously prevented that optimization.
Prior to this commit, floating point expressions were used to map between different-sized pixel arrays. These floating point expressions are less aggressively optimized by `RewriteSimplifier`, which can prevent some optimizations This was first noticed during investigation into issue apache#13508. Benchmarks of `topi.image.resize` showed 1000x and 50x performance improvements using the LLVM and CUDA backends, respectively, by using integer expressions instead of floating point. This performance improvement is partly driven by enabling `tir.transform.VectorizeLoops` to recognize vectorizable indices, where the round-trip through floating point previously prevented that optimization.
Prior to this commit, floating point expressions were used to map between different-sized pixel arrays. These floating point expressions are less aggressively optimized by `RewriteSimplifier`, which can prevent some optimizations This was first noticed during investigation into issue apache#13508. Benchmarks of `topi.image.resize` showed 1000x and 50x performance improvements using the LLVM and CUDA backends, respectively, by using integer expressions instead of floating point. This performance improvement is partly driven by enabling `tir.transform.VectorizeLoops` to recognize vectorizable indices, where the round-trip through floating point previously prevented that optimization.
Can we close this? |
Yeah this particular issue has been resolved. Closed. |
Prior to this commit, floating point expressions were used to map between different-sized pixel arrays. These floating point expressions are less aggressively optimized by `RewriteSimplifier`, which can prevent some optimizations This was first noticed during investigation into issue apache#13508. Benchmarks of `topi.image.resize` showed 1000x and 50x performance improvements using the LLVM and CUDA backends, respectively, by using integer expressions instead of floating point. This performance improvement is partly driven by enabling `tir.transform.VectorizeLoops` to recognize vectorizable indices, where the round-trip through floating point previously prevented that optimization.
Prior to this commit, floating point expressions were used to map between different-sized pixel arrays. These floating point expressions are less aggressively optimized by `RewriteSimplifier`, which can prevent some optimizations This was first noticed during investigation into issue apache#13508. Benchmarks of `topi.image.resize` showed 1000x and 50x performance improvements using the LLVM and CUDA backends, respectively, by using integer expressions instead of floating point. This performance improvement is partly driven by enabling `tir.transform.VectorizeLoops` to recognize vectorizable indices, where the round-trip through floating point previously prevented that optimization.
Prior to this commit, floating point expressions were used to map between different-sized pixel arrays. These floating point expressions are less aggressively optimized by `RewriteSimplifier`, which can prevent some optimizations This was first noticed during investigation into issue apache#13508. Benchmarks of `topi.image.resize` showed 1000x and 50x performance improvements using the LLVM and CUDA backends, respectively, by using integer expressions instead of floating point. This performance improvement is partly driven by enabling `tir.transform.VectorizeLoops` to recognize vectorizable indices, where the round-trip through floating point previously prevented that optimization.
Prior to this commit, floating point expressions were used to map between different-sized pixel arrays. These floating point expressions are less aggressively optimized by `RewriteSimplifier`, which can prevent some optimizations This was first noticed during investigation into issue apache#13508. Benchmarks of `topi.image.resize` showed 1000x and 50x performance improvements using the LLVM and CUDA backends, respectively, by using integer expressions instead of floating point. This performance improvement is partly driven by enabling `tir.transform.VectorizeLoops` to recognize vectorizable indices, where the round-trip through floating point previously prevented that optimization.
Prior to this commit, floating point expressions were used to map between different-sized pixel arrays. These floating point expressions are less aggressively optimized by `RewriteSimplifier`, which can prevent some optimizations This was first noticed during investigation into issue apache#13508. Benchmarks of `topi.image.resize` showed 1000x and 50x performance improvements using the LLVM and CUDA backends, respectively, by using integer expressions instead of floating point. This performance improvement is partly driven by enabling `tir.transform.VectorizeLoops` to recognize vectorizable indices, where the round-trip through floating point previously prevented that optimization.
Prior to this commit, floating point expressions were used to map between different-sized pixel arrays. These floating point expressions are less aggressively optimized by `RewriteSimplifier`, which can prevent some optimizations This was first noticed during investigation into issue apache#13508. Benchmarks of `topi.image.resize` showed 1000x and 50x performance improvements using the LLVM and CUDA backends, respectively, by using integer expressions instead of floating point. This performance improvement is partly driven by enabling `tir.transform.VectorizeLoops` to recognize vectorizable indices, where the round-trip through floating point previously prevented that optimization.
Expected behavior
The lowering time of the given case should be around 10 seconds.
Actual behavior
The lowering time is more than 550 seconds.
Environment
Any environment with commit commit 101e3a4 (#13217) or later.
Steps to reproduce
The script:
Here are also the lowered IR without and with this commit:
Without this commit:
With this commit:
The IRs are pretty much identical, so it may be due to the change of lowering passes.
cc @Lunderberg @masahi
Triage
The text was updated successfully, but these errors were encountered: