-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[perf] Fix Taichi CPU backend compile parameter to pair performance with Numba. #7731
Conversation
✅ Deploy Preview for docsite-preview ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per discussion offline, one of the major concern is that these changes on FastMath might induce performance regression on CPU backends, which have not been actively monitored by our perf bot yet.
Can we add some descriptions or profiling results (with demos under python/taichi/examples
) to demonstrate the influence on CPU performance?
/benchmark |
@feisuzhu This PR changes CPU performance so it might not be observable with /benchmark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Based on your test, it seems that the performance of "mpm88" has not been affected by the PR (pull request). This is because I only added some "fast_math" flags during codegen, which can currently only solve the issue of why "pi" cannot be automatically vectorized. I hope this helps clarify the situation for you! Let me know if you have any other question |
Here's this pr performance
Here's master performance:
|
…ith Numba. (taichi-dev#7731) Issue: taichi-dev#7442 ### Brief Summary In this issue, Numba is a magnitude faster than Taichi due to the absence of automatic vectorization. The root cause is the incorrect passage of the `fast_flag`. To solve this problem, `fast_flag` is now added to the initialization of cpu codegen. Numba and Taichi now reveal comparable performance. Here's perf comparison: numba: 13052.542478MFlops taichi(master): 6544.274409MFlops taichi(this pr): 12778.240179MFlops --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Issue: #7442
Brief Summary
In this issue, Numba is a magnitude faster than Taichi due to the absence of automatic vectorization.
The root cause is the incorrect passage of the
fast_flag
.To solve this problem,
fast_flag
is now added to the initialization of cpu codegen. Numba and Taichi now reveal comparable performance.Here's perf comparison:
numba: 13052.542478MFlops
taichi(master): 6544.274409MFlops
taichi(this pr): 12778.240179MFlops