Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[perf] Fix Taichi CPU backend compile parameter to pair performance with Numba. #7731

Merged
merged 7 commits into from
Apr 13, 2023

Conversation

zxlbig
Copy link
Contributor

@zxlbig zxlbig commented Apr 4, 2023

Issue: #7442

Brief Summary

In this issue, Numba is a magnitude faster than Taichi due to the absence of automatic vectorization.
The root cause is the incorrect passage of the fast_flag.

To solve this problem, fast_flag is now added to the initialization of cpu codegen. Numba and Taichi now reveal comparable performance.
Here's perf comparison:
numba: 13052.542478MFlops
taichi(master): 6544.274409MFlops
taichi(this pr): 12778.240179MFlops

@netlify
Copy link

netlify bot commented Apr 4, 2023

Deploy Preview for docsite-preview ready!

Name Link
🔨 Latest commit 4313338
🔍 Latest deploy log https://app.netlify.com/sites/docsite-preview/deploys/642fd48dacd1ec00080d1a60
😎 Deploy Preview https://deploy-preview-7731--docsite-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@CLAassistant
Copy link

CLAassistant commented Apr 4, 2023

CLA assistant check
All committers have signed the CLA.

@zxlbig zxlbig marked this pull request as draft April 4, 2023 04:57
@turbo0628 turbo0628 changed the title [Perf] CPu Optimize Taichi to achieve the same performance as Numba (single-threaded) for computing pi. [perf] Fix Taichi CPU backend compile parameter to pair performance with Numba. Apr 4, 2023
@turbo0628 turbo0628 requested a review from jim19930609 April 4, 2023 05:18
Copy link
Contributor

@jim19930609 jim19930609 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per discussion offline, one of the major concern is that these changes on FastMath might induce performance regression on CPU backends, which have not been actively monitored by our perf bot yet.

Can we add some descriptions or profiling results (with demos under python/taichi/examples) to demonstrate the influence on CPU performance?

@feisuzhu
Copy link
Contributor

feisuzhu commented Apr 6, 2023

/benchmark

@turbo0628
Copy link
Member

@feisuzhu This PR changes CPU performance so it might not be observable with /benchmark.

Copy link
Member

@turbo0628 turbo0628 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@zxlbig
Copy link
Contributor Author

zxlbig commented Apr 10, 2023

Per discussion offline, one of the major concern is that these changes on FastMath might induce performance regression on CPU backends, which have not been actively monitored by our perf bot yet.

Can we add some descriptions or profiling results (with demos under python/taichi/examples) to demonstrate the influence on CPU performance?

Based on your test, it seems that the performance of "mpm88" has not been affected by the PR (pull request). This is because I only added some "fast_math" flags during codegen, which can currently only solve the issue of why "pi" cannot be automatically vectorized.

I hope this helps clarify the situation for you! Let me know if you have any other question

@zxlbig zxlbig marked this pull request as ready for review April 10, 2023 00:48
@zxlbig
Copy link
Contributor Author

zxlbig commented Apr 10, 2023

Per discussion offline, one of the major concern is that these changes on FastMath might induce performance regression on CPU backends, which have not been actively monitored by our perf bot yet.
Can we add some descriptions or profiling results (with demos under python/taichi/examples) to demonstrate the influence on CPU performance?

Based on your test, it seems that the performance of "mpm88" has not been affected by the PR (pull request). This is because I only added some "fast_math" flags during codegen, which can currently only solve the issue of why "pi" cannot be automatically vectorized.

I hope this helps clarify the situation for you! Let me know if you have any other question

Here's this pr performance

========================================================================
Kernel Profiler(count, default) @ X64 
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[ 83.67%   0.046 s     50x |    0.884     0.915     0.982 ms] substep_c74_0_kernel_3_range_for
[ 10.62%   0.006 s     50x |    0.102     0.116     0.200 ms] substep_c74_0_kernel_2_range_for
[  3.21%   0.002 s     50x |    0.022     0.035     0.100 ms] substep_c74_0_kernel_1_range_for
[  2.30%   0.001 s     50x |    0.013     0.025     0.088 ms] substep_c74_0_kernel_0_range_for
[  0.20%   0.000 s      1x |    0.108     0.108     0.108 ms] init_c76_0_kernel_0_range_for
-------------------------------------------------------------------------
[100.00%] Total execution time:   0.055 s   number of results: 5
=========================================================================

Here's master performance:

=========================================================================
Kernel Profiler(count, default) @ X64 
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[ 83.88%   0.045 s     50x |    0.884     0.907     0.958 ms] substep_c74_0_kernel_3_range_for
[ 10.39%   0.006 s     50x |    0.101     0.112     0.176 ms] substep_c74_0_kernel_2_range_for
[  3.17%   0.002 s     50x |    0.022     0.034     0.083 ms] substep_c74_0_kernel_1_range_for
[  2.42%   0.001 s     50x |    0.013     0.026     0.069 ms] substep_c74_0_kernel_0_range_for
[  0.14%   0.000 s      1x |    0.076     0.076     0.076 ms] init_c76_0_kernel_0_range_for
-------------------------------------------------------------------------
[100.00%] Total execution time:   0.054 s   number of results: 5
=========================================================================

@zxlbig zxlbig merged commit 4eea1ec into taichi-dev:master Apr 13, 2023
quadpixels pushed a commit to quadpixels/taichi that referenced this pull request May 13, 2023
…ith Numba. (taichi-dev#7731)

Issue: taichi-dev#7442

### Brief Summary

In this issue, Numba is a magnitude faster than Taichi due to the
absence of automatic vectorization.
The root cause is the incorrect passage of the `fast_flag`.

To solve this problem, `fast_flag` is now added to the initialization of
cpu codegen. Numba and Taichi now reveal comparable performance.
Here's perf comparison:
numba:            13052.542478MFlops
taichi(master): 6544.274409MFlops
taichi(this pr):  12778.240179MFlops

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants