[perf] Fix Taichi CPU backend compile parameter to pair performance with Numba. #7731

zxlbig · 2023-04-04T03:25:22Z

Brief Summary

In this issue, Numba is a magnitude faster than Taichi due to the absence of automatic vectorization.
The root cause is the incorrect passage of the fast_flag.

To solve this problem, fast_flag is now added to the initialization of cpu codegen. Numba and Taichi now reveal comparable performance.
Here's perf comparison:
numba: 13052.542478MFlops
taichi(master): 6544.274409MFlops
taichi(this pr): 12778.240179MFlops

…_optimize

netlify · 2023-04-04T03:25:28Z

✅ Deploy Preview for docsite-preview ready!

Name	Link
🔨 Latest commit	`4313338`
🔍 Latest deploy log	https://app.netlify.com/sites/docsite-preview/deploys/642fd48dacd1ec00080d1a60
😎 Deploy Preview	https://deploy-preview-7731--docsite-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

CLAassistant · 2023-04-04T03:25:28Z

All committers have signed the CLA.

for more information, see https://pre-commit.ci

jim19930609

Per discussion offline, one of the major concern is that these changes on FastMath might induce performance regression on CPU backends, which have not been actively monitored by our perf bot yet.

Can we add some descriptions or profiling results (with demos under python/taichi/examples) to demonstrate the influence on CPU performance?

feisuzhu · 2023-04-06T07:52:29Z

/benchmark

turbo0628 · 2023-04-07T02:06:27Z

@feisuzhu This PR changes CPU performance so it might not be observable with /benchmark.

for more information, see https://pre-commit.ci

turbo0628

LGTM!

zxlbig · 2023-04-10T00:40:32Z

Per discussion offline, one of the major concern is that these changes on FastMath might induce performance regression on CPU backends, which have not been actively monitored by our perf bot yet.

Can we add some descriptions or profiling results (with demos under python/taichi/examples) to demonstrate the influence on CPU performance?

Based on your test, it seems that the performance of "mpm88" has not been affected by the PR (pull request). This is because I only added some "fast_math" flags during codegen, which can currently only solve the issue of why "pi" cannot be automatically vectorized.

I hope this helps clarify the situation for you! Let me know if you have any other question

zxlbig · 2023-04-10T00:54:26Z

Per discussion offline, one of the major concern is that these changes on FastMath might induce performance regression on CPU backends, which have not been actively monitored by our perf bot yet.
Can we add some descriptions or profiling results (with demos under python/taichi/examples) to demonstrate the influence on CPU performance?

Based on your test, it seems that the performance of "mpm88" has not been affected by the PR (pull request). This is because I only added some "fast_math" flags during codegen, which can currently only solve the issue of why "pi" cannot be automatically vectorized.

I hope this helps clarify the situation for you! Let me know if you have any other question

Here's this pr performance

========================================================================
Kernel Profiler(count, default) @ X64 
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[ 83.67%   0.046 s     50x |    0.884     0.915     0.982 ms] substep_c74_0_kernel_3_range_for
[ 10.62%   0.006 s     50x |    0.102     0.116     0.200 ms] substep_c74_0_kernel_2_range_for
[  3.21%   0.002 s     50x |    0.022     0.035     0.100 ms] substep_c74_0_kernel_1_range_for
[  2.30%   0.001 s     50x |    0.013     0.025     0.088 ms] substep_c74_0_kernel_0_range_for
[  0.20%   0.000 s      1x |    0.108     0.108     0.108 ms] init_c76_0_kernel_0_range_for
-------------------------------------------------------------------------
[100.00%] Total execution time:   0.055 s   number of results: 5
=========================================================================

Here's master performance:

=========================================================================
Kernel Profiler(count, default) @ X64 
=========================================================================
[      %     total   count |      min       avg       max   ] Kernel name
-------------------------------------------------------------------------
[ 83.88%   0.045 s     50x |    0.884     0.907     0.958 ms] substep_c74_0_kernel_3_range_for
[ 10.39%   0.006 s     50x |    0.101     0.112     0.176 ms] substep_c74_0_kernel_2_range_for
[  3.17%   0.002 s     50x |    0.022     0.034     0.083 ms] substep_c74_0_kernel_1_range_for
[  2.42%   0.001 s     50x |    0.013     0.026     0.069 ms] substep_c74_0_kernel_0_range_for
[  0.14%   0.000 s      1x |    0.076     0.076     0.076 ms] init_c76_0_kernel_0_range_for
-------------------------------------------------------------------------
[100.00%] Total execution time:   0.054 s   number of results: 5
=========================================================================

…ith Numba. (taichi-dev#7731) Issue: taichi-dev#7442 ### Brief Summary In this issue, Numba is a magnitude faster than Taichi due to the absence of automatic vectorization. The root cause is the incorrect passage of the `fast_flag`. To solve this problem, `fast_flag` is now added to the initialization of cpu codegen. Numba and Taichi now reveal comparable performance. Here's perf comparison: numba: 13052.542478MFlops taichi(master): 6544.274409MFlops taichi(this pr): 12778.240179MFlops --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

zxlbig added 2 commits April 4, 2023 10:24

perf: improve pi perfomance with llvm autovectorization using ffast_flag

07e1c7a

Merge branch 'master' of https://github.com/taichi-dev/taichi into pi…

8b89722

…_optimize

[pre-commit.ci] auto fixes from pre-commit.com hooks

3f05cfe

for more information, see https://pre-commit.ci

zxlbig marked this pull request as draft April 4, 2023 04:57

turbo0628 changed the title ~~[Perf] CPu Optimize Taichi to achieve the same performance as Numba (single-threaded) for computing pi.~~ [perf] Fix Taichi CPU backend compile parameter to pair performance with Numba. Apr 4, 2023

turbo0628 requested a review from jim19930609 April 4, 2023 05:18

jim19930609 reviewed Apr 6, 2023

View reviewed changes

zxlbig added 2 commits April 6, 2023 15:44

perf: add fast_math in codegen

0844185

restore jit option

c923f9a

jim19930609 requested review from turbo0628, lin-hitonami and ailzhang April 6, 2023 08:39

zxlbig and others added 2 commits April 7, 2023 16:28

[bug fix]: fix mass_spring_game bug

f8a1286

[pre-commit.ci] auto fixes from pre-commit.com hooks

4313338

for more information, see https://pre-commit.ci

turbo0628 approved these changes Apr 9, 2023

View reviewed changes

zxlbig marked this pull request as ready for review April 10, 2023 00:48

jim19930609 approved these changes Apr 10, 2023

View reviewed changes

zxlbig merged commit 4eea1ec into taichi-dev:master Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf] Fix Taichi CPU backend compile parameter to pair performance with Numba. #7731

[perf] Fix Taichi CPU backend compile parameter to pair performance with Numba. #7731

zxlbig commented Apr 4, 2023 •

edited

Loading

netlify bot commented Apr 4, 2023 •

edited

Loading

CLAassistant commented Apr 4, 2023 •

edited

Loading

jim19930609 left a comment •

edited

Loading

feisuzhu commented Apr 6, 2023

turbo0628 commented Apr 7, 2023

turbo0628 left a comment

zxlbig commented Apr 10, 2023 •

edited

Loading

zxlbig commented Apr 10, 2023 •

edited

Loading

[perf] Fix Taichi CPU backend compile parameter to pair performance with Numba. #7731

[perf] Fix Taichi CPU backend compile parameter to pair performance with Numba. #7731

Conversation

zxlbig commented Apr 4, 2023 • edited Loading

Brief Summary

netlify bot commented Apr 4, 2023 • edited Loading

✅ Deploy Preview for docsite-preview ready!

CLAassistant commented Apr 4, 2023 • edited Loading

jim19930609 left a comment • edited Loading

Choose a reason for hiding this comment

feisuzhu commented Apr 6, 2023

turbo0628 commented Apr 7, 2023

turbo0628 left a comment

Choose a reason for hiding this comment

zxlbig commented Apr 10, 2023 • edited Loading

zxlbig commented Apr 10, 2023 • edited Loading

zxlbig commented Apr 4, 2023 •

edited

Loading

netlify bot commented Apr 4, 2023 •

edited

Loading

CLAassistant commented Apr 4, 2023 •

edited

Loading

jim19930609 left a comment •

edited

Loading

zxlbig commented Apr 10, 2023 •

edited

Loading

zxlbig commented Apr 10, 2023 •

edited

Loading