Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[opt] Add pass eliminate_immutable_local_vars #6926

Merged
merged 6 commits into from
Dec 21, 2022

Conversation

strongoier
Copy link
Contributor

@strongoier strongoier commented Dec 19, 2022

Issue: #6933

Brief Summary

There are many redundant copies of local vars in the initial IR:

  <[Tensor (3, 3) f32]> $128 = [$103, $106, $109, $112, $115, $118, $121, $124, $127]
  $129 : local store [$100 <- $128]
  <[Tensor (3, 3) f32]> $130 = alloca
  $131 = local load [$100]
  $132 : local store [$130 <- $131]
  <[Tensor (3, 3) f32]> $133 = alloca
  $134 = local load [$130]
  $135 : local store [$133 <- $134]
  <[Tensor (3, 3) f32]> $136 = alloca
  $137 = local load [$133]
  $138 : local store [$136 <- $137]
// In fact, `$128` can be used wherever `$136` is loaded.

These can come from many places; one of the main sources is the pass-by-value convention of ti.func. The consequence is that the number of instructions is unnecessarily large, which significantly slows down compilation.

My solution here is to identify and eliminate such redundant instructions in the first place so all later passes can take a much smaller number of instructions as input. These redundant local vars are essentially immutable ones - they are assigned only once and only loaded after the assignment. In this PR, I add an optimization pass eliminate_immutable_local_vars as the first pass.

(P.S. The type check processes of MatrixExpression and LocalLoadStmt are fixed by the way to make the pass work properly.)

Let's study the effects in two cases: #6933 and voxel-rt2.

First, let's compare the number of instructions after scalarization pass (which happens immediately after the first pass).

Kernel Before this PR After this PR Rate of decrease
test (#6933) 45859 26452 42%
spatial_GRIS (voxel-rt2) 48519 17713 63%

Then, let's compare the total time of compile().

Case Before this PR After this PR Rate of decrease
#6933 20.622s 8.550s 59%
voxel-rt2 27.676s 9.495s 66%

@strongoier strongoier added the full-ci Run complete set of CI tests label Dec 19, 2022
@netlify
Copy link

netlify bot commented Dec 19, 2022

Deploy Preview for docsite-preview ready!

Name Link
🔨 Latest commit f191880
🔍 Latest deploy log https://app.netlify.com/sites/docsite-preview/deploys/63a1b1fa05926500081038fb
😎 Deploy Preview https://deploy-preview-6926--docsite-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

Copy link
Collaborator

@bobcao3 bobcao3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@strongoier strongoier merged commit 19fce81 into taichi-dev:master Dec 21, 2022
@strongoier strongoier deleted the add-eli-pass branch December 21, 2022 03:14
quadpixels pushed a commit to quadpixels/taichi that referenced this pull request May 13, 2023
Issue: taichi-dev#6933

### Brief Summary

There are many redundant copies of local vars in the initial IR:
```
  <[Tensor (3, 3) f32]> $128 = [$103, $106, $109, $112, $115, $118, $121, $124, $127]
  $129 : local store [$100 <- $128]
  <[Tensor (3, 3) f32]> $130 = alloca
  $131 = local load [$100]
  $132 : local store [$130 <- $131]
  <[Tensor (3, 3) f32]> $133 = alloca
  $134 = local load [$130]
  $135 : local store [$133 <- $134]
  <[Tensor (3, 3) f32]> $136 = alloca
  $137 = local load [$133]
  $138 : local store [$136 <- $137]
// In fact, `$128` can be used wherever `$136` is loaded.
```

These can come from many places; one of the main sources is the
pass-by-value convention of `ti.func`. The consequence is that the
number of instructions is unnecessarily large, which significantly slows
down compilation.

My solution here is to identify and eliminate such redundant
instructions in the first place so all later passes can take a much
smaller number of instructions as input. These redundant local vars are
essentially immutable ones - they are assigned only once and only loaded
after the assignment. In this PR, I add an optimization pass
`eliminate_immutable_local_vars` as the first pass.

(P.S. The type check processes of `MatrixExpression` and `LocalLoadStmt`
are fixed by the way to make the pass work properly.)

Let's study the effects in two cases: taichi-dev#6933 and
[voxel-rt2](https://github.com/taichi-dev/voxel-rt2/blob/main/example7.py).

First, let's compare the number of instructions after `scalarization`
pass (which happens immediately after the first pass).

| Kernel | Before this PR | After this PR | Rate of decrease |
| ------ | ------ | ------ | ------ |
| `test` (taichi-dev#6933) | 45859  | 26452 | 42% |
| `spatial_GRIS` (voxel-rt2) | 48519 | 17713 | 63% |

Then, let's compare the total time of `compile()`.

| Case | Before this PR | After this PR | Rate of decrease |
| ------ | ------ | ------ | ------ |
| taichi-dev#6933 | 20.622s | 8.550s | 59% |
| voxel-rt2  | 27.676s  | 9.495s | 66% |

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full-ci Run complete set of CI tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants