Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support JIT Offline Cache for Taichi #4401

Open
14 of 16 tasks
PGZXB opened this issue Feb 28, 2022 · 19 comments
Open
14 of 16 tasks

Support JIT Offline Cache for Taichi #4401

PGZXB opened this issue Feb 28, 2022 · 19 comments
Assignees

Comments

@PGZXB
Copy link
Contributor

PGZXB commented Feb 28, 2022

Solution

Workflow (on llvm backends)

... → Python Code → Taichi AST → Trans AST to string as key → Hash it(hashed key) → Try find offline cache file by hashed-key:

  • Found: Load cache data from disk → Create kernel form cache → Run kernel
  • Not Found: ( Continue to compile ) ... → Get llvm::Module + offloaded_task_list → Cache them → Run kernel → ... → ( Before exiting ) Dump cache data to disk

Todo & Memo

  • Support for cpu
  • Support for cuda
  • Add ASTKeyGenerator to generator key of Taichi AST instead of IRPrinter, which will holds more information compared with IRPrinter.
  • Fix bugs that some global-vars' changes will not cause re-compiling (Maybe let result of IRPrinter and Expression::serialize hold more information).
  • Fix IRPrinter to generate offline-cache-key more correctly
  • Add tests
  • Consider compile-config's change
  • Trace useless cache-files and delete them. Current implementation causes "cache-file leak"
  • Impl binary ticache file format
  • Run on multi thread/process
  • Support on vulkan
  • Support on opengl
  • Support on metal
  • Refactor (see Support JIT Offline Cache for Taichi #4401 (comment)). see Refactor kernel compilation #7002

- [ ] Support on dx11
- [ ] Support on dx12
- [ ] Handle hash collisions
- [ ] Allow to set/unset offline_cache per kernel ( Optional

Usage

Just set offline_cache=True The feature is enabled by default.

import taichi as ti

# ti.init(arch=ti.cpu, offline_cache=True)
ti.init(arch=ti.cpu)

@kernel
def f():
    print("Hello ticache")

f()

Supported backends

See #4401 (comment)

For more, see Offline Cache

Potential Bugs

@PGZXB PGZXB added discussion Welcome discussion! llvm LLVM backend labels Feb 28, 2022
@PGZXB PGZXB self-assigned this Feb 28, 2022
@bobcao3
Copy link
Collaborator

bobcao3 commented Mar 1, 2022

Requesting for extending this to all back ends, considering the huge range of users on Mac or non-cuda laptops

@PGZXB
Copy link
Contributor Author

PGZXB commented Mar 1, 2022

Requesting for extending this to all back ends, considering the huge range of users on Mac or non-cuda laptops

Yes, impl the feature for all backends is my goal.

@PGZXB
Copy link
Contributor Author

PGZXB commented Mar 10, 2022

#4401 (comment)

@bobcao3
Copy link
Collaborator

bobcao3 commented Mar 14, 2022

I argue strongly against this solution. We have profiles showing the LLVM codegen takes about 30% of the entire JIT codegen time, it would be much wiser to spend time figuring out AST->CHI-IR caching first.

@bobcao3
Copy link
Collaborator

bobcao3 commented Mar 14, 2022

A two staged caching gives a major JIT time boost to all backends, I'd argue this is a lot cleaner to implement as well compared to having one stage caching for each individual backend, which will cause maintenance problems down the line

@PGZXB
Copy link
Contributor Author

PGZXB commented Mar 14, 2022

A two staged caching gives a major JIT time boost to all backends, I'd argue this is a lot cleaner to implement as well compared to having one stage caching for each individual backend, which will cause maintenance problems down the line
I argue strongly against this solution. We have profiles showing the LLVM codegen takes about 30% of the entire JIT codegen time, it would be much wiser to spend time figuring out AST->CHI-IR caching first.

Step by step. Maybe temporary solution. We don't have serialization of CHI IR now. After CHI IR's serialization is implemented, maybe 2-level cache is better, especially for multi backends...

ps. I think CHI IR's serialization is very important for standardizing CHI IR, which needs a feasible efficient standard (more .adj to show the importance I think) solution, like llvm-ir, IL, Java bytecode, intel-asm, which is not easy...

@k-ye
Copy link
Member

k-ye commented Mar 14, 2022

Is there a middle ground we can find out? E.g. how easy is it for us to migrate the implementation from caching LLVM to caching CHI IR? If most users don't care about the internal implementation of the cache, I expect the following scenario to happen:

  1. At first, they can only benefit from the caching behavior for CUDA/CPU backends
  2. Then after release X, they find out the caching is working for all the backends automatically.

In addition, IMHO the complexity still comes from the cache key part (considering all the involved global states). The cached contents can be adjusted fairly easily, provided that CHI IR serialization is implemented.

@PGZXB
Copy link
Contributor Author

PGZXB commented Mar 14, 2022

Is there a middle ground we can find out? E.g. how easy is it for us to migrate the implementation from caching LLVM to caching CHI IR? If most users don't care about the internal implementation of the cache, I expect the following scenario to happen:

  1. At first, they can only benefit from the caching behavior for CUDA/CPU backends
  2. Then after release X, they find out the caching is working for all the backends automatically.

The (new) implementation of offline-cache is transparent. All logic is in C++ side. Frontend only see the offline_cache: bool and offline_cache_file_path: str options. If we have serialization and deserialization of CHI IR, implementing caching CHI IR will be simple. Maybe doing it after standardizing CHI IR is better. After release X, users can also use it by simply set options without any migration cost. And maybe multilevel cache is optional(even better) solution, running backend lang directly is fastest.

@PGZXB
Copy link
Contributor Author

PGZXB commented Mar 14, 2022

In addition, IMHO the complexity still comes from the cache key part (considering all the involved global states). The cached contents can be adjusted fairly easily, provided that CHI IR serialization is implemented.

Can't agree more. Because taichi's kernels depends on global vars/states, generating a key which can uniquely identifies a kernel is difficult and the key of implementing caching a kernel. And, at present, before we have a standardized de/serializable CHI IR, dumping and loading & running backend-language is more simple than CHI IR because they have mature/standard solution.

ps. Overhead of generating key is what we should consider. Python -> Taichi AST -> CHI IR -> Backend lang. From left to right:

  • overhead of generating key ↑ ,
  • overhead of loading & running offline-cache-file ↓ ,
  • difficulty of generating cache key which can uniquely identifies a kernel ↓

ailzhang pushed a commit that referenced this issue Sep 26, 2022
Issue: #4401 
* Fix a potential bug in metal AOT
* Prepare for implementing offline cache on metal

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@PGZXB
Copy link
Contributor Author

PGZXB commented Oct 8, 2022

Supported or not

Backend Supported or not Overhead (running Cornell Box)
CPU 393.25ms
CUDA 882.426ms
Vulkan 218.030ms
OpenGL
Metal
AMDGPU
Microsoft DirectX 11
Microsoft DirectX 12 N/A

P.S.

  1. The "overhead" is the time spent on loading cached compiled data and converting it to a callable object.
  2. Testing environment:
    • OS: Windows 11, CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz 1.61 GHz, RAM: 16GB for CPU, CUDA, OpenGL and Vulkan
  3. ⏩: Working in progress

PGZXB added a commit that referenced this issue Oct 10, 2022
Issue: #4401

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
PGZXB added a commit that referenced this issue Oct 13, 2022
Issue: #6263, #4401

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
PGZXB added a commit that referenced this issue Oct 14, 2022
Issue: #6263, #4401

Co-authored-by: Yi Xu <xy_xuyi@foxmail.com>
@PGZXB PGZXB removed llvm LLVM backend vulkan Vulkan backend labels Nov 5, 2022
PGZXB added a commit that referenced this issue Nov 7, 2022
Issue: #4401

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
PGZXB added a commit that referenced this issue Dec 19, 2022
Issue: #4401, #6614

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
PGZXB added a commit to PGZXB/taichi that referenced this issue Dec 19, 2022
Issue: taichi-dev#4401, taichi-dev#6614

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
PENGUINLIONG pushed a commit that referenced this issue Feb 23, 2023
Issue: #7002, #4401

### Brief Summary
This PR:
1. Introduced `KernelCompilationManager` to unify implementation of the
Offline Cache;
2. Used `KernelCompilationManager` re-impl JIT, Offline Cache on gfx
backends (vulkan, metal, dx11, opengl);
3. Removed the `gfx::CacheManager`.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
)

Issue: taichi-dev#4401

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#4401, taichi-dev#6614

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#7002, taichi-dev#4401

### Brief Summary
This PR:
1. Introduced `KernelCompilationManager` to unify implementation of the
Offline Cache;
2. Used `KernelCompilationManager` re-impl JIT, Offline Cache on gfx
backends (vulkan, metal, dx11, opengl);
3. Removed the `gfx::CacheManager`.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

7 participants