Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infra] Add "ti.profiler" (PythonProfiler) for intuitive Python-scope profiling #1493

Merged
merged 18 commits into from
Jul 31, 2020

Conversation

archibate
Copy link
Collaborator

Related issue = #

[Click here for the format server]


See misc/mpm99_timed.py for usage example.

@archibate archibate added enhancement Make existing things or codebases better large changeset The PR contains a large changeset and reviewing may take times python Python engineering related labels Jul 14, 2020
@codecov
Copy link

codecov bot commented Jul 14, 2020

Codecov Report

Merging #1493 into master will decrease coverage by 1.01%.
The diff coverage is 34.90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1493      +/-   ##
==========================================
- Coverage   67.53%   66.51%   -1.02%     
==========================================
  Files          40       41       +1     
  Lines        5630     5794     +164     
  Branches      982     1001      +19     
==========================================
+ Hits         3802     3854      +52     
- Misses       1660     1766     +106     
- Partials      168      174       +6     
Impacted Files Coverage Δ
python/taichi/misc/timing.py 33.65% <33.65%> (ø)
python/taichi/lang/snode.py 93.69% <100.00%> (ø)
python/taichi/misc/__init__.py 100.00% <100.00%> (ø)
python/taichi/misc/gui.py 23.12% <0.00%> (-2.24%) ⬇️
python/taichi/lang/impl.py 87.82% <0.00%> (-1.59%) ⬇️
python/taichi/lang/ops.py 92.51% <0.00%> (-0.82%) ⬇️
python/taichi/lang/__init__.py 80.20% <0.00%> (+<0.01%) ⬆️
python/taichi/lang/kernel.py 82.82% <0.00%> (+0.04%) ⬆️
python/taichi/lang/expr.py 89.08% <0.00%> (+0.12%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df57ba8...c2bbec5. Read the comment docs.

docs/profiler.rst Outdated Show resolved Hide resolved
docs/profiler.rst Outdated Show resolved Hide resolved
ti.profiler.print()


2. ``ti.profiler()``, this one makes our code cleaner:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. ``ti.profiler()``, this one makes our code cleaner:
2. Besides, ``ti.profiler()``can make our code cleaner:

IMO

10.10ms | 10.10ms | 10.10ms | 1x | 10.10ms | A


3. ``@ti.profiler.timed``, this one is very intuitive when profiling kernels:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to explain this decorator more? It is still a little confused maybe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Will do this once later this day.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New version done! Please ff2 leave a review.

compute()
ti.print_profile_info()
Set ``warmup=5`` for example, will **discard** the first 5 record entries.
I.e. discard the kernel compile time and possible TLB and cache misses on start up.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
I.e. discard the kernel compile time and possible TLB and cache misses on start up.
I.e. discard the kernel compile time and possible TLB(Translation Lookaside Buffer) and cache misses on start up.

IMO

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance caring people should be OK with this term, and btw we use a space between brackets and chars:

Suggested change
I.e. discard the kernel compile time and possible TLB and cache misses on start up.
I.e. discard the kernel compile time and possible TLB (Translation Lookaside Buffer) and cache misses on start up.

docs/profiler.rst Outdated Show resolved Hide resolved
@Rullec
Copy link
Contributor

Rullec commented Jul 15, 2020

LGTMig + few nits

Co-authored-by: Xudong Feng <xudongfeng1996@gmail.com>
@archibate archibate requested a review from Rullec July 15, 2020 08:44
Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think it would be very helpful to enable profiling in Python. However, maybe it makes more sense to simply do a pybind11 binding of the C++ ScopedProfiler.

docs/profiler.rst Outdated Show resolved Hide resolved
docs/profiler.rst Show resolved Hide resolved
return len(self.rec)


class PythonProfiler:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should reuse the C++ ScopedProfiler here? The code in this file is mostly a Python rewrite of that C++ class.

Also note that if a timed Python function calls any other timed C++ function, it would be nice to have a hierarchy containing both Python and C++ function calls.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hierarchy

Oh? Is that? Not sure about how ScopedProfiler works, I'll dig deeper once I've time.
My concern is that will invocation into pybind11 time cost make its result inaccurate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My another concern is: ScopedProfiler is default on and will record profiling data in C++ side. Will this confuse users who only want to test a simple Python side kernel?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that will invocation into pybind11 time cost make its result inaccurate?

That should be fine. If they really want accuracy directly using time.time() would be a better choice.

My another concern is: ScopedProfiler is default on and will record profiling data in C++ side. Will this confuse users who only want to test a simple Python side kernel?

That's a good point. Maybe we can make ScopeProfiler off by default and add an option to turn it on?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, like ti.init(cpp_scoped_profiler=True), but maybe iapr.

Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates - I still suggest we reuse the C++ ScopedProfiler via Pybind11.

@archibate
Copy link
Collaborator Author

archibate commented Jul 23, 2020

Thanks for the updates - I still suggest we reuse the C++ ScopedProfiler via Pybind11.

I see, but the existing base of ScoepdProfiler:

[Profiler thread 140527766734656]
      7.832  s clone_runtime_module          [2 x   3.916  s]
          7.745  s 98.89%  compile_runtime_bitcode [2 x   3.873  s]                          
          0.031  s  0.39%  module_from_bitcode_file [2 x  15.250 ms]                         
          0.034  s  0.43%  clone module          [2 x  17.017 ms]                            
          0.022  s  0.28%  link_module_with_cuda_libdevice [1 x  22.153 ms]                  
             17.608 ms 79.48%  module_from_bitcode_file [1 x  17.608 ms]                     
              4.545 ms 20.52%  [unaccounted]
      2.223  s run                           [2 x   1.111  s]
          0.000  s  0.01%  generate_types        [68 x   1.893 us]                           
          0.000  s  0.02%  generate_child_accessors [2 x 197.530 us]                         
             35.763 us  9.05%  generate_refine_coordinates [2 x  17.881 us]                  
            357.151 us 90.40%  generate_child_accessors [22 x  16.234 us]                    
                189.304 us 53.00%  generate_refine_coordinates [22 x   8.605 us]             
                 92.983 us 26.03%  generate_child_accessors [44 x   2.113 us]                
                 74.863 us 20.96%  [unaccounted]
              2.146 us  0.54%  [unaccounted]
          0.028  s  1.27%  clone_struct_module   [2 x  14.061 ms]
          0.007  s  0.30%  eliminate_unused_functions [2 x   3.305 ms]                       
          0.161  s  7.24%  global_optimize_module_cpu [1 x 160.948 ms]                       
              3.071 ms  1.91%  llvm_function_pass    [1 x   3.071 ms]                        
            155.340 ms 96.52%  llvm_module_pass      [1 x 155.340 ms]                        
              2.537 ms  1.58%  [unaccounted]
          0.519  s 23.35%  compile_module_to_ptx [1 x 518.989 ms]
              2.424 ms  0.47%  llvm_function_pass    [1 x   2.424 ms]                        
            458.300 ms 88.31%  llvm_module_pass      [1 x 458.300 ms]                        
             58.265 ms 11.23%  [unaccounted]
          1.508  s 67.83%  [unaccounted]
      1.398  s compile                       [9 x 155.305 ms]
          0.771  s 55.16%  compile_to_offloads   [9 x  85.669 ms]                            
             74.019 ms  9.60%  lower                 [9 x   8.224 ms]                        
              8.493 ms  1.10%  typecheck             [36 x 235.928 us]                       
              6.049 ms  0.78%  verify                [117 x  51.702 us]                      
            647.648 ms 84.00%  full_simplify         [27 x  23.987 ms]                       
                 16.458 ms  2.54%  extract_constant      [45 x 365.734 us]                   
                  0.637 ms  0.10%  unreachable_code_elimination [45 x  14.151 us]            
                  0.730 ms  0.11%  binary_op_simplify    [45 x  16.223 us]                   
                 93.519 ms 14.44%  constant_fold         [45 x   2.078 ms]                   
                     81.938 ms 87.62%  compile               [2 x  40.969 ms]                
                          0.066 ms  0.08%  compile_to_offloads   [2 x  33.021 us]            
                              3.815 us  5.78%  lower                 [2 x   1.907 us]        
                              4.053 us  6.14%  typecheck             [2 x   2.027 us]        
                              6.199 us  9.39%  verify                [4 x   1.550 us]        
                             46.968 us 71.12%  offload               [2 x  23.484 us]        
                                  3.815 us  8.12%  typecheck             [4 x 953.674 ns]    
                                 43.154 us 91.88%  [unaccounted]
                              5.007 us  7.58%  [unaccounted]                                 
                         81.866 ms 99.91%  compile               [2 x  40.933 ms]
                             81.863 ms 100.00%  cuda codegen         [2 x  40.932 ms]        
                                 44.087 ms 53.85%  clone_struct_module   [2 x  22.043 ms]    
                                  0.117 ms  0.14%  emit_to_module        [2 x  58.413 us]    
                                 10.965 ms 13.39%  eliminate_unused_functions [2 x   5.482 ms]                                                                                            
                                 24.785 ms 30.28%  compile_module_to_ptx [2 x  12.392 ms]    
                                      0.398 ms  1.61%  llvm_function_pass    [2 x 199.080 us]
                                     11.201 ms 45.19%  llvm_module_pass      [2 x   5.600 ms]
                                     13.186 ms 53.20%  [unaccounted]
                                  1.910 ms  2.33%  [unaccounted]
                     11.580 ms 12.38%  [unaccounted]
                 11.993 ms  1.85%  die                   [135 x  88.835 us]
                  6.206 ms  0.96%  alg_simp              [45 x 137.912 us]
                111.358 ms 17.19%  simplify              [45 x   2.475 ms]
                     75.719 ms 68.00%  typecheck             [282 x 268.506 us]
                     35.639 ms 32.00%  [unaccounted]
                250.476 ms 38.67%  whole_kernel_cse      [33 x   7.590 ms]
                156.155 ms 24.11%  cfg_optimization      [33 x   4.732 ms]
                    113.316 ms 72.57%  store_to_load_forwarding [38 x   2.982 ms]
                         38.301 ms 33.80%  reaching_definition_analysis [38 x   1.008 ms]
                         75.015 ms 66.20%  [unaccounted]
                     38.137 ms 24.42%  dead_store_elimination [38 x   1.004 ms]
                         29.400 ms 77.09%  live_variable_analysis [38 x 773.693 us]
                          8.737 ms 22.91%  [unaccounted]
                      2.646 ms  1.69%  die                   [33 x  80.181 us]
                      2.056 ms  1.32%  [unaccounted]
              0.125 ms  0.02%  flag_access           [27 x   4.627 us]
             13.170 ms  1.71%  offload               [9 x   1.463 ms]
                  1.913 ms 14.52%  typecheck             [18 x 106.255 us]
                 11.257 ms 85.48%  [unaccounted]
              8.824 ms  1.14%  cfg_optimization      [9 x 980.457 us]
                  3.574 ms 40.50%  store_to_load_forwarding [9 x 397.073 us]
                      2.032 ms 56.87%  reaching_definition_analysis [9 x 225.809 us]
                      1.541 ms 43.13%  [unaccounted]
                  4.133 ms 46.84%  dead_store_elimination [9 x 459.221 us]
                      2.942 ms 71.19%  live_variable_analysis [9 x 326.898 us]
                      1.191 ms 28.81%  [unaccounted]
                  0.522 ms  5.92%  die                   [9 x  58.015 us]
                  0.595 ms  6.75%  [unaccounted]
              5.275 ms  0.68%  make_thread_local     [9 x 586.086 us]
                  0.711 ms 13.49%  typecheck             [9 x  79.049 us]
                  4.563 ms 86.51%  [unaccounted]
              0.709 ms  0.09%  make_block_local      [9 x  78.811 us]
                673.294 us 94.92%  typecheck             [9 x  74.810 us]
                 36.001 us  5.08%  [unaccounted]
              0.038 ms  0.00%  remove_range_assumption [9 x   4.212 us]
              1.158 ms  0.15%  die                   [9 x 128.640 us]
              3.621 ms  0.47%  demote_atomics        [9 x 402.371 us]
                  1.185 ms 32.72%  typecheck             [9 x 131.660 us]
                  2.436 ms 67.28%  [unaccounted]
          0.627  s 44.83%  compile               [9 x  69.623 ms]
            626.596 ms 100.00%  cuda codegen         [9 x  69.622 ms]
                208.417 ms 33.26%  clone_struct_module   [9 x  23.157 ms]
                  1.904 ms  0.30%  emit_to_module        [9 x 211.610 us]
                 52.255 ms  8.34%  eliminate_unused_functions [9 x   5.806 ms]
                354.518 ms 56.58%  compile_module_to_ptx [9 x  39.391 ms]
                      7.532 ms  2.12%  llvm_function_pass    [9 x 836.902 us]
                    292.388 ms 82.47%  llvm_module_pass      [9 x  32.488 ms]
                     54.598 ms 15.40%  [unaccounted]
                  9.502 ms  1.52%  [unaccounted]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

can make user confused if they only want to measure a little kernel, with no knowledge about these low-level details.
That's why I want a PythonProfiler. pybind11'ing ScopedProfiler is a good idea, but we still need a PythonProfiler given that there's no way to turn off our ScopedProfiler in these TI_AUTO_PROF, let's keep ScopedProfiler specific for devs, and PythonProfiler for end-users, WDYT?

@yuanming-hu
Copy link
Member

Thanks - as I mentioned previously, we should introduce a flag to disable the C++ part of the existing profiler by default, in the following PR.

Co-authored-by: Yuanming Hu <yuanming-hu@users.noreply.github.com>
@archibate archibate requested a review from yuanming-hu July 30, 2020 07:09
@archibate archibate merged commit 7574254 into taichi-dev:master Jul 31, 2020
yuanming-hu added a commit that referenced this pull request Aug 1, 2020
yuanming-hu pushed a commit that referenced this pull request Aug 1, 2020
@FantasyVR FantasyVR mentioned this pull request Aug 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Make existing things or codebases better large changeset The PR contains a large changeset and reviewing may take times python Python engineering related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants