Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally enable threading via FastBroadcast.jl #1508

Merged
merged 20 commits into from
Oct 25, 2021

Conversation

ranocha
Copy link
Member

@ranocha ranocha commented Oct 22, 2021

This is a first draft of threaded parallelism in explicit RK methods using the nice work of @chriselrod in YingboMa/FastBroadcast.jl#19. As discussed in #1423, I added an option thread to some RK methods with default value False(). If set to True, @.. will use threads (from Polyester.jl).

Using 946d55a, I get the following results with 4 threads on a AMD Ryzen Threadripper 3990X 64-Core Processor (second run, after compilation)

julia> using Trixi

julia> begin
           redirect_stdout(devnull) do
               trixi_include(joinpath(examples_dir(), "structured_3d_dgsem", "elixir_euler_source_terms.jl"), 
                             polydeg=5, cells_per_dimension = (8, 8, 8), tspan=(0.0, 0.0))
           end
           ode = remake(ode, tspan=(0.0, 1.0))
           
           @time redirect_stdout(devnull) do
               solve(ode, RDPK3SpFSAL49(), abstol=1.0e-7, reltol=1.0e-7,
                     save_everystep=false, callback=summary_callback)
           end; summary_callback()
           
           sleep(0.5)
           
           @time redirect_stdout(devnull) do
               solve(ode, RDPK3SpFSAL49(thread=True()), abstol=1.0e-7, reltol=1.0e-7,
                     save_everystep=false, callback=summary_callback)
           end; summary_callback()
       end
  2.607690 seconds (14.30 k allocations: 56.529 MiB, 9.35% gc time, 0.55% compilation time)
 ─────────────────────────────────────────────────────────────────────────────
           Trixi.jl                   Time                   Allocations      
                              ──────────────────────   ───────────────────────
       Tot / % measured:           2.34s / 60.1%           18.2MiB / 7.38%    

 Section              ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────
 rhs!                    516    1.41s   100%  2.72ms   1.34MiB  100%   2.67KiB
   volume integral       516    576ms  41.0%  1.12ms    226KiB  16.4%     448B
   interface flux        516    393ms  27.9%   761μs    185KiB  13.5%     368B
   source terms          516    266ms  18.9%   516μs    379KiB  27.5%     752B
   reset ∂u/∂t           516   98.1ms  6.98%   190μs     0.00B  0.00%    0.00B
   surface integral      516   50.7ms  3.61%  98.3μs    202KiB  14.6%     400B
   Jacobian              516   13.3ms  0.94%  25.7μs    169KiB  12.3%     336B
   ~rhs!~                516   8.79ms  0.63%  17.0μs    216KiB  15.7%     428B
   boundary flux         516    145μs  0.01%   281ns     0.00B  0.00%    0.00B
 ─────────────────────────────────────────────────────────────────────────────

  1.694267 seconds (16.97 k allocations: 56.705 MiB, 1.07% compilation time)
 ─────────────────────────────────────────────────────────────────────────────
           Trixi.jl                   Time                   Allocations      
                              ──────────────────────   ───────────────────────
       Tot / % measured:           1.67s / 78.9%           18.3MiB / 7.33%    

 Section              ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────
 rhs!                    516    1.31s   100%  2.55ms   1.34MiB  100%   2.67KiB
   volume integral       516    573ms  43.5%  1.11ms    226KiB  16.4%     448B
   interface flux        516    391ms  29.8%   758μs    185KiB  13.5%     368B
   source terms          516    265ms  20.1%   513μs    379KiB  27.5%     752B
   surface integral      516   50.5ms  3.84%  97.9μs    202KiB  14.6%     400B
   reset ∂u/∂t           516   15.4ms  1.17%  29.8μs     0.00B  0.00%    0.00B
   Jacobian              516   13.2ms  1.00%  25.6μs    169KiB  12.3%     336B
   ~rhs!~                516   6.88ms  0.52%  13.3μs    216KiB  15.7%     428B
   boundary flux         516   37.6μs  0.00%  72.9ns     0.00B  0.00%    0.00B
 ─────────────────────────────────────────────────────────────────────────────

There are two interesting observations for me here.

  1. Multiple threads reduce the time to solution significantly.
  2. Using threads inside the RK steps also reduces the runtime of the first threaded part of our RHS evaluation in Trixi.jl (reset ∂u/∂t). The threads from Polyester.jl are already spinning so that there is less latency of subsequent multithreaded parts. See also Implement a fast multithreaded way to reset du trixi-framework/Trixi.jl#924

TODO

  • Check whether it can be beneficial to optionally enable threads in calculate_residuals! from DiffEqBase.jl
    • At least it doesn't seem to hurt and might be slightly beneficial, so I enabled it.
  • If so, update DiffEqBase.jl, set new compat bounds here, and import the broadcasting stuff from there instead of FastBroadcast.jl
  • Extend to more algorithms... (maybe also other PRs later)

@ranocha
Copy link
Member Author

ranocha commented Oct 22, 2021

I will wait until SciML/DiffEqBase.jl#711 is merged and Chris' comment above is resolved before pushing any new changes here (to avoid duplicated work).

@ChrisRackauckas
Copy link
Member

Yeah, this looks like something to accept.

@ranocha
Copy link
Member Author

ranocha commented Oct 22, 2021

Okay. I will wait until a new version of DiffEqBase.jl is released and update compat bounds accordingly. I will also update a few more methods accordingly.

@ranocha
Copy link
Member Author

ranocha commented Oct 22, 2021

I enabled threaded broadcasting and stage/step limiters for

  • 3Sp and 3SpFSAL low-storage methods
  • 2N low-storage methods
  • Tsit5
  • BS3
  • SSPRK methods (SSPRK22, SSPRK33, SSPRK53, SSPRK53_2N1, SSPRK53_2N2, SSPRK53_H, SSPRK63, SSPRK73, SSPRK83, SSPRK43, SSPRK432, SSPRK932, SSPRK54, SSPRK104)

Left for another PR:

  • SSP RK MS methods (SSPRKMSVS43, SSPRKMSVS32): they already have limiters
  • All other methods (without limiters so far)

@ranocha ranocha marked this pull request as ready for review October 22, 2021 15:04
@ChrisRackauckas
Copy link
Member

Test failures look real.

@ranocha
Copy link
Member Author

ranocha commented Oct 25, 2021

I fixed the failing tests and CI looks good to me. There is a slightly decreased coverage, but I cannot view the report on src/algorithms.jl on Codecov.

@ChrisRackauckas
Copy link
Member

For algorithms like Tsit5, there is a separate dispatch for Array which is used to reduce the compile time. We'd need to make sure to not dispatch there when Thread=True for this to work. I guess that could be follow up though

@ranocha
Copy link
Member Author

ranocha commented Oct 25, 2021

Tests pass 🥳

@ChrisRackauckas ChrisRackauckas merged commit 94a9b06 into SciML:master Oct 25, 2021
@ChrisRackauckas
Copy link
Member

Can you open an issue about extending this to other methods? Could be a good GSoC starter/test project.

@ranocha
Copy link
Member Author

ranocha commented Oct 25, 2021

Can you open an issue about extending this to other methods? Could be a good GSoC starter/test project.

#1511

@ranocha
Copy link
Member Author

ranocha commented Oct 25, 2021

Would you mind making a new release of OrdinaryDiffEq.jl?

@ChrisRackauckas
Copy link
Member

Donezo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants