You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The hybrid openMP/MPI branch (#1628) uses various openMP directives to parallelize the computation (e.g. #pragma omp parallel for).
More recent versions of openMP (circa 2018) support offloading the same computation onto hardware accelerators (e.g. GPUs), with very little modification to the same compiler directives. We would just have to make sure data that is meant to stay on the accelerator actually stays on the accelerator for a certain amount of time to overcome the hit from communication.
For example, we could create a function called run_until(n) that continuously timesteps for n steps without any interrupts (currently the run(until=n) calls back to python each iteration). All of the timestepping, dft-ing, etc. can be performed on the accelerator. Even convergence checks can be performed on the accelerator. The main benefit to using an accelerator for FDTD, of course, would be the extremely high memory bandwidths (FDTD is generally memory-bound, not compute bound).
In the past, pursuing hardware acceleration was rather undesirable as this required a custom kernel written using a proprietary API. While some directive-level shortcuts have existed for a long time (e.g. OpenACC) there wasn't enough motivation to justify the time sink. However, since we are already playing with OpenMP, it might be worth extending (or at least exploring) the functionality to also support basic accelerators.
The text was updated successfully, but these errors were encountered:
An alternative approach is to use a framework like kokkos, which supports many different backends but is both data- and compute-explicit. This would potentially work much better than a single openMP library shipped with a particular compiler.
The hybrid openMP/MPI branch (#1628) uses various openMP directives to parallelize the computation (e.g.
#pragma omp parallel for
).More recent versions of openMP (circa 2018) support offloading the same computation onto hardware accelerators (e.g. GPUs), with very little modification to the same compiler directives. We would just have to make sure data that is meant to stay on the accelerator actually stays on the accelerator for a certain amount of time to overcome the hit from communication.
For example, we could create a function called
run_until(n)
that continuously timesteps forn
steps without any interrupts (currently therun(until=n)
calls back to python each iteration). All of the timestepping, dft-ing, etc. can be performed on the accelerator. Even convergence checks can be performed on the accelerator. The main benefit to using an accelerator for FDTD, of course, would be the extremely high memory bandwidths (FDTD is generally memory-bound, not compute bound).In the past, pursuing hardware acceleration was rather undesirable as this required a custom kernel written using a proprietary API. While some directive-level shortcuts have existed for a long time (e.g. OpenACC) there wasn't enough motivation to justify the time sink. However, since we are already playing with OpenMP, it might be worth extending (or at least exploring) the functionality to also support basic accelerators.
The text was updated successfully, but these errors were encountered: