Adding target option #62

michel2323 · 2024-04-04T16:40:26Z

This adds a target option to the parallel function calls. For CUDA:

JACC.parallel_for(CUDABackend(), N, axpy, alpha, x_device_JACC, y_device_JACC)

The GPU packages provide these backends. JACC then defines ThreadsBackend() in addition to those.

Doing it this way should resolve precompilation error, while also resolving #56 . In addition, there is no need to set preferences anymore and the various backends can be used concurrently in a code. Also no need for a JACC.Array type. This tries to imitate the target offload pragma of OpenMP.

@PhilipFackler Let me know if there are any further issues with this solution.

Edit: These backends are also used by KernelAbstractions (except ThreadsBackend(), of course), so it would be easy now to write, for example, some GPU kernels in KA that don't require backend-specific functionality.

williamfgc · 2024-04-04T17:03:08Z

@michel2323 thanks for adding this. I think we need to discuss offline as the changes remove JACC's public API portability across vendors for the same code. Am I seeing this right?

michel2323 · 2024-04-04T17:05:28Z

@michel2323 thanks for adding this. I think we need to discuss offline as the changes remove JACC's public API portability across vendors for the same code. Am I seeing this right?

I wouldn't say so? In how far? At some point, you have to pick a backend. But the same is true for OpenMP. In Julia you can do this at runtime.

michel2323 · 2024-04-04T17:08:54Z

If your code only uses one backend, say CUDA, you could have a setup.jl where the user has to pick the backend. Or you could load all backend packages (CUDA.jl, AMDGPU.jl,...) and see which one is functional() (see tests). I think this is great in case you have a mix of AMD and NVIDIA GPUs on one system, for example. The code with the parallel() calls is the same across all vendors.

michel2323 · 2024-04-04T17:16:35Z

Ah, I see what you mean maybe: the array types CuArray are vendor specific. But there Julia provides already a wonderful solution with the Adapt package that also all backends support.

x = zeros(10)
dx = adapt(backend, x)

So in the case where backend=CUDABackend(), dx will be of type CuArray and you never have to (nor should) use the vendor specific types.

For @PhilipFackler this would also make it easier if there's a struct with mixed host and device types. He would only have to define a Adapt.adapt(backend, mystruct) function.

williamfgc · 2024-04-04T17:20:25Z

I think this is great in case you have a mix of AMD and NVIDIA GPUs on one system

This is mostly a corner case that very rarely comes up, so we should focus on portable code across different vendors. I agree it's a nice to have, but enforcing a specific back end in the public API should be optional (maybe should be a macro?) for corner cases not the rule.

The back end selection follows Preferences.jl just like MPIPreferences.jl, so user code calling JACC (like those in tests) doesn't need to be touched from parallel_for(BackendX, ...) to parallel_for(BackendY,...), especially in code with several calls to parallel_for. In fact, they only need to set LocalPreferences and add an "import XBackend". We can discuss offline.

michel2323 · 2024-04-04T17:23:52Z

The argument would be a variable backend. If you want, you can make it a global variable or have a default based on what backend is functional. I don't think it's such a corner case since one has at least host and device backends available, and I doubt one wants to run everything on a device.

williamfgc · 2024-04-04T17:33:02Z

I doubt one wants to run everything on a device

For those cases, the user should rely on Julia regular Arrays and CPU (host) if it's not worth porting, JACC is very targeted for performance portable code pieces.

The argument would be a variable backend. If you want, you can make it a global variable or have a default based on what backend is functional.

That's what JACCPreferences sets, but via LocalPreferences, see this line. The least vendor/system info is exposed to the targeted users (domain scientists) the better.

michel2323 · 2024-04-04T17:39:02Z

Let me add an example code:

# code in a setup jl or run by the user before running his code
using CUDA

if CUDA.functional()
    backend = CUDABackend()
else
    backend = ThreadsBackend()
end

# application code using JACC which is the same accross all vendors

using JACC

function axpy(i, alpha, x, y)
    if i <= length(x)
        @inbounds x[i] += alpha * y[i]
    end
end

x = adapt(backend, x)
y = adapt(backend, y)

for i in 1:11
    @time JACC.parallel_for(backend, N, axpy, alpha, x, y)
end

# Copy to host
x = adapt(ThreadsBackend(),x)
y = adapt(ThreadsBackend(),y)

So the difference is whether to set preferences or set it in a setup.jl. The preferences solution breaks precompilation with the current API #53 . I don't know how else to resolve that.

michel2323 · 2024-04-04T17:41:03Z

The difference between MPI and the GPU backends is that MPI has the same API across all implementations and the same array types are passed in. For the GPUs that's different.

williamfgc · 2024-04-04T17:43:32Z

The difference between MPI and the GPU backends is that MPI has the same API across all implementations and the same array types are passed in. For the GPUs that's different.

Yeah, that's the goal of JACC. Users should not interact with back ends (at most minimally like it's done today with Preferences). "JACC-aware" MPI would be a noble goal, though.

michel2323 · 2024-04-05T00:38:58Z

Another stab at it. This defines a default_backend.

using JACC
# "Default backend is ThreadsBackend()"
println_default_backend()
using CUDA
# "Default backend is CUDABackend()"
println_default_backend()

And then there are parallel methods that pass this down. Of course, if multiple GPU packages are loaded by the user, this will pick whatever extension was compiled last.

Sorry, I really don't know how else to resolve the precompilation issue with Preferences. You can't redefine a method with the same arguments.

michel2323 · 2024-04-05T00:55:06Z

And now with Preferences support too. So the breaking change is that JACC.Array is gone. That is still the difficult bit as you cannot dispatch on JACC.Array with all backends and have precompilation working.

williamfgc · 2024-04-05T01:33:32Z

@michel2323 thanks, see discussion in #53 . I am asking @PhilipFackler how to reproduce the error as it's not showing in the current CI. I'd rather keep the public API as simple as possible since back ends can be handled internally and weak dependencies should provide the desired separation.

williamfgc · 2024-04-05T02:08:15Z

Ideally users should not deal with any detail in the code other than memory allocation and parallel_for and parallel_reduce. Otherwise, there is little advantage in using JACC if the programming model is not that simple (even adapt is too complex for end-users). Today, it works like this:

# Using CUDA triggers weak dependencies JACCCUDA and must match LocalPreferences.toml
using CUDA # the code should work just fine on CPU without this line 
using JACC # I don't know if there is a good way to just import a back end here (e.g. CUDA, AMDGPU, etc.)

function axpy(i, alpha, x, y)
    if i <= length(x)
        @inbounds x[i] += alpha * y[i]
    end
end

x = JACC.Array(round.(rand(Float32, N) * 100))
y = JACC.Array(round.(rand(Float32, N) * 100))
alpha = 2.5

for i in 1:11
    @time JACC.parallel_for(N, axpy, alpha, x, y)
end

# Copy to host...perhaps implement JACC.to_host(x) to avoid deep copies on CPU host and device
x_h = Array(x)
y_h = Array(y)

williamfgc requested a review from PhilipFackler April 4, 2024 16:56

michel2323 force-pushed the ms/target branch from f5111df to d5e8545 Compare April 5, 2024 00:58

Adding target option

e002e63

michel2323 force-pushed the ms/target branch from d5e8545 to e002e63 Compare April 16, 2024 19:34

PhilipFackler mentioned this pull request Apr 18, 2024

Tag dispatch #78

Closed

williamfgc mentioned this pull request May 4, 2024

Create an internal API which deploys off of JACCPreferences.backend #86

Open

williamfgc mentioned this pull request Oct 15, 2024

Unify back end dispatch and precompilation effort #122

Open

5 tasks

PhilipFackler mentioned this pull request Oct 18, 2024

Integrate other PRs for fixing extensions #123

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding target option #62

Adding target option #62

michel2323 commented Apr 4, 2024 •

edited

Loading

williamfgc commented Apr 4, 2024

michel2323 commented Apr 4, 2024

michel2323 commented Apr 4, 2024 •

edited

Loading

michel2323 commented Apr 4, 2024 •

edited

Loading

williamfgc commented Apr 4, 2024

michel2323 commented Apr 4, 2024

williamfgc commented Apr 4, 2024

michel2323 commented Apr 4, 2024 •

edited

Loading

michel2323 commented Apr 4, 2024

williamfgc commented Apr 4, 2024 •

edited

Loading

michel2323 commented Apr 5, 2024 •

edited

Loading

michel2323 commented Apr 5, 2024

williamfgc commented Apr 5, 2024

williamfgc commented Apr 5, 2024

Adding target option #62

Are you sure you want to change the base?

Adding target option #62

Conversation

michel2323 commented Apr 4, 2024 • edited Loading

williamfgc commented Apr 4, 2024

michel2323 commented Apr 4, 2024

michel2323 commented Apr 4, 2024 • edited Loading

michel2323 commented Apr 4, 2024 • edited Loading

williamfgc commented Apr 4, 2024

michel2323 commented Apr 4, 2024

williamfgc commented Apr 4, 2024

michel2323 commented Apr 4, 2024 • edited Loading

michel2323 commented Apr 4, 2024

williamfgc commented Apr 4, 2024 • edited Loading

michel2323 commented Apr 5, 2024 • edited Loading

michel2323 commented Apr 5, 2024

williamfgc commented Apr 5, 2024

williamfgc commented Apr 5, 2024

michel2323 commented Apr 4, 2024 •

edited

Loading

michel2323 commented Apr 4, 2024 •

edited

Loading

michel2323 commented Apr 4, 2024 •

edited

Loading

michel2323 commented Apr 4, 2024 •

edited

Loading

williamfgc commented Apr 4, 2024 •

edited

Loading

michel2323 commented Apr 5, 2024 •

edited

Loading