-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rethinking the programming model #143
Comments
I'm working on some of the necessary CUDAdrv improvements over at JuliaGPU/CUDAdrv.jl#133. |
Part of the challenge is that only on very modern Linux system any |
Is there even a version of Linux & CUDA where this works? Sure, HMM is merged in 4.14, but it doesn't work on CUDA 10 + Linux 4.19. Furthermore, it's not like unified memory is a magic bullet. Workloads that flips between CPU and GPU will still be similarly slow as the current |
Widely-available HMM definitely seems like the major blocker. I think it's worth exploring whether some workarounds are possible. For example, we could swap out Julia's default If the major downside to this approach is that we have a little extra work to turn slow code into failures/warnings, that seems like an OK position to be in. If |
Except that those cases would become very hard to spot. As soon as some shared pointer leaks (which wouldn't be limited to Isn't the higher abstraction level much more suited for capturing inputs and uploading them to the GPU? I haven't been following Flux.jl, but I think I greatly prefer improving it as opposed to betting on unified memory (performance cost: unknown) and hoping we don't make things even harder to reason about. |
I think that's where we need some empirical testing, to see how likely this really is to trip people up. My feeling is that while those cases are possible, they are going to be much less common than just running a few simple matmuls in a clearly scoped block, which is going to work fine and have far fewer hazards than the current model. The cost of running the experiment seems low for the potential gains -- and we can decide whether to bet the farm on it later. FWIW what I'm proposing is also significantly different from the CUDA C unified programming model, where CPU and GPU kernels can be pretty freely mixed, and closer to what we have now. Kernels don't have to be allowed outside a Improving Flux is obviously preferable, but I basically think we've hit a wall there. You put conversions in a bunch of places and if it's slightly wrong you go out of memory or get an obscure error. The TensorFlow-style approach takes control of that for you at a very high cost to usability (that's why we're here, after all). Unified memory is the only way I can see to get the best of all worlds, though of course I'm very open to other suggestions. |
My issue title was misleading and unclear; unified memory is kind of beside the point here, it's just one implementation of a better CUDA programming model (and possibly not the best one). We discussed this a bit today and came to the conclusion that prototyping this as a simple compiler pass is the right way to try it out. There are various other things – e.g. better array abstractions in Base – that we may need for the full story, but that's a start. I may get time to prototype something soon. Anyone interested in hacking on this is welcome to reach out and I can help with that too. |
Duplicating FluxML/Flux.jl#706 here so that the right people can see it. I think the GPU maintainers generally agree that this is a good idea (please say if not) but we haven't written it down anywhere yet. Ideally we can work out some forward path for putting some effort into this.
The text was updated successfully, but these errors were encountered: