-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a virtual buffer pointer mapping mechanism for OCL memory obj #4421
Conversation
…bject. We introduce a virtual buffer pointer mapping mechanism here to unify the CUDA code path and OCL code path into one style. Thus we can remove a lot of duplicate code in the layers implementations.
@naibaf7 This PR is from the original clcaffe implementation and ported by @wujunkai166 . Please help to review. Thanks. |
@gongzg |
@gongzg At least, to take this PR's approach into consideration, it should be separate address spaces for each device, using the device-class (device* is passed and available in every layer) as an abstraction, and not globally manage the maps/decoding of memory objects. |
@naibaf7 Thanks for the comments. The current implementation does lack consideration of multiple OCL devices. The memory space comment is also valid. If we change to use the polymorphic virtual pointer class you proposed above and separate memory space for each device, I think we don't have the address space limitation issue then, right? As the advantage of this solution is also very obvious, we can see the changes in the layers code. Most of the GPU code path will converge to the same for CUDA or OCL device. And I agree with you, this is a big change, and we'd better to gather more people's comments. |
@gongzg We can also reduce the amount of code when using polymorphic virtual pointers. In fact, it would even be possible to go single-kernel-source for CUDA and OpenCL and launch all the kernels with polymorphic pointer resolution, argument unrolling & NVRTC (on CUDA). |
@naibaf7 Thanks for the comments. Here are my understandings of your comments: If we still use the ClState mechanism, the mapping addresses for opencl memory objects and their offsets should be seperated for each device rather than share a global common space in current implementation. If we don't use the ClState mechanism, a polymorphic structure (128 bit) for cl_mem + offset can be used to encode the cl_mem object in its high 64 bit and offset in its low 64 bit. You suggest this approach since it won't rely on the ClState which may bring limitations on memory address space. Are my above understandings correct? If yes, I can try to implement a version using polymorphic structure you suggested. If no, please point out the parts that I misunderstand. Thanks. |
Yes that sounds right. Note that when implementing the 128bit method (32 + 32 bit on 32bit platforms), nVidias Dtype* also needs a wrapper with a shared interface so that we still get the benefit of reducing the amount of code necessary :) |
We should get some more opinions though before going with either a virtual address space per device + clstate OR virtual 128bit smart pointers though... |
@naibaf7 who do you think we should invitate to do further discussion on this? |
@gongzg |
/cc @pansk |
This table seems interesting. How HIP handle memory? |
@bhack If we wrap, we can offer dereferencing & direct on-memory operations, but it requires a heavier 128 bit polymorphic struct/class to do so. If we use virtual pointers we can keep 32/64 bit pointers throughout the library for all CPU, CUDA and OpenCL, but the memory space won't be flat, and it can be a bit flimsy when interpreting/casting the pointers. Also we would need data structures (such as in this PR) to handle memory allocation & virtual pointer resolution. This gives issues with multi-device support, threading (concurrent allocation) and confusion when dereferencing CUDA or CPU pointers (possible) as opposed to OpenCL virtual pointers (where it will never be possible). |
I am using https://github.com/CNugteren/CLCudaAPI, a C++11 interface to either CUDA or OpenCL. It's just a single header file, so it is easy and light-weight. However, it might not be evolved enough to deal with the issues you are facing? It's still quite young, but I am using it in my own projects. |
Nice.. Seems that we are try to solve almost the same problems (viennacl, easycl etc..) ;). @CNugteren has some similarity with http://libocca.org? |
@CNugteren @bhack @naibaf7 @gongzg The motivation of this work is that we found a lot of places in caffe need to apply an offset to a base address allocated by calling gpu_data() function and pass it as a parameter to other functions. In CUDA, device memory is allocated and returned in a pointer which points to a basic type (float, double, etc.), so it can easily do the arithmetic operations to get the offset address. However, in OpenCL, calling gpu_data() will return a pointer which points to a memory object, so the arithmetic operations on the pointer will get an incorrect result. That's why we can observe a lot of places with following code patterns: #ifdef USE_CUDA The purpose of the work in this PR is to eliminate this kind of difference to make the code with OpenCL version more clean compared with CUDA version. By utilizing the data structure described above, the two version code path can be merged into only one statement: Here, in OpenCL version, (A + offset) is a virtual address and it will be decoded into (cl_mem mem_A, int offset) inside func(). Therfore, this work can make the code more clean and reduce a lot of code differences between Opencl and CUDA version. Moreover, the original CUDA path code won't be changed. It's a little different with some of the work described in above comments. We aim at reducing the difference between CUDA and OpenCL implementation in caffe under the condition that the original CUDA path code won't be changed. |
Yeah I understand what this PR does, and I think the idea is good, and in terms of the code, only some multi-device and thread safety features are missing, otherwise it's really nice. I think we rather want to modify the CUDA path as well. Going that way, and swapping static CUDA compilation with NVRTC, we could even eliminate the kernel launch code duplication. Yes, this would move the code a bit further away from Caffe-master, but it would open up more flexibility for additional backends, and we can hide the implementation details (flat Dtype* versus cl_mem) completely. Kernel launch code, math functions (BLAS) and abstract convolution interfaces (cuDNN, libDNN) could all be called from a polymorphic Also, you have to note that CUDA brings a flat, unified address space. Direct memory access is still not possible if we use the virtual pointers in this PR. If we abstract cl_mem and Dtype* in a polymorphic What do you think about that? |
@naibaf7 I think you are describing exactly what CLCudaAPI does. It uses NVRTC, it has a device class and launch code, and it abstracts memory objects using |
@CNugteren |
@naibaf7 Of course, it would be nice to extend the CLCudaAPI with more features, and I will probably be able to help you there, although my main focus (next to a full-time job) is on the CLBlast BLAS library. I am not sure if porting Caffe is a good idea, I saw so many forks of Caffe and am not sure what the changes are that this will become mainstream. Also, is Caffe (and their users) happy to require C++11 and drop GCC < 4.7 and Visual Studio < 2015 support? |
@CNugteren I'll do some tests with CLCudaAPI and see how it works :) |
Virtual GPU pointers implemented. |
We introduce a virtual buffer pointer mapping mechanism here to unify the
CUDA code path and OCL code path into one style. Thus we can remove a lot
of duplicate code in the layers implementations.