Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a virtual buffer pointer mapping mechanism for OCL memory obj #4421

Closed
wants to merge 1 commit into from
Closed

Introduce a virtual buffer pointer mapping mechanism for OCL memory obj #4421

wants to merge 1 commit into from

Conversation

gongzg
Copy link

@gongzg gongzg commented Jul 6, 2016

We introduce a virtual buffer pointer mapping mechanism here to unify the
CUDA code path and OCL code path into one style. Thus we can remove a lot
of duplicate code in the layers implementations.

…bject.

We introduce a virtual buffer pointer mapping mechanism here to unify the
CUDA code path and OCL code path into one style. Thus we can remove a lot
of duplicate code in the layers implementations.
@gongzg
Copy link
Author

gongzg commented Jul 6, 2016

@naibaf7 This PR is from the original clcaffe implementation and ported by @wujunkai166 . Please help to review. Thanks.

@naibaf7
Copy link
Member

naibaf7 commented Jul 6, 2016

@gongzg
This code path will use a fixed amount of the Dtype*-pointer address in OpenCL though to decode the buffer offset and another part for getting the cl_mem, right?
If so, this is a solution I cannot accept into the OpenCL branch (as discussed earlier in #2610 and #2195), as it limits the address space too much.

@naibaf7
Copy link
Member

naibaf7 commented Jul 6, 2016

@gongzg
I would have preferred looking into a solution that wraps the cl_mem + offset structure polymorphic (taking up 128 bit (64 bit pointer + 64 bit offset)) and the CUDA counterpart as just wrapping around a 64 bit pointer.
Furthermore, this polymorphic virtual pointer class could implement methods such as direct memory access, increment, decrement, etc. and would not rely on the ClState, which I find also problematic when using multiple device support (this should not be broken, but the current implementation in this PR will).

At least, to take this PR's approach into consideration, it should be separate address spaces for each device, using the device-class (device* is passed and available in every layer) as an abstraction, and not globally manage the maps/decoding of memory objects.
But even then, I am not particularly in favor of such a solution, but it would be good to get feedback what others think of it.

@gongzg
Copy link
Author

gongzg commented Jul 7, 2016

@naibaf7 Thanks for the comments. The current implementation does lack consideration of multiple OCL devices. The memory space comment is also valid. If we change to use the polymorphic virtual pointer class you proposed above and separate memory space for each device, I think we don't have the address space limitation issue then, right? As the advantage of this solution is also very obvious, we can see the changes in the layers code. Most of the GPU code path will converge to the same for CUDA or OCL device. And I agree with you, this is a big change, and we'd better to gather more people's comments.

@naibaf7
Copy link
Member

naibaf7 commented Jul 7, 2016

@gongzg
Yes, no limitations on memory address space. And it is quite important to have that, due to 32 bit devices (mainly a concern on Android and other embedded platforms).

We can also reduce the amount of code when using polymorphic virtual pointers. In fact, it would even be possible to go single-kernel-source for CUDA and OpenCL and launch all the kernels with polymorphic pointer resolution, argument unrolling & NVRTC (on CUDA).

@wujunkai166
Copy link

@naibaf7 Thanks for the comments. Here are my understandings of your comments:

If we still use the ClState mechanism, the mapping addresses for opencl memory objects and their offsets should be seperated for each device rather than share a global common space in current implementation.

If we don't use the ClState mechanism, a polymorphic structure (128 bit) for cl_mem + offset can be used to encode the cl_mem object in its high 64 bit and offset in its low 64 bit. You suggest this approach since it won't rely on the ClState which may bring limitations on memory address space.

Are my above understandings correct? If yes, I can try to implement a version using polymorphic structure you suggested. If no, please point out the parts that I misunderstand.

Thanks.

@naibaf7
Copy link
Member

naibaf7 commented Jul 11, 2016

@wujunkai166

Yes that sounds right. Note that when implementing the 128bit method (32 + 32 bit on 32bit platforms), nVidias Dtype* also needs a wrapper with a shared interface so that we still get the benefit of reducing the amount of code necessary :)

@naibaf7
Copy link
Member

naibaf7 commented Jul 11, 2016

We should get some more opinions though before going with either a virtual address space per device + clstate OR virtual 128bit smart pointers though...

@gongzg
Copy link
Author

gongzg commented Jul 18, 2016

@naibaf7 who do you think we should invitate to do further discussion on this?

@naibaf7
Copy link
Member

naibaf7 commented Jul 18, 2016

@gongzg
Maybe @karlrupp, @bhack, @benoitsteiner, @CNugteren, @edgarriba as they are also dealing with projects involving both CUDA and OpenCL... would be nice to hear some opinions from them.

@bhack
Copy link
Contributor

bhack commented Jul 18, 2016

/cc @pansk

@bhack
Copy link
Contributor

bhack commented Jul 18, 2016

This table seems interesting. How HIP handle memory?

@naibaf7
Copy link
Member

naibaf7 commented Jul 18, 2016

@bhack
HIP is similar to CUDA in terms of memory pointers. Not an option for us, so we need to either decode from virtual pointers or wrap the memory in a struct...

If we wrap, we can offer dereferencing & direct on-memory operations, but it requires a heavier 128 bit polymorphic struct/class to do so.

If we use virtual pointers we can keep 32/64 bit pointers throughout the library for all CPU, CUDA and OpenCL, but the memory space won't be flat, and it can be a bit flimsy when interpreting/casting the pointers. Also we would need data structures (such as in this PR) to handle memory allocation & virtual pointer resolution. This gives issues with multi-device support, threading (concurrent allocation) and confusion when dereferencing CUDA or CPU pointers (possible) as opposed to OpenCL virtual pointers (where it will never be possible).

@CNugteren
Copy link

I am using https://github.com/CNugteren/CLCudaAPI, a C++11 interface to either CUDA or OpenCL. It's just a single header file, so it is easy and light-weight. However, it might not be evolved enough to deal with the issues you are facing? It's still quite young, but I am using it in my own projects.

@bhack
Copy link
Contributor

bhack commented Jul 19, 2016

Nice.. Seems that we are try to solve almost the same problems (viennacl, easycl etc..) ;). @CNugteren has some similarity with http://libocca.org?

@wujunkai166
Copy link

@CNugteren @bhack @naibaf7 @gongzg
The work described in this PR is to utilize an additional data structure (ClState) to map the tuples with types: (cl_mem obj, int obj_buffer_size) to a pointer allocated in a virtual address. The data structure stores all these maps for future usage. Basically, when OpenCL path is chosen in caffe, every time the program calls function A->gpu_data(), it will return a virtual address which is mapped to (cl_mem mem_A, int sizeA).

The motivation of this work is that we found a lot of places in caffe need to apply an offset to a base address allocated by calling gpu_data() function and pass it as a parameter to other functions. In CUDA, device memory is allocated and returned in a pointer which points to a basic type (float, double, etc.), so it can easily do the arithmetic operations to get the offset address. However, in OpenCL, calling gpu_data() will return a pointer which points to a memory object, so the arithmetic operations on the pointer will get an incorrect result. That's why we can observe a lot of places with following code patterns:

#ifdef USE_CUDA
func(..., A + offset, ...);
#ifdef USE_GREENTEA
func(..., A, offset, ...);

The purpose of the work in this PR is to eliminate this kind of difference to make the code with OpenCL version more clean compared with CUDA version. By utilizing the data structure described above, the two version code path can be merged into only one statement:
func(..., A + offset, ...);

Here, in OpenCL version, (A + offset) is a virtual address and it will be decoded into (cl_mem mem_A, int offset) inside func().

Therfore, this work can make the code more clean and reduce a lot of code differences between Opencl and CUDA version. Moreover, the original CUDA path code won't be changed. It's a little different with some of the work described in above comments. We aim at reducing the difference between CUDA and OpenCL implementation in caffe under the condition that the original CUDA path code won't be changed.

@naibaf7
Copy link
Member

naibaf7 commented Jul 25, 2016

@wujunkai166

Yeah I understand what this PR does, and I think the idea is good, and in terms of the code, only some multi-device and thread safety features are missing, otherwise it's really nice.

I think we rather want to modify the CUDA path as well. Going that way, and swapping static CUDA compilation with NVRTC, we could even eliminate the kernel launch code duplication. Yes, this would move the code a bit further away from Caffe-master, but it would open up more flexibility for additional backends, and we can hide the implementation details (flat Dtype* versus cl_mem) completely.

Kernel launch code, math functions (BLAS) and abstract convolution interfaces (cuDNN, libDNN) could all be called from a polymorphic device class (that already exists now, but is only used to get device specific information, such as memory type and workgroup sizes), which could come in 3 types at the moment: CPU, OpenCL and CUDA. But that would be easily extensible with further backends.

Also, you have to note that CUDA brings a flat, unified address space. Direct memory access is still not possible if we use the virtual pointers in this PR. If we abstract cl_mem and Dtype* in a polymorphic std::shared_ptr<>-like class though, we can also offer that feature, at the cost of wrapping the CUDA Dtype* and CPU Dtype* memory.

What do you think about that?

@CNugteren
Copy link

CNugteren commented Jul 25, 2016

@naibaf7 I think you are describing exactly what CLCudaAPI does. It uses NVRTC, it has a device class and launch code, and it abstracts memory objects using std::shared_ptr. Maybe it's not complete yet for your purposes (e.g. unified address space), but it could be a good starting point.

@naibaf7
Copy link
Member

naibaf7 commented Jul 25, 2016

@CNugteren
Should we look into collaborating on CLCudaAPI and port Caffe onto it once it's complete enough?
We'd probably have to integrate some heavy stuff from Caffe to make it work (i.e. standard math kernels, the BLAS interfaces (ViennaCL-BLAS, clBLAS, CLBLast), unified virtual address space). Is that something that you would like to have in CLCudaAPI or rather not?

@CNugteren
Copy link

@naibaf7 Of course, it would be nice to extend the CLCudaAPI with more features, and I will probably be able to help you there, although my main focus (next to a full-time job) is on the CLBlast BLAS library. I am not sure if porting Caffe is a good idea, I saw so many forks of Caffe and am not sure what the changes are that this will become mainstream. Also, is Caffe (and their users) happy to require C++11 and drop GCC < 4.7 and Visual Studio < 2015 support?

@naibaf7
Copy link
Member

naibaf7 commented Jul 26, 2016

@CNugteren
The OpenCL branch (https://github.com/BVLC/caffe/tree/opencl). Will likely never replace or be merged with the main branch, but at the pace that BVLC/caffe evolves now (slowly enough to be able to translate all changes to OpenCL Caffe), it would be reasonable to go ahead and make the backend more modern and flexible than it is now. C++11, GCC-4.8, OpenCL-1.1 and CUDA-7.5 (optional) are already dependencies for OpenCL-Caffe anyways.

I'll do some tests with CLCudaAPI and see how it works :)

@naibaf7
Copy link
Member

naibaf7 commented Dec 16, 2017

Virtual GPU pointers implemented.

@naibaf7 naibaf7 closed this Dec 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants