Skip to content
This repository has been archived by the owner on May 27, 2021. It is now read-only.

Support for loading values through the texture cache (__ldg) #158

Merged
merged 17 commits into from
Mar 16, 2018
Merged

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Feb 5, 2018

This branch packs a couple of improvements, as well as initial support for __ldg for loading values through the texture cache. See the CUDA docs for more info, but in short: the texture cache is faster than the global cache (which caches global values automatically), but it is a non-coherent cache which implies that the array should be read-only for the entire duration of the kernel.

This PR only features the compiler support, no real front-end yet. Ideas/suggestions? Keyword argument to getindex? Constness typevar for CuArray/CuDeviceArray? Even though we'd need proper alias analysis (ref JuliaLang/julia#25890), this should already be usable in eg. non-mutating broadcast.

cc @vchuravy @MikeInnes

@MikeInnes
Copy link
Contributor

Perhaps something like @cuda kernel(y, readonly(x))? And readonly can create a simple wrapper and/or set a type-level flag that overrides getindex.

@vchuravy
Copy link
Member

vchuravy commented Feb 5, 2018

Yeah I think I am going to try out the readonly(x) -> Const(DeviceArray) way and see how it feels.

@cdsousa
Copy link

cdsousa commented Mar 13, 2018

Hi, let me ask this: will this change allow to do something like the function tex2d?
(I'm really missing texture access with (free) interpolation in CUDAnative.jl kernels.)

@maleadt
Copy link
Member Author

maleadt commented Mar 13, 2018

Sorry, no. This just exposes the texture cache, proper texture memory isn't supported yet.

@cdsousa
Copy link

cdsousa commented Mar 13, 2018 via email

@maleadt
Copy link
Member Author

maleadt commented Mar 16, 2018

Sorry, missed your reply.

interpolation by hand, with performance comparable to texture fetches using tex2d

Hard to answer, I don't know how this interpolation is performed in hardware (I can't find much on the web). From a quick glance it looks like tex2d is compiled to a compiler intrinsic called __tex2Dgather, which is turned into the PTX instruction tex.2d, which generates the SASS instruction TEX. Assuming this uses a hardware sampler, you won't get all of the performance benefit from only loading from the texture memory cache as implemented in this PR. But if your workload is memory bound, this PR gets you most of the way.

For my own reference (if I ever have time to implement this feature):

__global__ void tex2D_kernel(float* output, cudaTextureObject_t texObj) 
{
    output[0] = tex2D<float>(texObj, 1, 2);
}
.visible .entry _Z12tex2D_kernelPfy(
        .param .u64 _Z12tex2D_kernelPfy_param_0,
        .param .u64 _Z12tex2D_kernelPfy_param_1
)
{
        .reg .f32       %f<7>;
        .reg .b64       %rd<4>;


        ld.param.u64    %rd1, [_Z12tex2D_kernelPfy_param_0];
        ld.param.u64    %rd2, [_Z12tex2D_kernelPfy_param_1];
        cvta.to.global.u64      %rd3, %rd1;
        mov.f32         %f1, 0f40000000;
        mov.f32         %f2, 0f3F800000;
        tex.2d.v4.f32.f32       {%f3, %f4, %f5, %f6}, [%rd2, {%f2, %f1}];
        st.global.f32   [%rd3], %f3;
        ret;
}
        /*0008*/                   MOV R1, c[0x0][0x44];               /* 0x2800400110005de4 */
        /*0010*/                   MOV32I R4, 0x3f800000;              /* 0x18fe000000011de2 */
        /*0018*/                   MOV32I R5, 0x40000000;              /* 0x1900000000015de2 */
        /*0020*/                   TEX.NODEP.P R4, R4, 0x52, 2D, 0x1;  /* 0x80104052fc411f06 */
        /*0028*/                   MOV R2, c[0x0][0x140];              /* 0x2800400500009de4 */
        /*0030*/                   MOV R3, c[0x0][0x144];              /* 0x280040051000dde4 */
        /*0038*/                   TEXDEPBAR 0x0;                      /* 0xf000000000001de6 */
                                                                       /* 0x200000000002f047 */
        /*0048*/                   ST.E [R2], R4;                      /* 0x9400000000211c85 */
        /*0050*/                   EXIT;                               /* 0x8000000000001de7 */
        /*0058*/                   BRA 0x58;                           /* 0x4003ffffe0001de7 */
        /*0060*/                   NOP;                                /* 0x4000000000001de4 */
        /*0068*/                   NOP;                                /* 0x4000000000001de4 */
        /*0070*/                   NOP;                                /* 0x4000000000001de4 */
        /*0078*/                   NOP;                                /* 0x4000000000001de4 */

@cdsousa
Copy link

cdsousa commented Mar 16, 2018

Thanks for the thoughtful answer. I myself am trying to play with the ldg branch, though I'm still dealing with other unrelated issues. I'll post any results that I get.

@maleadt
Copy link
Member Author

maleadt commented Mar 16, 2018

Great. Since I'm not too happy about the way this hooks into CuDeviceArray (I think I want to have a buffer type, with a more generic device array type on top of that), I'll reduce the scope of this PR and merge that. So you'll only be able to use this through unsafe_cached_load, but that should be fine for performance sensitive code. Or maybe I can add a ldg function for now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants