WMMA uses generic addressing #548

thomasfaingnaert · 2020-02-09T09:07:43Z

Not sure why I didn't notice this before...

using CUDAnative, CuArrays;

function kernel()
    shmem = @cuStaticSharedMem(Float32, (16, 16))
    conf = WMMA.Config{16, 16, 16, Float32}

    d_frag = WMMA.fill_c(Float32(0), conf)
    WMMA.store_d(pointer(shmem), d_frag, 16, WMMA.ColMajor, conf)

    return
end

CUDAnative.code_llvm(kernel, ())
CUDAnative.code_ptx(kernel, ())

gives:

;  @ REPL[2]:2 within `kernel'
define void @julia_kernel_1() {
top:
;  @ REPL[2]:6 within `kernel'
; ┌ @ /home/tfaingna/.julia/dev/CUDAnative/src/device/cuda/wmma.jl:587 within `store_d'
; │┌ @ /home/tfaingna/.julia/dev/CUDAnative/src/device/cuda/wmma.jl:597 within `macro expansion'
; ││┌ @ /home/tfaingna/.julia/dev/CUDAnative/src/device/cuda/wmma.jl:158 within `llvm_wmma_store_d_col_m16n16k16_shared_stride_f32'
     call void @llvm.nvvm.wmma.m16n16k16.store.d.col.stride.f32.p3i8(i8* addrspacecast (i8 addrspace(3)* bitcast ([256 x float] addrspace(3)* @shmem__static_shmem_253 to i8 addrspace(3)*) to i8*), float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, i32 16)
; └└└
;  @ REPL[2]:8 within `kernel'
  ret void
}

On the LLVM side, everything looks fine: the p3 variant for shared memory is used.

//
// Generated by LLVM NVPTX Back-End
//

.version 6.3
.target sm_75
.address_size 64

	// .globl	julia_kernel_2  // -- Begin function julia_kernel_2
// shmem__static_shmem_253 has been demoted
                                        // @julia_kernel_2
.visible .func julia_kernel_2()
{
	.reg .f32 	%f<2>;
	.reg .b32 	%r<2>;
	.reg .b64 	%rd<3>;
	// demoted variable
	.shared .align 32 .b8 shmem__static_shmem_253[1024];
// %bb.0:                               // %top
	mov.u64 	%rd1, shmem__static_shmem_253;
	cvta.shared.u64 	%rd2, %rd1;
	mov.u32 	%r1, 16;
	mov.f32 	%f1, 0f00000000;
	wmma.store.d.sync.aligned.col.m16n16k16.f32 	[%rd2],{%f1, %f1, %f1, %f1, %f1, %f1, %f1, %f1}, %r1;
	ret;
                                        // -- End function
}

However, the NVPTX backend selected an wmma.store.d.sync.aligned.col.m16n16k16.f32 and a cvta.shared instead of just using a wmma.store.d.sync.aligned.col.m16n16k16.shared.f32.
This seems to happen because of the addrspacecast to i8*, even though the intrinsic has argument type i8 addrspace(3)*. This confuses the NVPTX backend, which consequently selects the generic WMMA variant.

To fix this, I should either make sure that ccall(..., llvmcall, ...) respects addrspaces, or extend Ref{T} with an addrspace argument Ref{T, A}.

The text was updated successfully, but these errors were encountered:

vchuravy · 2020-02-09T13:45:14Z

Or fix the LLVM backend? You will probably get the same addrcasts after inlining.

thomasfaingnaert · 2020-02-10T11:17:26Z

Or fix the LLVM backend? You will probably get the same addrcasts after inlining.

The problem isn't so much the addrspacecasts themselves, but rather that we call the intrinsic with type i8* instead of i8 addrspace(3)*.

On first sight, solving this in the backend will be tricky: the p3i8 suffix is removed fairly early, and instruction selection uses the type associated with the SDValue of the first argument to determine the correct instruction to use. Besides, from LLVM's perspective, we are generating invalid IR.

I think the cleanest solution is to add a pointer type annotated with the address space ASPtr{T, AS} to Julia (since changing Ref{T} to take an extra argument may be too breaking of a change) for use with ccall(..., llvmcall, ...).

Alternatively, instead of using ccall in CUDAnative, we could switch to @generated functions and LLVM.jl, but I think the other approach is cleaner.

@maleadt What are your thoughts on this?

maleadt · 2020-02-10T11:46:34Z

Besides, from LLVM's perspective, we are generating invalid IR.

This is the real problem, LLVM might refuse to validate our IR at any point because of it. I've been wanting some way of emitting pointers in specific address spaces for a long time, but Ref{T,AS} doesn't seem like a great solution. Some way to attach attributes to values would be better, but I don't think we have the infrastructure for that. So for now, maybe an non-exported ASPtr type would be a good way to experiment with the functionality we get from it?

vchuravy · 2020-02-10T13:32:37Z

Hm I see... So the declared type to the intrinsic in ccall is Ptr{Void} and during codegen Julia inserts the addrspacecast to i8*. And then later one we fold the intermediate addrspacecasts and end up with an almost correct invocation.

I suspect short-term the generated function will get you there the quickest. I was thinking about maybe fixing this up in codegen, but since the declared types are a missmatch, we would indeed need some representation of addressspace at the Julia level.

557: Use AddrSpacePtr to call WMMA intrinsics r=vchuravy a=thomasfaingnaert Closes #548 Depends on JuliaLang/julia#34760 Co-authored-by: Thomas Faingnaert <thomas.faingnaert@hotmail.com>

This was referenced Feb 14, 2020

Add AddrSpacePtr type JuliaLang/julia#34760

Merged

Use AddrSpacePtr to call WMMA intrinsics #557

Merged

bors bot added a commit that referenced this issue Feb 23, 2020

Merge #557

1c8d85a

557: Use AddrSpacePtr to call WMMA intrinsics r=vchuravy a=thomasfaingnaert Closes #548 Depends on JuliaLang/julia#34760 Co-authored-by: Thomas Faingnaert <thomas.faingnaert@hotmail.com>

bors bot closed this as completed in #557 Feb 24, 2020

bors bot closed this as completed in 2f60116 Feb 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WMMA uses generic addressing #548

WMMA uses generic addressing #548

thomasfaingnaert commented Feb 9, 2020

vchuravy commented Feb 9, 2020

thomasfaingnaert commented Feb 10, 2020 •

edited

Loading

maleadt commented Feb 10, 2020

vchuravy commented Feb 10, 2020

WMMA uses generic addressing #548

WMMA uses generic addressing #548

Comments

thomasfaingnaert commented Feb 9, 2020

vchuravy commented Feb 9, 2020

thomasfaingnaert commented Feb 10, 2020 • edited Loading

maleadt commented Feb 10, 2020

vchuravy commented Feb 10, 2020

thomasfaingnaert commented Feb 10, 2020 •

edited

Loading