Skip to content
This repository has been archived by the owner on May 27, 2021. It is now read-only.

WMMA uses generic addressing #548

Closed
thomasfaingnaert opened this issue Feb 9, 2020 · 4 comments · Fixed by #557
Closed

WMMA uses generic addressing #548

thomasfaingnaert opened this issue Feb 9, 2020 · 4 comments · Fixed by #557

Comments

@thomasfaingnaert
Copy link
Member

Not sure why I didn't notice this before...

using CUDAnative, CuArrays;

function kernel()
    shmem = @cuStaticSharedMem(Float32, (16, 16))
    conf = WMMA.Config{16, 16, 16, Float32}

    d_frag = WMMA.fill_c(Float32(0), conf)
    WMMA.store_d(pointer(shmem), d_frag, 16, WMMA.ColMajor, conf)

    return
end

CUDAnative.code_llvm(kernel, ())
CUDAnative.code_ptx(kernel, ())

gives:

;  @ REPL[2]:2 within `kernel'
define void @julia_kernel_1() {
top:
;  @ REPL[2]:6 within `kernel'
; ┌ @ /home/tfaingna/.julia/dev/CUDAnative/src/device/cuda/wmma.jl:587 within `store_d'
; │┌ @ /home/tfaingna/.julia/dev/CUDAnative/src/device/cuda/wmma.jl:597 within `macro expansion'
; ││┌ @ /home/tfaingna/.julia/dev/CUDAnative/src/device/cuda/wmma.jl:158 within `llvm_wmma_store_d_col_m16n16k16_shared_stride_f32'
     call void @llvm.nvvm.wmma.m16n16k16.store.d.col.stride.f32.p3i8(i8* addrspacecast (i8 addrspace(3)* bitcast ([256 x float] addrspace(3)* @shmem__static_shmem_253 to i8 addrspace(3)*) to i8*), float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, float 0.000000e+00, i32 16)
; └└└
;  @ REPL[2]:8 within `kernel'
  ret void
}

On the LLVM side, everything looks fine: the p3 variant for shared memory is used.

//
// Generated by LLVM NVPTX Back-End
//

.version 6.3
.target sm_75
.address_size 64

	// .globl	julia_kernel_2  // -- Begin function julia_kernel_2
// shmem__static_shmem_253 has been demoted
                                        // @julia_kernel_2
.visible .func julia_kernel_2()
{
	.reg .f32 	%f<2>;
	.reg .b32 	%r<2>;
	.reg .b64 	%rd<3>;
	// demoted variable
	.shared .align 32 .b8 shmem__static_shmem_253[1024];
// %bb.0:                               // %top
	mov.u64 	%rd1, shmem__static_shmem_253;
	cvta.shared.u64 	%rd2, %rd1;
	mov.u32 	%r1, 16;
	mov.f32 	%f1, 0f00000000;
	wmma.store.d.sync.aligned.col.m16n16k16.f32 	[%rd2],{%f1, %f1, %f1, %f1, %f1, %f1, %f1, %f1}, %r1;
	ret;
                                        // -- End function
}

However, the NVPTX backend selected an wmma.store.d.sync.aligned.col.m16n16k16.f32 and a cvta.shared instead of just using a wmma.store.d.sync.aligned.col.m16n16k16.shared.f32.
This seems to happen because of the addrspacecast to i8*, even though the intrinsic has argument type i8 addrspace(3)*. This confuses the NVPTX backend, which consequently selects the generic WMMA variant.

To fix this, I should either make sure that ccall(..., llvmcall, ...) respects addrspaces, or extend Ref{T} with an addrspace argument Ref{T, A}.

@vchuravy
Copy link
Member

vchuravy commented Feb 9, 2020

Or fix the LLVM backend? You will probably get the same addrcasts after inlining.

@thomasfaingnaert
Copy link
Member Author

thomasfaingnaert commented Feb 10, 2020

Or fix the LLVM backend? You will probably get the same addrcasts after inlining.

The problem isn't so much the addrspacecasts themselves, but rather that we call the intrinsic with type i8* instead of i8 addrspace(3)*.

On first sight, solving this in the backend will be tricky: the p3i8 suffix is removed fairly early, and instruction selection uses the type associated with the SDValue of the first argument to determine the correct instruction to use. Besides, from LLVM's perspective, we are generating invalid IR.

I think the cleanest solution is to add a pointer type annotated with the address space ASPtr{T, AS} to Julia (since changing Ref{T} to take an extra argument may be too breaking of a change) for use with ccall(..., llvmcall, ...).

Alternatively, instead of using ccall in CUDAnative, we could switch to @generated functions and LLVM.jl, but I think the other approach is cleaner.

@maleadt What are your thoughts on this?

@maleadt
Copy link
Member

maleadt commented Feb 10, 2020

Besides, from LLVM's perspective, we are generating invalid IR.

This is the real problem, LLVM might refuse to validate our IR at any point because of it. I've been wanting some way of emitting pointers in specific address spaces for a long time, but Ref{T,AS} doesn't seem like a great solution. Some way to attach attributes to values would be better, but I don't think we have the infrastructure for that. So for now, maybe an non-exported ASPtr type would be a good way to experiment with the functionality we get from it?

@vchuravy
Copy link
Member

Hm I see... So the declared type to the intrinsic in ccall is Ptr{Void} and during codegen Julia inserts the addrspacecast to i8*. And then later one we fold the intermediate addrspacecasts and end up with an almost correct invocation.

I suspect short-term the generated function will get you there the quickest. I was thinking about maybe fixing this up in codegen, but since the declared types are a missmatch, we would indeed need some representation of addressspace at the Julia level.

bors bot added a commit that referenced this issue Feb 23, 2020
557: Use AddrSpacePtr to call WMMA intrinsics r=vchuravy a=thomasfaingnaert

Closes #548 
Depends on JuliaLang/julia#34760

Co-authored-by: Thomas Faingnaert <thomas.faingnaert@hotmail.com>
@bors bors bot closed this as completed in #557 Feb 24, 2020
@bors bors bot closed this as completed in 2f60116 Feb 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants