Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

cuTENSOR wrappers #330

Merged
merged 31 commits into from
Sep 5, 2019
Merged

cuTENSOR wrappers #330

merged 31 commits into from
Sep 5, 2019

Conversation

kshyatt
Copy link
Contributor

@kshyatt kshyatt commented Apr 29, 2019

You'll need to put libcutensor.so in a reasonable place. I ran all these tests separately but we should check they actually work integrated into CuArrays too.

@kshyatt kshyatt requested a review from maleadt April 29, 2019 15:38
@kshyatt
Copy link
Contributor Author

kshyatt commented Apr 29, 2019

This also includes some high-level type wrapping too so you can just add tensors and stuff. We can integrate stuff like mapslices and broadcasting later imo.

@maleadt
Copy link
Member

maleadt commented Apr 30, 2019

Great! I'll have a closer look as soon as I have some time.

cc @Jutho @springer13

const cutensorContext = Cvoid
const cutensorHandle_t = Ptr{cutensorContext}

const cudaDataType_t = UInt32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,315 @@
using CUDAdrv: CuDefaultStream, CuStream, CuStream_t

function cudaDataType(T::DataType)
Copy link
Member

@maleadt maleadt Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make sense to move into CUDAapi too, as it's probably used by CUBLAS too.

Copy link
Contributor Author

@kshyatt kshyatt Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't using it in CUBLAS yet because we don't have tests or anything for mixed precision gemm

@Jutho
Copy link
Contributor

Jutho commented Apr 30, 2019

Looks like we've been doing a bit of double work. Anyway, great. I won't have much time until the semester is over (three more weeks), so it's good to see this being done. I guess it wouldn't then cost too much time finalize and release my CuTensorOperations.jl package which builds on top of CuTensor and TensorOperations.jl.

@maleadt
Copy link
Member

maleadt commented Apr 30, 2019

Looks like we've been doing a bit of double work.

FYI, this is based on the latest versions of maleadt/CUTENSOR.jl and Jutho/CuTensor.jl.

@Jutho
Copy link
Contributor

Jutho commented Apr 30, 2019

Ah; ok. I didn't look at it in detail, so I did indeed not appreciate that :-).

@Jutho
Copy link
Contributor

Jutho commented Apr 30, 2019

However, cutensor has undergone some changes since then, and I haven't had time to update the Julia wrappers. I think it's mostly the contraction routine, which now accepts a work buffer as argument. @springer13 can comment.

@maleadt
Copy link
Member

maleadt commented May 2, 2019

Pushed some fixed, but can't seem to get this to work (CUDA and CUBLAS errors).
The errors reproduce with the samples that are shipped with cutensor:

$ ./elementwise_binary
Total memory: 0.18 GiB
ERROR: CUTENSOR_STATUS_CUDA_ERROR
ERROR: CUTENSOR_STATUS_CUDA_ERROR
ERROR: CUTENSOR_STATUS_CUDA_ERROR
cuTensor: 13880.85 GB/s
memcpy: 166.82 GB/s

@kshyatt which versions did you use? I'm testing CUTENSOR 0.1.10 with CUDA 10.1.
@springer13 are there specific version requirements? (if so, it would be nice to have a cutensorGetProperty to query and enforce a version).

@maleadt maleadt changed the title CuTensor wrappers WIP: cuTENSOR wrappers May 2, 2019
@maleadt
Copy link
Member

maleadt commented May 16, 2019

Updated to CUTENSOR 0.1.14 on CUDA 10.1, same error:

$ ./elementwise_binary                                                                                                                                                                                          
Total memory: 0.18 GiB
ERROR: CUTENSOR_STATUS_CUDA_ERROR
ERROR: CUTENSOR_STATUS_CUDA_ERROR
ERROR: CUTENSOR_STATUS_CUDA_ERROR
cuTensor: 13356.83 GB/s
memcpy: 161.90 GB/s

@Jutho
Copy link
Contributor

Jutho commented May 16, 2019

I will look at this as soon as I finished reading your thesis (almost done) :-).

@springer13
Copy link

Hi all, sorry for being late to this conversation (need to update my mail address). It is great to see all this progress.
The latest cuTENSOR version requires SM60 (or later) as well as the CUDA 10.1 toolkit and the 418.xx driver. We'll provide more builds in the future.
@maleadt I assume that you'd like cutensorGetProperty() to behave similar to its cuBLASLt counter-part? You could look into the cutensor.h header it has three defines that encode the current version. Is this enough for now?
Please let me know if you can resolve the CUDA_ERROR.

@maleadt
Copy link
Member

maleadt commented May 19, 2019

SM60 is the missing one, I'll test on a more recent GPU soon.

@maleadt I assume that you'd like cutensorGetProperty() to behave similar to its cuBLASLt counter-part?

Yes. I assumed that the libraryPropertyType_t mechanism would be standard across CUDA libraries (and it's implemented by many already).

You could look into the cutensor.h header it has three defines that encode the current version. Is this enough for now?

No, since we don't use the header, we just call directly into the library so we need to be able to query for that kind of information.

@springer13
Copy link

Okay thanks, I'll put this on my TODO list. We should include this as part of our next minor release.
Please let me know if you need anything else regarding the API.

@Jutho
Copy link
Contributor

Jutho commented May 20, 2019

I have everything running (after some upgrade trouble of the CUDA toolkit). Only the elementwiseTrinary test fails. Not sure what is going on there, just in the case of simple matrices and with A and C nonzero and B zero, it yields the correct result for D = A + B + C, but with B non-zero, a different result is obtained. With only B nonzero, and A and C zero, D contains entries of the order 10^{-25} (in Float32).

I don't see an immediate problem with the function definitions though, it seems the B array is passed along correctly.

Of course, even when the above is fixed, there need to be many more tests (different eltypes etc). I'll make some time this this week.

@kshyatt
Copy link
Contributor Author

kshyatt commented May 20, 2019 via email

@kshyatt
Copy link
Contributor Author

kshyatt commented May 22, 2019

Added some more test types and commented out (for now) the weird trinary fail. I'll try to add in combining integer and float types in addition etc.

@Jutho
Copy link
Contributor

Jutho commented May 22, 2019

That's great. I think that in my earlier explorations with the cutensor library, I never tested or even used/wrapped the tirnary function. Could it be that there is really an error in the implementation in the CUTENSOR library itself, @springer13 ?

@maleadt
Copy link
Member

maleadt commented May 23, 2019

And we have CI! Which fails though, error 18 (unknown CUDA error) with CuArrays.jl and error 20 (insufficient driver version) when running the CUDA C samples. Even though I do have 418.xx and CUDA 10.1... @springer13 ?

# ./contraction
Total memory: 0.45 GiB
Error: CUTENSOR_STATUS_INSUFFICIENT_DRIVER in line 156

# ./elementwise_binary 
Total memory: 0.18 GiB
CUTENSOR ERROR: some argument is NULL.
ERROR: CUTENSOR_STATUS_INVALID_VALUE
CUTENSOR ERROR: some argument is NULL.
ERROR: CUTENSOR_STATUS_INVALID_VALUE
CUTENSOR ERROR: some argument is NULL.
ERROR: CUTENSOR_STATUS_INVALID_VALUE
cuTensor: 76534.50 GB/s
memcpy: 810052.77 GB/s
terminate called after throwing an instance of 'cutensor::InternalError'
  what():  cublasDestroy failed.

Aborted (core dumped)
# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:10:27_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168

$ nvidia-smi
Thu May 23 11:52:43 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+

CUTENSOR 0.1.14

@maleadt
Copy link
Member

maleadt commented May 23, 2019

EDIT: I'm stupid, using the wrong Docker runtime. @springer13 it might be useful if it would throw a better error if CUDA isn't available though 🙂

$ docker run --rm -it -v $(pwd):/cutensor nvidia/cuda:10.1-devel
# should have used docker --runtime=nvidia

cutensor/samples# CPATH=../include LIBRARY_PATH=../lib make -j10

cutensor/samples# LD_LIBRARY_PATH=../../../lib ./contraction
Total memory: 0.45 GiB
Error: CUTENSOR_STATUS_INSUFFICIENT_DRIVER in line 156

# other samples crash more spectacularly

@Jutho
Copy link
Contributor

Jutho commented May 28, 2019

@kshyatt , did you also observe that the elementwiseBinary tests fail specifically for the ComplexF16 case, and work for all other eltypes?

@springer13
Copy link

@Jutho Could you please provide me an exemplary input that I could use to reproduce the trinary issue?
@maleadt I'll double-check how other CUDA libraries report errors in the absence of a CUDA runtime and report errors accordingly.

@Jutho
Copy link
Contributor

Jutho commented May 30, 2019

I've updated the elementwiseBinary test to test more operations (both unary and binary) and a varying number of indices N=2:5. Let me know what you think; if you all agree I will generalise the other tests in a similar fashion.

@springer13 , so far, I've seen errors in elementwiseBinary with mixed element types and with CUTENSOR_OP_SQRT when the elements are complex. Is this to be expected?

I've also changed CharUnion, used in the argument type for specifying the modes, to accept arbitrary Integers. I don't know if this is fair or common; the original type in cutensor is Int32. I would certainly appreciate to accept at least that, and not only characters.

@maleadt
Copy link
Member

maleadt commented May 31, 2019

Looks like CI is working! The failures are unrelated, and have started occurring after upgrading CUDA for this PR. I'll look into that.

@springer13
Copy link

@Jutho What type of errors? I'd say that a NOT_SUPPORTED status would be expected since we have not instantiate all different data type combinations (for instance, sqrt of complex is not supported).

@maleadt maleadt changed the title WIP: cuTENSOR wrappers cuTENSOR wrappers Sep 5, 2019
@maleadt
Copy link
Member

maleadt commented Sep 5, 2019

I have removed the Libdl fiddling, it didn't seem necessary on my system. Maybe something has changed with the cutensor build process? Either way, @Jutho, you originally added that, care to see if it still works on your system? Besides that, this should be good to go I think.

@maleadt
Copy link
Member

maleadt commented Sep 5, 2019

CI looks good, let's merge this!

@maleadt maleadt merged commit 06943a9 into master Sep 5, 2019
@bors bors bot deleted the ksh/tensor branch September 5, 2019 13:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants