cuTENSOR wrappers #330

kshyatt · 2019-04-29T15:38:44Z

You'll need to put libcutensor.so in a reasonable place. I ran all these tests separately but we should check they actually work integrated into CuArrays too.

kshyatt · 2019-04-29T15:39:24Z

This also includes some high-level type wrapping too so you can just add tensors and stuff. We can integrate stuff like mapslices and broadcasting later imo.

maleadt · 2019-04-30T06:34:34Z

Great! I'll have a closer look as soon as I have some time.

cc @Jutho @springer13

maleadt · 2019-04-30T06:35:01Z

src/tensor/libcutensor_types.jl

+const cutensorContext = Cvoid
+const cutensorHandle_t = Ptr{cutensorContext}
+
+const cudaDataType_t = UInt32


These live in CUDAapi now: https://github.com/JuliaGPU/CUDAapi.jl/blob/0ec114a4ae2eecbecf05963532123f4ba56dbaf3/src/library_types.jl#L7-L22

maleadt · 2019-04-30T06:35:34Z

src/tensor/wrappers.jl

@@ -0,0 +1,315 @@
+using CUDAdrv: CuDefaultStream, CuStream, CuStream_t
+
+function cudaDataType(T::DataType)


This would make sense to move into CUDAapi too, as it's probably used by CUBLAS too.

We aren't using it in CUBLAS yet because we don't have tests or anything for mixed precision gemm

Jutho · 2019-04-30T07:23:16Z

Looks like we've been doing a bit of double work. Anyway, great. I won't have much time until the semester is over (three more weeks), so it's good to see this being done. I guess it wouldn't then cost too much time finalize and release my CuTensorOperations.jl package which builds on top of CuTensor and TensorOperations.jl.

maleadt · 2019-04-30T07:25:10Z

Looks like we've been doing a bit of double work.

FYI, this is based on the latest versions of maleadt/CUTENSOR.jl and Jutho/CuTensor.jl.

Jutho · 2019-04-30T07:25:51Z

Ah; ok. I didn't look at it in detail, so I did indeed not appreciate that :-).

Jutho · 2019-04-30T07:27:23Z

However, cutensor has undergone some changes since then, and I haven't had time to update the Julia wrappers. I think it's mostly the contraction routine, which now accepts a work buffer as argument. @springer13 can comment.

maleadt · 2019-05-02T10:01:18Z

Pushed some fixed, but can't seem to get this to work (CUDA and CUBLAS errors).
The errors reproduce with the samples that are shipped with cutensor:

$ ./elementwise_binary
Total memory: 0.18 GiB
ERROR: CUTENSOR_STATUS_CUDA_ERROR
ERROR: CUTENSOR_STATUS_CUDA_ERROR
ERROR: CUTENSOR_STATUS_CUDA_ERROR
cuTensor: 13880.85 GB/s
memcpy: 166.82 GB/s

@kshyatt which versions did you use? I'm testing CUTENSOR 0.1.10 with CUDA 10.1.
@springer13 are there specific version requirements? (if so, it would be nice to have a cutensorGetProperty to query and enforce a version).

maleadt · 2019-05-16T09:11:16Z

Updated to CUTENSOR 0.1.14 on CUDA 10.1, same error:

$ ./elementwise_binary                                                                                                                                                                                          
Total memory: 0.18 GiB
ERROR: CUTENSOR_STATUS_CUDA_ERROR
ERROR: CUTENSOR_STATUS_CUDA_ERROR
ERROR: CUTENSOR_STATUS_CUDA_ERROR
cuTensor: 13356.83 GB/s
memcpy: 161.90 GB/s

Jutho · 2019-05-16T09:27:03Z

I will look at this as soon as I finished reading your thesis (almost done) :-).

springer13 · 2019-05-19T07:09:07Z

Hi all, sorry for being late to this conversation (need to update my mail address). It is great to see all this progress.
The latest cuTENSOR version requires SM60 (or later) as well as the CUDA 10.1 toolkit and the 418.xx driver. We'll provide more builds in the future.
@maleadt I assume that you'd like cutensorGetProperty() to behave similar to its cuBLASLt counter-part? You could look into the cutensor.h header it has three defines that encode the current version. Is this enough for now?
Please let me know if you can resolve the CUDA_ERROR.

maleadt · 2019-05-19T08:15:45Z

SM60 is the missing one, I'll test on a more recent GPU soon.

@maleadt I assume that you'd like cutensorGetProperty() to behave similar to its cuBLASLt counter-part?

Yes. I assumed that the libraryPropertyType_t mechanism would be standard across CUDA libraries (and it's implemented by many already).

You could look into the cutensor.h header it has three defines that encode the current version. Is this enough for now?

No, since we don't use the header, we just call directly into the library so we need to be able to query for that kind of information.

springer13 · 2019-05-19T16:35:22Z

Okay thanks, I'll put this on my TODO list. We should include this as part of our next minor release.
Please let me know if you need anything else regarding the API.

Jutho · 2019-05-20T08:53:19Z

I have everything running (after some upgrade trouble of the CUDA toolkit). Only the elementwiseTrinary test fails. Not sure what is going on there, just in the case of simple matrices and with A and C nonzero and B zero, it yields the correct result for D = A + B + C, but with B non-zero, a different result is obtained. With only B nonzero, and A and C zero, D contains entries of the order 10^{-25} (in Float32).

I don't see an immediate problem with the function definitions though, it seems the B array is passed along correctly.

Of course, even when the above is fixed, there need to be many more tests (different eltypes etc). I'll make some time this this week.

kshyatt · 2019-05-20T10:27:58Z

Yes, more tests! I can help with that too, I wanted to get some barebones in to make sure I wrote the wrappers properly in the first place.

…

On Mon, May 20, 2019 at 4:53 AM Jutho ***@***.***> wrote: I have everything running (after some upgrade trouble of the CUDA toolkit). Only the elementwiseTrinary test fails. Not sure what is going on there, just in the case of simple matrices and with A and C nonzero and B zero, it yields the correct result for D = A + B + C, but with B non-zero, a different result is obtained. With only B nonzero, and A and C zero, D contains entries of the order 10^{-25} (in Float32). I don't see an immediate problem with the function definitions though, it seems the B array is passed along correctly. Of course, even when the above is fixed, there need to be many more tests (different eltypes etc). I'll make some time this this week. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#330?email_source=notifications&email_token=AAGKJYYI7YJO3CR5L23JJ33PWJRIBA5CNFSM4HJEOCW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVYEKLA#issuecomment-493897004>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAGKJY24GVJD66L25FJJNITPWJRIBANCNFSM4HJEOCWQ> .

kshyatt · 2019-05-22T20:17:32Z

Added some more test types and commented out (for now) the weird trinary fail. I'll try to add in combining integer and float types in addition etc.

Jutho · 2019-05-22T22:11:28Z

That's great. I think that in my earlier explorations with the cutensor library, I never tested or even used/wrapped the tirnary function. Could it be that there is really an error in the implementation in the CUTENSOR library itself, @springer13 ?

maleadt · 2019-05-23T09:54:12Z

And we have CI! Which fails though, error 18 (unknown CUDA error) with CuArrays.jl and error 20 (insufficient driver version) when running the CUDA C samples. Even though I do have 418.xx and CUDA 10.1... @springer13 ?

# ./contraction
Total memory: 0.45 GiB
Error: CUTENSOR_STATUS_INSUFFICIENT_DRIVER in line 156

# ./elementwise_binary 
Total memory: 0.18 GiB
CUTENSOR ERROR: some argument is NULL.
ERROR: CUTENSOR_STATUS_INVALID_VALUE
CUTENSOR ERROR: some argument is NULL.
ERROR: CUTENSOR_STATUS_INVALID_VALUE
CUTENSOR ERROR: some argument is NULL.
ERROR: CUTENSOR_STATUS_INVALID_VALUE
cuTensor: 76534.50 GB/s
memcpy: 810052.77 GB/s
terminate called after throwing an instance of 'cutensor::InternalError'
  what():  cublasDestroy failed.

Aborted (core dumped)

# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:10:27_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168

$ nvidia-smi
Thu May 23 11:52:43 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+

CUTENSOR 0.1.14

maleadt · 2019-05-23T12:07:41Z

EDIT: I'm stupid, using the wrong Docker runtime. @springer13 it might be useful if it would throw a better error if CUDA isn't available though 🙂

$ docker run --rm -it -v $(pwd):/cutensor nvidia/cuda:10.1-devel
# should have used docker --runtime=nvidia

cutensor/samples# CPATH=../include LIBRARY_PATH=../lib make -j10

cutensor/samples# LD_LIBRARY_PATH=../../../lib ./contraction
Total memory: 0.45 GiB
Error: CUTENSOR_STATUS_INSUFFICIENT_DRIVER in line 156

# other samples crash more spectacularly

Jutho · 2019-05-28T09:21:19Z

@kshyatt , did you also observe that the elementwiseBinary tests fail specifically for the ComplexF16 case, and work for all other eltypes?

springer13 · 2019-05-28T15:16:20Z

@Jutho Could you please provide me an exemplary input that I could use to reproduce the trinary issue?
@maleadt I'll double-check how other CUDA libraries report errors in the absence of a CUDA runtime and report errors accordingly.

Jutho · 2019-05-30T21:40:55Z

I've updated the elementwiseBinary test to test more operations (both unary and binary) and a varying number of indices N=2:5. Let me know what you think; if you all agree I will generalise the other tests in a similar fashion.

@springer13 , so far, I've seen errors in elementwiseBinary with mixed element types and with CUTENSOR_OP_SQRT when the elements are complex. Is this to be expected?

I've also changed CharUnion, used in the argument type for specifying the modes, to accept arbitrary Integers. I don't know if this is fair or common; the original type in cutensor is Int32. I would certainly appreciate to accept at least that, and not only characters.

maleadt · 2019-05-31T07:37:31Z

Looks like CI is working! The failures are unrelated, and have started occurring after upgrading CUDA for this PR. I'll look into that.

springer13 · 2019-06-02T03:17:34Z

@Jutho What type of errors? I'd say that a NOT_SUPPORTED status would be expected since we have not instantiate all different data type combinations (for instance, sqrt of complex is not supported).

* print "not_supported" error * support host memory arrays * fix how scalars are treated * fix passing of beta in `cutensorElementwiseTrinary` * assert compatibility of `D` and `C`

…GetWorkspace`)

maleadt · 2019-09-05T12:55:17Z

I have removed the Libdl fiddling, it didn't seem necessary on my system. Maybe something has changed with the cutensor build process? Either way, @Jutho, you originally added that, care to see if it still works on your system? Besides that, this should be good to go I think.

maleadt · 2019-09-05T13:46:47Z

CI looks good, let's merge this!

kshyatt requested a review from maleadt April 29, 2019 15:38

maleadt reviewed Apr 30, 2019

View reviewed changes

maleadt changed the title ~~CuTensor wrappers~~ WIP: cuTENSOR wrappers May 2, 2019

maleadt force-pushed the ksh/tensor branch from 602d13f to d561a96 Compare May 2, 2019 12:02

maleadt force-pushed the ksh/tensor branch from f26619b to 4015536 Compare May 16, 2019 09:17

maleadt force-pushed the ksh/tensor branch from fbf642c to 5fd9b39 Compare May 23, 2019 09:55

kshyatt and others added 19 commits September 5, 2019 12:48

More types for the tensor tests

db1ccc2

Wrap CUTENSOR_STATUS_INSUFFICIENT_DRIVER.

6ae6da0

update CharUnion and use consistently everywhere

df57c7c

generalize elementwiseBinary test

f500c7a

forgotten end

49a79ac

Don't use deprecated cuzeros.

f22c6ec

implementation changes

1612f7b

generalize permutation and contraction tests

eead078

accept host memory pointers in elementwiseBinary

5c2daa7

implementation fixes/changes:

639584f

* print "not_supported" error * support host memory arrays * fix how scalars are treated * fix passing of beta in `cutensorElementwiseTrinary` * assert compatibility of `D` and `C`

completely restructured tensor tests

aca67c9

clean up tests

03774bb

some significant structural changes

2fe554a

implement reduction and add tests

1928d56

update wrappers for cutensor 0.2.2 (in particular: `cutensorReduction…

983f31f

…GetWorkspace`)

Warn about missing libraries.

fddbe93

Report the CUTENSOR version.

396a2e5

Extend the optional-library warning mechanism for CUTENSOR.

42debee

Fail tests if optional libraries are missing on CI.

6ef3e44

maleadt force-pushed the ksh/tensor branch from 6facfa1 to 991786c Compare September 5, 2019 10:54

Try without dlopen'ing libcutensor.

3d59210

maleadt force-pushed the ksh/tensor branch from 991786c to 3d59210 Compare September 5, 2019 11:18

maleadt changed the title ~~WIP: cuTENSOR wrappers~~ cuTENSOR wrappers Sep 5, 2019

maleadt approved these changes Sep 5, 2019

View reviewed changes

maleadt added the enhancement label Sep 5, 2019

maleadt merged commit 06943a9 into master Sep 5, 2019

bors bot deleted the ksh/tensor branch September 5, 2019 13:46

maleadt referenced this pull request Nov 4, 2019

Test released Flux.

7642038

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuTENSOR wrappers #330

cuTENSOR wrappers #330

kshyatt commented Apr 29, 2019

kshyatt commented Apr 29, 2019

maleadt commented Apr 30, 2019

maleadt Apr 30, 2019

maleadt Apr 30, 2019 •

edited

Loading

kshyatt Apr 30, 2019 •

edited

Loading

Jutho commented Apr 30, 2019

maleadt commented Apr 30, 2019

Jutho commented Apr 30, 2019

Jutho commented Apr 30, 2019

maleadt commented May 2, 2019

maleadt commented May 16, 2019

Jutho commented May 16, 2019

springer13 commented May 19, 2019

maleadt commented May 19, 2019

springer13 commented May 19, 2019

Jutho commented May 20, 2019

kshyatt commented May 20, 2019 via email

kshyatt commented May 22, 2019

Jutho commented May 22, 2019

maleadt commented May 23, 2019

maleadt commented May 23, 2019 •

edited

Loading

Jutho commented May 28, 2019

springer13 commented May 28, 2019

Jutho commented May 30, 2019

maleadt commented May 31, 2019

springer13 commented Jun 2, 2019

maleadt commented Sep 5, 2019

maleadt commented Sep 5, 2019

		@@ -0,0 +1,315 @@
		using CUDAdrv: CuDefaultStream, CuStream, CuStream_t

		function cudaDataType(T::DataType)

cuTENSOR wrappers #330

cuTENSOR wrappers #330

Conversation

kshyatt commented Apr 29, 2019

kshyatt commented Apr 29, 2019

maleadt commented Apr 30, 2019

maleadt Apr 30, 2019

Choose a reason for hiding this comment

maleadt Apr 30, 2019 • edited Loading

Choose a reason for hiding this comment

kshyatt Apr 30, 2019 • edited Loading

Choose a reason for hiding this comment

Jutho commented Apr 30, 2019

maleadt commented Apr 30, 2019

Jutho commented Apr 30, 2019

Jutho commented Apr 30, 2019

maleadt commented May 2, 2019

maleadt commented May 16, 2019

Jutho commented May 16, 2019

springer13 commented May 19, 2019

maleadt commented May 19, 2019

springer13 commented May 19, 2019

Jutho commented May 20, 2019

kshyatt commented May 20, 2019 via email

kshyatt commented May 22, 2019

Jutho commented May 22, 2019

maleadt commented May 23, 2019

maleadt commented May 23, 2019 • edited Loading

Jutho commented May 28, 2019

springer13 commented May 28, 2019

Jutho commented May 30, 2019

maleadt commented May 31, 2019

springer13 commented Jun 2, 2019

maleadt commented Sep 5, 2019

maleadt commented Sep 5, 2019

maleadt Apr 30, 2019 •

edited

Loading

kshyatt Apr 30, 2019 •

edited

Loading

maleadt commented May 23, 2019 •

edited

Loading