-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds Sort, revamps TensorMath, adds masked* operations #120
Conversation
Also has a multi-GPU copy performance enhancement by @adamlerer |
also fyi re: @adamlerer 's change, that's actually a bug fix too. both the old and new non-contiguous copy kernels were missing the same synchronization semantics implied by cudaMemcpyAsync() when copying between GPUs; cudaMemcpyAsync() actually synchronizes in both src and dst default streams, whereas a kernel launch only gets inserted into the stream of the GPU that it is running on. |
Thanks guys. That looks really impressive! I just have two questions from having a first glance:
|
I believe when I looked at the SASS while working on this (and on other For really small tensors, note that if the tensor has less than 4096 I will look into 1-16384ish element tensors, inspect the SASS again and do
On Thu, Mar 19, 2015 at 9:56 PM, Dominik Grewe notifications@github.com
|
Oh, and another comment about sort(). Some caveats about the implementation: -Handles sort along any dimension (contiguous or not) if the size of each slice is 1-2048 elements. Thus, a sort(2) on a CudaTensor(128, 128, 128, 128) will work. I set that aside to work on other things. That can be fixed sooner or later if important. |
*nb: pointwise functions allocate no memory except in the exceedingly weird case where you are writing into a tensor with multiple indices that map to the same element, say an unfold-ed tensor, which I argue should just be an error in Torch (unless you are updating just a single element). cunn had a test for this. Instead, I follow the old behavior is that both arguments are made contiguous, the operation is made, and then the result is written back into the overlapping index tensor using the copy kernel again. The only reason why you got a result that you expected before is because of several properties: In case where all source values are the same for a single output, you get what you expect; in the case where values are different, it depends upon the implementation and execution order of the non-contiguous copy kernel. |
This PR only seems to have unit tests for non-contiguous multi-gpu copy. It should include tests for masked*, sort and most of the other use cases that were added. |
Perf relative to the old Thrust implementation for 1-d contiguous tensors. apply1 here is add(1.0), apply2 is a:add(b), apply3 is torch.cmul(c, a, b). New code is faster for large tensors, but slower for small tensors, to a larger degree than desired (but still everything is in the us range, and typical noise is 1-5 us anyways). I'll work on perf for small tensors.
Perf relative to old newContiguous/Thrust/copy to non-contig/free code for non-contiguous 2-d tensors with transpose(1, 2) applied to all arguments. Same operations as above. (Though, as an optimization, I could detect that for apply1 the layout is contiguous after some permutation, and for apply2/3 all arguments have the same size/stride arrays and is contiguous after some permutation, and do the application in the reordered view, so this would map to the contiguous case, but this currently forces the non-contiguous path for both old and new code).
|
#ifndef TH_CUDA_TENSOR_SORT_INC | ||
#define TH_CUDA_TENSOR_SORT_INC | ||
|
||
#include "deeplearning/torch/cutorch/lib/THC/THCTensor.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix path.
const IndexType startLinearIndex = | ||
getStartLinearIndex<IndexType>(blockId); | ||
|
||
if (blockId == getLastLinearBlockId<IndexType>()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for massive tensors with >536 million elements, virtual (with stride 0) or physical, this logic is wrong, fixing...
What's the status of this PR? Would be great to get this in. |
f004da6
to
daf2c51
Compare
Tomorrow. Writing tests. |
added some tests and fixes for review. will squash the commits before merging. |
|
||
-- contiguous, no result tensor, cuda mask | ||
local x = torch.randn(n_row, n_col):float() | ||
local mask = torch.DoubleTensor():rand(n_row*n_col):mul(2):floor():byte():resize(n_row,n_col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This can be shortened to
torch.DoubleTensor(n_row, n_col):uniform():round():byte()
Okay, TensorCopy was implemented according to the docs, but the docs were apparently wrong. So removed it. Wrote a simple unit test for sort to start with, but sort does not seem to be agreeing with CPU for some cases. To not delay this PR any further, I'll factor Sort out of this PR and merge it tomorrow if everyone's okay: cc: @dominikgrewe |
okay, ready for squash & merge now. unit-tested and good to go:
|
cutorch.withDevice(dstDevice, function() cutorch.synchronize() end) | ||
|
||
local t2_max = t2:max() | ||
assert(t2_max == 1, "bad copy, transposeSrc= " .. transposeSrc .. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be tester:assert
? There are a couple of other places that simply call assert
.
Just one small comment on the testing code, otherwise this looks good to me! Thanks a lot for breaking up the monolithic THC.cu. That was overdue :) |
…elect and maskedFill operations (and tests). Also adds generic Reduce and Apply kernels that can be reused.
adds Sort, revamps TensorMath, adds masked* operations
fixed the asserts and squashed. merged. thanks for reviewing. |
Implements for #70 :
Adds a TensorApply and TensorReduce kernels that work on both contiguous and non-contiguous tensors, and maintains equal performance for contiguous cases.
This PR makes performance with non-contiguous kernels a huge huge improvement.
Implemented completely by @wickedfoo
Having some build issues, looking into it