Bug in WarpScan? #112

RaulPPelaez · 2017-09-13T14:41:59Z

I am using WarpScan::InclusiveSum and I am noticing something unexpected.

When I specialize the WarpScan template for PTX_ARCH>210 the warp_aggregate value is wrong. No matter the arch I compile for (I tried running with a gtx980 and a 750ti) or the cub version used (> 1.5.4). I only tested with CUDA 8.0
Check this sample:

#include<cub/cub.cuh>


// Logical warp thread size
constexpr int tpp = 2;

//The result is incorrect when cub::WarpScan 
using WarpScan = cub::WarpScan<int, tpp>;

template<int tpp>
__global__ void test(int N){
  int id = blockIdx.x*blockDim.x + threadIdx.x;

  extern __shared__ typename WarpScan::TempStorage temp_storage[];
     
  int my_value  = threadIdx.x;
  //sum of my_value for threads in my logical warp up to my id inside the warp.
  int my_local_sum = 0;
//sum of my_value for all threads in my logical warp. AKA value of my_local_sum for last thread in my warp
  int warp_aggregate = 0;
    
    int warp_id = threadIdx.x/tpp;
    __syncthreads();
    WarpScan(temp_storage[warp_id]).InclusiveSum(my_value, my_local_sum, warp_aggregate);
    __syncthreads();

    printf("id: %d - my_value = %d - my_local_sum = %d - warp_aggregate = %d\n",
	   threadIdx.x, my_value, my_local_sum, warp_aggregate);


};




int main(){

  int N = 6;
  test<tpp><<<1, N, N*tpp*sizeof(WarpScan)>>>(N);
  cudaDeviceSynchronize();

};

With a warp size of 2 (tpp) I would expect the following output:

id: 0 - my_value = 0 - my_local_sum = 0 - warp_aggregate = 1
id: 1 - my_value = 1 - my_local_sum = 1 - warp_aggregate = 1
id: 2 - my_value = 2 - my_local_sum = 2 - warp_aggregate = 5
id: 3 - my_value = 3 - my_local_sum = 5 - warp_aggregate = 5
id: 4 - my_value = 4 - my_local_sum = 4 - warp_aggregate = 9
id: 5 - my_value = 5 - my_local_sum = 9 - warp_aggregate = 9

And it is indeed the output when the WaspScan template is specialized with PTX_ARCH<=210 and/or the code is compiled with
nvcc -std=c++11 -arch=sm_20 WarpScan.cu

However, if I specialize for PTX>210 and/or compile with -arch=sm_52 (>21 actually), I get the following result:

id: 0 - my_value = 0 - my_local_sum = 0 - warp_aggregate = 1
id: 1 - my_value = 1 - my_local_sum = 1 - warp_aggregate = 1
id: 2 - my_value = 2 - my_local_sum = 2 - warp_aggregate = 1
id: 3 - my_value = 3 - my_local_sum = 5 - warp_aggregate = 1
id: 4 - my_value = 4 - my_local_sum = 4 - warp_aggregate = 1
id: 5 - my_value = 5 - my_local_sum = 9 - warp_aggregate = 1

Which is not only different but incorrect.
Also, the code wont even compile when the virtual warp size is 1 for arch <= 21, and the results are incorrect otherwise (I want tpp=1 because in this case the code should be equivalent to another kernel I want to reproduce and improve).

This is weird, because if any, I would expect WarpScan to fail when compiled for an incorrect architecture (I run in sm_35 and sm_52 GPUs), but it is the other way around!.

Diving into the code I guess the behavior comes down to use the shared memory or the __shfl versions of WarpScan.

Am I doing something weird here?
Is there something I am not understanding about the behavior of this utility and this is expected?

Thanks!

EDIT:

So upon further testing I am seeing that with arch>210 warp_aggregate will take the value of the sum of the first logical warp. For example:

id: 0 - my_value = 2 - my_local_sum = 2 - warp_aggregate = 3
id: 1 - my_value = 1 - my_local_sum = 3 - warp_aggregate = 3
id: 2 - my_value = 2 - my_local_sum = 2 - warp_aggregate = 3
id: 3 - my_value = 2 - my_local_sum = 4 - warp_aggregate = 3
id: 4 - my_value = 2 - my_local_sum = 2 - warp_aggregate = 3
id: 5 - my_value = 2 - my_local_sum = 4 - warp_aggregate = 3

EDIT 2:

I managed to bypass the issue by adding the following snippet:

...
    int warp_id = threadIdx.x/tpp;                                                                                    
    __syncthreads();                                                                                                  
    WarpScan(temp_storage[warp_id]).InclusiveSum(my_value, my_local_sum, warp_aggregate);                             
    __syncthreads();                                                                                                  
                       
                                                                                    
#if __CUDA_ARCH__ > 210        
//warp_aggregate is equal to the last thread in the warp's version my_local_sum, so fetch it.
    int warpid = threadIdx.x%tpp;                                                                                     
    int delta = tpp-warpid-1;                                                                                            
    warp_aggregate = __shfl_down(my_local_sum, delta, 32);                                                            
#endif                                                                                                          
          printf("arch: %d, id: %d - my_value = %d - my_local_sum = %d - warp_aggregate = %d\n",                
           __CUDA_ARCH__, threadIdx.x, my_value, my_local_sum, warp_aggregate);                                
                                                                                                        
...

The text was updated successfully, but these errors were encountered:

dumerrill · 2018-02-08T16:45:05Z

NVCC compiles the code twice, once for the host, and once for the device. CUDA_ARCH is specified differently during those two passes. For the host, it is undefined. For the device, it is whatever the compute-capability is. I think the root of your problem is that you are using the template-instance of WarpScan as specialized on the host to specify the amount of shared memory dynamically to the kernel.
Because you are letting the PTX_ARCH template parameter default to whatever the compiler pass is saying, there is a mismatch between the WarpScan class instantiated for the host, and the one on the device.

My suggestion is to either:

(a) Use statically-allocated shared memory instead of dynamic: remove extern. The compiler will use the device-side instantiation of WarpScan::TempStorage to statically allocate the shared memory for the kernel, and you don't specify any bytes of dynamic shared memory at launch. This is the easiest thing to do because there's never a possiblity of mismatch, no matter which architectures you compile for.

(b) Instantiate WarpScan template on the host using all three parameters, e.g., "WarpScan<int, 32, 350>". But you will have to make sure the last one (PTX_ARCH) actdually matches the compile string.

Let me know if you have any problems after fixing your bug.

RaulPPelaez · 2018-02-09T09:56:57Z

Thanks @dumerrill for you answer!
You are right, the host alias of WarpScan was not helping here. But the original issue remains. Maybe it is me not understanding the expected behavior of a logical warp size != 32.
I want to set a logical warp size of 2 and I want to InclusiveSum threadIdx.x for a block of 4 threads.

See this minimal example in which I have removed the PTX_ARCH mismatch issue.

#include<cub/cub.cuh>

constexpr int tpp = 2;

template<int tpp>
__global__ void test(int N){
  int id = blockIdx.x*blockDim.x + threadIdx.x;
  using WarpScan = cub::WarpScan<int, tpp>;
  __shared__ typename WarpScan::TempStorage temp_storage[tpp];
      
  int input  = threadIdx.x;
  
  int inclusive_output = 0;
  int warp_aggregate = 0;
    
  int warp_id = threadIdx.x/tpp;
  __syncthreads();
  WarpScan(temp_storage[warp_id]).InclusiveSum(input, inclusive_output, warp_aggregate);
  __syncthreads();
    
  int warpid = threadIdx.x%tpp;
  int delta = tpp-warpid-1;
  //The output is the expected one only if this line is uncommented
  //warp_aggregate = __shfl_down(inclusive_output, delta, 32);

  printf("threadIdx.x: %d - input = %d - inclusive_output = %d - warp_aggregate = %d\n",
	 threadIdx.x, input, inclusive_output, warp_aggregate);
};

int main(){

  int N = 4;
  test<tpp><<<1, N>>>(N);
  cudaDeviceSynchronize();

};

I would expect InclusiveSum to give me the following output:

threadIdx.x: 0 - input = 0 - inclusive_output = 0 - warp_aggregate = 1
threadIdx.x: 1 - input = 1 - inclusive_output = 1 - warp_aggregate = 1
threadIdx.x: 2 - input = 2 - inclusive_output = 2 - warp_aggregate = 5
threadIdx.x: 3 - input = 3 - inclusive_output = 5 - warp_aggregate = 5

But what I get (unless I uncomment the __shfl_down line that fixes it) is this:

threadIdx.x: 0 - input = 0 - inclusive_output = 0 - warp_aggregate = 1
threadIdx.x: 1 - input = 1 - inclusive_output = 1 - warp_aggregate = 1
threadIdx.x: 2 - input = 2 - inclusive_output = 2 - warp_aggregate = 1
threadIdx.x: 3 - input = 3 - inclusive_output = 5 - warp_aggregate = 1

warp_aggregate is 1 for all threads, which is not what I expected. But maybe it is what it is intended to be!
By changing the logical warp size I see that warp_aggregate takes the value of the sum of all inputs in the first logical warp. What I would expect is for warp_aggregate to be, for each logical warp, the sum of all inputs in that logical warp. Is this expectation invalid?

mphoward · 2018-02-09T18:09:03Z

I have been having similar issues with WarpScan for logical warp sizes < 32, using a similar reproducer as above. I think that I have narrowed the issue down to the ShuffleIndex to obtain the warp aggregate value, which seems to be consistent with what @RaulPPelaez is fixing with the shuffle down. In my test case, going into the ShuffleIndex, every thread holds the expected value, but afterwards, all threads have the value from the last thread of the first logical warp, not their own.

mphoward · 2018-02-09T19:58:44Z

Additionally, if I replace line 553:
warp_aggregate = ShuffleIndex(inclusive_output, LOGICAL_WARP_THREADS - 1, LOGICAL_WARP_THREADS, member_mask);
with a direct call to __shfl:
warp_aggregate = __shfl(inclusive_output, LOGICAL_WARP_THREADS - 1, LOGICAL_WARP_THREADS);
then the correct result is generated.

Issue was not setting up the shfl constant properly. Refactor of shfl scans and reductions to always use lane_id as being relative to logical warp (not physical)

dumerrill · 2018-02-15T15:54:20Z

Thanks guys, you're right, we had a bug where we weren't setting the PTX shuffle constant properly, and it didn't show up in tests because we were only testing 1 subwarp instead of several. A fix is in Master and currently being QA'd for an imminent bugfix release.

(As you guys probably know, you want to use CUB scans instead of writing your own via __shfl() because, by going down to ptx, we can leverage the predicate output from shfl to avoid additional instructions).

dumerrill · 2018-02-16T14:53:54Z

Fixed in v1.8.0

dumerrill referenced this issue Feb 14, 2018

Fix for https://github.com/NVlabs/cub/issues/112: Bug in WarpScan.

edee0bc

Issue was not setting up the shfl constant properly. Refactor of shfl scans and reductions to always use lane_id as being relative to logical warp (not physical)

dumerrill closed this as completed Feb 16, 2018

joaander mentioned this issue Feb 12, 2019

CUDA 9+Volta compatibility glotzerlab/hoomd-blue#292

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in WarpScan? #112

Bug in WarpScan? #112

RaulPPelaez commented Sep 13, 2017 •

edited

Loading

dumerrill commented Feb 8, 2018

RaulPPelaez commented Feb 9, 2018

mphoward commented Feb 9, 2018

mphoward commented Feb 9, 2018 •

edited

Loading

dumerrill commented Feb 15, 2018

dumerrill commented Feb 16, 2018

Bug in WarpScan? #112

Bug in WarpScan? #112

Comments

RaulPPelaez commented Sep 13, 2017 • edited Loading

EDIT:

EDIT 2:

dumerrill commented Feb 8, 2018

RaulPPelaez commented Feb 9, 2018

mphoward commented Feb 9, 2018

mphoward commented Feb 9, 2018 • edited Loading

dumerrill commented Feb 15, 2018

dumerrill commented Feb 16, 2018

RaulPPelaez commented Sep 13, 2017 •

edited

Loading

mphoward commented Feb 9, 2018 •

edited

Loading