faster cluster, now does not requires to know number of modules #68

VinInn · 2018-05-30T13:32:05Z

addresses #66 as well

cmsbot · 2018-05-30T13:32:23Z

A new Pull Request was created by @VinInn (Vincenzo Innocente) for CMSSW_10_2_X_Patatrack.

It involves the following packages:

RecoLocalTracker/SiPixelClusterizer

@cmsbot, @fwyzard can you please review it and eventually sign? Thanks.

cms-bot commands are listed here

VinInn · 2018-05-30T13:32:38Z

[innocent@vinzen0 src]$ nvprof ../test/slc7_amd64_gcc700/gpuClustering_t
==13715== NVPROF is profiling process 13715, command: ../test/slc7_amd64_gcc700/gpuClustering_t
created 175025 digis in 8956 clusters
CUDA countModules kernel launch with 2000 blocks of 256 threads
CUDA findModules kernel launch with 2000 blocks of 256 threads
found 897 Modules active
debug 0 4
found 8956 8956 clusters
==13715== Profiling application: ../test/slc7_amd64_gcc700/gpuClustering_t
==13715== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   89.81%  2.0436ms         1  2.0436ms  2.0436ms  2.0436ms  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const *, unsigned int*, unsigned int*, int*, unsigned int*, int)
                    4.79%  108.96us         5  21.792us  1.1200us  52.706us  [CUDA memcpy DtoH]
                    4.40%  100.13us         5  20.026us     928ns  28.225us  [CUDA memcpy HtoD]
                    0.92%  21.025us         1  21.025us  21.025us  21.025us  gpuClustering::countModules(unsigned short const *, unsigned int*, int*, int)
                    0.08%  1.7280us         2     864ns     736ns     992ns  [CUDA memset]
      API calls:   97.28%  156.68ms         9  17.409ms  4.1290us  156.17ms  cudaMalloc
                    1.57%  2.5271ms        10  252.71us  10.710us  2.0620ms  cudaMemcpy
                    0.51%  816.14us        96  8.5010us     280ns  364.63us  cuDeviceGetAttribute
                    0.40%  636.78us         9  70.752us  5.5600us  141.66us  cudaFree
                    0.11%  169.62us         1  169.62us  169.62us  169.62us  cuDeviceTotalMem
                    0.05%  77.869us         1  77.869us  77.869us  77.869us  cuDeviceGetName
                    0.03%  54.060us        10  5.4060us     430ns  47.290us  cudaGetDevice
                    0.03%  51.349us         2  25.674us  14.759us  36.590us  cudaLaunchKernel
                    0.02%  33.740us         2  16.870us  5.7000us  28.040us  cudaMemset
                    0.00%  3.8800us         1  3.8800us  3.8800us  3.8800us  cuDeviceGetPCIBusId
                    0.00%  2.2100us         3     736ns     340ns  1.4900us  cuDeviceGetCount
                    0.00%  1.2900us         2     645ns     400ns     890ns  cuDeviceGet
                    0.00%     920ns         1     920ns     920ns     920ns  cudaGetDeviceCount
[innocent@vinzen0 src]$ nvprof $CMSSW_RELEASE_BASE/test/slc7_amd64_gcc700/gpuClustering_t
==13757== NVPROF is profiling process 13757, command: /data/vin/cmssw/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_2_0_pre4_Patatrack/test/slc7_amd64_gcc700/gpuClustering_t
created 175025 digis in 8956 clusters
CUDA countModules kernel launch with 2000 blocks of 256 threads
found 897 Modules active
CUDA findModules kernel launch with 897 blocks of 256 threads
debug 0 4
found 8956 8956 clusters
==13757== Profiling application: /data/vin/cmssw/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_2_0_pre4_Patatrack/test/slc7_amd64_gcc700/gpuClustering_t
==13757== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   95.32%  4.7209ms         1  4.7209ms  4.7209ms  4.7209ms  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const *, unsigned int*, unsigned int*, int*, unsigned int*, int)
                    2.21%  109.29us         5  21.857us  1.1210us  52.835us  [CUDA memcpy DtoH]
                    2.02%  99.973us         5  19.994us     896ns  28.290us  [CUDA memcpy HtoD]
                    0.42%  20.929us         1  20.929us  20.929us  20.929us  gpuClustering::countModules(unsigned short const *, unsigned int*, int*, int)
                    0.03%  1.6960us         2     848ns     704ns     992ns  [CUDA memset]
      API calls:   95.62%  160.72ms         9  17.857ms  6.8100us  160.02ms  cudaMalloc
                    3.14%  5.2846ms        10  528.46us  13.630us  4.8848ms  cudaMemcpy
                    0.49%  825.37us        96  8.5970us     270ns  369.65us  cuDeviceGetAttribute
                    0.46%  781.52us         9  86.835us  9.3700us  183.67us  cudaFree
                    0.10%  170.69us         1  170.69us  170.69us  170.69us  cuDeviceTotalMem
                    0.05%  90.320us         2  45.160us  22.980us  67.340us  cudaLaunchKernel
                    0.05%  90.310us         1  90.310us  90.310us  90.310us  cuDeviceGetName
                    0.04%  60.399us        10  6.0390us     710ns  49.739us  cudaGetDevice
                    0.03%  46.109us         2  23.054us  8.1200us  37.989us  cudaMemset
                    0.00%  3.8100us         1  3.8100us  3.8100us  3.8100us  cuDeviceGetPCIBusId
                    0.00%  2.3000us         3     766ns     270ns  1.6000us  cuDeviceGetCount
                    0.00%  1.2600us         2     630ns     290ns     970ns  cuDeviceGet
                    0.00%     820ns         1     820ns     820ns     820ns  cudaGetDeviceCount

makortel · 2018-05-30T13:37:14Z

Thanks!

fwyzard

@felicepantaleo

fwyzard · 2018-06-04T09:55:04Z

RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h

    auto niter = (msize-first)/blockDim.x;
    assert(niter<10);
-    for (int i=0; i<niter+1; ++i) jmax[i]=msize;    
+    // for (int k=0; k<niter+1; ++k) jmin[k]=first+k*blockDim.x+1;    
+    for (int k=0; k<niter+1; ++k) jmax[k]=msize;

    while (go) {


I know it was like this before, but I'm wondering - is it actually safe to use __syncthreads() inside a while loop?
Isn't it a problem if some threads run a different number of loops, or take different branches (i.e. continue) ?

cuda-memcheck did not detect any race

All threads goes through the while loop together.
the continue is in the for and there is no __syncthreads() in the for-loop.
So no divergencies besides the for loop

cmsbot · 2018-06-04T10:11:49Z

Pull request #68 was updated. @cmsbot, @fwyzard can you please check and sign again.

cmsbot · 2018-06-04T12:35:10Z

Pull request #68 was updated. @cmsbot, @fwyzard can you please check and sign again.

felicepantaleo · 2018-06-04T12:35:57Z