Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster cluster, now does not requires to know number of modules #68

Merged

Conversation

VinInn
Copy link

@VinInn VinInn commented May 30, 2018

addresses #66 as well

@cmsbot
Copy link

cmsbot commented May 30, 2018

A new Pull Request was created by @VinInn (Vincenzo Innocente) for CMSSW_10_2_X_Patatrack.

It involves the following packages:

RecoLocalTracker/SiPixelClusterizer

@cmsbot, @fwyzard can you please review it and eventually sign? Thanks.

cms-bot commands are listed here

@VinInn
Copy link
Author

VinInn commented May 30, 2018

[innocent@vinzen0 src]$ nvprof ../test/slc7_amd64_gcc700/gpuClustering_t
==13715== NVPROF is profiling process 13715, command: ../test/slc7_amd64_gcc700/gpuClustering_t
created 175025 digis in 8956 clusters
CUDA countModules kernel launch with 2000 blocks of 256 threads
CUDA findModules kernel launch with 2000 blocks of 256 threads
found 897 Modules active
debug 0 4
found 8956 8956 clusters
==13715== Profiling application: ../test/slc7_amd64_gcc700/gpuClustering_t
==13715== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   89.81%  2.0436ms         1  2.0436ms  2.0436ms  2.0436ms  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const *, unsigned int*, unsigned int*, int*, unsigned int*, int)
                    4.79%  108.96us         5  21.792us  1.1200us  52.706us  [CUDA memcpy DtoH]
                    4.40%  100.13us         5  20.026us     928ns  28.225us  [CUDA memcpy HtoD]
                    0.92%  21.025us         1  21.025us  21.025us  21.025us  gpuClustering::countModules(unsigned short const *, unsigned int*, int*, int)
                    0.08%  1.7280us         2     864ns     736ns     992ns  [CUDA memset]
      API calls:   97.28%  156.68ms         9  17.409ms  4.1290us  156.17ms  cudaMalloc
                    1.57%  2.5271ms        10  252.71us  10.710us  2.0620ms  cudaMemcpy
                    0.51%  816.14us        96  8.5010us     280ns  364.63us  cuDeviceGetAttribute
                    0.40%  636.78us         9  70.752us  5.5600us  141.66us  cudaFree
                    0.11%  169.62us         1  169.62us  169.62us  169.62us  cuDeviceTotalMem
                    0.05%  77.869us         1  77.869us  77.869us  77.869us  cuDeviceGetName
                    0.03%  54.060us        10  5.4060us     430ns  47.290us  cudaGetDevice
                    0.03%  51.349us         2  25.674us  14.759us  36.590us  cudaLaunchKernel
                    0.02%  33.740us         2  16.870us  5.7000us  28.040us  cudaMemset
                    0.00%  3.8800us         1  3.8800us  3.8800us  3.8800us  cuDeviceGetPCIBusId
                    0.00%  2.2100us         3     736ns     340ns  1.4900us  cuDeviceGetCount
                    0.00%  1.2900us         2     645ns     400ns     890ns  cuDeviceGet
                    0.00%     920ns         1     920ns     920ns     920ns  cudaGetDeviceCount
[innocent@vinzen0 src]$ nvprof $CMSSW_RELEASE_BASE/test/slc7_amd64_gcc700/gpuClustering_t
==13757== NVPROF is profiling process 13757, command: /data/vin/cmssw/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_2_0_pre4_Patatrack/test/slc7_amd64_gcc700/gpuClustering_t
created 175025 digis in 8956 clusters
CUDA countModules kernel launch with 2000 blocks of 256 threads
found 897 Modules active
CUDA findModules kernel launch with 897 blocks of 256 threads
debug 0 4
found 8956 8956 clusters
==13757== Profiling application: /data/vin/cmssw/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_2_0_pre4_Patatrack/test/slc7_amd64_gcc700/gpuClustering_t
==13757== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   95.32%  4.7209ms         1  4.7209ms  4.7209ms  4.7209ms  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const *, unsigned int*, unsigned int*, int*, unsigned int*, int)
                    2.21%  109.29us         5  21.857us  1.1210us  52.835us  [CUDA memcpy DtoH]
                    2.02%  99.973us         5  19.994us     896ns  28.290us  [CUDA memcpy HtoD]
                    0.42%  20.929us         1  20.929us  20.929us  20.929us  gpuClustering::countModules(unsigned short const *, unsigned int*, int*, int)
                    0.03%  1.6960us         2     848ns     704ns     992ns  [CUDA memset]
      API calls:   95.62%  160.72ms         9  17.857ms  6.8100us  160.02ms  cudaMalloc
                    3.14%  5.2846ms        10  528.46us  13.630us  4.8848ms  cudaMemcpy
                    0.49%  825.37us        96  8.5970us     270ns  369.65us  cuDeviceGetAttribute
                    0.46%  781.52us         9  86.835us  9.3700us  183.67us  cudaFree
                    0.10%  170.69us         1  170.69us  170.69us  170.69us  cuDeviceTotalMem
                    0.05%  90.320us         2  45.160us  22.980us  67.340us  cudaLaunchKernel
                    0.05%  90.310us         1  90.310us  90.310us  90.310us  cuDeviceGetName
                    0.04%  60.399us        10  6.0390us     710ns  49.739us  cudaGetDevice
                    0.03%  46.109us         2  23.054us  8.1200us  37.989us  cudaMemset
                    0.00%  3.8100us         1  3.8100us  3.8100us  3.8100us  cuDeviceGetPCIBusId
                    0.00%  2.3000us         3     766ns     270ns  1.6000us  cuDeviceGetCount
                    0.00%  1.2600us         2     630ns     290ns     970ns  cuDeviceGet
                    0.00%     820ns         1     820ns     820ns     820ns  cudaGetDeviceCount

@VinInn VinInn changed the title faster cluster, now do not requires to know number of modules faster cluster, now does not requires to know number of modules May 30, 2018
@makortel
Copy link

Thanks!

Copy link

@fwyzard fwyzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto niter = (msize-first)/blockDim.x;
assert(niter<10);
for (int i=0; i<niter+1; ++i) jmax[i]=msize;
// for (int k=0; k<niter+1; ++k) jmin[k]=first+k*blockDim.x+1;
for (int k=0; k<niter+1; ++k) jmax[k]=msize;

while (go) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was like this before, but I'm wondering - is it actually safe to use __syncthreads() inside a while loop?
Isn't it a problem if some threads run a different number of loops, or take different branches (i.e. continue) ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda-memcheck did not detect any race

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All threads goes through the while loop together.
the continue is in the for and there is no __syncthreads() in the for-loop.
So no divergencies besides the for loop

@cmsbot
Copy link

cmsbot commented Jun 4, 2018

Pull request #68 was updated. @cmsbot, @fwyzard can you please check and sign again.

@cmsbot
Copy link

cmsbot commented Jun 4, 2018

Pull request #68 was updated. @cmsbot, @fwyzard can you please check and sign again.

// if (h_clus[i]==i) seeds.push_back(i); // only if no renumbering
}

std::cout << "found " << std::accumulate(nclus,nclus+MaxNumModules,0) << ' ' << clids.size() << " clusters" << std::endl;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

d_clus.get(),
d_debug.get(),
n
);

cuda::memory::copy(&nModules,d_moduleStart.get(),sizeof(uint32_t));
std::cout << "found " << nModules << " Modules active" << std::endl;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind keeping the cout since this is a just a test file, not one used in production.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a test!

d_clus.get(),
d_debug.get(),
n
);

cuda::memory::copy(&nModules,d_moduleStart.get(),sizeof(uint32_t));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this async?

threadsPerBlock = 256;
blocksPerGrid = MaxNumModules; //nModules;



std::cout

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

// Launch CUDA Kernels


int threadsPerBlock = 256;
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
std::cout

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a test; it is normal to print something!

uint32_t nModules=0;
cuda::memory::copy(d_moduleStart.get(),&nModules,sizeof(uint32_t));


cuda::memory::copy(d_id.get(), h_id.get(), size16);
cuda::memory::copy(d_x.get(), h_x.get(), size16);
cuda::memory::copy(d_y.get(), h_y.get(), size16);
cuda::memory::copy(d_adc.get(), h_adc.get(), size8);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these copies async?

cuda::memory::copy(d_id.get(), h_id.get(), size16);
cuda::memory::copy(d_x.get(), h_x.get(), size16);
cuda::memory::copy(d_y.get(), h_y.get(), size16);
cuda::memory::copy(d_adc.get(), h_adc.get(), size8);
cuda::memory::device::zero(d_debug.get(),size32);
cuda::memory::device::zero(d_debug.get(),size32);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a cuda memset to zero? If yes, is there a async version with stream?

@cmsbot
Copy link

cmsbot commented Jun 4, 2018

Pull request #68 was updated. @cmsbot, @fwyzard can you please check and sign again.

Copy link

@fwyzard fwyzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave the synchronous memcpy/memset and the cout in a .../test file.

fwyzard pushed a commit that referenced this pull request Feb 12, 2021
Co-authored-by: Anders <aryd@cern.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants