Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cooperative groups to populate Associations (Histograms) in Pixel Patatrack #35713

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

VinInn
Copy link
Contributor

@VinInn VinInn commented Oct 18, 2021

In this PR I wish to share code that used cooperative groups to reduce the number of kernels used to populate "Histograms" (actually OneToMany Associations) in Patatrack.

In unit tests (single Thread) the gain in speed is noticeable (even in just the prefix scan).
In standard multithread multi-stream workflows a loss in throughput can easily be observed if the maximum number of blocks is allocated. Some fine tuning of the number of blocks allocated to each kernel (even just one block?) makes this PR at least as fast as the standard multi-kernel implementation.

More comments inline.

The code is "configured" to run with cooperative groups: of course the actual PR can be merged with the standard multi-kernel implementation as default.

namespace cms {
namespace cuda {

template <template <CountOrFill> typename Func, typename Histo, typename... Args>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used (yet?) It may make the syntax more complex, not simpler

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35713/26023

  • This PR adds an extra 152KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 18, 2021

A new Pull Request was created by @VinInn (Vincenzo Innocente) for master.

It involves the following packages:

  • HeterogeneousCore/CUDAUtilities (heterogeneous)
  • RecoLocalTracker/SiPixelClusterizer (reconstruction)
  • RecoLocalTracker/SiPixelRecHits (reconstruction)
  • RecoPixelVertexing/PixelTriplets (reconstruction)
  • RecoPixelVertexing/PixelVertexFinding (reconstruction)

@jpata, @cmsbuild, @fwyzard, @makortel, @slava77 can you please review it and eventually sign? Thanks.
@mtosi, @makortel, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @OzAmram, @ferencek, @dkotlins, @gpetruc, @mmusich, @threus, @dgulhan, @tvami this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

auto kernel = fillManyFromVectorCoopKernel<Histo, T>;
auto nblocks = (totSize + nthreads - 1) / nthreads;
assert(nblocks > 0);
auto nOnes = view.size();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, a huge stack of boiler plate. could be partially encapsulated in a "launch" interface as in launch.h.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to give it a try, there is launch_cooperative(...) in launch.h .
I don't think I've ever tested it, though.

template <typename T>
inline T __ldg(T const* x) {
return *x;
}

namespace cooperative_groups {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from @fwyzard contribution to patatrack-alone

#define GET_COOP_RED_FACT_FROM_ENV

// to drive performance assessment by envvar
#ifdef GET_COOP_RED_FACT_FROM_ENV
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes life easy. Not supposed to be used in production.

#include <cstdlib>

template <typename F>
inline int maxCoopBlocks(F kernel, int nthreads, int shmem, int device, int redFact = 10) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be moved to CUDAService?
MUST be called at max once per job (per device? per kernel?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does CUDA itself require that it must be called at most once per job, or calling it many times would be slow (either by itself or it causes synchronization)?

Either way, a good question. Probably the best place to cache the values would be in CUDAService. Can the number of threads per block and size of shared memory per block vary between events (in general)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked: major slowdown

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would go well with a more general development I've been thinking about for a while (and that Abdulla may work on, if he comes to CERN in January): making the launch configuration of each kernel configurable, with a common interface.

@@ -183,6 +184,52 @@ namespace cms {
co[i] += psum[k];
}
}

template <typename T>
__device__ void coopBlockPrefixScan(T const* ici, T* ico, int32_t size, T* ipsum) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really faster than the above (at least if all required blocks are available)

int maxBlocks = maxCoopBlocks(populate, nThreads, 0, 0, 0);
std::cout << "max number of blocks is " << maxBlocks << std::endl;
auto ncoopblocks = std::min(nBlocks, maxBlocks);
auto a1 = v_d.get();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one cannot get a pointer to the return value of .get()

auto env = getenv("COOP_RED_FACT");
int redFactFromEnv = env ? atoi(env) : 0;
if (redFactFromEnv != 0)
redFact = redFactFromEnv;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a "global reduction factor" to reduce the number of required blocks to launch a cooperative groups.
Maybe shall be tuned kernel by kernel: a bit of a mess...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this: why a "reduction factor" ?
If it needs to be tuned kernel by kernel, the effect is the same a setting a hard limit on the number of blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I hope that there will be no need for a tune kernel by kernel to get reasonable performance for any kind of wf and event size/type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, on the T4 and V100, what are the maximum number of blocks reported by CUDA ?

using View = caConstants::TupleMultiplicity::View;
View view = {tupleMultiplicity_d, nullptr, nullptr, -1, -1};

int blockSize = 128;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated boilerplate.
The effort of factorization may be waisted if one would decide to get rid of the TupleMultiplicity container and filter multiplicity in the fit routine....

@VinInn
Copy link
Contributor Author

VinInn commented Oct 18, 2021

@cmsbuild , please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 6, 2023

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7ae3e5/35635/summary.html
COMMIT: 418d0c4
CMSSW: CMSSW_13_3_X_2023-11-06-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35713/35635/install.sh to create a dev area with all the needed externals and cmssw changes.

This pull request cannot be automatically merged, could you please rebase it?
You can see the log for git cms-merge-topic here: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7ae3e5/35635/git-merge-result

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2024

Milestone for this pull request has been moved to CMSSW_14_1_X. Please open a backport if it should also go in to CMSSW_14_0_X.

@cmsbuild cmsbuild modified the milestones: CMSSW_14_0_X, CMSSW_14_1_X Feb 6, 2024
@smuzaffar
Copy link
Contributor

ping

@cmsbuild cmsbuild modified the milestones: CMSSW_14_0_X, CMSSW_14_1_X Feb 12, 2024
@cmsbuild
Copy link
Contributor

Milestone for this pull request has been moved to CMSSW_14_2_X. Please open a backport if it should also go in to CMSSW_14_1_X.

@antoniovilela
Copy link
Contributor

ping (to make bot change milestone)

@cmsbuild cmsbuild modified the milestones: CMSSW_14_1_X, CMSSW_14_2_X Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants