-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace CUDA API wrapper memory operations with native CUDA calls #395
Replace CUDA API wrapper memory operations with native CUDA calls #395
Conversation
@@ -4,5 +4,5 @@ | |||
|
|||
BeamSpotCUDA::BeamSpotCUDA(Data const* data_h, cuda::stream_t<>& stream) { | |||
data_d_ = cudautils::make_device_unique<Data>(stream); | |||
cuda::memory::async::copy(data_d_.get(), data_h, sizeof(Data), stream.id()); | |||
cudaMemcpyAsync(data_d_.get(), data_h, sizeof(Data), cudaMemcpyHostToDevice, stream.id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've typically used cudaMemcpyDefault
elsewhere (but I'm not against of denoting the direction explicitly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So could I leave the cudaMemcpy
with the direction defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So could I leave the
cudaMemcpy
with the direction defined?
I'm fine with that.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you
- fix the spurious lines
- fix the types used in the copies
- not add/remove whitespaces and empty lines (unless it is done on purpose)
?
Then, one thing I forgot to ask you earlier: could you wrap every call to cudaMemcpy(...)
, cudaMemcpyAsync(...)
, cudaMemset(...)
, cudaMemsetAsync(...)
in a call to cudaCheck()
?
For example
cudaMemcpyAsync(data_d_.get(), data_h, sizeof(Data), cudaMemcpyHostToDevice, stream);
should become
cudaCheck(cudaMemcpyAsync(data_d_.get(), data_h, sizeof(Data), cudaMemcpyHostToDevice, stream));
To make it available, you may need to add
#include "HeterogeneousCore/CUDAUtilities/interface/cudaCheck.h"
if it was not already there.
cuda::memory::copy(nn, LOC_ONGPU(ndof), nv * sizeof(int32_t)); | ||
cuda::memory::copy(chi2, LOC_ONGPU(chi2), nv * sizeof(float)); | ||
cudaMemcpy(&nv, LOC_ONGPU(nvFinal), sizeof(uint32_t), cudaMemcpyDeviceToHost); | ||
cudaMemcpy(nn, LOC_ONGPU(ndof), nv * sizeof(uint32_t), cudaMemcpyDeviceToHost); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here uint32_t
was originally int32_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified, only cudaMemcpy(nn, LOC_ONGPU(ndof), nv * sizeof(uint32_t), cudaMemcpyDeviceToHost);
was originally int32_t
…rack/cmssw into replace_cuda_memory
… into replace_cuda_memory Updating changes
Validation summaryReference release CMSSW_11_0_0_pre7 at 411b633 Validation plots/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW
Throughput plots/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53logs and
|
@VinInn could you have a look at this PR ? The changes should be only technical (moving from the
while there doesnt seem to be any change in the overall tracks:
|
} | ||
|
||
template <typename T> | ||
inline void copyAsync(cudautils::host::unique_ptr<T>& dst, | ||
const cudautils::device::unique_ptr<T>& src, | ||
cudaStream_t stream) { | ||
static_assert(std::is_array<T>::value == false, "For array types, use the other overload with the size parameter"); | ||
cuda::memory::async::copy(dst.get(), src.get(), sizeof(T), stream); | ||
cudaCheck(cudaMemcpyAsync(dst.get(), src.get(), sizeof(T), cudaMemcpyHostToDevice, stream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is device2host
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And "Calling cudaMemcpyAsync()
with dst
and src
pointers that do not match the direction of the copy results in an undefined behavior." (*), so specifying the direction explicitly is actually harmful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I think we agreed to remove all explicit directions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling
cudaMemcpyAsync()
withdst
andsrc
pointers that do not match the direction of the copy results in an undefined behavior.
I thought it was supposed to crash...
} | ||
|
||
template <typename T> | ||
inline void copyAsync(cudautils::host::unique_ptr<T[]>& dst, | ||
const cudautils::device::unique_ptr<T[]>& src, | ||
size_t nelements, | ||
cudaStream_t stream) { | ||
cuda::memory::async::copy(dst.get(), src.get(), nelements * sizeof(T), stream); | ||
cudaCheck(cudaMemcpyAsync(dst.get(), src.get(), nelements * sizeof(T), cudaMemcpyHostToDevice, stream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Validation summaryReference release CMSSW_11_0_0_pre7 at 411b633 Validation plots/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW
Throughput plots/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53logs and
|
OK, now it looks better.
and all the others show identical or almost identical results. |
PR description
This PR is part of #386:
cuda::memory::copy()
withcudaMemcpy()
, cuda::memory::async::copy()
withcudaMemcpyAsync()
cuda::memory::zero()
andcuda::memory::set()
withcudaMemset()
cuda::memory::async::zero()
andcuda::memory::async::set()
withcudaMemsetAsync()
PR validation
unit tests run