-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tkurth/sgbn fixes #1685
Tkurth/sgbn fixes #1685
Conversation
@eqy, @rmhaskarnvidia please review this PR and/or suggest someone to review. I will also take a look, but I am not familiar with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a first look, but given the size of this PR I believe @crcrpar should get the final say
strideA[d] = strideA[d + 1] * dimA[d + 1]; | ||
} | ||
strideA[0] = strideA[2] * dimA[2]; | ||
void generateStrides(const int64_t* dimA, int64_t* strideA, int64_t nbDims, cudnnTensorFormat_t filterFormat) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to simply use the Tensor's existing .strides()
rather than relying on another helper function? AFAIK it would respect the NHWC vs. NCHW convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is how it was implemented before though, I basically based the new implementation on the old one. We could of course use strides and pass the stride tensors to the routine. Do we want to change that?
auto tensor_create = [&tensor_stride, &tensorDims](cudnnDataType_t type, | ||
int64_t id) { | ||
return cudnn_frontend::TensorBuilder() | ||
.setDim(4, tensorDims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, can we use the existing .sizes()
instead of creating another tensorDims
array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here, this is how this was implemented previously. We could generate all those shapes and strides in cudnn_gbn and pass them to the planning function.
auto plan = run_batch_norm_forward(tensorDims, perChannelDims, epsilonDims, peerDims, CUDNN_DATA_HALF); | ||
gbn_plan_cache.insert(std::make_pair(fv, plan)); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like some of the code makes assumptions about the input tensor(s)' memory layout. If so, there should be checks like is_contiguous(at::MemoryFormat::ChannelsLast)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done on the python frontend. That check is here
.setDim(4, tensorDims) | ||
.setStrides(4, tensor_stride) | ||
.setId(id) | ||
.setAlignment(16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manually setting alignment without checking the actual tensor address seems dangerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the existing code has that too: this is just a refactor of the code which is already present:
https://github.com/NVIDIA/apex/blob/master/apex/contrib/csrc/cudnn_gbn/norm_sample.cpp
This is somewhat urgent since it is fixing a showstopper bug for mlperf hpc 3.0. I am fine with rewriting this but I want to move fast on this. Is there an example of how this should be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this require any changes to https://github.com/NVIDIA/apex/blob/30a7ad3974b32f7ce68cefabc38374fb4520a35e/apex/contrib/test/cudnn_gbn/test_cudnn_gbn_with_two_gpus.py?
@azrael417 we can defer addressing the issues I brought up to a later PR if @crcrpar is content to merge the fix given the urgency |
rel: #1689 |
* fixing order of class instantiation and device extraction in mixed precision lamb * this commit fixes the SGBN graph capture problem by caching the cudnn plan and re-using it * disentangling the mplamb MR and SGBN MR * cleaner caching
This PR fixes the single node group batch norm in APEX to work with cuda 12.2 and RTC.