Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tkurth/sgbn fixes #1685

Merged
merged 4 commits into from
Jul 2, 2023
Merged

Tkurth/sgbn fixes #1685

merged 4 commits into from
Jul 2, 2023

Conversation

azrael417
Copy link
Contributor

This PR fixes the single node group batch norm in APEX to work with cuda 12.2 and RTC.

@Aidyn-A
Copy link
Collaborator

Aidyn-A commented Jun 22, 2023

@eqy, @rmhaskarnvidia please review this PR and/or suggest someone to review. I will also take a look, but I am not familiar with cudnn-frontend.

Copy link
Contributor

@eqy eqy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first look, but given the size of this PR I believe @crcrpar should get the final say

strideA[d] = strideA[d + 1] * dimA[d + 1];
}
strideA[0] = strideA[2] * dimA[2];
void generateStrides(const int64_t* dimA, int64_t* strideA, int64_t nbDims, cudnnTensorFormat_t filterFormat) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to simply use the Tensor's existing .strides() rather than relying on another helper function? AFAIK it would respect the NHWC vs. NCHW convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is how it was implemented before though, I basically based the new implementation on the old one. We could of course use strides and pass the stride tensors to the routine. Do we want to change that?

auto tensor_create = [&tensor_stride, &tensorDims](cudnnDataType_t type,
int64_t id) {
return cudnn_frontend::TensorBuilder()
.setDim(4, tensorDims)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, can we use the existing .sizes() instead of creating another tensorDims array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, this is how this was implemented previously. We could generate all those shapes and strides in cudnn_gbn and pass them to the planning function.

auto plan = run_batch_norm_forward(tensorDims, perChannelDims, epsilonDims, peerDims, CUDNN_DATA_HALF);
gbn_plan_cache.insert(std::make_pair(fv, plan));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like some of the code makes assumptions about the input tensor(s)' memory layout. If so, there should be checks like is_contiguous(at::MemoryFormat::ChannelsLast).

Copy link
Contributor Author

@azrael417 azrael417 Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done on the python frontend. That check is here

.setDim(4, tensorDims)
.setStrides(4, tensor_stride)
.setId(id)
.setAlignment(16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually setting alignment without checking the actual tensor address seems dangerous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the existing code has that too: this is just a refactor of the code which is already present:

https://github.com/NVIDIA/apex/blob/master/apex/contrib/csrc/cudnn_gbn/norm_sample.cpp

This is somewhat urgent since it is fixing a showstopper bug for mlperf hpc 3.0. I am fine with rewriting this but I want to move fast on this. Is there an example of how this should be done?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@crcrpar crcrpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eqy
Copy link
Contributor

eqy commented Jul 1, 2023

@azrael417 we can defer addressing the issues I brought up to a later PR if @crcrpar is content to merge the fix given the urgency

@crcrpar
Copy link
Collaborator

crcrpar commented Jul 2, 2023

rel: #1689

@crcrpar crcrpar merged commit 8ffc901 into NVIDIA:master Jul 2, 2023
yuanzhedong pushed a commit to yuanzhedong/apex that referenced this pull request Jul 14, 2023
* fixing order of class instantiation and device extraction in mixed precision lamb

* this commit fixes the SGBN graph capture problem by caching the cudnn plan and re-using it

* disentangling the mplamb MR and SGBN MR

* cleaner caching
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants