Tkurth/sgbn fixes #1685

azrael417 · 2023-06-22T07:45:46Z

This PR fixes the single node group batch norm in APEX to work with cuda 12.2 and RTC.

…ecision lamb

… plan and re-using it

Aidyn-A · 2023-06-22T08:00:26Z

@eqy, @rmhaskarnvidia please review this PR and/or suggest someone to review. I will also take a look, but I am not familiar with cudnn-frontend.

eqy

Took a first look, but given the size of this PR I believe @crcrpar should get the final say

eqy · 2023-06-22T19:15:40Z

apex/contrib/csrc/cudnn_gbn/norm_sample.cpp

-            strideA[d] = strideA[d + 1] * dimA[d + 1];
-        }
-        strideA[0] = strideA[2] * dimA[2];
+void generateStrides(const int64_t* dimA, int64_t* strideA, int64_t nbDims, cudnnTensorFormat_t filterFormat) {


Is it possible to simply use the Tensor's existing .strides() rather than relying on another helper function? AFAIK it would respect the NHWC vs. NCHW convention.

this is how it was implemented before though, I basically based the new implementation on the old one. We could of course use strides and pass the stride tensors to the routine. Do we want to change that?

eqy · 2023-06-22T19:16:51Z

apex/contrib/csrc/cudnn_gbn/norm_sample.cpp

+  auto tensor_create = [&tensor_stride, &tensorDims](cudnnDataType_t type,
+  int64_t id) {
+    return cudnn_frontend::TensorBuilder()
+      .setDim(4, tensorDims)


Similarly, can we use the existing .sizes() instead of creating another tensorDims array?

Same thing here, this is how this was implemented previously. We could generate all those shapes and strides in cudnn_gbn and pass them to the planning function.

eqy · 2023-06-22T19:18:57Z

apex/contrib/csrc/cudnn_gbn/cudnn_gbn.cpp

+    auto plan = run_batch_norm_forward(tensorDims, perChannelDims, epsilonDims, peerDims, CUDNN_DATA_HALF);
+    gbn_plan_cache.insert(std::make_pair(fv, plan));
+  }
+


It looks like some of the code makes assumptions about the input tensor(s)' memory layout. If so, there should be checks like is_contiguous(at::MemoryFormat::ChannelsLast).

This is done on the python frontend. That check is here

eqy · 2023-06-22T19:20:18Z

apex/contrib/csrc/cudnn_gbn/norm_sample.cpp

+      .setDim(4, tensorDims)
+        .setStrides(4, tensor_stride)
+          .setId(id)
+            .setAlignment(16)


Manually setting alignment without checking the actual tensor address seems dangerous.

the existing code has that too: this is just a refactor of the code which is already present:

https://github.com/NVIDIA/apex/blob/master/apex/contrib/csrc/cudnn_gbn/norm_sample.cpp

This is somewhat urgent since it is fixing a showstopper bug for mlperf hpc 3.0. I am fine with rewriting this but I want to move fast on this. Is there an example of how this should be done?

Yes, e.g., https://github.com/pytorch/pytorch/blob/004ff536e87c9586064fb49c4e581f185f3a9d47/aten/src/ATen/native/cudnn/Conv_v8.cpp#L55

crcrpar

wouldn't this require any changes to https://github.com/NVIDIA/apex/blob/30a7ad3974b32f7ce68cefabc38374fb4520a35e/apex/contrib/test/cudnn_gbn/test_cudnn_gbn_with_two_gpus.py?

eqy · 2023-07-01T03:30:17Z

@azrael417 we can defer addressing the issues I brought up to a later PR if @crcrpar is content to merge the fix given the urgency

crcrpar · 2023-07-02T00:45:57Z

rel: #1689

* fixing order of class instantiation and device extraction in mixed precision lamb * this commit fixes the SGBN graph capture problem by caching the cudnn plan and re-using it * disentangling the mplamb MR and SGBN MR * cleaner caching

azrael417 added 3 commits June 21, 2023 23:40

fixing order of class instantiation and device extraction in mixed pr…

7316130

…ecision lamb

this commit fixes the SGBN graph capture problem by caching the cudnn…

dafe66a

… plan and re-using it

disentangling the mplamb MR and SGBN MR

3ad083d

eqy suggested changes Jun 22, 2023

View reviewed changes

cleaner caching

96b961f

crcrpar reviewed Jun 28, 2023

View reviewed changes

crcrpar merged commit 8ffc901 into NVIDIA:master Jul 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tkurth/sgbn fixes #1685

Tkurth/sgbn fixes #1685

azrael417 commented Jun 22, 2023

Aidyn-A commented Jun 22, 2023 •

edited

Loading

eqy left a comment •

edited

Loading

eqy Jun 22, 2023

azrael417 Jun 29, 2023

eqy Jun 22, 2023

azrael417 Jun 29, 2023

eqy Jun 22, 2023

azrael417 Jun 29, 2023 •

edited

Loading

eqy Jun 22, 2023

azrael417 Jun 29, 2023

eqy Jul 1, 2023

crcrpar left a comment

eqy commented Jul 1, 2023

crcrpar commented Jul 2, 2023

Tkurth/sgbn fixes #1685

Tkurth/sgbn fixes #1685

Conversation

azrael417 commented Jun 22, 2023

Aidyn-A commented Jun 22, 2023 • edited Loading

eqy left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

azrael417 Jun 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crcrpar left a comment

Choose a reason for hiding this comment

eqy commented Jul 1, 2023

crcrpar commented Jul 2, 2023

Aidyn-A commented Jun 22, 2023 •

edited

Loading

eqy left a comment •

edited

Loading

azrael417 Jun 29, 2023 •

edited

Loading