Changes to support TNLRV3 fine-tuning #4639

Tixxx · 2020-07-28T05:32:26Z

Description: Describe your changes.
Changes to support tnlrv3 fine-tuning task

added gradient op for ReduceLogSumExp, reference implementation taken from pytorch here
fixed bug when passing fp16 to Cudnnreduce kernel. runtime type should be float
added sanitization code in python frontend to remove redundant states to match with pytorch state dict.

Motivation and Context
To support tnlrv3 fine-tuning

added test fixed type mismatch when calling cudnnreduce kernel fixed python frontend to remove redundant states to match pytorch state dict

orttraining/orttraining/core/graph/gradient_builder.cc

orttraining/orttraining/python/ort_trainer.py

orttraining/orttraining/core/graph/gradient_builder.cc

orttraining/orttraining/test/gradient/gradient_ops_test.cc

orttraining/orttraining/core/graph/gradient_builder.cc

onnxruntime/core/providers/cuda/reduction/reduction_ops.cc

orttraining/orttraining/test/gradient/gradient_ops_test.cc

orttraining/orttraining/test/gradient/gradient_op_test_utils.h

orttraining/orttraining/python/ort_trainer.py

SherlockNoMad

PR mostly looks good. Please address the comments and I think it's ready to go.

orttraining/orttraining/test/gradient/gradient_op_test_utils.h

#4639 changed the default behavior by removing optimizer state from state_dict/checkpoint APIs. The reason for the previous change was to allow models trained on ORT to be used for inference on PyTorch, which is an important feature. Due to the change aforementioned, when resuming training from a checkpoint, the optimizer would start with random weights, leading to a bad performance. This behavior would also cause reproducibility issues, as the optimizer wouldnt be able to resume from its previous state. This PR adds a boolean flag to state_dict/save_xheckpoint API that when True (default) it saves both model and optimizer state. When False, only the model state is kept.

added reducesumlogexp gradient

d2a5263

added test fixed type mismatch when calling cudnnreduce kernel fixed python frontend to remove redundant states to match pytorch state dict

Tixxx requested a review from a team July 28, 2020 05:32

Tixxx added training issues related to ONNX Runtime training; typically submitted using template component:training-frontend labels Jul 28, 2020