Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

fixing var-seq-len rnn backward() operator #15278

Merged
merged 5 commits into from
Jun 20, 2019

Conversation

stephenrawls
Copy link
Contributor

Description

This PR fixes the problem mentioned here: #15268

NOTE: this is in a preliminary state currently. The backward() pass now works without crashing and produces a gradient.

HOWEVER, the unit test I added to confirm the gradient we get matches what we expect is failing because the gradients do not match.

Due to time-sensitive nature of the PR I am creating it early in case more eyeballs can help.

@szha @roywei

@stephenrawls
Copy link
Contributor Author

For perspective, this is the output of my unit test currently showing the magnitude of the difference for one of the gradients between what I get and what I expect:

Error 12.396820 exceeds tolerance rtol=0.010000, atol=0.010000.  Location of maximum error:(26, 4), a=0.780967, b=1.032993
 a: array([[0.23794387, 0.19474988, 0.26953802, 0.2155416 , 0.23550223],
       [0.21725482, 0.16967712, 0.27878642, 0.16552497, 0.19390106],
       [0.2731786 , 0.21481094, 0.28132305, 0.2327928 , 0.2571561 ],...
 b: array([[0.2389039 , 0.19521129, 0.27095696, 0.21666154, 0.23750533],
       [0.21805991, 0.16999641, 0.28004444, 0.16646673, 0.19568215],
       [0.27414015, 0.21513419, 0.2827976 , 0.23384902, 0.2590059 ],...

@stephenrawls
Copy link
Contributor Author

stephenrawls commented Jun 19, 2019

Just to keep the ticket updated:

I have confirmed the following facts:

  1. If I set each sequence_length entry to the maximum sequence length, then my gradients between the reference net and the var-seq-len net do match
  2. When I set cudnn debugging on, I am calling the appropriate "unpacked enabled" version of the cudnn api and the appropriate seq-len values are passed in.
    i.e. I set:
export CUDNN_LOGINFO_DBG=1
export CUDNN_LOGDEST_DBG=/home/ec2-user/cudnn.dbg.log

And I look at the resulting output and see:

I! CuDNN (v7501) function cudnnRNNForwardTrainingEx() called:
...
paddingMode: type=cudnnRNNPaddingMode_t; val=CUDNN_RNN_PADDED_IO_ENABLED (1);
...
i!         seqLengthArray: type=int; val=[10,7,10,11,8,3,5,11,6,2];

And this does match up to the corresponding call to the backward functions, i.e.

I! CuDNN (v7501) function cudnnRNNBackwardDataEx() called:
...
paddingMode: type=cudnnRNNPaddingMode_t; val=CUDNN_RNN_PADDED_IO_ENABLED (1);
...
seqLengthArray: type=int; val=[10,7,10,11,8,3,5,11,6,2];

And same for cudnnRNNBackwardWeightsEx().

My suspicion now is maybe the reference net gradient is losing floating point precision because it is going through extra reverse / concat / etc operations. Going to consider another way of constructing the reference net for testing the gradient.

@stephenrawls
Copy link
Contributor Author

@roywei @szha

Okay I think the PR is good now.

The problem was indeed what I speculated before: the cudnn backward pass was producing the correct gradient, and the reference net was "close but not close enough" to it, and the reason for the discrepancy was in the reference net.

I changed the reference net to just be a dead simple LSTM, where I just process each batch element one-at-a-time, so that each time the LSTM can size itself appropriately to the current input. For the backward pass I set the gradient parameter to accumulate so that I can compare against the LSTM using sequence_length with a batch.

The unit test now tests the backward pass, and it is successful.

@stephenrawls
Copy link
Contributor Author

Looks like test failed due to tolerance issue. (I tried with same exact random seed on my compute instance and test passed however).

Trying again with updated tolerance values of rtol=1e-2, atol=1e-6 which is what some of the other tests in this file use.

(Quick aside: how do people usually look at the test log files? It keeps crashing my browser before I can click on the "get raw text" link)

@szha szha merged commit 4d96671 into apache:master Jun 20, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants