fixing var-seq-len rnn backward() operator #15278

stephenrawls · 2019-06-19T17:50:18Z

Description

This PR fixes the problem mentioned here: #15268

NOTE: this is in a preliminary state currently. The backward() pass now works without crashing and produces a gradient.

HOWEVER, the unit test I added to confirm the gradient we get matches what we expect is failing because the gradients do not match.

Due to time-sensitive nature of the PR I am creating it early in case more eyeballs can help.

@szha @roywei

stephenrawls · 2019-06-19T17:55:59Z

For perspective, this is the output of my unit test currently showing the magnitude of the difference for one of the gradients between what I get and what I expect:

Error 12.396820 exceeds tolerance rtol=0.010000, atol=0.010000.  Location of maximum error:(26, 4), a=0.780967, b=1.032993
 a: array([[0.23794387, 0.19474988, 0.26953802, 0.2155416 , 0.23550223],
       [0.21725482, 0.16967712, 0.27878642, 0.16552497, 0.19390106],
       [0.2731786 , 0.21481094, 0.28132305, 0.2327928 , 0.2571561 ],...
 b: array([[0.2389039 , 0.19521129, 0.27095696, 0.21666154, 0.23750533],
       [0.21805991, 0.16999641, 0.28004444, 0.16646673, 0.19568215],
       [0.27414015, 0.21513419, 0.2827976 , 0.23384902, 0.2590059 ],...

stephenrawls · 2019-06-19T22:48:31Z

Just to keep the ticket updated:

I have confirmed the following facts:

If I set each sequence_length entry to the maximum sequence length, then my gradients between the reference net and the var-seq-len net do match
When I set cudnn debugging on, I am calling the appropriate "unpacked enabled" version of the cudnn api and the appropriate seq-len values are passed in.
i.e. I set:

export CUDNN_LOGINFO_DBG=1
export CUDNN_LOGDEST_DBG=/home/ec2-user/cudnn.dbg.log

And I look at the resulting output and see:

I! CuDNN (v7501) function cudnnRNNForwardTrainingEx() called:
...
paddingMode: type=cudnnRNNPaddingMode_t; val=CUDNN_RNN_PADDED_IO_ENABLED (1);
...
i!         seqLengthArray: type=int; val=[10,7,10,11,8,3,5,11,6,2];

And this does match up to the corresponding call to the backward functions, i.e.

I! CuDNN (v7501) function cudnnRNNBackwardDataEx() called:
...
paddingMode: type=cudnnRNNPaddingMode_t; val=CUDNN_RNN_PADDED_IO_ENABLED (1);
...
seqLengthArray: type=int; val=[10,7,10,11,8,3,5,11,6,2];

And same for cudnnRNNBackwardWeightsEx().

My suspicion now is maybe the reference net gradient is losing floating point precision because it is going through extra reverse / concat / etc operations. Going to consider another way of constructing the reference net for testing the gradient.

stephenrawls · 2019-06-19T23:58:56Z

@roywei @szha

Okay I think the PR is good now.

The problem was indeed what I speculated before: the cudnn backward pass was producing the correct gradient, and the reference net was "close but not close enough" to it, and the reason for the discrepancy was in the reference net.

I changed the reference net to just be a dead simple LSTM, where I just process each batch element one-at-a-time, so that each time the LSTM can size itself appropriately to the current input. For the backward pass I set the gradient parameter to accumulate so that I can compare against the LSTM using sequence_length with a batch.

The unit test now tests the backward pass, and it is successful.

stephenrawls · 2019-06-20T01:39:39Z

Looks like test failed due to tolerance issue. (I tried with same exact random seed on my compute instance and test passed however).

Trying again with updated tolerance values of rtol=1e-2, atol=1e-6 which is what some of the other tests in this file use.

(Quick aside: how do people usually look at the test log files? It keeps crashing my browser before I can click on the "get raw text" link)

fixing var-seq-len rnn backward() operator

d2bd53d

updating var-length lstm to test backward pass

f20e4d5

stephenrawls added 3 commits June 20, 2019 00:04

removing bit of dbg print to stderr i forgot to remove earlier

20b79e8

resolving TODO about using int32 for sequence_length

60f277e

setting rtol and atol similar to other tests in this file

25e8e51

szha approved these changes Jun 20, 2019

View reviewed changes

szha merged commit 4d96671 into apache:master Jun 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixing var-seq-len rnn backward() operator #15278

fixing var-seq-len rnn backward() operator #15278

stephenrawls commented Jun 19, 2019

stephenrawls commented Jun 19, 2019

stephenrawls commented Jun 19, 2019 •

edited

Loading

stephenrawls commented Jun 19, 2019

stephenrawls commented Jun 20, 2019

fixing var-seq-len rnn backward() operator #15278

fixing var-seq-len rnn backward() operator #15278

Conversation

stephenrawls commented Jun 19, 2019

Description

stephenrawls commented Jun 19, 2019

stephenrawls commented Jun 19, 2019 • edited Loading

stephenrawls commented Jun 19, 2019

stephenrawls commented Jun 20, 2019

stephenrawls commented Jun 19, 2019 •

edited

Loading