Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-620]Fix flaky test batchnorm training #11544

Merged
merged 3 commits into from
Jul 10, 2018

Conversation

roywei
Copy link
Member

@roywei roywei commented Jul 3, 2018

Description

Fix issue #8044

tracked at: https://issues.apache.org/jira/browse/MXNET-620

Solution:
Bump up atol value from 1e-4 to 1e-2

The following 3 tests related to batchnorm training can all pass 10000 runs:

tests/python/unittest/test_operator.py:test_batchnorm_training
tests/python/gpu/test_operator_gpu.py:test_batchnorm_training
tests/python/mkl/test_mkldnn.py:test_batchnorm

Finding 1:
Using fp32 with atol=1e-2 or fp64 with atol=1e-3 both works, chose the former one.
Tried different numeric_eps=1e-3, didn't work for all cases

Finding 2:
Reason of failing due to batchnorm result slightly different than result calculated by numeric gradient. Current implementation of batchnorm should be correct.

Finding 3:
Verified the gradient of 3 different batchnorm implementation are close but they are all different than numeric gradient. (CPU, GPU without cuDNN, GPU with cuDNN)

Conclusion

  1. Current batchnorm implementation is correct and result is consistent across all implementations (mkldnn, w/ cudnn, w/o cuddn)
  2. The batchnorm gradient is slightly different than numeric gradient calculated using finite differential method. Difference is less than rtol=0.16. atol=1e-2, it's large comparing to other operators, but acceptable in this case due to large precision loss in batchnorm.
  3. Enable this test with larger atol is better than just disable it. (it could caught fail to fall back when sparse arrays are passed to MKLDNN-enabled operators. #11448 if enabled), it serves more like a sanity check to prevent the gradient results are far too off from numeric calculations.
  4. Batchnorm errors can also be caught in test_batchnorm_with_type where consistency between each implementation is compared.

See comments in #8044 for detailed results.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with MXNET-620created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@roywei
Copy link
Member Author

roywei commented Jul 3, 2018

@haojin2 @eric-haibin-lin could you help take a look? Thanks

@roywei roywei changed the title [MXNET-620]Fix flaky test batchnorm training [MXNET-620][WIP]Fix flaky test batchnorm training Jul 3, 2018
@roywei roywei changed the title [MXNET-620][WIP]Fix flaky test batchnorm training [MXNET-620][WIP][Do not merge]Fix flaky test batchnorm training Jul 3, 2018
@eric-haibin-lin
Copy link
Member

@rahul003

@marcoabreu
Copy link
Contributor

"Enable this test with larger atol is better than just disable it." - Sure. The goal now is to do it properly, dive deep and track down the root causes and possible implications.

@haojin2
Copy link
Contributor

haojin2 commented Jul 5, 2018

@marcoabreu if you take a closer look at "finding 3", we do deliver the same results as the CuDNN version, so the root cause is just that the numeric gradient having a larger error margin compared to the symbolic backward one.

@roywei
Copy link
Member Author

roywei commented Jul 5, 2018

  • Updated conclusion.
  • Removed 'row_sparse' from stypes for the following reasons:
  1. It's failing at testing varying channel axis , this test already has rtol=0.2 and atol=0.01, we cant further increase rtol and atol.
  2. row_sparse is never used and always fallback in this test, the logic is also tested tested at mkl/test_mkldnn.py

@roywei
Copy link
Member Author

roywei commented Jul 5, 2018

@zheng-da Hi, I have modified your unit test for mkldnn batchnorm, increased atol to 1e-2 as its flaky at 1e-3 and 1e-4. Reasons listed in PR description, all batchnorm implementation (mkldnn, cudnn) gradient is slightly off from numeric gradient. Using 1e-2 is more stable and passed 10000 runs.

@roywei roywei changed the title [MXNET-620][WIP][Do not merge]Fix flaky test batchnorm training [MXNET-620]Fix flaky test batchnorm training Jul 5, 2018
@zheng-da
Copy link
Contributor

zheng-da commented Jul 5, 2018

@roywei thanks for fixing the flaky test.

@@ -1527,7 +1526,7 @@ def check_batchnorm_training(stype):
test = mx.symbol.BatchNorm(data, fix_gamma=False, use_global_stats=True, axis=chaxis)
check_numeric_gradient(test, in_location, xmean_std, numeric_eps=1e-2, rtol=0.2, atol=0.01)

stypes = ['row_sparse', 'default']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just curious. does row_sparse make the test flaky?

Copy link
Member Author

@roywei roywei Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without row_sprase, the flakyness come from difference between gradient values of mkldnn/cuddn implementation of batchnorm and numeric calculation from check_numeric_gradients

row_sparse makes it more flaky, and will fail at this part: testing varying channel axis. Which is not in your test_mkddnn:test_batchnorm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But isn't that an actual issue? It means that the two implementations are not equivalent and yield different results, right?

@marcoabreu
Copy link
Contributor

I don't really get the statement about row_sparse being removed because of the mkldnn test. Considering we run without mkldnn, how is that related?

I might be missing something here, but I got the feeling that row_sparse is being removed because it's causing trouble in some environment configurations. Could you please elaborate here?

@haojin2
Copy link
Contributor

haojin2 commented Jul 6, 2018

@marcoabreu We don't have a batch norm implemented for sparse arrays, so 'row_sparse' was there to test if we can fallback correctly. The storage type inference logic for sparse ops share some code paths with the dispatch inference logic for MKLDNN operators, which is actually what we want to test, so that's why it's now moved there.

@marcoabreu
Copy link
Contributor

marcoabreu commented Jul 6, 2018

I see, that makes sense. Where does it fall back to if MKLDNN is not available?

My concern here is that this fallback actually seems to be failing (not generating the expected values) in some cases. This shouldn't happen, right? To me, it seems like we're now removing this problematic test although the problem might lie in whatever algorithm implementation we use in that fallback - and the test actually serving its purpose to reveal that problem.

@roywei
Copy link
Member Author

roywei commented Jul 6, 2018

@marcoabreu You are right, and when row_sparse is not available, it should fallback to default mxnet implementation, it could be mkldnn or not. However, for mkldnn enabled operators, they changed the fallback logic and introduced a bug tracked at #11448. Thus it make sense to moved the fallback check under mkl/test_mkldnn.py, and for each operator impacted, there should be a similar unit test.

@haojin2 is double checking if the fallback values and logic is fixed and correct.
We will see if we need additional tests for fallback logic and value checks .

@marcoabreu
Copy link
Contributor

marcoabreu commented Jul 7, 2018

I think we should leave the tests as they are and there can be additional specific tests for the mkldnn backend.

Our users should not have to care what backend is being selected and the behaviour should always be identical - these high level operator tests are implicitly testing the fallback behavior. This means that however I call a certain operator, row_sparse in this case, I would expect that I get a proper result in any case. It's great that the issue is already tracked, but I'm heavily opposed to removing that operator from our test suite. All operator tests have to be implementation independent and should not differentiate or exclude certain cases. From my point of view, this flaky test is only resolved when the mkldnn implementation has been fixed and when this test passes all the time. Removing it is not an option.

@haojin2
Copy link
Contributor

haojin2 commented Jul 10, 2018

@marcoabreu I think we can merge this first since I've identified the root cause of the inconsistencies between sparse and dense matrices, and will be submitting a fix with corresponding tests soon.

@marcoabreu
Copy link
Contributor

marcoabreu commented Jul 10, 2018

Well the problem is not really solved but the disabling of a certain test has just been narrowed down - instead of disabling the entire suite, this PR now disables a single sub-test.
I don't really feel comfortable merging as-is because the actual root cause is still the row_sparse and that one has not been solved yet.

Right now, the test is disabled and we should leave it like that until the test has been stabilized entirely. I'm happy to merge if we leave the test disabled (re-add the skip annotation) or when you added the mentioned fixes.

@haojin2
Copy link
Contributor

haojin2 commented Jul 10, 2018

@marcoabreu I just submitted a fix for that, and I need the fix for the flakiness of the existing test in this PR in that PR.

@marcoabreu
Copy link
Contributor

I just edited by post, please check the last section

@marcoabreu
Copy link
Contributor

Excellent. Let's submit this PR with the test disabled and you'll take it from there in your PR. How does that sound?

@haojin2
Copy link
Contributor

haojin2 commented Jul 10, 2018

@marcoabreu The test for fallback correctness should not even appear in this file, this file is for operators under normal dense inputs, the tests related with sparse operators should go to test_sparse_operators.py. As this test can pass more than 10000 times under dense inputs I would say the test on dense inputs is stable already.

@marcoabreu
Copy link
Contributor

Very good point about the sparse operator tests, I wasn't aware of them. Thank you!

I will make a last pass and merge if everything is as expected.

@marcoabreu
Copy link
Contributor

Thanks @roywei and @haojin2 for your work on this PR and the detailed explanations!

@marcoabreu marcoabreu merged commit 0bec94b into apache:master Jul 10, 2018
@haojin2
Copy link
Contributor

haojin2 commented Jul 10, 2018

@marcoabreu Thanks for understanding! Please also take a look at my PR when you have time to do so, thanks! @roywei Thanks for your work on this complicated issue!

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
* increase atol to 1e-2

* enable test_batchnorm_training

* remove row_sparse as it's tested in another test, increase mkldnn batchnorm test atol to 1e-2
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants