Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[CI] unix cpu validation Timeout #15880

Open
ChaiBapchya opened this issue Aug 13, 2019 · 17 comments
Open

[CI] unix cpu validation Timeout #15880

ChaiBapchya opened this issue Aug 13, 2019 · 17 comments
Labels

Comments

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Aug 13, 2019

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: CI, Build

@ChaiBapchya
Copy link
Contributor Author

@mxnet-label-bot add [CI]

@marcoabreu marcoabreu added the CI label Aug 13, 2019
@DickJC123
Copy link
Contributor

test_random.py:test_shuffle is taking a long time to run. I've seen cpu runtimes between 10 and 50 minutes for that test alone. I've developed a fix and piggy-backed it onto a pending PR of mine: #15882.

@ChaiBapchya
Copy link
Contributor Author

@pengzhao-intel
Copy link
Contributor

This is interesting and we need to figure out if the increased computation leads to the problem.
@zixuanweeei could you help to take a look for the CI?

@zixuanweeei
Copy link
Contributor

@pengzhao-intel I've seen cpu runtimes more than 10 minutes by testing test_random.py:test_shuffle for three times. Seems there are lots of discussions on the shuffle operator, like PR #10048, PR #15882 and ISSUE #10277. I will take some surveys on them first.

@pengzhao-intel
Copy link
Contributor

Thanks @zixuanweeei

Could we statistic and sort the runtime for all cases in CPU side (CPU, CPU+MKL, CPU+MKLDNN)?
After that, we can see the runtime change by a new PR like @ChaiBapchya's large tensor PR.

@zixuanweeei
Copy link
Contributor

Sure. @pengzhao-intel

BTW, I have disabled MKLDNN subgraph backend to see whether it impacts on the efficiency of shuffle operator. The results showed the shuffle operator has the same time cost w/ and w/o MKLDNN subgraph backend.

@zixuanweeei
Copy link
Contributor

zixuanweeei commented Aug 16, 2019

Some fixes from PR #15882 and PR #15922 (they have the same fixes on test_shuffle) has reduced the cost from more than 10 mins to no more than 2 mins. It cost ~41s on local test. And it seems that the fixes on test_shuffle doesn't alter the functionality of the test. It just drops the needless equal assertions.

@zixuanweeei
Copy link
Contributor

From the last comment by @ChaiBapchya, we also found that test_operator.test_convolution_independent_gradients costed too much. And that test was conducted on a library compiled with MKL-DNN. So it will cost more on CPU context when MXNet is compiled without MKL-DNN. Should PR #15922 work for test_shuffle, we would reduce the cost from test_operator.test_convolution_independent_gradients.

@aaronmarkham
Copy link
Contributor

aaronmarkham commented Oct 1, 2019

4 hr timeout on the python3 mkldnn-mkl-cpu test. Why is this test still active? It causes a lot of issues with getting PRs through the pipeline.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16342/2/pipeline/266

@ChaiBapchya
Copy link
Contributor Author

4 hr timeout again! MKL CPU
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16336/6/pipeline/263

#16336 is a step towards getting conclusive evidence towards perennially slow unittests. Hopefully we get clarity onto it once that PR is merged.

I am leaning towards disabling this test until timeout issue for mkldnn is fixed! @aaronmarkham

@ChaiBapchya
Copy link
Contributor Author

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants