[CI] unix cpu validation Timeout #15880

ChaiBapchya · 2019-08-13T20:22:49Z

Python 3 MKL CPU timeout >3hr timeout

Shell script runs for 3h
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15794/2/pipeline/281/

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15794/1/pipeline/283

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15794/6/pipeline
But what's the cause?

PR #15794 doesn't make any change to C API.

mxnet-label-bot · 2019-08-13T20:22:52Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: CI, Build

ChaiBapchya · 2019-08-13T20:23:01Z

@mxnet-label-bot add [CI]

ChaiBapchya · 2019-08-13T20:29:04Z

Another PR #15785
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15785/8/pipeline

Python3 MKLDNN MKL CPU
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15785/6/pipeline/284

ChaiBapchya · 2019-08-14T16:19:31Z

Another PR #15881
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15881/7/pipeline

ChaiBapchya · 2019-08-15T00:30:26Z

Another PR #15769
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15769/6/pipeline

DickJC123 · 2019-08-15T01:07:15Z

test_random.py:test_shuffle is taking a long time to run. I've seen cpu runtimes between 10 and 50 minutes for that test alone. I've developed a fix and piggy-backed it onto a pending PR of mine: #15882.

ChaiBapchya · 2019-08-15T21:26:18Z

Another PR #15541 Python 3 CPU (runs for 4hours) before terminating!
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15541/8/pipeline/279

pengzhao-intel · 2019-08-15T23:02:24Z

This is interesting and we need to figure out if the increased computation leads to the problem.
@zixuanweeei could you help to take a look for the CI?

zixuanweeei · 2019-08-16T01:56:37Z

@pengzhao-intel I've seen cpu runtimes more than 10 minutes by testing test_random.py:test_shuffle for three times. Seems there are lots of discussions on the shuffle operator, like PR #10048, PR #15882 and ISSUE #10277. I will take some surveys on them first.

pengzhao-intel · 2019-08-16T02:01:43Z

Thanks @zixuanweeei

Could we statistic and sort the runtime for all cases in CPU side (CPU, CPU+MKL, CPU+MKLDNN)?
After that, we can see the runtime change by a new PR like @ChaiBapchya's large tensor PR.

zixuanweeei · 2019-08-16T02:52:05Z

Sure. @pengzhao-intel

BTW, I have disabled MKLDNN subgraph backend to see whether it impacts on the efficiency of shuffle operator. The results showed the shuffle operator has the same time cost w/ and w/o MKLDNN subgraph backend.

zixuanweeei · 2019-08-16T05:35:15Z

Some fixes from PR #15882 and PR #15922 (they have the same fixes on test_shuffle) has reduced the cost from more than 10 mins to no more than 2 mins. It cost ~41s on local test. And it seems that the fixes on test_shuffle doesn't alter the functionality of the test. It just drops the needless equal assertions.

ChaiBapchya · 2019-08-16T05:58:57Z

Another one #15736
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15736/10/

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15736/11/pipeline/291

zixuanweeei · 2019-08-16T13:47:55Z

From the last comment by @ChaiBapchya, we also found that test_operator.test_convolution_independent_gradients costed too much. And that test was conducted on a library compiled with MKL-DNN. So it will cost more on CPU context when MXNet is compiled without MKL-DNN. Should PR #15922 work for test_shuffle, we would reduce the cost from test_operator.test_convolution_independent_gradients.

aaronmarkham · 2019-10-01T20:39:55Z

4 hr timeout on the python3 mkldnn-mkl-cpu test. Why is this test still active? It causes a lot of issues with getting PRs through the pipeline.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16342/2/pipeline/266

ChaiBapchya · 2019-10-02T06:47:43Z

4 hr timeout again! MKL CPU
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16336/6/pipeline/263

#16336 is a step towards getting conclusive evidence towards perennially slow unittests. Hopefully we get clarity onto it once that PR is merged.

I am leaning towards disabling this test until timeout issue for mkldnn is fixed! @aaronmarkham

ChaiBapchya · 2019-10-25T23:19:42Z

Another one -
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16625/1/pipeline/300

marcoabreu added the CI label Aug 13, 2019

zixuanweeei mentioned this issue Aug 19, 2019

Discard needless test cases in test_convolution_independent_gradients #15939

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] unix cpu validation Timeout #15880

[CI] unix cpu validation Timeout #15880

ChaiBapchya commented Aug 13, 2019 •

edited

Loading

mxnet-label-bot commented Aug 13, 2019

ChaiBapchya commented Aug 13, 2019

ChaiBapchya commented Aug 13, 2019 •

edited

Loading

ChaiBapchya commented Aug 14, 2019

ChaiBapchya commented Aug 15, 2019

DickJC123 commented Aug 15, 2019

ChaiBapchya commented Aug 15, 2019

pengzhao-intel commented Aug 15, 2019

zixuanweeei commented Aug 16, 2019

pengzhao-intel commented Aug 16, 2019

zixuanweeei commented Aug 16, 2019

zixuanweeei commented Aug 16, 2019 •

edited

Loading

ChaiBapchya commented Aug 16, 2019 •

edited

Loading

zixuanweeei commented Aug 16, 2019

aaronmarkham commented Oct 1, 2019 •

edited

Loading

ChaiBapchya commented Oct 2, 2019

ChaiBapchya commented Oct 25, 2019

[CI] unix cpu validation Timeout #15880

[CI] unix cpu validation Timeout #15880

Comments

ChaiBapchya commented Aug 13, 2019 • edited Loading

mxnet-label-bot commented Aug 13, 2019

ChaiBapchya commented Aug 13, 2019

ChaiBapchya commented Aug 13, 2019 • edited Loading

ChaiBapchya commented Aug 14, 2019

ChaiBapchya commented Aug 15, 2019

DickJC123 commented Aug 15, 2019

ChaiBapchya commented Aug 15, 2019

pengzhao-intel commented Aug 15, 2019

zixuanweeei commented Aug 16, 2019

pengzhao-intel commented Aug 16, 2019

zixuanweeei commented Aug 16, 2019

zixuanweeei commented Aug 16, 2019 • edited Loading

ChaiBapchya commented Aug 16, 2019 • edited Loading

zixuanweeei commented Aug 16, 2019

aaronmarkham commented Oct 1, 2019 • edited Loading

ChaiBapchya commented Oct 2, 2019

ChaiBapchya commented Oct 25, 2019

ChaiBapchya commented Aug 13, 2019 •

edited

Loading

ChaiBapchya commented Aug 13, 2019 •

edited

Loading

zixuanweeei commented Aug 16, 2019 •

edited

Loading

ChaiBapchya commented Aug 16, 2019 •

edited

Loading

aaronmarkham commented Oct 1, 2019 •

edited

Loading