-
Notifications
You must be signed in to change notification settings - Fork 6.8k
GPU tests are unstable #12453
Comments
@lebeg Thanks for reporting this |
This is failing again on a GPU instance p3.2xlarge. time ci/build.py --docker-registry mxnetci --platform ubuntu_build_cuda --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh build_ubuntu_gpu_mkldnn && time ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_ubuntu_python3_gpu ERROR |
CI failed with similar error -
|
Can we close this for now? |
Seems it failing still from time to time, right? |
Can we close this? @szha |
I had the same problem in some of my NMT experiments running of multi-GPUs on p3.2xlarge. It ran some times but failed other times, and the error was not consistent at where it occurred and what messages it displayed. I tested every part of my code without finding any problems. Though it could be my fault, but is it possible that the issue is with MXNet? some of the error messages
|
@jzhou316 thanks for pointing this out. Could you give more info about the environment in which this happened? is it running in EC2? How difficult you think is to reproduce? Is there a way to reproduce it every time? Thanks. |
Description
Multiple CI jobs were failing with CUDA memory problems:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10921/23/pipeline/
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1550/pipeline/
Message
Log with context
The text was updated successfully, but these errors were encountered: