Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update mxnet version to fix CI error #863

Merged
merged 10 commits into from
Aug 3, 2019

Conversation

xinyu-intel
Copy link
Member

No description provided.

@xinyu-intel
Copy link
Member Author

xinyu-intel commented Jul 13, 2019

@zhreshold Catch a bug after update to the latest nightly build. It seems a cudnn op introduce it:

import mxnet as mx
import gluoncv as gcv

ctx = mx.gpu(0)
x = mx.random.uniform(shape=(2, 3, 224, 224), ctx=ctx)
net = gcv.model_zoo.get_model('resnet18_v1b_0.89', pretrained=False)
net.initialize()
net.collect_params().reset_ctx(ctx)
net(x)
mx.nd.waitall()
[22:39:50] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Traceback (most recent call last):
  File "test_model_zoo.py", line 415, in <module>
    test_imagenet_models()
  File "/home/chenxiny/gluon-cv/tests/unittests/common.py", line 43, in test_wrapper
    orig_test(*args, **kwargs)
  File "test_model_zoo.py", line 123, in test_imagenet_models
    _test_model_list(models, ctx, x)
  File "test_model_zoo.py", line 53, in _test_model_list
    mx.nd.waitall()
  File "/home/chenxiny/mxnet-gpu/python/mxnet/ndarray/ndarray.py", line 166, in waitall
    check_call(_LIB.MXNDArrayWaitAll())
  File "/home/chenxiny/mxnet-gpu/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [22:39:50] include/mxnet/././tensor_blob.h:290: Check failed: this->shape_.Size() == static_cast<size_t>(shape.Size()) (64 vs. 8) : TBlob.get_with_shape: new and old shape do not match total elements
Stack trace:
  [bt] (0) /home/chenxiny/mxnet-gpu/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x2b) [0x7f46220c1c4b]
  [bt] (1) /home/chenxiny/mxnet-gpu/python/mxnet/../../lib/libmxnet.so(mshadow::Tensor<mshadow::gpu, 1, float> mxnet::TBlob::get_with_shape<mshadow::gpu, 1, float>(mshadow::Shape<1> const&, mshadow::Stream<mshadow::gpu>*) const+0x1e4) [0x7f4624f96194]
  [bt] (2) /home/chenxiny/mxnet-gpu/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNBatchNormOp<float>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x6c5) [0x7f46264c8305]
  [bt] (3) /home/chenxiny/mxnet-gpu/python/mxnet/../../lib/libmxnet.so(void mxnet::op::BatchNormCompute<mshadow::gpu>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x90e) [0x7f46264bd96e]
  [bt] (4) /home/chenxiny/mxnet-gpu/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x372) [0x7f46245a5f22]
  [bt] (5) /home/chenxiny/mxnet-gpu/python/mxnet/../../lib/libmxnet.so(+0x3f06e64) [0x7f4624d27e64]
  [bt] (6) /home/chenxiny/mxnet-gpu/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x636) [0x7f4624d23576]
  [bt] (7) /home/chenxiny/mxnet-gpu/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x150) [0x7f4624d35c70]
  [bt] (8) /home/chenxiny/mxnet-gpu/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>)+0x46) [0x7f4624d35ef6]

@xinyu-intel xinyu-intel self-assigned this Jul 13, 2019
@xinyu-intel xinyu-intel added bug Something isn't working enhancement New feature or request labels Jul 13, 2019
@zhreshold
Copy link
Member

I'd say it's really a weird error

@xinyu-intel
Copy link
Member Author

@zhreshold It looks like resnet18_v1b_0.89 is shown as resnet18_v1b_2.6x in gcv.model_zoo.pretrained_model_list(), so the CI tests this model without using pre-trained weight.

@zhreshold
Copy link
Member

#871

@zhreshold
Copy link
Member

@xinyu-intel I am trying to fix all these things together but get a new error: http://ci.mxnet.io/blue/organizations/jenkins/gluon-cv/detail/PR-870/11/pipeline

@xinyu-intel
Copy link
Member Author

@zhreshold okay, i'll take a look at this bug.

@xinyu-intel
Copy link
Member Author

@zhreshold It's weird that when I test this only one case, it returns no error. But when I test the whole file, it returns error...

@mli
Copy link
Member

mli commented Aug 3, 2019

Job PR-863-10 is done.
Docs are uploaded to http://gluon-vision-staging.s3-website-us-west-2.amazonaws.com/PR-863/10/index.html
Code coverage of this PR: pr.svg vs. Master: master.svg

@mli
Copy link
Member

mli commented Aug 3, 2019

Job PR-863-11 is done.
Docs are uploaded to http://gluon-vision-staging.s3-website-us-west-2.amazonaws.com/PR-863/11/index.html
Code coverage of this PR: pr.svg vs. Master: master.svg

@xinyu-intel
Copy link
Member Author

@zhreshold ci pass:)

@zhreshold zhreshold merged commit e2310b8 into dmlc:master Aug 3, 2019
@zhreshold
Copy link
Member

@xinyu-intel merged 😃

@xinyu-intel
Copy link
Member Author

@zhreshold Thanks:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants