Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training time error in multi gpu #444

Closed
legolas123 opened this issue Nov 23, 2017 · 1 comment
Closed

Training time error in multi gpu #444

legolas123 opened this issue Nov 23, 2017 · 1 comment

Comments

@legolas123
Copy link

legolas123 commented Nov 23, 2017

Installed the latest caffe-0.16 with cuda 8, cudnn 7 and nccl-1.3.4-1. With ImageData layer as follows:

 layer {
  name: "data"  
  type: "ImageData"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: true
    crop_size: 224
    mean_value: 103.940002441
    mean_value: 116.779998779
    mean_value: 123.680000305
  }
  image_data_param {
    source: "/home/ubuntu/caffe_1/models/project/train.txt"
    batch_size: 12
    shuffle: true
    new_height: 240
    new_width: 240
  }
}

The command run is

./build/tools/caffe train -solver models/seres_inception/solver.prototxt -gpu=0,1

This results in following error message

F1123 09:52:47.834031 15880 syncedmem.cpp:178] Check failed: Caffe::current_device() == gpu_device_ (0 vs. 1) 

With the option -gpu=all, it randomly gets stuck without any error message.
And the last snippet of output where it gets stuck is

I1123 09:54:54.429864 18272 common.cpp:228] New stream 0x7fe958334c20 on device 3, thread 140642958915328
I1123 09:54:54.432126 18048 common.cpp:228] New stream 0x7fea1c33a420 on device 0, thread 140644109477632
I1123 09:54:54.439190 18271 common.cpp:228] New stream 0x7fe954334c20 on device 2, thread 140641080178432
I1123 09:54:54.441017 18273 common.cpp:228] New stream 0x7fe960001060 on device 0, thread 140641071785728

The issue #357 seems to suggest that this type of deadlock is already solved with the latest version. Can somebody please help me in figuring out what could be the problem.

@drnikolaev
Copy link

Please check v0.16.5 and reopen the issue if the problem still exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants