Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImageDataLayer deadlock problem #357

Closed
smuelpeng opened this issue Jun 18, 2017 · 11 comments
Closed

ImageDataLayer deadlock problem #357

smuelpeng opened this issue Jun 18, 2017 · 11 comments
Labels

Comments

@smuelpeng
Copy link

Hello,I've used nvidia-caffe for months, and I clone the newest version caff-0.16 this week.
But I found that image data layer in nvidiacaffe can't work well but DataLayer with LMDB works as usual (0.5 time faster in our K40 servers,that's very cool).
When I try imagedatalayer in a simple 5-class classification work by using resnet18,prototxt like:
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
transform_param {
scale:1
crop_size: 224
}
image_data_param {
source: "data/age_5/train.txt"
batch_size: 64
mirror: true
shuffle: true
new_height: 250
new_width: 250
root_folder: "/home/yuzhipeng/data/"
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 64
pad: 3
kernel_size: 7
stride: 2
weight_filler {
type: "xavier"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
·····
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "classifier"
bottom: "label"
top: "loss"
}

Well,I found that after first batch, work remain stagnant in:
4510 I0618 23:22:02.069660 28888 image_data_layer.cpp:79] output data size: 32, 3,224, 224
and I debug base_data_layer.cpp by setting:
image
I found that " qid:0 queryid:0" has printed only once.
So, I suspect that you may reuse the imagedatalayer thread by more than one time by using shared_ptr unproprtly.
or someting like lock and didn't unlock it. or something else.
I hope you can check this issue or point some mistake I've made.
Thank you!

@drnikolaev
Copy link

Hi @smuelpeng thanks!
Seems like a bug, we are looking to it..

@drnikolaev drnikolaev added the bug label Jun 19, 2017
@mathmanu
Copy link

mathmanu commented Jun 20, 2017

I faced a similar issue when I was creating a custom layer. Finally it worked when I commented out the highlighted portion of the code under
if (this->auto_mode_)
after overriding InternalThreadEntryN and also set threads and parser_threads to 1.

template<typename Ftype, typename Btype>
void ImageLabelListDataLayer<Ftype, Btype>::InternalThreadEntryN(size_t thread_id) {
.
.
.
//comment this out to avoid hang.
//if (this->auto_mode_) {
// break;
//} // manual otherwise, thus keep rolling

iter0 = false;
.
.
.
}

@drnikolaev
Copy link

@smuelpeng @mathmanu This was reproduced and fixed. Pre-release: https://github.com/drnikolaev/caffe/tree/caffe-0.16
Please verify. Thank you.

@mathmanu
Copy link

@drnikolaev I can see that you have added a fix in DataLayer. But the layer that I am using is similar to ImageDataLayer. It derives directly from BasePrefetchingDataLayer

So the fix that you applied doesn't help me - it still hangs.

Shouldn't you be applying the fix to BasePrefetchingDataLayer since DataLayer and ImageDataLayer are derives from this class?

@drnikolaev
Copy link

@mathmanu The hang I reproduced was fixed by this commit:

drnikolaev@232d38b#diff-8c16c57bbe3538ff698add14cb67a7dd

Seems like there is something else. May I ask you to give me self-contained sample?

@mathmanu
Copy link

scripts_debug_nvcaffe_issue357_v1.zip

Attachment is a self contained example.

  1. Unzip the attachment
  2. copy the image_label_list_data_layer.cpp and image_label_list_data_layer.hpp into the respective folders in your caffe.
  3. build
  4. change directory into script_debug folder (of the unzipped attachment)
  5. ./run_all.sh
    you can see that it hangs
  6. to remove the hang, open image_label_list_data_layer.cpp and find the line with comment
    #if 1 //set this to 0 to remove the hang
    Set this to 0 to remove those three lines of code and then build.
    it should now run fine.

@mathmanu
Copy link

mathmanu commented Aug 28, 2017

Please correct the path to your caffe.bin in the script
scripts_debug/training/cityscapes5_jsegnet21v2_2017-08-28_14-33-49/initial/run.sh
before running.

@drnikolaev
Copy link

Well, the code given is not complete:

/home/snikolaev/CODE/caffe/drnikolaev/caffe/src/caffe/layers/image_label_list_data_layer.cpp:72:13: error: ‘Slice’ in namespace ‘caffe’ does not name a type
       const caffe::Slice &label_slice, Dtype *slice_data) {
etc. etc.

The whole concept of auto_mode_ doesn't make any sense in file-based storage. This is why in my recent fix I added virtual bool auto_mode();. Please give it a try, i.e. replace the line

if (this->auto_mode_) {

with the line

if (this->auto_mode()) {

@mathmanu
Copy link

Sorry - my mistake. The following definition is required in caffe.proto to make the example compile.
message Slice {
repeated uint32 dim = 1;
repeated uint32 stride = 2;
repeated uint32 offset = 3;
}

@mathmanu
Copy link

mathmanu commented Aug 29, 2017

I looked at how you did it in image_data_layer.hpp and I added the same definition for auto_mode in my image_label_list_data_layer.hpp header file as well:
bool auto_mode() const override {
return false;
}

Now I can remove the overriding function InternalThreadEntryN() from my image_label_list_data_layer class. (That function was only added only to take care of this auto_mode issue by overriding the base class implementation).

And it works! No hang.

Thanks for the suggestion!

@drnikolaev
Copy link

Excellent!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants