ImageDataLayer deadlock problem #357

smuelpeng · 2017-06-18T16:39:35Z

Hello,I've used nvidia-caffe for months, and I clone the newest version caff-0.16 this week.
But I found that image data layer in nvidiacaffe can't work well but DataLayer with LMDB works as usual (0.5 time faster in our K40 servers,that's very cool).
When I try imagedatalayer in a simple 5-class classification work by using resnet18,prototxt like:
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
transform_param {
scale:1
crop_size: 224
}
image_data_param {
source: "data/age_5/train.txt"
batch_size: 64
mirror: true
shuffle: true
new_height: 250
new_width: 250
root_folder: "/home/yuzhipeng/data/"
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 64
pad: 3
kernel_size: 7
stride: 2
weight_filler {
type: "xavier"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
·····
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "classifier"
bottom: "label"
top: "loss"
}

Well,I found that after first batch, work remain stagnant in:
4510 I0618 23:22:02.069660 28888 image_data_layer.cpp:79] output data size: 32, 3,224, 224
and I debug base_data_layer.cpp by setting:

I found that " qid:0 queryid:0" has printed only once.
So, I suspect that you may reuse the imagedatalayer thread by more than one time by using shared_ptr unproprtly.
or someting like lock and didn't unlock it. or something else.
I hope you can check this issue or point some mistake I've made.
Thank you!

drnikolaev · 2017-06-19T05:51:25Z

Hi @smuelpeng thanks!
Seems like a bug, we are looking to it..

mathmanu · 2017-06-20T13:05:14Z

I faced a similar issue when I was creating a custom layer. Finally it worked when I commented out the highlighted portion of the code under
if (this->auto_mode_)
after overriding InternalThreadEntryN and also set threads and parser_threads to 1.

template<typename Ftype, typename Btype>
void ImageLabelListDataLayer<Ftype, Btype>::InternalThreadEntryN(size_t thread_id) {
.
.
.
//comment this out to avoid hang.
//if (this->auto_mode_) {
// break;
//} // manual otherwise, thus keep rolling
iter0 = false;
.
.
.
}

drnikolaev · 2017-08-25T07:41:58Z

@smuelpeng @mathmanu This was reproduced and fixed. Pre-release: https://github.com/drnikolaev/caffe/tree/caffe-0.16
Please verify. Thank you.

mathmanu · 2017-08-28T07:49:04Z

@drnikolaev I can see that you have added a fix in DataLayer. But the layer that I am using is similar to ImageDataLayer. It derives directly from BasePrefetchingDataLayer

So the fix that you applied doesn't help me - it still hangs.

Shouldn't you be applying the fix to BasePrefetchingDataLayer since DataLayer and ImageDataLayer are derives from this class?

drnikolaev · 2017-08-28T08:03:47Z

@mathmanu The hang I reproduced was fixed by this commit:

drnikolaev@232d38b#diff-8c16c57bbe3538ff698add14cb67a7dd

Seems like there is something else. May I ask you to give me self-contained sample?

mathmanu · 2017-08-28T09:37:28Z

scripts_debug_nvcaffe_issue357_v1.zip

Attachment is a self contained example.

Unzip the attachment
copy the image_label_list_data_layer.cpp and image_label_list_data_layer.hpp into the respective folders in your caffe.
build
change directory into script_debug folder (of the unzipped attachment)
./run_all.sh
you can see that it hangs
to remove the hang, open image_label_list_data_layer.cpp and find the line with comment
#if 1 //set this to 0 to remove the hang
Set this to 0 to remove those three lines of code and then build.
it should now run fine.

mathmanu · 2017-08-28T09:49:34Z

Please correct the path to your caffe.bin in the script
scripts_debug/training/cityscapes5_jsegnet21v2_2017-08-28_14-33-49/initial/run.sh
before running.

drnikolaev · 2017-08-29T01:36:15Z

Well, the code given is not complete:

/home/snikolaev/CODE/caffe/drnikolaev/caffe/src/caffe/layers/image_label_list_data_layer.cpp:72:13: error: ‘Slice’ in namespace ‘caffe’ does not name a type
       const caffe::Slice &label_slice, Dtype *slice_data) {
etc. etc.

The whole concept of auto_mode_ doesn't make any sense in file-based storage. This is why in my recent fix I added virtual bool auto_mode();. Please give it a try, i.e. replace the line

if (this->auto_mode_) {

with the line

if (this->auto_mode()) {

mathmanu · 2017-08-29T05:00:38Z

Sorry - my mistake. The following definition is required in caffe.proto to make the example compile.
message Slice {
repeated uint32 dim = 1;
repeated uint32 stride = 2;
repeated uint32 offset = 3;
}

mathmanu · 2017-08-29T05:04:37Z

I looked at how you did it in image_data_layer.hpp and I added the same definition for auto_mode in my image_label_list_data_layer.hpp header file as well:
bool auto_mode() const override {
return false;
}

Now I can remove the overriding function InternalThreadEntryN() from my image_label_list_data_layer class. (That function was only added only to take care of this auto_mode issue by overriding the base class implementation).

And it works! No hang.

Thanks for the suggestion!

drnikolaev · 2017-09-01T23:07:28Z

Excellent!

drnikolaev added the bug label Jun 19, 2017

drnikolaev closed this as completed Sep 1, 2017

drnikolaev mentioned this issue Sep 3, 2017

August release: fixes and optimizations #410

Merged

legolas123 mentioned this issue Nov 23, 2017

Training time error in multi gpu #444

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ImageDataLayer deadlock problem #357

ImageDataLayer deadlock problem #357

smuelpeng commented Jun 18, 2017

drnikolaev commented Jun 19, 2017

mathmanu commented Jun 20, 2017 •

edited

Loading

drnikolaev commented Aug 25, 2017

mathmanu commented Aug 28, 2017

drnikolaev commented Aug 28, 2017

mathmanu commented Aug 28, 2017

mathmanu commented Aug 28, 2017 •

edited

Loading

drnikolaev commented Aug 29, 2017

mathmanu commented Aug 29, 2017

mathmanu commented Aug 29, 2017 •

edited

Loading

drnikolaev commented Sep 1, 2017

ImageDataLayer deadlock problem #357

ImageDataLayer deadlock problem #357

Comments

smuelpeng commented Jun 18, 2017

drnikolaev commented Jun 19, 2017

mathmanu commented Jun 20, 2017 • edited Loading

drnikolaev commented Aug 25, 2017

mathmanu commented Aug 28, 2017

drnikolaev commented Aug 28, 2017

mathmanu commented Aug 28, 2017

mathmanu commented Aug 28, 2017 • edited Loading

drnikolaev commented Aug 29, 2017

mathmanu commented Aug 29, 2017

mathmanu commented Aug 29, 2017 • edited Loading

drnikolaev commented Sep 1, 2017

mathmanu commented Jun 20, 2017 •

edited

Loading

mathmanu commented Aug 28, 2017 •

edited

Loading

mathmanu commented Aug 29, 2017 •

edited

Loading