-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ImageDataLayer deadlock problem #357
Comments
Hi @smuelpeng thanks! |
I faced a similar issue when I was creating a custom layer. Finally it worked when I commented out the highlighted portion of the code under template<typename Ftype, typename Btype> |
@smuelpeng @mathmanu This was reproduced and fixed. Pre-release: https://github.com/drnikolaev/caffe/tree/caffe-0.16 |
@drnikolaev I can see that you have added a fix in DataLayer. But the layer that I am using is similar to ImageDataLayer. It derives directly from BasePrefetchingDataLayer So the fix that you applied doesn't help me - it still hangs. Shouldn't you be applying the fix to BasePrefetchingDataLayer since DataLayer and ImageDataLayer are derives from this class? |
@mathmanu The hang I reproduced was fixed by this commit: drnikolaev@232d38b#diff-8c16c57bbe3538ff698add14cb67a7dd Seems like there is something else. May I ask you to give me self-contained sample? |
scripts_debug_nvcaffe_issue357_v1.zip Attachment is a self contained example.
|
Please correct the path to your caffe.bin in the script |
Well, the code given is not complete:
The whole concept of auto_mode_ doesn't make any sense in file-based storage. This is why in my recent fix I added
with the line
|
Sorry - my mistake. The following definition is required in caffe.proto to make the example compile. |
I looked at how you did it in image_data_layer.hpp and I added the same definition for auto_mode in my image_label_list_data_layer.hpp header file as well: Now I can remove the overriding function InternalThreadEntryN() from my image_label_list_data_layer class. (That function was only added only to take care of this auto_mode issue by overriding the base class implementation). And it works! No hang. Thanks for the suggestion! |
Excellent! |
Hello,I've used nvidia-caffe for months, and I clone the newest version caff-0.16 this week.
But I found that image data layer in nvidiacaffe can't work well but DataLayer with LMDB works as usual (0.5 time faster in our K40 servers,that's very cool).
When I try imagedatalayer in a simple 5-class classification work by using resnet18,prototxt like:
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
transform_param {
scale:1
crop_size: 224
}
image_data_param {
source: "data/age_5/train.txt"
batch_size: 64
mirror: true
shuffle: true
new_height: 250
new_width: 250
root_folder: "/home/yuzhipeng/data/"
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 64
pad: 3
kernel_size: 7
stride: 2
weight_filler {
type: "xavier"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
·····
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "classifier"
bottom: "label"
top: "loss"
}
Well,I found that after first batch, work remain stagnant in:
4510 I0618 23:22:02.069660 28888 image_data_layer.cpp:79] output data size: 32, 3,224, 224
and I debug base_data_layer.cpp by setting:
I found that " qid:0 queryid:0" has printed only once.
So, I suspect that you may reuse the imagedatalayer thread by more than one time by using shared_ptr unproprtly.
or someting like lock and didn't unlock it. or something else.
I hope you can check this issue or point some mistake I've made.
Thank you!
The text was updated successfully, but these errors were encountered: