-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inferring by using multi-thread will be hung and the results are not right #2565
Comments
I also encountered this bug when running language_model. I set gdb log: (gdb) bt
#0 0x000000318b20d720 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007fffefaad23f in paddle::MultiGradientMachine::getOutArgs(std::vector<paddle::Argument, std::allocator<paddle::Argument> >*,
paddle::enumeration_wrapper::PassType) () at /home/lizhao/Paddle/paddle/gserver/gradientmachines/MultiGradientMachine.h:354
#2 0x00007fffef932ad3 in _wrap_GradientMachine_forward () at /home/lizhao/Paddle/build/paddle/api/PaddlePYTHON_wrap.cxx:22906
#3 0x00007ffff7d1e3a3 in ext_do_call (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4331
#4 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2705
#5 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff1c2cab0, globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=4, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
at Python/ceval.c:3253
#6 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#7 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#8 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#9 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b38d30, globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=2, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
at Python/ceval.c:3253
#10 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#11 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#12 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#13 0x00007ffff7ca16b7 in gen_send_ex (gen=0x37966e0, arg=0x0, exc=<value optimized out>) at Objects/genobject.c:84
#14 0x00007ffff7d199ed in PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2497
#15 0x00007ffff7ca16b7 in gen_send_ex (gen=0x379e1e0, arg=0x0, exc=<value optimized out>) at Objects/genobject.c:84
#16 0x00007ffff7d199ed in PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2497
#17 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x37a9bb0, globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=1, kws=<value optimized out>, kwcount=1, defs=0x37a6468, defcount=1, closure=0x0)
at Python/ceval.c:3253
#18 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#19 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#20 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#21 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b26830, globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=0, kws=<value optimized out>, kwcount=3, defs=0x0, defcount=0, closure=0x0)
at Python/ceval.c:3253
#22 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#23 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#24 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#25 0x00007ffff7d1ec56 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4107
#26 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#27 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#28 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b28ab0, globals=<value optimized out>, locals=<value optimized out>,
args=<value optimized out>, argcount=0, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
at Python/ceval.c:3253
#29 0x00007ffff7d20242 in PyEval_EvalCode (co=<value optimized out>, globals=<value optimized out>, locals=<value optimized out>)
at Python/ceval.c:667
#30 0x00007ffff7d3a62c in run_mod (mod=<value optimized out>, filename=<value optimized out>, globals=0x640160, locals=0x640160,
flags=<value optimized out>, arena=<value optimized out>) at Python/pythonrun.c:1353
#31 0x00007ffff7d3a700 in PyRun_FileExFlags (fp=0x6c69c0, filename=0x7fffffffe137 "infer.py", start=<value optimized out>, globals=
0x640160, locals=0x640160, closeit=1, flags=0x7fffffffdd10) at Python/pythonrun.c:1339
#32 0x00007ffff7d3bc0c in PyRun_SimpleFileExFlags (fp=0x6c69c0, filename=0x7fffffffe137 "infer.py", closeit=1, flags=0x7fffffffdd10)
at Python/pythonrun.c:943
---Type <return> to continue, or q <return> to quit---start infer.pyq
#33 0x00007ffff7d4d4cc in Py_Main (argc=<value optimized out>, argv=<value optimized out>) at Modules/main.c:639
#34 0x000000318ae1ecdd in __libc_start_main () from /lib64/libc.so.6
#35 0x0000000000400659 in _start ()
(gdb) f 1
#1 0x00007fffefaad23f in paddle::MultiGradientMachine::getOutArgs(std::vector<paddle::Argument, std::allocator<paddle::Argument> >*,
paddle::enumeration_wrapper::PassType) () at /home/lizhao/Paddle/paddle/gserver/gradientmachines/MultiGradientMachine.h:354
354 void waitOutArgsReady() { outArgsReadySem_.wait(); }
(gdb) l
349
350 void start();
351
352 void onPassEnd() { gradientMachine_->onPassEnd(); }
353
354 void waitOutArgsReady() { outArgsReadySem_.wait(); }
355
356 void notifyTaskReady() { taskReadySem_.post(); }
357
358 int getDeviceId() const { return deviceId_; }
(gdb) i threads
29 Thread 0x7fffa29cc700 (LWP 15404) 0x000000318aeddfc3 in poll () from /lib64/libc.so.6
28 Thread 0x7fffa33cd700 (LWP 15403) 0x000000318aee99af in accept4 () from /lib64/libc.so.6
* 1 Thread 0x7ffff7c3b700 (LWP 14790) 0x000000318b20d720 in sem_wait () from /lib64/libpthread.so.0 |
And the same as @lcy-seso , when I set |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I am running text generation by using the encoder-decoder model, here is my codes: https://github.com/lcy-seso/models/blob/refine_seq2seq/nmt_without_attention/generate.py.
I found that:
trainer_count
is set larger than 1, the generation process will be hung wheninfer
is called the second time.trainer_count=1
andtrainer_count > 1
.outputs when setting
trainer_count=1
anduse_gpu=True
goes like this:but when setting
trainer_count=4
anduse_gpu=True
, the outputs are different:The text was updated successfully, but these errors were encountered: