inferring by using multi-thread will be hung and the results are not right #2565

lcy-seso · 2017-06-22T11:30:12Z

I am running text generation by using the encoder-decoder model, here is my codes: https://github.com/lcy-seso/models/blob/refine_seq2seq/nmt_without_attention/generate.py.

I found that:

if trainer_count is set larger than 1, the generation process will be hung when infer is called the second time.
the prediction results are different between trainer_count=1 and trainer_count > 1.
This bug occurs both in CPU and GPU mode.

outputs when setting trainer_count=1 and use_gpu=True goes like this:

Les <unk> se <unk> au sujet de la <unk> des <unk> alors que de <unk> <unk> sont en jeu
-119.7212       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> <unk> . <e>
-170.2804       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> <unk> , <unk> <unk> <unk>
-170.3101       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the
-170.5066       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> <unk> <unk>
-170.5434       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of <unk>

but when setting trainer_count=4 and use_gpu=True, the outputs are different:

Les <unk> se <unk> au sujet de la <unk> des <unk> alors que de <unk> <unk> sont en jeu
-8.0064 <e>
-16.0127        <s> <e>
-16.0127        the <e>
-16.0127        , <e>
-16.0127        <unk> <e>

The text was updated successfully, but these errors were encountered:

livc · 2017-06-22T12:00:03Z

I also encountered this bug when running language_model.

I set trainer_count=4, it will be hung when infer for the second time.

gdb log:

(gdb) bt
#0  0x000000318b20d720 in sem_wait () from /lib64/libpthread.so.0
#1  0x00007fffefaad23f in paddle::MultiGradientMachine::getOutArgs(std::vector<paddle::Argument, std::allocator<paddle::Argument> >*,
 paddle::enumeration_wrapper::PassType) () at /home/lizhao/Paddle/paddle/gserver/gradientmachines/MultiGradientMachine.h:354
#2  0x00007fffef932ad3 in _wrap_GradientMachine_forward () at /home/lizhao/Paddle/build/paddle/api/PaddlePYTHON_wrap.cxx:22906
#3  0x00007ffff7d1e3a3 in ext_do_call (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4331
#4  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2705
#5  0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff1c2cab0, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=4, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:3253
#6  0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#7  call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#8  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#9  0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b38d30, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=2, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:3253
#10 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#11 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#12 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#13 0x00007ffff7ca16b7 in gen_send_ex (gen=0x37966e0, arg=0x0, exc=<value optimized out>) at Objects/genobject.c:84
#14 0x00007ffff7d199ed in PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2497
#15 0x00007ffff7ca16b7 in gen_send_ex (gen=0x379e1e0, arg=0x0, exc=<value optimized out>) at Objects/genobject.c:84
#16 0x00007ffff7d199ed in PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2497
#17 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x37a9bb0, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=1, kws=<value optimized out>, kwcount=1, defs=0x37a6468, defcount=1, closure=0x0)
    at Python/ceval.c:3253
#18 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#19 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#20 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#21 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b26830, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=0, kws=<value optimized out>, kwcount=3, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:3253
#22 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#23 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#24 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#25 0x00007ffff7d1ec56 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4107
#26 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#27 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#28 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b28ab0, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=0, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:3253
#29 0x00007ffff7d20242 in PyEval_EvalCode (co=<value optimized out>, globals=<value optimized out>, locals=<value optimized out>)
    at Python/ceval.c:667
#30 0x00007ffff7d3a62c in run_mod (mod=<value optimized out>, filename=<value optimized out>, globals=0x640160, locals=0x640160,
    flags=<value optimized out>, arena=<value optimized out>) at Python/pythonrun.c:1353
#31 0x00007ffff7d3a700 in PyRun_FileExFlags (fp=0x6c69c0, filename=0x7fffffffe137 "infer.py", start=<value optimized out>, globals=
    0x640160, locals=0x640160, closeit=1, flags=0x7fffffffdd10) at Python/pythonrun.c:1339
#32 0x00007ffff7d3bc0c in PyRun_SimpleFileExFlags (fp=0x6c69c0, filename=0x7fffffffe137 "infer.py", closeit=1, flags=0x7fffffffdd10)
    at Python/pythonrun.c:943
---Type <return> to continue, or q <return> to quit---start infer.pyq
#33 0x00007ffff7d4d4cc in Py_Main (argc=<value optimized out>, argv=<value optimized out>) at Modules/main.c:639
#34 0x000000318ae1ecdd in __libc_start_main () from /lib64/libc.so.6
#35 0x0000000000400659 in _start ()
(gdb) f 1
#1  0x00007fffefaad23f in paddle::MultiGradientMachine::getOutArgs(std::vector<paddle::Argument, std::allocator<paddle::Argument> >*,
 paddle::enumeration_wrapper::PassType) () at /home/lizhao/Paddle/paddle/gserver/gradientmachines/MultiGradientMachine.h:354
354       void waitOutArgsReady() { outArgsReadySem_.wait(); }
(gdb) l
349
350       void start();
351
352       void onPassEnd() { gradientMachine_->onPassEnd(); }
353
354       void waitOutArgsReady() { outArgsReadySem_.wait(); }
355
356       void notifyTaskReady() { taskReadySem_.post(); }
357
358       int getDeviceId() const { return deviceId_; }
(gdb) i threads
  29 Thread 0x7fffa29cc700 (LWP 15404)  0x000000318aeddfc3 in poll () from /lib64/libc.so.6
  28 Thread 0x7fffa33cd700 (LWP 15403)  0x000000318aee99af in accept4 () from /lib64/libc.so.6
* 1 Thread 0x7ffff7c3b700 (LWP 14790)  0x000000318b20d720 in sem_wait () from /lib64/libpthread.so.0

livc · 2017-06-22T12:02:23Z

And the same as @lcy-seso , when I set trainer_count=4, the first infer result is [[ 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]], which is obviously wrong.

lcy-seso · 2017-06-23T23:59:52Z

I found this issue is duplicate to #2534, so I close it. We can trace the problem-solving process in #2534.

lcy-seso added the Bug label Jun 22, 2017

lcy-seso assigned reyoung and ghost Jun 22, 2017

lcy-seso mentioned this issue Jun 23, 2017

python script hangs when trainer_count > 1 and call trainer.train() more than once. #2534

Closed

lcy-seso added the duplicate label Jun 24, 2017

lcy-seso closed this as completed Jun 24, 2017

jacquesqiao mentioned this issue Jun 25, 2017

fix MultiGradientMachine train and infer #2595

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inferring by using multi-thread will be hung and the results are not right #2565

inferring by using multi-thread will be hung and the results are not right #2565

lcy-seso commented Jun 22, 2017 •

edited

Loading

livc commented Jun 22, 2017

livc commented Jun 22, 2017

lcy-seso commented Jun 23, 2017 •

edited

Loading

inferring by using multi-thread will be hung and the results are not right #2565

inferring by using multi-thread will be hung and the results are not right #2565

Comments

lcy-seso commented Jun 22, 2017 • edited Loading

livc commented Jun 22, 2017

livc commented Jun 22, 2017

lcy-seso commented Jun 23, 2017 • edited Loading

lcy-seso commented Jun 22, 2017 •

edited

Loading

lcy-seso commented Jun 23, 2017 •

edited

Loading