Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inferring by using multi-thread will be hung and the results are not right #2565

Closed
lcy-seso opened this issue Jun 22, 2017 · 3 comments
Closed
Assignees

Comments

@lcy-seso
Copy link
Contributor

lcy-seso commented Jun 22, 2017

I am running text generation by using the encoder-decoder model, here is my codes: https://github.com/lcy-seso/models/blob/refine_seq2seq/nmt_without_attention/generate.py.

I found that:

  1. if trainer_count is set larger than 1, the generation process will be hung when infer is called the second time.
  2. the prediction results are different between trainer_count=1 and trainer_count > 1.
  3. This bug occurs both in CPU and GPU mode.

outputs when setting trainer_count=1 and use_gpu=True goes like this:

Les <unk> se <unk> au sujet de la <unk> des <unk> alors que de <unk> <unk> sont en jeu
-119.7212       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> <unk> . <e>
-170.2804       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> <unk> , <unk> <unk> <unk>
-170.3101       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the
-170.5066       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> <unk> <unk>
-170.5434       The <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of the <unk> of <unk>

but when setting trainer_count=4 and use_gpu=True, the outputs are different:

Les <unk> se <unk> au sujet de la <unk> des <unk> alors que de <unk> <unk> sont en jeu
-8.0064 <e>
-16.0127        <s> <e>
-16.0127        the <e>
-16.0127        , <e>
-16.0127        <unk> <e>
@lcy-seso lcy-seso added the Bug label Jun 22, 2017
@lcy-seso lcy-seso assigned reyoung and ghost Jun 22, 2017
@livc
Copy link
Member

livc commented Jun 22, 2017

I also encountered this bug when running language_model.

I set trainer_count=4, it will be hung when infer for the second time.

gdb log:

(gdb) bt
#0  0x000000318b20d720 in sem_wait () from /lib64/libpthread.so.0
#1  0x00007fffefaad23f in paddle::MultiGradientMachine::getOutArgs(std::vector<paddle::Argument, std::allocator<paddle::Argument> >*,
 paddle::enumeration_wrapper::PassType) () at /home/lizhao/Paddle/paddle/gserver/gradientmachines/MultiGradientMachine.h:354
#2  0x00007fffef932ad3 in _wrap_GradientMachine_forward () at /home/lizhao/Paddle/build/paddle/api/PaddlePYTHON_wrap.cxx:22906
#3  0x00007ffff7d1e3a3 in ext_do_call (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4331
#4  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2705
#5  0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff1c2cab0, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=4, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:3253
#6  0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#7  call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#8  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#9  0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b38d30, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=2, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:3253
#10 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#11 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#12 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#13 0x00007ffff7ca16b7 in gen_send_ex (gen=0x37966e0, arg=0x0, exc=<value optimized out>) at Objects/genobject.c:84
#14 0x00007ffff7d199ed in PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2497
#15 0x00007ffff7ca16b7 in gen_send_ex (gen=0x379e1e0, arg=0x0, exc=<value optimized out>) at Objects/genobject.c:84
#16 0x00007ffff7d199ed in PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2497
#17 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x37a9bb0, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=1, kws=<value optimized out>, kwcount=1, defs=0x37a6468, defcount=1, closure=0x0)
    at Python/ceval.c:3253
#18 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#19 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#20 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#21 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b26830, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=0, kws=<value optimized out>, kwcount=3, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:3253
#22 0x00007ffff7d1e4a1 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#23 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#24 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#25 0x00007ffff7d1ec56 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4107
#26 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#27 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#28 0x00007ffff7d20130 in PyEval_EvalCodeEx (co=0x7ffff7b28ab0, globals=<value optimized out>, locals=<value optimized out>,
    args=<value optimized out>, argcount=0, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:3253
#29 0x00007ffff7d20242 in PyEval_EvalCode (co=<value optimized out>, globals=<value optimized out>, locals=<value optimized out>)
    at Python/ceval.c:667
#30 0x00007ffff7d3a62c in run_mod (mod=<value optimized out>, filename=<value optimized out>, globals=0x640160, locals=0x640160,
    flags=<value optimized out>, arena=<value optimized out>) at Python/pythonrun.c:1353
#31 0x00007ffff7d3a700 in PyRun_FileExFlags (fp=0x6c69c0, filename=0x7fffffffe137 "infer.py", start=<value optimized out>, globals=
    0x640160, locals=0x640160, closeit=1, flags=0x7fffffffdd10) at Python/pythonrun.c:1339
#32 0x00007ffff7d3bc0c in PyRun_SimpleFileExFlags (fp=0x6c69c0, filename=0x7fffffffe137 "infer.py", closeit=1, flags=0x7fffffffdd10)
    at Python/pythonrun.c:943
---Type <return> to continue, or q <return> to quit---start infer.pyq
#33 0x00007ffff7d4d4cc in Py_Main (argc=<value optimized out>, argv=<value optimized out>) at Modules/main.c:639
#34 0x000000318ae1ecdd in __libc_start_main () from /lib64/libc.so.6
#35 0x0000000000400659 in _start ()
(gdb) f 1
#1  0x00007fffefaad23f in paddle::MultiGradientMachine::getOutArgs(std::vector<paddle::Argument, std::allocator<paddle::Argument> >*,
 paddle::enumeration_wrapper::PassType) () at /home/lizhao/Paddle/paddle/gserver/gradientmachines/MultiGradientMachine.h:354
354       void waitOutArgsReady() { outArgsReadySem_.wait(); }
(gdb) l
349
350       void start();
351
352       void onPassEnd() { gradientMachine_->onPassEnd(); }
353
354       void waitOutArgsReady() { outArgsReadySem_.wait(); }
355
356       void notifyTaskReady() { taskReadySem_.post(); }
357
358       int getDeviceId() const { return deviceId_; }
(gdb) i threads
  29 Thread 0x7fffa29cc700 (LWP 15404)  0x000000318aeddfc3 in poll () from /lib64/libc.so.6
  28 Thread 0x7fffa33cd700 (LWP 15403)  0x000000318aee99af in accept4 () from /lib64/libc.so.6
* 1 Thread 0x7ffff7c3b700 (LWP 14790)  0x000000318b20d720 in sem_wait () from /lib64/libpthread.so.0

@livc
Copy link
Member

livc commented Jun 22, 2017

And the same as @lcy-seso , when I set trainer_count=4, the first infer result is [[ 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]], which is obviously wrong.

@lcy-seso
Copy link
Contributor Author

lcy-seso commented Jun 23, 2017

I found this issue is duplicate to #2534, so I close it. We can trace the problem-solving process in #2534.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants