parallel_nn related crash #241

byzhang · 2016-10-25T00:26:38Z

If I turn on --parallel_nn, the translation demo (./train.sh) will crash:

#0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:33
#1  0x0000000000740f90 in paddle::BaseMatrixT<float>::assign(paddle::BaseMatrixT<float>&) ()
#2  0x00000000005b4fb5 in paddle::GruStepLayer::forward (this=0x16875e80, passType=<optimized out>) at /home/byzhang/src/github/Paddle/paddle/gserver/layers/GruStepLayer.cpp:98
#3  0x000000000062639e in paddle::NeuralNetwork::forward (this=0x16866ac0, inArgs=..., outArgs=0x7fff86369b10, passType=paddle::enumeration_wrapper::PASS_TRAIN) at /home/byzhang/src/github/Paddle/paddle/gserver/gradientmachines/NeuralNetwork.cpp:242
#4  0x00000000006164af in paddle::RecurrentGradientMachine::forward (this=<optimized out>, inArgs=..., outArgs=<optimized out>, passType=paddle::enumeration_wrapper::PASS_TRAIN) at /home/byzhang/src/github/Paddle/paddle/gserver/gradientmachines/RecurrentGradientMachine.cpp:546
#5  0x000000000057257a in paddle::RecurrentLayerGroup::forward (this=<optimized out>, passType=<optimized out>) at /home/byzhang/src/github/Paddle/paddle/gserver/layers/RecurrentLayerGroup.cpp:42
#6  0x0000000000621854 in paddle::ParallelThread::computeThread (this=0x168c59b0) at /home/byzhang/src/github/Paddle/paddle/gserver/gradientmachines/ParallelNeuralNetwork.cpp:174
#7  0x00007ffff6486c30 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ffff78c2184 in start_thread (arg=0x7fff8636a700) at pthread_create.c:312
#9  0x00007ffff5c0e37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The commit is byzhang@de9d6c6

I tried to fix the aforementioned crash, but it crashed somewhere else related with unassigned realLayer_ of ScatterAgentLayer. So I need your help on debugging the crash.

The text was updated successfully, but these errors were encountered:

luotao1 · 2016-10-25T02:11:34Z

Thanks for your attention, we repeat above crash on our machine, and are debugging now.

byzhang · 2016-10-25T04:13:35Z

👍

hedaoyuan · 2016-10-25T09:28:52Z

I look at the commit byzhang/Paddle1@de9d6c6, there is a doubt. Do you just want to use GPU training? If so, just configure --use gpu = true and --gpu_id = 0 on the command line. Do not need parallel_nn and default_device(0) in the config.
PS: translation demo use RecurrentGradientMachine and RecurrentGradientMachine is not support --parallel_nn=true, this is why paddle run crash.

byzhang · 2016-10-25T18:52:02Z

I am building another network which uses both gru and parallel_nn. But to
make it simpler, I repro the crash using translation demo as an example.
What's the key reason why RecurrentGradientMachine doesn't support
parallel_nn?

On Oct 25, 2016 2:29 AM, "hedaoyuan" notifications@github.com wrote:

I look at the commit byzhang/Paddle1@de9d6c6
byzhang@de9d6c6, there is a doubt. Do
you just want to use GPU training? If so, just configure --use gpu = true
and --gpu_id = 0 on the command line. Do not need parallel_nn and
default_device(0) in the config.
PS: translation demo use RecurrentGradientMachine and
RecurrentGradientMachine is not support --parallel_nn=true, this is why
paddle run crash.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#241 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA1O1S3k4NVgjj2mr4X5mpZZMg86wvu1ks5q3cvigaJpZM4KfbuU
.

emailweixu · 2016-10-25T22:48:08Z

At the moment, should change ParallelNeuralNetwork to report error when recurrent_nn is used. In the longer term, should consider make ParallelNeuralNetwork support recurrent_nn.

hedaoyuan · 2016-10-26T02:29:27Z

The key reason is that ParallelNeuralNetwork design earlier, RecurrentGradientMachine design in the post. They are each in order to solve the training of different models. However, compatibility between the two is not considered. ParallelNeuralNetwork pays more attention to using different computing devices to train a model(like CPU+GPU, multi-GPU..). RecurrentGradientMachine pays more attention to how to train the RNN model.
PS: At present, we will follow emailweixu's comments, add some checks in the code.

update overview

update readme

* refactor scale op * fix bug

Fix compilation time

qingqing01 assigned luotao1 and hedaoyuan Oct 25, 2016

reyoung added the Bug label Oct 25, 2016

hedaoyuan mentioned this issue Nov 3, 2016

report error when use parallel_nn to train recurrent_nn model #335

Merged

hedaoyuan added this to the 0.8.1 milestone Nov 3, 2016

emailweixu closed this as completed in #335 Nov 3, 2016

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019

Merge pull request PaddlePaddle#241 from tink2123/1029_overview

4d9b460

update overview

Meiyim pushed a commit to Meiyim/Paddle that referenced this issue May 21, 2021

Merge pull request PaddlePaddle#241 from nbcc/develop

ce6b1ad

update readme

thisjiang pushed a commit to thisjiang/Paddle that referenced this issue Oct 28, 2021

Add computeinline and fix bug (PaddlePaddle#241)

5194cf3

gglin001 pushed a commit to graphcore/Paddle-fork that referenced this issue Dec 8, 2021

refactor scale op (PaddlePaddle#241)

2b19e58

* refactor scale op * fix bug

qingshui pushed a commit to qingshui/Paddle that referenced this issue Mar 30, 2023

fix ps gpu dense; (PaddlePaddle#241)

89f5ea8

danleifeng added a commit to danleifeng/Paddle that referenced this issue Sep 13, 2023

fix ps gpu dense; (PaddlePaddle#241)

2ad151b

AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this issue Dec 6, 2023

Merge pull request PaddlePaddle#241 from ksivaman/fix_compilation_time

cf4f0a3

Fix compilation time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel_nn related crash #241

parallel_nn related crash #241

byzhang commented Oct 25, 2016

luotao1 commented Oct 25, 2016

byzhang commented Oct 25, 2016

hedaoyuan commented Oct 25, 2016

byzhang commented Oct 25, 2016

emailweixu commented Oct 25, 2016

hedaoyuan commented Oct 26, 2016

parallel_nn related crash #241

parallel_nn related crash #241

Comments

byzhang commented Oct 25, 2016

luotao1 commented Oct 25, 2016

byzhang commented Oct 25, 2016

hedaoyuan commented Oct 25, 2016

byzhang commented Oct 25, 2016

emailweixu commented Oct 25, 2016

hedaoyuan commented Oct 26, 2016