Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel_nn related crash #241

Closed
byzhang opened this issue Oct 25, 2016 · 6 comments
Closed

parallel_nn related crash #241

byzhang opened this issue Oct 25, 2016 · 6 comments
Assignees
Labels
Milestone

Comments

@byzhang
Copy link

byzhang commented Oct 25, 2016

If I turn on --parallel_nn, the translation demo (./train.sh) will crash:

#0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:33
#1  0x0000000000740f90 in paddle::BaseMatrixT<float>::assign(paddle::BaseMatrixT<float>&) ()
#2  0x00000000005b4fb5 in paddle::GruStepLayer::forward (this=0x16875e80, passType=<optimized out>) at /home/byzhang/src/github/Paddle/paddle/gserver/layers/GruStepLayer.cpp:98
#3  0x000000000062639e in paddle::NeuralNetwork::forward (this=0x16866ac0, inArgs=..., outArgs=0x7fff86369b10, passType=paddle::enumeration_wrapper::PASS_TRAIN) at /home/byzhang/src/github/Paddle/paddle/gserver/gradientmachines/NeuralNetwork.cpp:242
#4  0x00000000006164af in paddle::RecurrentGradientMachine::forward (this=<optimized out>, inArgs=..., outArgs=<optimized out>, passType=paddle::enumeration_wrapper::PASS_TRAIN) at /home/byzhang/src/github/Paddle/paddle/gserver/gradientmachines/RecurrentGradientMachine.cpp:546
#5  0x000000000057257a in paddle::RecurrentLayerGroup::forward (this=<optimized out>, passType=<optimized out>) at /home/byzhang/src/github/Paddle/paddle/gserver/layers/RecurrentLayerGroup.cpp:42
#6  0x0000000000621854 in paddle::ParallelThread::computeThread (this=0x168c59b0) at /home/byzhang/src/github/Paddle/paddle/gserver/gradientmachines/ParallelNeuralNetwork.cpp:174
#7  0x00007ffff6486c30 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ffff78c2184 in start_thread (arg=0x7fff8636a700) at pthread_create.c:312
#9  0x00007ffff5c0e37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The commit is byzhang@de9d6c6

I tried to fix the aforementioned crash, but it crashed somewhere else related with unassigned realLayer_ of ScatterAgentLayer. So I need your help on debugging the crash.

@luotao1
Copy link
Contributor

luotao1 commented Oct 25, 2016

Thanks for your attention, we repeat above crash on our machine, and are debugging now.

@byzhang
Copy link
Author

byzhang commented Oct 25, 2016

👍

@hedaoyuan
Copy link
Contributor

I look at the commit byzhang/Paddle1@de9d6c6, there is a doubt. Do you just want to use GPU training? If so, just configure --use gpu = true and --gpu_id = 0 on the command line. Do not need parallel_nn and default_device(0) in the config.
PS: translation demo use RecurrentGradientMachine and RecurrentGradientMachine is not support --parallel_nn=true, this is why paddle run crash.

@reyoung reyoung added the Bug label Oct 25, 2016
@byzhang
Copy link
Author

byzhang commented Oct 25, 2016

I am building another network which uses both gru and parallel_nn. But to
make it simpler, I repro the crash using translation demo as an example.
What's the key reason why RecurrentGradientMachine doesn't support
parallel_nn?

On Oct 25, 2016 2:29 AM, "hedaoyuan" notifications@github.com wrote:

I look at the commit byzhang/Paddle1@de9d6c6
byzhang@de9d6c6, there is a doubt. Do
you just want to use GPU training? If so, just configure --use gpu = true
and --gpu_id = 0 on the command line. Do not need parallel_nn and
default_device(0) in the config.
PS: translation demo use RecurrentGradientMachine and
RecurrentGradientMachine is not support --parallel_nn=true, this is why
paddle run crash.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#241 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA1O1S3k4NVgjj2mr4X5mpZZMg86wvu1ks5q3cvigaJpZM4KfbuU
.

@emailweixu
Copy link
Collaborator

At the moment, should change ParallelNeuralNetwork to report error when recurrent_nn is used. In the longer term, should consider make ParallelNeuralNetwork support recurrent_nn.

@hedaoyuan
Copy link
Contributor

The key reason is that ParallelNeuralNetwork design earlier, RecurrentGradientMachine design in the post. They are each in order to solve the training of different models. However, compatibility between the two is not considered. ParallelNeuralNetwork pays more attention to using different computing devices to train a model(like CPU+GPU, multi-GPU..). RecurrentGradientMachine pays more attention to how to train the RNN model.
PS: At present, we will follow emailweixu's comments, add some checks in the code.

@hedaoyuan hedaoyuan added this to the 0.8.1 milestone Nov 3, 2016
zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019
Meiyim pushed a commit to Meiyim/Paddle that referenced this issue May 21, 2021
thisjiang pushed a commit to thisjiang/Paddle that referenced this issue Oct 28, 2021
gglin001 pushed a commit to graphcore/Paddle-fork that referenced this issue Dec 8, 2021
* refactor scale op

* fix bug
qingshui pushed a commit to qingshui/Paddle that referenced this issue Mar 30, 2023
danleifeng added a commit to danleifeng/Paddle that referenced this issue Sep 13, 2023
AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this issue Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants