-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profiling result on single GPU device #637
Comments
@zhxfl can you give some basic conclusions? What the self-defined flags stand for? |
LstmGradMatmul is total matrix multiply time in lstm_grad |
the main cost different focus on "lstmp_grad" according to lstmp_op.h LSTMPGradKernel w = dw + learn_rate * dw houyi It is seen that combines matrix multiply without data depandence should work for high-performance. |
您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持! |
See the script for profiling in #636
Device: Tesla K40m (12GB)
Conclusion: The computation of LSTMP layer, especially the backward, takes the most time.
The text was updated successfully, but these errors were encountered: