Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling result on single GPU device #637

Closed
kuke opened this issue Feb 5, 2018 · 5 comments
Closed

Profiling result on single GPU device #637

kuke opened this issue Feb 5, 2018 · 5 comments
Assignees
Labels

Comments

@kuke
Copy link
Collaborator

kuke commented Feb 5, 2018

See the script for profiling in #636
Device: Tesla K40m (12GB)
Conclusion: The computation of LSTMP layer, especially the backward, takes the most time.

-----------  Configuration Arguments -----------
batch_size: 32
device: GPU
feature_lst: data/feature.lst
first_batches_to_skip: 1
hidden_dim: 1024
label_lst: data/label.lst
learning_rate: 0.002
max_batch_num: 10
mean_var: data/global_mean_var_search26kHr
parallel: False
print_train_acc: False
proj_dim: 512
sorted_key: total
stacked_num: 5
------------------------------------------------
..........
Time consumed: 18.386745 s, performance: 3199.098050 frames/s.

------------------------->     Profiling Report     <-------------------------

Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread

Event                            Calls       Total       Min.        Max.        Ave.
thread0::lstmp_grad              45          9387.12     203.389     213.289     208.603
thread0::lstmp                   45          4344.11     94.435      98.2327     96.5359
thread0::mul_grad                54          1233.86     8.95155     42.5825     22.8492
thread0::mul                     54          599.213     4.5631      20.7512     11.0965
thread0::batch_norm_grad         54          570.634     8.05872     19.7179     10.5673
thread0::batch_norm              54          505.514     7.44077     17.1559     9.36137
thread0::sequence_conv_grad      9           441.255     45.7588     52.7421     49.0283
thread0::sequence_conv           9           232.747     24.8565     27.555      25.8607
thread0::elementwise_add_grad    63          105.817     0.516352    2.19734     1.67963
thread0::elementwise_add         63          89.0073     0.371936    2.18442     1.41281
thread0::adam                    369         55.4231     0.00704     0.808288    0.150198
thread0::softmax                 9           44.3238     4.69805     5.15946     4.92487
thread0::softmax_grad            9           19.5171     2.03376     2.31587     2.16857
thread0::sigmoid_grad            54          17.3561     0.25632     0.58752     0.32141
thread0::sigmoid                 54          12.0755     0.180064    0.403328    0.22362
thread0::top_k                   9           8.62765     0.901312    1.02582     0.958628
thread0::mean                    9           5.47344     0.571712    0.644768    0.60816
thread0::elementwise_mul         369         2.65594     0.005856    0.008512    0.00719766
thread0::cross_entropy_grad      9           2.55389     0.260096    0.311232    0.283765
thread0::fill_constant           378         2.31142     0.005216    0.01536     0.00611488
thread0::fill_zeros_like         216         1.52355     0.005504    0.012224    0.00705348
thread0::fetch                   18          0.783648    0.025696    0.072096    0.043536
thread0::accuracy                9           0.422208    0.046432    0.048224    0.046912
thread0::feed                    18          0.192256    0.006144    0.02        0.0106809
thread0::mean_grad               9           0.13536     0.013824    0.01536     0.01504
thread0::cross_entropy           9           0.125536    0.012672    0.014688    0.0139484
thread0::scale                   18          0.110144    0.005536    0.007744    0.00611911
@zhxfl
Copy link
Member

zhxfl commented Apr 13, 2018

_20180413145710
P40

@kuke
Copy link
Collaborator Author

kuke commented Apr 13, 2018

@zhxfl can you give some basic conclusions? What the self-defined flags stand for?

@zhxfl
Copy link
Member

zhxfl commented Apr 13, 2018

LstmGradMatmul is total matrix multiply time in lstm_grad
LstmUnitGradFunctor is the single kernel in lstm_grad except matrix multiply
LstmGradFor is the total cost in lstm_grad "for loop"

@zhxfl
Copy link
Member

zhxfl commented Apr 15, 2018

houyi profile
houyi
paddle profile
paddle

the main cost different focus on "lstmp_grad"

according to lstmp_op.h LSTMPGradKernel
paddle
for (t = frame_num - 1; t >= 0; t--) {
cal diff(t)
cal dw(t) = funtion2(diff(t), in(t))
}

w = dw + learn_rate * dw

houyi
for (t = frame_num - 1; t >= 0; t--) {
cal diff(t)
}
cal dw = function(diff, in)
w = dw + learn_rate * dw

It is seen that combines matrix multiply without data depandence should work for high-performance.

@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants