Profiling result on single GPU device #637

kuke · 2018-02-05T12:40:26Z

See the script for profiling in #636
Device: Tesla K40m (12GB)
Conclusion: The computation of LSTMP layer, especially the backward, takes the most time.

-----------  Configuration Arguments -----------
batch_size: 32
device: GPU
feature_lst: data/feature.lst
first_batches_to_skip: 1
hidden_dim: 1024
label_lst: data/label.lst
learning_rate: 0.002
max_batch_num: 10
mean_var: data/global_mean_var_search26kHr
parallel: False
print_train_acc: False
proj_dim: 512
sorted_key: total
stacked_num: 5
------------------------------------------------
..........
Time consumed: 18.386745 s, performance: 3199.098050 frames/s.

------------------------->     Profiling Report     <-------------------------

Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread

Event                            Calls       Total       Min.        Max.        Ave.
thread0::lstmp_grad              45          9387.12     203.389     213.289     208.603
thread0::lstmp                   45          4344.11     94.435      98.2327     96.5359
thread0::mul_grad                54          1233.86     8.95155     42.5825     22.8492
thread0::mul                     54          599.213     4.5631      20.7512     11.0965
thread0::batch_norm_grad         54          570.634     8.05872     19.7179     10.5673
thread0::batch_norm              54          505.514     7.44077     17.1559     9.36137
thread0::sequence_conv_grad      9           441.255     45.7588     52.7421     49.0283
thread0::sequence_conv           9           232.747     24.8565     27.555      25.8607
thread0::elementwise_add_grad    63          105.817     0.516352    2.19734     1.67963
thread0::elementwise_add         63          89.0073     0.371936    2.18442     1.41281
thread0::adam                    369         55.4231     0.00704     0.808288    0.150198
thread0::softmax                 9           44.3238     4.69805     5.15946     4.92487
thread0::softmax_grad            9           19.5171     2.03376     2.31587     2.16857
thread0::sigmoid_grad            54          17.3561     0.25632     0.58752     0.32141
thread0::sigmoid                 54          12.0755     0.180064    0.403328    0.22362
thread0::top_k                   9           8.62765     0.901312    1.02582     0.958628
thread0::mean                    9           5.47344     0.571712    0.644768    0.60816
thread0::elementwise_mul         369         2.65594     0.005856    0.008512    0.00719766
thread0::cross_entropy_grad      9           2.55389     0.260096    0.311232    0.283765
thread0::fill_constant           378         2.31142     0.005216    0.01536     0.00611488
thread0::fill_zeros_like         216         1.52355     0.005504    0.012224    0.00705348
thread0::fetch                   18          0.783648    0.025696    0.072096    0.043536
thread0::accuracy                9           0.422208    0.046432    0.048224    0.046912
thread0::feed                    18          0.192256    0.006144    0.02        0.0106809
thread0::mean_grad               9           0.13536     0.013824    0.01536     0.01504
thread0::cross_entropy           9           0.125536    0.012672    0.014688    0.0139484
thread0::scale                   18          0.110144    0.005536    0.007744    0.00611911

The text was updated successfully, but these errors were encountered:

zhxfl · 2018-04-13T05:42:47Z

P40

kuke · 2018-04-13T05:57:36Z

@zhxfl can you give some basic conclusions? What the self-defined flags stand for?

zhxfl · 2018-04-13T08:36:34Z

LstmGradMatmul is total matrix multiply time in lstm_grad
LstmUnitGradFunctor is the single kernel in lstm_grad except matrix multiply
LstmGradFor is the total cost in lstm_grad "for loop"

zhxfl · 2018-04-15T04:06:59Z

houyi profile

paddle profile

the main cost different focus on "lstmp_grad"

according to lstmp_op.h LSTMPGradKernel
paddle
for (t = frame_num - 1; t >= 0; t--) {
cal diff(t)
cal dw(t) = funtion2(diff(t), in(t))
}

w = dw + learn_rate * dw

houyi
for (t = frame_num - 1; t >= 0; t--) {
cal diff(t)
}
cal dw = function(diff, in)
w = dw + learn_rate * dw

It is seen that combines matrix multiply without data depandence should work for high-performance.

shanyi15 · 2018-08-15T10:07:33Z

您好，此issue在近一个月内暂无更新，我们将于今天内关闭。若在关闭后您仍需跟进提问，可重新开启此问题，我们将在24小时内回复您。因关闭带来的不便我们深表歉意，请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

kuke added the DeepASR label Feb 5, 2018

kuke mentioned this issue Feb 5, 2018

Performance profiling for model part #624

Closed

3 tasks

kuke mentioned this issue Mar 5, 2018

DeepASR profile timeline PaddlePaddle/Paddle#8750

Closed

kuke assigned zhxfl, pengeorge and pkuyym Apr 13, 2018

shanyi15 closed this as completed Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling result on single GPU device #637

Profiling result on single GPU device #637

kuke commented Feb 5, 2018 •

edited

Loading

zhxfl commented Apr 13, 2018 •

edited

Loading

kuke commented Apr 13, 2018 •

edited

Loading

zhxfl commented Apr 13, 2018 •

edited

Loading

zhxfl commented Apr 15, 2018 •

edited

Loading

shanyi15 commented Aug 15, 2018

Profiling result on single GPU device #637

Profiling result on single GPU device #637

Comments

kuke commented Feb 5, 2018 • edited Loading

zhxfl commented Apr 13, 2018 • edited Loading

kuke commented Apr 13, 2018 • edited Loading

zhxfl commented Apr 13, 2018 • edited Loading

zhxfl commented Apr 15, 2018 • edited Loading

shanyi15 commented Aug 15, 2018

kuke commented Feb 5, 2018 •

edited

Loading

zhxfl commented Apr 13, 2018 •

edited

Loading

kuke commented Apr 13, 2018 •

edited

Loading

zhxfl commented Apr 13, 2018 •

edited

Loading

zhxfl commented Apr 15, 2018 •

edited

Loading