New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Speed/sequence expand #9289

Merged

JiayiFeng merged 16 commits into PaddlePaddle:develop from dzhwinter:speed/sequence_expand

Apr 13, 2018

Contributor

dzhwinter commented Mar 21, 2018 •

edited

Loading

fix #9220
single batch sequence expand op 0.595545 -> 0.331348, improved ~1x speed.
Need to note that there is a lot of memory visiting inside Cuda kernel, and shared memory will given us a furture enhancement.

before optimize

Event                                     Calls       Total       Min.        Max.        Ave.
thread0::sum                              40856       3485.82     0.012192    89.6786     0.0853197
thread0::mul_grad                         14556       2180.88     0.047808    3.56227     0.149827
thread0::sequence_softmax_grad            1316        1359.13     0.052192    39.9963     1.03278
thread0::mul                              14556       1185.32     0.037536    2.49907     0.0814315
thread0::sequence_softmax                 1316        1111.14     0.046752    7.49488     0.844328
thread0::sequence_pool                    1336        927.845     0.043168    89.3362     0.694495
thread0::sequence_expand_grad             1316        783.738     0.046336    4.13078     0.595545
thread0::sequence_pool_grad               1336        566.56      0.03648     1.31149     0.424072
thread0::sequence_expand                  1316        538.142     0.029536    18.5873     0.408922
thread0::elementwise_add_grad             6580        482.06      0.0304      0.765728    0.0732613

after optimize

thread0::sum                              40112       3388.82     0.012928    121.016     0.084484
thread0::mul_grad                         14292       2073.97     0.042816    1.68877     0.145114
thread0::sequence_softmax_grad            1292        1400.66     0.050528    5.28992     1.0841
thread0::sequence_softmax                 1292        1177.63     0.045984    2.67725     0.91148
thread0::mul                              14292       1105.29     0.033536    2.02675     0.0773365
thread0::sequence_pool                    1312        856.938     0.039168    1.75539     0.653154
thread0::sequence_pool_grad               1312        617.013     0.03408     2.27245     0.470284
thread0::elementwise_add_grad             6460        452.989     0.025408    0.492832    0.0701221
thread0::sequence_expand_grad             1292        428.101     0.140192    12.6613     0.331348
thread0::sequence_expand                  1292        354.059     0.139392    0.813536    0.274039
thread0::elementwise_add                  6460        307.216     0.027424    29.2352     0.0475566
thread0::lstm_grad                        40          304.187     6.11402     8.52278     7.60468

dzhwinter added 3 commits

March 19, 2018 19:07


          "add sequence expand kernel"

4ee1c9e


          "add sequence kernel"

26822bd


          "add details"

e4c35d8

dzhwinter added 3 commits

March 21, 2018 05:49


          "debug the process"

53c8c36


          "add details"

db1b128


          merge develop branch

dzhwinter force-pushed the speed/sequence_expand branch from a4470fe to 5447046 Compare

March 27, 2018 03:20

dzhwinter added 4 commits

March 28, 2018 02:34


          "fix ci"

0be1e09


          "fix ci"

0412f5e


          "fix ci"

b661fe1


          Merge remote-tracking branch 'origin/develop' into speed/sequence_expand

a80bf70

chengduoZH reviewed

View reviewed changes

python/paddle/fluid/tests/unittests/op_test.py Outdated

                       for a, b, name in itertools.izip(numeric_grads, analytic_grads, names):
                           abs_a = np.abs(a)
                           abs_a[abs_a < 1e-3] = 1
+                          print("actual", a)
+                          print("*****")
+                          print("expected", b)

Contributor

chengduoZH Mar 29, 2018

Please remove those debug code.

paddle/fluid/operators/sequence_expand_op.cu Outdated

+                                                        the instance length*/
+                                                     const int x_item_length, T* out_data) {
+                constexpr int N = 1024;
+                __shared__ int mem[N];

Contributor

chengduoZH Mar 29, 2018

The length of shared memory can be defined outside.
I'm curious about how much it effects on performance that using shared memory. Do you have a benchmark?

Contributor Author

dzhwinter Mar 30, 2018

I'm curious about how much it effects on performance that using shared memory. Do you have a benchmark?

Good question.
Not using the shared memory. The speed keeps same with using shared memory. Note thatshared memory about 100x times fast that general memory, maybe that cover the overhead.

Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread

Event                                     Calls       Total       Min.        Max.        Ave.
thread0::sum                              60819       4296.31     0.010656    4.47222     0.0706409
thread0::mul_grad                         21669       3030.08     0.032448    2.92419     0.139835
thread0::sequence_softmax_grad            1959        1761.28     0.039712    4.09632     0.89907
thread0::mul                              21669       1613.08     0.02496     2.75418     0.0744419
thread0::sequence_softmax                 1959        1441.05     0.038592    3.60509     0.735606
thread0::elementwise_add_grad             9795        661.935     0.022528    2.10016     0.0675789
thread0::sequence_expand_grad             1959        654.569     0.121216    2.8896      0.334134
thread0::sequence_expand                  1959        553.549     0.119872    6.49626     0.282567

Contributor Author

dzhwinter Mar 30, 2018

Make the offset out of kernel, here we have some benifit.

Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread

Event                                     Calls       Total       Min.        Max.        Ave.
thread0::sum                              60695       4183.43     0.010816    6.24307     0.0689254
thread0::mul_grad                         21625       2983.08     0.032448    2.84122     0.137946
thread0::mul                              21625       1599.23     0.026432    14.0889     0.0739527
thread0::sequence_softmax_grad            1955        1559.16     0.039232    3.93517     0.797522
thread0::sequence_softmax                 1955        1243.7      0.035968    2.35155     0.636165
thread0::elementwise_add_grad             9775        645.952     0.020096    2.46816     0.0660821
thread0::sequence_expand_grad             1955        517.621     0.12704     2.53744     0.264768
thread0::lstm_grad                        60          460.934     6.75344     8.54125     7.68223
thread0::sequence_expand                  1955        416.672     0.124384    2.38714     0.213131

paddle/fluid/operators/sequence_expand_op.cu Outdated

+                // TODO(dzhwinter) : too many atomicAdd
+                // use shared memory to reduce memory visits
+                constexpr int N = 1024;
+                __shared__ int mem[N];

Contributor

chengduoZH Mar 29, 2018

The same above.

paddle/fluid/operators/sequence_expand_op.cu Outdated

+                  if (i < lod_size - 1) {
+                    offset += (ref_lod[i + 1] - ref_lod[i]) * (x_lod[i + 1] - x_lod[i]);
+                  }
+                }

Contributor

chengduoZH Mar 29, 2018

If the block size of threads is 16x16, line 35~40 is run 256 times, this is to say mem is assigned 256 times, it is unnecessary. Please make a double check.

paddle/fluid/operators/sequence_expand_op.cu

+                  int thread_y = 16;
+                  int thread_z = 1024 / thread_x / thread_y;
+                  int block_x = static_cast<int>(ref_lod.size());
+                  dim3 block_size(thread_x, thread_y, thread_z);

Contributor

chengduoZH Mar 29, 2018

Please double check the block_size.

dzhwinter added 3 commits

March 29, 2018 19:20


          "fix based on comment"

fbdb5b7


          Merge remote-tracking branch 'origin/develop' into speed/sequence_expand

62ba872


          "seperate test"

c72450d

chengduoZH reviewed

View reviewed changes

paddle/fluid/operators/sequence_expand_op.cu Outdated

		@@ -13,7 +13,135 @@ See the License for the specific language governing permissions and
		limitations under the License. */

		#define EIGEN_USE_GPU

Contributor

chengduoZH Apr 6, 2018

Maybe #define EIGEN_USE_GPU is no usefull now.

paddle/fluid/operators/sequence_expand_op.cu

+                                                     const size_t* offset,
+                                                     const size_t lod_size,
+                                                     /* default=1,
+                                                        the instance length*/

Contributor

chengduoZH Apr 6, 2018

line30~31 is useful? if no, please remove them.

paddle/fluid/operators/sequence_expand_op.cu

+                  const T* dout_data, const size_t* ref_lod, const size_t* dx_lod,
+                  const size_t* offset, const size_t lod_size,
+                  /* default=1,
+                     the instance length*/

Contributor

chengduoZH Apr 6, 2018

The same above.

paddle/fluid/operators/sequence_expand_op.cu

+                  dim3 block_size(thread_x, thread_y, thread_z);
+                  dim3 grid_size(block_x, 1);
+                  sequence_expand_kernel<<<grid_size, block_size, 0, context.stream()>>>(

Contributor

chengduoZH Apr 6, 2018

The design of sequence_expand_kernel is excellent. But I think that the logic is a little complex, maybe it can be more simple.
From my limited understanding, the function of sequence_expand_kernel is copying one matrix to the other according to the row index of source matrix, and the row index can be computed on CPU side.
For example:

cast 1 
Given a 1-level LoDTensor input(X)
    X.lod =  [[0,   2,        4]]
    X.data = [[a], [b], [c], [d]]
    X.dims = [4, 1]
and input(Y)
    Y.lod = [[0,    2,    4],
                  [0, 3, 6, 7, 8]]
ref_level: 0
then we get 1-level LoDTensor
    Out.lod =  [[0,   2,        4,        6,        8]]
    Out.data = [[a], [b], [a], [b], [c], [d], [c], [d]]
    Out.dims = [8, 1]

The row index should be [0,1,0,1,2,3,2,3].

Case 2:

Given 1-level LoDTensor input(X)
    X.lod =  [[0,   1,        4]]
    X.data = [[a], [b], [c], [d]]
    X.dims = [4, 1]
and input(Y)
    Y.lod = [[0,    2,    4],
             [0, 3, 6, 6, 8]]
ref_level: 0
then we get 1-level LoDTensor
    Out.lod =  [[0,   1,   2,        5,             8]]
    Out.data = [[a], [a], [b], [c], [d], [b], [c], [d]]
    Out.dims = [8, 1]

The row index should be [0,0,1,2,3,1,2,3].

dzhwinter added 3 commits

April 11, 2018 00:04


          Merge remote-tracking branch 'origin/develop' into speed/sequence_expand

996b3e1


          "fix the style"

80bd1ca


          "done"

62d1f9a

JiayiFeng approved these changes

View reviewed changes

Collaborator

JiayiFeng left a comment

merge it first to avoid it being too large.

JiayiFeng merged commit c1bf06f into PaddlePaddle:develop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet