Faster Transpose 2D #16104

ChaiBapchya · 2019-09-05T19:23:38Z

Description

Faster 2D transpose

How to achieve this

Existing implementation leverages mshadow's expression template.
However, we found that there is no need for it and can be optimized.

Principle behind this implementation is to derive maximum utilization from L1 cache. Most L1 caches will have cache size >= 32kb. Hence, the aim was to transpose the 2D matrix on a per block basis.

Block size calculation

Total size (to be utilized) = 32kb (2^15)
Largest size of a single unit of any dtype <= 8 byte (2^3)
Number of elements - 2^12 (2^15/2^3)
Block-size - 2^6 v 2^6 (64 v 64)

But we could leverage unrolling of for loops (for parallelization)
Block-size - 2^5 v 2^5 (32 v 32) with 4 pragma for loop unrolled

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Code is well-documented:
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Bypass expression template way of execution
Implement cache-optimal transpose

Flags used

[✖ CUDA, ✖ CUDNN, ✖ NCCL, ✖ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✖ OPENCV, ✖ CAFFE, ✖ PROFILER, ✖ DIST_KVSTORE, ✖ CXX14, ✔ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✖ DEBUG, ✖ TVM_OP]

Performance

Tested with DEBUG OFF (for consistency with release pip)

Shape	Branch	MKL	INT	AVG	p50	p90	p99
(1024, 1024)	master	off	int64	0.99276	0.92972	0.94741	1.77753
	latest transpose	off	int64	0.46001	0.34991	0.3681	1.0809

(10000, 10000)	master	off	int64	228.6221	214.03416	260.60906	404.57939
	latest transpose	off	int64	144.08654	136.49568	162.72096	273.59494

(1024, 1024)	master	off	int32	0.96307	0.90052	0.92238	1.75075
	latest transpose	off	int32	0.44294	0.38529	0.40005	1.30894

(10000, 10000)	master	off	int32	232.91236	214.38403	243.06632	486.65916
	latest transpose	off	int32	152.77504	137.72863	161.21566	394.53675

apeforest · 2019-09-05T20:44:25Z

Please fix the test_numpy_op.test_np_transpose

src/operator/tensor/matrix_op-inl.h

access2rohit · 2019-09-05T21:58:04Z

Can you add more description of how are you achieving the faster transpose ?
You can either paste a link that describes your approach or write it briefly yourself.

src/operator/tensor/matrix_op-inl.h

ChaiBapchya · 2019-09-06T00:22:37Z

@apeforest

Here numpy transpose calls the same TransposeImpl function that I have already handled.

Function Call
https://github.com/apache/incubator-mxnet/blob/d60be31df6b05385cd8adc4b8b26b34a33e7693c/src/operator/numpy/np_matrix_op-inl.h#L59

Function Definition
https://github.com/apache/incubator-mxnet/blob/d60be31df6b05385cd8adc4b8b26b34a33e7693c/src/operator/tensor/matrix_op-inl.h#L261

sxjscience · 2019-09-06T02:02:41Z

Will the GPU side be accelerated?

access2rohit · 2019-09-06T02:53:15Z

Will the GPU side be accelerated?

@sxjscience not yet! work will be done on that after CPU one is accelerated first

wuxun-zhang · 2019-09-06T04:06:55Z

@ChaiBapchya From your performance table, I also noticed that int32 has a very close performance number with int64 (if I am right, here int64 means using large tensor) for non-mkl 2D transpose op in master repo. Is there any performance regression with 3-D or 4-D input shape?

ChaiBapchya · 2019-09-06T15:28:51Z

Yes. @wuxun-zhang
This week, we found out that benchmark shouldn't be run with DEBUG=ON.
Turns out that after disabling the debug, results might be different. So I'll rerun the benchmarks for all configurations with DEBUG=OFF or Release mode (basically) and update the tracking issue. As of this table, you're right. MKL doesn't see performance difference for 2D input.

src/operator/tensor/matrix_op-inl.h

apeforest · 2019-09-06T18:08:23Z

Here numpy transpose calls the same TransposeImpl function that I have already handled.

Yes, please check why your code broke their unit test?

ChaiBapchya · 2019-10-07T20:17:42Z

@ptrendx Any update on the PR by @dtracz

src/operator/tensor/matrix_op-inl.h

zhreshold · 2019-10-07T22:45:41Z

Can you add test cases and update the benchmark since the previous error is not catched?

ChaiBapchya · 2019-10-08T01:26:38Z

Added the test case to catch the previous bug. @zhreshold @sxjscience

access2rohit · 2019-10-08T16:41:03Z

@ChaiBapchya also paste the results of new unittest run here.

ChaiBapchya · 2019-10-08T17:51:21Z

nosetests tests/python/unittest/test_operator.py:test_larger_transpose
.
----------------------------------------------------------------------
Ran 1 test in 0.042s

OK

sxjscience · 2019-10-08T18:13:15Z

What are the performances of the input sizes that are not divisible by blocksize?

ChaiBapchya · 2019-10-08T20:52:52Z

Script used - https://gist.github.com/ChaiBapchya/65a41eccf6c09a98b5a5012a211c820e
Performance updated with latest PR update.

access2rohit · 2019-10-08T20:57:11Z

What are the performances of the input sizes that are not divisible by blocksize?

@sxjscience 10000,10000 is not divisible by 32. Is there any specific input shape that you want benchmarked ?

access2rohit · 2019-10-09T22:50:22Z

@anirudh2290 can you merge this PR

anirudh2290 · 2019-10-09T23:27:08Z

tests/python/unittest/test_operator.py

@@ -2840,6 +2840,13 @@ def test_transpose():
            assert_allclose(np.transpose(x.asnumpy()), y.asnumpy())


+@with_seed()
+def test_larger_transpose():


Why does this say larger_transpose ? is it a bug with specific sizes ?

@ChaiBapchya I think we should merge the test with the current tests of transpose.

Current test_transpose has size less than the block_size
This PR has blocksize 32*32 (and edge case didn't get tested with existing set of tests)
Hence added this test to catch it.

can we add this with the current tests of transpose as @sxjscience suggests. otherwise LGTM

anirudh2290

Thanks for the contribution! Merging to avoid another CI round. Please merge the two transpose tests into one in your next PR.

* 2d transpose naive * omp pragma * omp pragma unroll * blocksize * make it 2d tile * loop peeling * better loop peeling * redundancy * removed bool * removing excess for loops, memory save * fix internal forloop * remove commented code, lint fix * Trigger notification * explain params, indent fix, explain blocksize * fix p,n and reduce for loop computation j+a,i+b * kernel * gpu thread 1 * remove gpu implementation * fix internal for loop * unittest to catch the previous error * optimizations * microsoft cpp doesn't support omp collapse

ChaiBapchya added 12 commits September 3, 2019 14:33

2d transpose naive

7a536d7

omp pragma

fe8700c

omp pragma unroll

5644dfd

blocksize

0d01d71

make it 2d tile

c7864c2

loop peeling

b44d4ea

better loop peeling

bdfde0a

redundancy

31f43cd

removed bool

3f5e7ce

removing excess for loops, memory save

875868d

fix internal forloop

29f80dd

remove commented code, lint fix

789e81e

access2rohit reviewed Sep 5, 2019

View reviewed changes

src/operator/tensor/matrix_op-inl.h Outdated Show resolved Hide resolved