Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Faster Transpose 2D #16104

Merged
merged 24 commits into from
Oct 10, 2019
Merged

Faster Transpose 2D #16104

merged 24 commits into from
Oct 10, 2019

Conversation

ChaiBapchya
Copy link
Contributor

@ChaiBapchya ChaiBapchya commented Sep 5, 2019

Description

Faster 2D transpose

How to achieve this

Existing implementation leverages mshadow's expression template.
However, we found that there is no need for it and can be optimized.

Principle behind this implementation is to derive maximum utilization from L1 cache. Most L1 caches will have cache size >= 32kb. Hence, the aim was to transpose the 2D matrix on a per block basis.

Block size calculation

Total size (to be utilized) = 32kb (2^15)
Largest size of a single unit of any dtype <= 8 byte (2^3)
Number of elements - 2^12 (2^15/2^3)
Block-size - 2^6 v 2^6 (64 v 64)

But we could leverage unrolling of for loops (for parallelization)
Block-size - 2^5 v 2^5 (32 v 32) with 4 pragma for loop unrolled

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Code is well-documented:
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Bypass expression template way of execution
  • Implement cache-optimal transpose

Flags used

[✖ CUDA, ✖ CUDNN, ✖ NCCL, ✖ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✖ OPENCV, ✖ CAFFE, ✖ PROFILER, ✖ DIST_KVSTORE, ✖ CXX14, ✔ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✖ DEBUG, ✖ TVM_OP]

Performance

Tested with DEBUG OFF (for consistency with release pip)

Shape Branch MKL INT AVG p50 p90 p99
(1024, 1024) master off int64 0.99276 0.92972 0.94741 1.77753
latest transpose off int64 0.46001 0.34991 0.3681 1.0809
(10000, 10000) master off int64 228.6221 214.03416 260.60906 404.57939
latest transpose off int64 144.08654 136.49568 162.72096 273.59494
(1024, 1024) master off int32 0.96307 0.90052 0.92238 1.75075
latest transpose off int32 0.44294 0.38529 0.40005 1.30894
(10000, 10000) master off int32 232.91236 214.38403 243.06632 486.65916
latest transpose off int32 152.77504 137.72863 161.21566 394.53675

@apeforest
Copy link
Contributor

Please fix the test_numpy_op.test_np_transpose

@access2rohit
Copy link
Contributor

Can you add more description of how are you achieving the faster transpose ?
You can either paste a link that describes your approach or write it briefly yourself.

@ChaiBapchya
Copy link
Contributor Author

@sxjscience
Copy link
Member

Will the GPU side be accelerated?

@access2rohit
Copy link
Contributor

Will the GPU side be accelerated?

@sxjscience not yet! work will be done on that after CPU one is accelerated first

@wuxun-zhang
Copy link
Contributor

@ChaiBapchya From your performance table, I also noticed that int32 has a very close performance number with int64 (if I am right, here int64 means using large tensor) for non-mkl 2D transpose op in master repo. Is there any performance regression with 3-D or 4-D input shape?

@ChaiBapchya
Copy link
Contributor Author

Yes. @wuxun-zhang
This week, we found out that benchmark shouldn't be run with DEBUG=ON.
Turns out that after disabling the debug, results might be different. So I'll rerun the benchmarks for all configurations with DEBUG=OFF or Release mode (basically) and update the tracking issue. As of this table, you're right. MKL doesn't see performance difference for 2D input.

@apeforest
Copy link
Contributor

apeforest commented Sep 6, 2019

Here numpy transpose calls the same TransposeImpl function that I have already handled.

Yes, please check why your code broke their unit test?

@ChaiBapchya
Copy link
Contributor Author

@ptrendx Any update on the PR by @dtracz

@zhreshold
Copy link
Member

Can you add test cases and update the benchmark since the previous error is not catched?

@ChaiBapchya
Copy link
Contributor Author

Added the test case to catch the previous bug. @zhreshold @sxjscience

@access2rohit
Copy link
Contributor

@ChaiBapchya also paste the results of new unittest run here.

@ChaiBapchya
Copy link
Contributor Author

nosetests tests/python/unittest/test_operator.py:test_larger_transpose
.
----------------------------------------------------------------------
Ran 1 test in 0.042s

OK

@sxjscience
Copy link
Member

What are the performances of the input sizes that are not divisible by blocksize?

@ChaiBapchya
Copy link
Contributor Author

Script used - https://gist.github.com/ChaiBapchya/65a41eccf6c09a98b5a5012a211c820e
Performance updated with latest PR update.

@access2rohit
Copy link
Contributor

What are the performances of the input sizes that are not divisible by blocksize?

@sxjscience 10000,10000 is not divisible by 32. Is there any specific input shape that you want benchmarked ?

@access2rohit
Copy link
Contributor

@anirudh2290 can you merge this PR

@@ -2840,6 +2840,13 @@ def test_transpose():
assert_allclose(np.transpose(x.asnumpy()), y.asnumpy())


@with_seed()
def test_larger_transpose():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this say larger_transpose ? is it a bug with specific sizes ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChaiBapchya I think we should merge the test with the current tests of transpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current test_transpose has size less than the block_size
This PR has blocksize 32*32 (and edge case didn't get tested with existing set of tests)
Hence added this test to catch it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add this with the current tests of transpose as @sxjscience suggests. otherwise LGTM

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Merging to avoid another CI round. Please merge the two transpose tests into one in your next PR.

@anirudh2290 anirudh2290 merged commit 1d0d1e6 into apache:master Oct 10, 2019
aaronmarkham pushed a commit to aaronmarkham/incubator-mxnet that referenced this pull request Oct 16, 2019
* 2d transpose naive

* omp pragma

* omp pragma unroll

* blocksize

* make it 2d tile

* loop peeling

* better loop peeling

* redundancy

* removed bool

* removing excess for loops, memory save

* fix internal forloop

* remove commented code, lint fix

* Trigger notification

* explain params, indent fix, explain blocksize

* fix p,n and reduce for loop computation j+a,i+b

* kernel

* gpu thread 1

* remove gpu implementation

* fix internal for loop

* unittest to catch the previous error

* optimizations

* microsoft cpp doesn't support omp collapse
@ChaiBapchya ChaiBapchya mentioned this pull request Nov 1, 2019
7 tasks
@ChaiBapchya ChaiBapchya deleted the transpose branch April 5, 2021 10:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants