Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hexagon] Create tests to showcase vtcm loading capabilities on Hexagon. #12667

Merged
merged 7 commits into from
Sep 13, 2022
Merged

[Hexagon] Create tests to showcase vtcm loading capabilities on Hexagon. #12667

merged 7 commits into from
Sep 13, 2022

Conversation

nverke
Copy link
Contributor

@nverke nverke commented Aug 31, 2022

Background

In order to learn more about efficiently running on Hexagon, an investigation into how to properly utilize VTCM was performed. These are the initial results of that investigation and should serve as a good starting point for others looking to leverage VTCM while running on Hexagon.

Results

Below are the results from running a simple parallel vrmpy operation in several different configurations. Each configuration is described below.

Without VTCM: This is just running the vrmpy operator without loading any data into VTCM

Basic VTCM Loads: This introduces loops to copy the data into VTCM before running the compute without any scheduling of those data copy loops.

Vec Loads: This applies the following scheduling to the data copy loops. For the DDR -> VTCM loops it uses unroll_split = 8 and vector_split = 64. For VTCM -> DDR loop it uses unroll_split = 8 and vector_split = 8

vb, vi = sch.get_loops(block)
v = sch.fuse(vb, vi)
_, vio, vii = sch.split(v, factors=[None, unroll_split, vector_split])
sch.vectorize(vii)
sch.unroll(vio)

Vec + Para Loads: This applies the same schedule as above except it also parallelizes on an outer loop. The outer_split is always 4.

vb, vi = sch.get_loops(block)
v = sch.fuse(vb, vi)
vbo, vbi, vio, vii = sch.split(v, factors=[outer_split, None, unroll_split, vector_split])
sch.vectorize(vii)
sch.unroll(vio)
sch.parallel(vbo)

Pre + Vec Loads: Same as "Vec Loads" except the VTCM buffers are allocated before runtime and passed into the operator.

Pre + Vec + Para Loads: Same as "Vec + Para Loads" except the VTCM buffers are allocated before runtime and passed into the operator.

Single DMA Load: A single DMA command is used to copy all of the data over to VTCM. This was preallocated since I could not get it to work without preallocation.

Preloaded: All of the data is already loaded into VTCM before the test starts.

8Gen1 HDK

Total Vrmpy Operations Total Transfer (MB) Without VTCM (Gops) Basic VTCM Loads (Gops) Vec Loads (Gops) Vec + Para Loads (Gops) Pre + Vec Loads (Gops) Pre + Vec + Para Loads (Gops) Single DMA Load (Gops) Preloaded (Gops)
1024 0.39 95.0256 0.345 0.5408 0.5905 44.4814 32.8886 15.0813 124.7117
2048 0.79 124.2389 0.4002 0.7063 0.8826 43.5238 47.2871 16.1339 209.2688
4096 1.57 41.5497 0.4215 0.8664 1.1977 10.9374 26.5749 18.1754 241.1628
10240 3.93 33.2139 0.4419 1.0506 1.7311 11.7886 34.0405 25.4214 370.2948
16384 6.29 20.7683 0.4195 1.0568 1.898 7.7292 22.5898 29.7011 397.4137
20480 7.86 20.2128 0.4406 1.069 1.9779 6.6829 17.7941 25.4929 338.294

888 HDK

Total Vrmpy Operations Total Transfer (MB) Without VTCM (Gops) Basic VTCM Loads (Gops) Vec Loads (Gops) Vec + Para Loads (Gops) Pre + Vec Loads (Gops) Pre + Vec + Para Loads (Gops) Single DMA Load (Gops) Preloaded (Gops)
1024 0.39 92.2826 0.5363 1.1438 1.3951 42.2813 37.2929 13.1085 121.4004
2048 0.79 98.8228 0.5269 1.1818 1.6554 43.9298 43.1773 14.5703 205.442
4096 1.57 22.1415 0.4095 0.988 1.5843 6.3227 16.1113 15.397 271.1367
10240 3.93 15.3377 0.4323 1.1091 1.9689 6.4958 18.68 17.9959 360.6824

Below are the results from copying data to vtcm with various strategies. The strategies are described below.

Base: This copies the data into VTCM with a simple loop.

Unroll + Vectorize: This applies the following scheduling to the data copy loop. These tests use unroll_split=2 and vector_split=128

vb = sch.get_loops(vtcm_block_a)
vbi_a, vio_a, vii_a = sch.split(vb[0], factors=[None, unroll_split, vector_split])
sch.unroll(vio_a)
sch.vectorize(vii_a)

Unroll + Vectorize + Parallel: This applies the same schedule as above except it also parallelizes on an outer loop. The outer_split is 4.

vb = sch.get_loops(vtcm_block_a)
vbo_a, vbi_a, vio_a, vii_a = sch.split(vb[0], factors=[outer_split, None, unroll_split, vector_split])
sch.unroll(vio_a)
sch.vectorize(vii_a)
sch.parallel(vbo_a)

Single DMA: Copies the data into VTCM with a single DMA instruction.

8Gen1 HDK

Total Transfer (MB) Base (GBps) Unroll + Vectorize (GBps) Unroll + Vectorize + Parallel (GBps) Single DMA (GBps)
0.01 2.2122 15.9211 4.8287 2.2524
0.02 2.3207 26.1998 9.5082 4.6669
0.04 2.4425 38.1089 17.5147 6.4492
0.08 2.5067 48.5949 32.507 9.1469
0.16 2.5507 57.6021 55.1855 11.1598
0.31 2.7053 62.8063 83.4726 15.2878
0.62 2.9199 74.3696 114.7925 17.6438
1 2.2645 49.8653 63.8026 18.8814
2 1.1232 10.3933 29.1977 20.6719
4 1.0683 9.6105 26.5143 25.201
8 0.6814 6.1916 24.049 26.1883

888 HDK

Total Transfer (MB) Base (GBps) Unroll + Vectorize (GBps) Unroll + Vectorize + Parallel (GBps) Single DMA (GBps)
0.01MB 2.6699 12.1178 4.3369 1.8245
0.02MB 2.7955 24.6427 8.6658 3.4972
0.04MB 3.0016 35.7516 14.4496 5.0863
0.08MB 3.1047 37.8442 25.2964 7.2166
0.16MB 3.2119 55.4663 43.0918 9.4149
0.31MB 3.2614 61.023 65.6292 9.8254
0.62MB 3.4791 70.5527 111.0134 10.7716
1.0MB 1.5253 42.0009 45.3035 11.5082
2.0MB 0.7137 5.29 17.3306 13.3808
4.0MB 0.721 5.2936 19.2567 13.639

cc @csullivan @mehrdadh

@tmoreau89
Copy link
Contributor

Very neat work @nverke ! CC @masahi @kparzysz-quic

Copy link
Contributor

@csullivan csullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 Amazing work characterizing the performance of VTCM and DMA @nverke. :shipit:

@@ -58,7 +58,7 @@ def __init__(
remote_kw: dict,
session_name: str = "hexagon-rpc",
remote_stack_size_bytes: int = 256 * 1024, # Min size for main thread in QuRT/sim
rpc_receive_buffer_size_bytes: int = 5 * 1024 * 1024, # Size for passing hexagon tests
rpc_receive_buffer_size_bytes: int = 1024 * 1024 * 1024, # Size for passing hexagon tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left over from testing? A gigabyte for the rpc buffer can impact available memory for model execution

Comment on lines +50 to +51
A = T.match_buffer(a, size, dtype="int8", align=128, scope="global")
A_global_vtcm = T.match_buffer(a_v, size, dtype="int8", align=128, scope="global.vtcm")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice follow up would to support measuring the bandwidth in both directions (ddr->vtcm and vtcm->ddr) given that in tests we saw a significant perf asymmetry and making that easily reproducible can help the QC experts debug or provide us insights.

@nverke nverke marked this pull request as ready for review September 2, 2022 16:57
@tmoreau89 tmoreau89 merged commit 8058423 into apache:main Sep 13, 2022
@tmoreau89
Copy link
Contributor

Thank you @csullivan and @nverke for the PR; it's been merged!

xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
…on. (apache#12667)

* [Hexagon] Increase max buffer size for tvm_rpc_android to 1GB.

* [Hexagon] Make errors more clear when unable to allocate VTCM buffers and throw an error to fail early.

* [Hexagon] Add mem_copy_DLTensor to enable directly calling DMA for mem copies.

* [Hexagon] Add new tests as examples of the performance to expect when copying data to VTCM.

* [Hexagon] Reduce rpc max size.

* [Hexagon] Fix test_parallel_hvx_load_vtcm.py test output to be human readable.

* Comment out tests that only work on 8Gen1 HDKs to get CI to pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants