-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TIR][TOPI][x86][CI] Support skylake avx512 #13621
Conversation
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
5601b8d
to
3a8fbf3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @vvchernov for this excellent addition on x86 coverage !
- I've suggested few nits in the code, mostly cosmetic.
- We could add (next PR?) 4x4 (ssse3/m128), 8x4 (avx2/m256)
skylake
counterparts:
tvm/python/tvm/topi/x86/tensor_intrin.py
Lines 90 to 107 in 58f924d
if int32_lanes == 4: int_lx32 = "int16x8" int_8xl = "int8x16" int_32xl = "int32x4" pmaddubs = "llvm.x86.ssse3.pmadd.ub.sw.128" pmaddw = "llvm.x86.sse2.pmadd.wd" elif int32_lanes == 8: int_lx32 = "int16x16" int_8xl = "int8x32" int_32xl = "int32x8" pmaddubs = "llvm.x86.avx2.pmadd.ub.sw" pmaddw = "llvm.x86.avx2.pmadd.wd" elif int32_lanes == 16: int_lx32 = "int16x32" int_8xl = "int8x64" int_32xl = "int32x16" pmaddubs = "llvm.x86.avx512.pmaddubs.w.512" pmaddw = "llvm.x86.avx512.pmaddw.d.512"
python/tvm/relay/op/strategy/x86.py
Outdated
and target_has_vnni(mcpu) | ||
and target_has_avx512(mcpu) | ||
and inputs[0].dtype == "uint8" | ||
and inputs[1].dtype == "int8" | ||
and inputs[1].shape[-2] % 16 == 0 | ||
and inputs[1].shape[-1] % 4 == 0 | ||
): | ||
strategy.add_implementation( | ||
wrap_compute_batch_matmul(topi.x86.batch_matmul_vnni_compute, need_out_dtype=True), | ||
wrap_topi_schedule(topi.x86.schedule_batch_matmul_vnni), | ||
name="batch_matmul_vnni.x86", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Don't remove the
vnni
one completley - Instead, maybe a descend strategy would be better:
- ifhas_vnni
// llvm -mcpu=cascadelake
- elifhas_avx512
// llvm -mcpu=skylake
- elifhas_avx2
(in a future PR) // llvm -mcpu=haswell
- elifhas_ssse3
(in a future PR) // llvm -mcpu=sandybridge - See strategy/x86.py, also descending
plevel
, in the upcoming PR#13642
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @cbalint13! Thank you for your nits and remarks! In this case VNNI was not removed but extended, as you know VNNI is a part of AVX512 architectures. The fork is here:
https://github.com/apache/tvm/blob/main/python/tvm/topi/x86/tensor_intrin.py#:~:text=def%20dot_16x1x16_uint8_int8_int32()%3A,return%20dot_16x1x16_uint8_int8_int32_skylake()
As you correctly remarked avx2 and ssse3 are also processed here, but they are not accessable due to high-level check target_has_avx512. Possibly you suggestion is good way how to resolve it further. Now I only extended existed approach for avx512.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
know VNNI is a part of AVX512 architectures. The fork is here:
https://github.com/apache/tvm/blob/main/python/tvm/topi/x86/tensor_intrin.py#:~:text=def%20dot_16x1x16_uint8_int8_int32()%3A,return%20dot_16x1x16_uint8_int8_int32_skylake()
As you correctly remarked avx2 and ssse3 are also processed here, but they are not accessable due to high-level check target_has_avx512.
Can't see original llvm.x86.vnni instrinsic one in the above.
This switch, right in the provided tensor_intrin.py fork:
if target_has_vnni(mcpu):
# VNNI capable platform
return dot_16x1x16_uint8_int8_int32_cascadelake()
# vpmaddubsw/vpmaddwd fallback
return dot_16x1x16_uint8_int8_int32_skylake()
As +SIMD keeps coming, would't be better to stay as upcoming strategy/x86.py if/elif
+ preferece plevel
?
- The tensor_intrin.py would remain only as enums of SIMD (no decisions), triages would stay strategy, etc.
- User may control the fall into strategy by narrowing
"llvm +mattr={avx512bw,avxvnni,...}"
, as llvm flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @cbalint13! Your view looks reasonable and there is no problems to reimplement it from my side. But I did not implement method dot_16x1x16_uint8_int8_int32
with conditions on tensor_intrin.py side and thought that it is brick to build some concept. @elvin-n and @jwfromm what do you think about Balint's view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @cbalint13! Your view looks reasonable and there is no problems to reimplement it from my side. But I did not implement method
dot_16x1x16_uint8_int8_int32
with conditions on tensor_intrin.py side and thought that it is brick to build some concept. @elvin-n and @jwfromm what do you think about Balint's view?
Thanks for clarifications, I see your point, it is perfectly fine way too.
I think by making all CI tests to pass in green more reviewers will come.
I try sum up, on this very pinned PR change on strategy/x86.py, visibile on top of this thread:
-from tvm.topi.x86.utils import target_has_vnni
+from tvm.topi.x86.utils import target_has_avx512
- and target_has_vnni(mcpu)
+ and target_has_avx512(mcpu)
- wrap_compute_dense(topi.x86.dense_vnni),
- wrap_topi_schedule(topi.x86.schedule_dense_vnni),
+ wrap_compute_dense(topi.x86.dense_int8),
+ wrap_topi_schedule(topi.x86.schedule_dense_int8),
- This merge vnni to avx512 (under new dense_int8 umbrella) arguing that VNNI is subset of AVX512 group.
- VNNI is subset of AVX512 group, however there are CPU having AVX512 but no VNNI [1].
[1] https://en.wikipedia.org/wiki/AVX-512#VNNI
My view was to leave separate avx512 & vnni(as was) in strategy/x86.py (not to merge vnni->avx512)
My argument was to triage any SIMD right in strategy/x86.py as upcoming AMX do here + plevel control.
I saw VNNI and AVX512 +(AVX2, SSSE3) as potentialy independend things, moreover choosable via "llvm +mattr=...".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instruction set for SSE/AVX2/AVX512 for int8 is absolutely the same, the only difference is the number of lanes. Additionally, the patterns how these int8 instructions (VPMADDUBSW/VPMADDWD/VPADDD) are used, is the same as the only VNNI instruction (VPDPBUSD). I.e. it is reasonable to have the only tvm intrinsic, it is reasonable to remove VNNI from the name of the function, and it is reasonable to extend these intrinsic function to SSE and AVX2 that is not done yet in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instruction set for SSE/AVX2/AVX512 for int8 is absolutely the same, the only difference is the number of lanes.
Yes, same class doing integer dot products on immediate registers, but mention:
- different clocking, timing & implementation on ASIC
- (auto)tensorization opportunities differ as inner loops match differently
Additionally, the patterns how these int8 instructions (VPMADDUBSW/VPMADDWD/VPADDD) are used, is the >same as the only VNNI instruction (VPDPBUSD).
Right.
- VNNI insn. accumulates into int32 lanes in single step: vpdpbusd
- AVX512 (incl. AVX2, SSSE3 ones) does same in two-step, e.g: pmaddubs + pmadd
I.e. it is reasonable to have the only tvm intrinsic, it is reasonable to remove VNNI from the name of the function, and it is reasonable to extend these intrinsic function to SSE and AVX2 that is not done yet in this PR
- Indeed the proposed intrinsic merger is perfectly fine.
- It was possible to question it with reasonable arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
different clocking, timing & implementation on ASIC
What kind of ASIC do you mean?
(auto)tensorization opportunities differ as inner loops match differently
Under tensorization opportunities differ
do yo mean different number of lanes for different instruction set which can be reflected in potential different blocking size? Or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
different clocking, timing & implementation on ASIC
What kind of ASIC do you mean?
- CPU, family of x86, different generations, varying extended ISA layouts: amx avx512 vnni avx2 ssse3 sse2
(auto)tensorization opportunities differ as inner loops match differently
Undertensorization opportunities differ
do yo mean different number of lanes for different instruction set which can be reflected in potential different blocking size?
- Yes, both input-widths and output-lanes yields different outcomes, varying performances.
- E.g. autotensorizer will opportunistically search to permute & match inner loops to these varying sizes.
Or something else?
- TVM is a compiler after all, to my knowledge the only capable of auto-tensorization with arbitrary intrinsic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider amx vs vnni avx512 avx2 sse3 (btw, there is no sse2 for int8, required instructions appeared if I am not mistaken in sse3.x) because first is matrix multiplication, other ones are vector instructions. For now I propose to go from local to generic and when we see needs in differentiate vector sets, we will do this. For now pattern look similar for all of vector instructions, the aspect of blocking should be added separately if it is not done yet, The aspect of lanes in TVM intrinsic should be covered in this PR
match inner loops to these varying sizes.
The inner loop is the same for all these instructions. It will be
for (int k = 0; k < 4; k++){
output[i] += data[k] * kernel[i][k]
}
TVM is a compiler after all, to my knowledge the only capable of auto-tensorization with arbitrary intrinsic.
I agree, at the same time I propose to move from local to generic patterns. We do not limit anything for now
Thanks! I agree and also thought how it can be done in reasonable way. And yes, I think it should be done in separated PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvchernov ,
It is important to make tests for this PR pass all in green.
Unsure for the actual failures, maybe CI lacks related avx512 ISA ?
0000b49
to
1aa2093
Compare
…ted shedules and postprocs for it. add TODOs for further check and development
cccd755
to
a289d4b
Compare
Hello @areusch, @driazati, @junrushao! Could you see this PR? |
Happy to take a look tomorrow :-) |
Hello @masahi! Could you see this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please remove all diffs from some unittests where VNNI vs AVX512 difference doesn't matter. The diffs in this PR is unnecessarily big.
- Please verify that VNNI tests are still functional after this change.
@@ -28,6 +28,7 @@ | |||
from tvm.tir.schedule import BlockRV, Schedule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename it to test_meta_schedule_cpu_dot_product.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. But I'm not sure that it is clearest name due to cpu includes not only Intel architectures. Nevertheless there is no other similar test to disturb somebody
@@ -41,6 +41,71 @@ | |||
from tvm.te import create_prim_func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also remove change from this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no big changes, I tried to unify tests using the same classes, but my try failed and I return it back (as fact it was replaced inside the file). I've rollbacked the trasferred code. Just now there is pylint fix and renaming for the sake of clarity (not only VNNI is checked)
Hello @masahi! Some words about check of VNNI functionality after changing: 1. Unfortunately I do not have machine with VNNI to check it locally therefore I based on CI test for VNNI, 2. This PR is devoted to support of avx512 which was implemented in paralell to VNNI functionality. Changes touching VNNI are related to renaming or unifying test common code. 3. I plan to open new PR with fixes for VNNI of some small issues observed during this work if they (fixes) are correct. |
* add skylake-avx512 tests * extend tests by skylake-avx512 * lint fixes * fix misprinting * misprinting fix * TODOs for further development * add temporally commented tests for skylake-avx512 due to not implemented shedules and postprocs for it. add TODOs for further check and development * update int8-acc32 test for vnni and avx512 w/o it * pylint fix * once more pylint fix * fix Feature init for skylake * fix test * fix intrin names for assert for skylake * small fix * return back fast int8 intrinsic tests * test connect of dense and batch_matmul to avx512 tensorization * extend dense_alter_layout on avx512 (currently) instead of VNNI. some renaming vnni to int8 for the sake of clarity * more renaming vnni to int8 for dense schedule, compute, strategy for the sake of clarity * update for batch_matmul with avx512 * extend space generator init for avx512. Add Default AVX512 schedule rules * avx512 dot 16x4 intrin was implemented for MS default schedule rule * small fix * update * pylint fixes * test workaround for const alloc in tir * test fix (broadcasting) * remove excess instructions from dot_product_16x4_u8i8i32_avx512 * pylint fix * skip asm check for askew weight shapes * fix pylint * revert test fix * set number of args * test fix * fix const allocation in tir for avx512 dot 16x4 * fix signature of dot_product_16x4_u8i8i32_avx512 * use script instead of tvm.tir for const allocation * extend auto tensorize test by skylake-avx512 target * clean code * update test_op_level1, resolve TODO * small update test_op_level2 * update test_op_level10, resolve TODO * update qnn legalize pass test, resolve TODOs * pylint fixes * update ms test for avx512 * update more ms test for avx512 * try to fix i386 CI tests * fix intrin name for check * skip test due to model downloading issue * fix test failure * use ORT for conv2d check * lint fix after rebasing * comment ORT part of test * extend tests tir schedule analysis and transform for avx512. unify test classes * extend test tir schedule tensorize for avx512 * extend test meta schedule vnni integration for avx512 * rename test file * pylint fix * tag fix * update test meta schedule trace apply with avx512 * rollback test class unifying in utils * pylint fixes * separate TIRs for scheduled conv2d for vnni and avx512 * fix registering issue in test * update conv+bias onnx model for intermediate test * fix int16 overflow * fix int16 overflow for dense test * update input data for test of dense * small rollback * fix misprinting * fix * restart CI * DefaultVNNI was renamed to DefaultLLVM for mutator * rename test file for the sake of clarity * DefaultVNNI was renamed to DefaultCPUTensorization for postproc * remove resolved TODO * DefaultVNNI and AVX512 for ScheduleRule were unified * replace code to upstream with initial version * fix arg type * lint fix * small fix * lint fix * fix misprinting * rollback trace apply test for avx512 (reviewer remark) * fix pylint Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
It looks like despite of some avx512 intrinsics were supported (see topi/x86 and tir), they are not used during simple compilation or tuning model by meta-scheduler on skylake-avx512 target.
The aim is end-to-end support of Skylake X architecture on TVM side for dense, batch_matmul and conv ops.
Details