[LLM Runtime] refactor itrex backend based on the latest Jblas #769

zhewang1-intc · 2023-11-24T08:39:46Z

Type of Change

feature or bug fix or documentation or others
API changed or not

Description

detail description
JIRA ticket: xxx

Expected Behavior & Potential Risk

the expected behavior that triggered by this PR

How has this PR been tested?

how to reproduce the test (including hardware information)

Dependency Change?

any library dependency introduced or removed

DDEle

Glad to see the great refactor!

(seem someone need to fix the cpp graph)

intel_extension_for_transformers/llm/library/.clang-format

luoyu-intel · 2023-12-01T10:29:03Z

Next-token beam_number=1 Xeon8480+ 48 threads
group_size=128
CompType=fp32 1.3x speedup
main branch:
perf_total_per_op_us[ FFN_SILU] = 16.995 ms
perf_total_per_op_us[ INNER PRODUCT] = 13.321 ms

pr branch:
perf_total_per_op_us[ MUL_QKV] = 5.460 ms
perf_total_per_op_us[ FFN_SILU] = 12.781 ms
perf_total_per_op_us[ INNER PRODUCT] = 2.931 ms

CompType=int8 1.08x
main:
perf_total_per_op_us[ MUL_QKV] = 4.812 ms
perf_total_per_op_us[ FFN_SILU] = 11.572 ms
perf_total_per_op_us[ INNER PRODUCT] = 2.455 ms

pr branch:
perf_total_per_op_us[ MUL_QKV] = 4.343 ms
perf_total_per_op_us[ FFN_SILU] = 10.654 ms
perf_total_per_op_us[ INNER PRODUCT] = 2.180 ms

comp_type=bf16, ~1.3x
main(no ffn, qkv fusion) :
perf_total_per_op_us[ INNER PRODUCT] = 41.611 ms

pr branch:
perf_total_per_op_us[ MUL_QKV] = 6.667 ms
perf_total_per_op_us[ FFN_SILU] = 16.416 ms
perf_total_per_op_us[ INNER PRODUCT] = 3.089 ms

DDEle · 2023-12-04T05:43:36Z

Fused-Attention part (intel_extension_for_transformers/llm/runtime/graph/core/layers/mha_dense.cpp) is ready for review.

luoyu-intel · 2023-12-05T06:52:44Z

long prompt len=2023 Xeon8480+ 48 threads
group_size=128

CompType=int8, 1.2x:
main:
model_print_timings: prompt eval time = 1430.17 ms / 2023 tokens ( 0.71 ms per token)
model_print_timings: eval time = 410.02 ms / 15 runs ( 27.33 ms per token)
model_print_timings: total time = 2969.41 ms
========== eval time log of each prediction ==========
prediction 0, time: 1430.17ms
prediction 1, time: 28.56ms
prediction 2, time: 27.65ms
prediction 3, time: 27.34ms
prediction 4, time: 27.46ms

pr branch:
model_print_timings: prompt eval time = 1184.35 ms / 2023 tokens ( 0.59 ms per token)
model_print_timings: eval time = 389.48 ms / 15 runs ( 25.97 ms per token)
model_print_timings: total time = 2577.57 ms
========== eval time log of each prediction ==========
prediction 0, time: 1184.35ms
prediction 1, time: 27.49ms
prediction 2, time: 26.47ms
prediction 3, time: 25.85ms
prediction 4, time: 25.77ms

CompType=bf16 1.14x,
main:
model_print_timings: prompt eval time = 1624.27 ms / 2023 tokens ( 0.80 ms per token)
model_print_timings: eval time = 722.67 ms / 15 runs ( 48.18 ms per token)
model_print_timings: total time = 3281.25 ms
========== eval time log of each prediction ==========
prediction 0, time: 1624.27ms
prediction 1, time: 49.15ms
prediction 2, time: 48.79ms
prediction 3, time: 48.60ms
prediction 4, time: 48.33ms
prediction 5, time: 48.33ms

pr branch:
model_print_timings: prompt eval time = 1422.43 ms / 2023 tokens ( 0.70 ms per token)
model_print_timings: eval time = 549.44 ms / 15 runs ( 36.63 ms per token)
model_print_timings: total time = 3063.09 ms
========== eval time log of each prediction ==========
prediction 0, time: 1422.43ms
prediction 1, time: 37.75ms
prediction 2, time: 36.69ms
prediction 3, time: 36.61ms
prediction 4, time: 36.57ms

CompType=fp32 0.98x
main:
model_print_timings: prompt eval time = 5066.65 ms / 2023 tokens ( 2.50 ms per token)
model_print_timings: eval time = 567.13 ms / 15 runs ( 37.81 ms per token)
model_print_timings: total time = 6633.77 ms
========== eval time log of each prediction ==========
prediction 0, time: 5066.65ms
prediction 1, time: 38.81ms
prediction 2, time: 38.33ms
prediction 3, time: 38.15ms
prediction 4, time: 37.60ms

pr branch:
model_print_timings: prompt eval time = 5184.93 ms / 2023 tokens ( 2.56 ms per token)
model_print_timings: eval time = 427.62 ms / 15 runs ( 28.51 ms per token)
model_print_timings: total time = 6701.66 ms
========== eval time log of each prediction ==========
prediction 0, time: 5184.93ms
prediction 1, time: 29.46ms
prediction 2, time: 29.46ms
prediction 3, time: 29.28ms
prediction 4, time: 29.23ms

intel_extension_for_transformers/llm/library/jblas/.clang-format

intel_extension_for_transformers/llm/runtime/graph/application/main_pybind.cpp

intel_extension_for_transformers/llm/runtime/graph/core/layers/mha_dense.cpp

intel_extension_for_transformers/llm/runtime/graph/models/model_utils/model_utils.cpp

airMeng · 2023-12-07T05:21:11Z

shall there be any updated on jblas doc https://github.com/intel/intel-extension-for-transformers/blob/refactor_itrex_backend_based_on_new_jblas/intel_extension_for_transformers/llm/library/jblas/README.md ?

@luoyu-intel @zhewang1-intc

intel_extension_for_transformers/llm/runtime/graph/core/layers/jblas_gemm.h

intel_extension_for_transformers/llm/runtime/graph/core/ne_layers.c

VincyZhang · 2023-12-08T07:21:10Z

https://inteltf-jenk.sh.intel.com/view/nlp-toolkit-validation/job/ITREX-cpp-graph-extension/200/

VincyZhang · 2023-12-10T12:11:42Z

https://inteltf-jenk.sh.intel.com/job/ITREX-cpp-graph-extension/202/

VincyZhang · 2023-12-10T12:17:13Z

https://inteltf-jenk.sh.intel.com/view/nlp-toolkit-validation/job/ITREX-cpp-graph-extension/200/

Some model failed without output, just hangs for days (not due to disk), please check @zhewang1-intc @luoyu-intel

VincyZhang · 2023-12-11T08:53:11Z

https://inteltf-jenk.sh.intel.com/job/ITREX-cpp-graph-extension/203/

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

Signed-off-by: Meng, Hengyu <airdldl@163.com>

VincyZhang · 2023-12-12T13:13:44Z

https://inteltf-jenk.sh.intel.com/view/nlp-toolkit-validation/job/ITREX-cpp-graph-extension/204/

VincyZhang · 2023-12-12T14:27:54Z

Neuralchat UT failure is irrelevant, please check @lvliang-intel

airMeng requested review from DDEle, yuchengliu1, zhenwei-intel and zhentaoyu November 24, 2023 08:51

DDEle reviewed Nov 24, 2023

View reviewed changes

intel_extension_for_transformers/llm/library/.clang-format Outdated Show resolved Hide resolved

airMeng added the ITREX.cpp label Nov 29, 2023

airMeng mentioned this pull request Nov 29, 2023

[LLM Runtime] GPTQ to jblas for llama2 #791

Closed

zhewang1-intc marked this pull request as ready for review December 7, 2023 02:55

zhewang1-intc requested review from PenghuiCheng, a32543254 and airMeng as code owners December 7, 2023 02:55

airMeng reviewed Dec 7, 2023

View reviewed changes

luoyu-intel force-pushed the refactor_itrex_backend_based_on_new_jblas branch from c65666f to 31764ca Compare December 7, 2023 03:51

airMeng reviewed Dec 7, 2023

View reviewed changes

intel_extension_for_transformers/llm/runtime/graph/core/layers/mha_dense.cpp Show resolved Hide resolved

intel_extension_for_transformers/llm/runtime/graph/core/layers/mha_dense.cpp Show resolved Hide resolved

luoyu-intel force-pushed the refactor_itrex_backend_based_on_new_jblas branch from 31764ca to ecea677 Compare December 7, 2023 03:56

airMeng reviewed Dec 7, 2023

View reviewed changes

intel_extension_for_transformers/llm/runtime/graph/models/model_utils/model_utils.cpp Show resolved Hide resolved

zhewang1-intc requested a review from VincyZhang as a code owner December 7, 2023 04:04

zhentaoyu mentioned this pull request Dec 7, 2023

[LLM Runtime] Add MX-Format (FP8_E5M2, FP8_E4M3, FP4_E2M1, NF4) #872

Merged

a32543254 reviewed Dec 7, 2023

View reviewed changes

intel_extension_for_transformers/llm/runtime/graph/core/layers/jblas_gemm.h Show resolved Hide resolved

a32543254 reviewed Dec 7, 2023

View reviewed changes

intel_extension_for_transformers/llm/runtime/graph/core/ne_layers.c Show resolved Hide resolved

VincyZhang added the ITREX-1.3 label Dec 8, 2023

airMeng force-pushed the refactor_itrex_backend_based_on_new_jblas branch from d4d52ce to 221b82b Compare December 12, 2023 08:36

luoyu-intel and others added 23 commits December 12, 2023 20:05

update jblas

1dcea99

update copyright

05e0633

update jblas

f548e30

mha ut improvement

17e3a7e

disable Wno-narrowing in qbits and fix all warning

02cd2b8

add qbits cpplint check

45f2a01

update jblas and support e5m2/e4m3 wei + e8m0 scale woq feature in qbits

6bb9101

update jblas

8054d89

update pybind format

4ea5cfb

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

fix clang-format.

65f84c7

fix some cpplints

42de32e

add fp8 weight and fp8 scale

51aadd3

fix cpplint in conv and add some nolint

5d6959f

more cpplints in conv.cpp

1b92bbc

fp8 weight only valid for fp8 scale

176b9b3

bug fix

b59eb81

add gptq shuffle support

9108cfb

fix f8 quant mantissa bits

bdef9df

fix ffn template bug

9af4c5a

fix ip_add fusion bug

1520710

change p=32 to p=16

1570c3b

fix shuffle buf

97877fc

determine whether symmetric from input zero_points tensor

192f979

Signed-off-by: Meng, Hengyu <airdldl@163.com>

airMeng force-pushed the refactor_itrex_backend_based_on_new_jblas branch from bc9c455 to 192f979 Compare December 12, 2023 12:05

airMeng approved these changes Dec 12, 2023

View reviewed changes

fix ebits overflow when using uint8 to store

7263631

VincyZhang merged commit 43e30bc into main Dec 13, 2023
20 of 22 checks passed

VincyZhang deleted the refactor_itrex_backend_based_on_new_jblas branch December 13, 2023 06:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM Runtime] refactor itrex backend based on the latest Jblas #769

[LLM Runtime] refactor itrex backend based on the latest Jblas #769

zhewang1-intc commented Nov 24, 2023

DDEle left a comment

luoyu-intel commented Dec 1, 2023 •

edited

Loading

DDEle commented Dec 4, 2023 •

edited

Loading

luoyu-intel commented Dec 5, 2023 •

edited

Loading

airMeng commented Dec 7, 2023

VincyZhang commented Dec 8, 2023

VincyZhang commented Dec 10, 2023 •

edited

Loading

VincyZhang commented Dec 10, 2023

VincyZhang commented Dec 11, 2023

VincyZhang commented Dec 12, 2023

VincyZhang commented Dec 12, 2023

[LLM Runtime] refactor itrex backend based on the latest Jblas #769

[LLM Runtime] refactor itrex backend based on the latest Jblas #769

Conversation

zhewang1-intc commented Nov 24, 2023

Type of Change

Description

Expected Behavior & Potential Risk

How has this PR been tested?

Dependency Change?

DDEle left a comment

Choose a reason for hiding this comment

luoyu-intel commented Dec 1, 2023 • edited Loading

DDEle commented Dec 4, 2023 • edited Loading

luoyu-intel commented Dec 5, 2023 • edited Loading

airMeng commented Dec 7, 2023

VincyZhang commented Dec 8, 2023

VincyZhang commented Dec 10, 2023 • edited Loading

VincyZhang commented Dec 10, 2023

VincyZhang commented Dec 11, 2023

VincyZhang commented Dec 12, 2023

VincyZhang commented Dec 12, 2023

luoyu-intel commented Dec 1, 2023 •

edited

Loading

DDEle commented Dec 4, 2023 •

edited

Loading

luoyu-intel commented Dec 5, 2023 •

edited

Loading

VincyZhang commented Dec 10, 2023 •

edited

Loading