Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

[LLM Runtime] refactor itrex backend based on the latest Jblas #769

Merged
merged 94 commits into from
Dec 13, 2023

Conversation

zhewang1-intc
Copy link
Contributor

Type of Change

feature or bug fix or documentation or others
API changed or not

Description

detail description
JIRA ticket: xxx

Expected Behavior & Potential Risk

the expected behavior that triggered by this PR

How has this PR been tested?

how to reproduce the test (including hardware information)

Dependency Change?

any library dependency introduced or removed

Copy link
Contributor

@DDEle DDEle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see the great refactor!

(seem someone need to fix the cpp graph)

intel_extension_for_transformers/llm/library/.clang-format Outdated Show resolved Hide resolved
@luoyu-intel
Copy link
Contributor

luoyu-intel commented Dec 1, 2023

Next-token beam_number=1 Xeon8480+ 48 threads
group_size=128
CompType=fp32 1.3x speedup
main branch:
perf_total_per_op_us[ FFN_SILU] = 16.995 ms
perf_total_per_op_us[ INNER PRODUCT] = 13.321 ms

pr branch:
perf_total_per_op_us[ MUL_QKV] = 5.460 ms
perf_total_per_op_us[ FFN_SILU] = 12.781 ms
perf_total_per_op_us[ INNER PRODUCT] = 2.931 ms

CompType=int8 1.08x
main:
perf_total_per_op_us[ MUL_QKV] = 4.812 ms
perf_total_per_op_us[ FFN_SILU] = 11.572 ms
perf_total_per_op_us[ INNER PRODUCT] = 2.455 ms

pr branch:
perf_total_per_op_us[ MUL_QKV] = 4.343 ms
perf_total_per_op_us[ FFN_SILU] = 10.654 ms
perf_total_per_op_us[ INNER PRODUCT] = 2.180 ms

comp_type=bf16, ~1.3x
main(no ffn, qkv fusion) :
perf_total_per_op_us[ INNER PRODUCT] = 41.611 ms

pr branch:
perf_total_per_op_us[ MUL_QKV] = 6.667 ms
perf_total_per_op_us[ FFN_SILU] = 16.416 ms
perf_total_per_op_us[ INNER PRODUCT] = 3.089 ms

@DDEle
Copy link
Contributor

DDEle commented Dec 4, 2023

Fused-Attention part (intel_extension_for_transformers/llm/runtime/graph/core/layers/mha_dense.cpp) is ready for review.

@luoyu-intel
Copy link
Contributor

luoyu-intel commented Dec 5, 2023

long prompt len=2023 Xeon8480+ 48 threads
group_size=128

CompType=int8, 1.2x:
main:
model_print_timings: prompt eval time = 1430.17 ms / 2023 tokens ( 0.71 ms per token)
model_print_timings: eval time = 410.02 ms / 15 runs ( 27.33 ms per token)
model_print_timings: total time = 2969.41 ms
========== eval time log of each prediction ==========
prediction 0, time: 1430.17ms
prediction 1, time: 28.56ms
prediction 2, time: 27.65ms
prediction 3, time: 27.34ms
prediction 4, time: 27.46ms

pr branch:
model_print_timings: prompt eval time = 1184.35 ms / 2023 tokens ( 0.59 ms per token)
model_print_timings: eval time = 389.48 ms / 15 runs ( 25.97 ms per token)
model_print_timings: total time = 2577.57 ms
========== eval time log of each prediction ==========
prediction 0, time: 1184.35ms
prediction 1, time: 27.49ms
prediction 2, time: 26.47ms
prediction 3, time: 25.85ms
prediction 4, time: 25.77ms

CompType=bf16 1.14x,
main:
model_print_timings: prompt eval time = 1624.27 ms / 2023 tokens ( 0.80 ms per token)
model_print_timings: eval time = 722.67 ms / 15 runs ( 48.18 ms per token)
model_print_timings: total time = 3281.25 ms
========== eval time log of each prediction ==========
prediction 0, time: 1624.27ms
prediction 1, time: 49.15ms
prediction 2, time: 48.79ms
prediction 3, time: 48.60ms
prediction 4, time: 48.33ms
prediction 5, time: 48.33ms

pr branch:
model_print_timings: prompt eval time = 1422.43 ms / 2023 tokens ( 0.70 ms per token)
model_print_timings: eval time = 549.44 ms / 15 runs ( 36.63 ms per token)
model_print_timings: total time = 3063.09 ms
========== eval time log of each prediction ==========
prediction 0, time: 1422.43ms
prediction 1, time: 37.75ms
prediction 2, time: 36.69ms
prediction 3, time: 36.61ms
prediction 4, time: 36.57ms

CompType=fp32 0.98x
main:
model_print_timings: prompt eval time = 5066.65 ms / 2023 tokens ( 2.50 ms per token)
model_print_timings: eval time = 567.13 ms / 15 runs ( 37.81 ms per token)
model_print_timings: total time = 6633.77 ms
========== eval time log of each prediction ==========
prediction 0, time: 5066.65ms
prediction 1, time: 38.81ms
prediction 2, time: 38.33ms
prediction 3, time: 38.15ms
prediction 4, time: 37.60ms

pr branch:
model_print_timings: prompt eval time = 5184.93 ms / 2023 tokens ( 2.56 ms per token)
model_print_timings: eval time = 427.62 ms / 15 runs ( 28.51 ms per token)
model_print_timings: total time = 6701.66 ms
========== eval time log of each prediction ==========
prediction 0, time: 5184.93ms
prediction 1, time: 29.46ms
prediction 2, time: 29.46ms
prediction 3, time: 29.28ms
prediction 4, time: 29.23ms

@zhewang1-intc zhewang1-intc marked this pull request as ready for review December 7, 2023 02:55
@luoyu-intel luoyu-intel force-pushed the refactor_itrex_backend_based_on_new_jblas branch from c65666f to 31764ca Compare December 7, 2023 03:51
@luoyu-intel luoyu-intel force-pushed the refactor_itrex_backend_based_on_new_jblas branch from 31764ca to ecea677 Compare December 7, 2023 03:56
@airMeng
Copy link
Contributor

airMeng commented Dec 7, 2023

@VincyZhang
Copy link
Contributor

@VincyZhang
Copy link
Contributor

VincyZhang commented Dec 10, 2023

@VincyZhang
Copy link
Contributor

https://inteltf-jenk.sh.intel.com/view/nlp-toolkit-validation/job/ITREX-cpp-graph-extension/200/

Some model failed without output, just hangs for days (not due to disk), please check @zhewang1-intc @luoyu-intel

@VincyZhang
Copy link
Contributor

@airMeng airMeng force-pushed the refactor_itrex_backend_based_on_new_jblas branch from d4d52ce to 221b82b Compare December 12, 2023 08:36
@airMeng airMeng force-pushed the refactor_itrex_backend_based_on_new_jblas branch from bc9c455 to 192f979 Compare December 12, 2023 12:05
@VincyZhang
Copy link
Contributor

@VincyZhang
Copy link
Contributor

Neuralchat UT failure is irrelevant, please check @lvliang-intel

@VincyZhang VincyZhang merged commit 43e30bc into main Dec 13, 2023
20 of 22 checks passed
@VincyZhang VincyZhang deleted the refactor_itrex_backend_based_on_new_jblas branch December 13, 2023 06:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants