[CPU] Enabled FP16 Compressed FC on models with PagedAttention #26279

dmitry-gorokhov · 2024-08-28T12:03:04Z

No description provided.

usstq

LGTM, vcvtph2ps used for decompressing FP16 into FP32 introduced overhead according to tests shown in https://jira.devtools.intel.com/browse/CVS-133453, but given the fact:

most serving workload with big-batch-size (compute-bounded) would be running on Xeon with (src-bf16, wei-bf16) or (src-f16, wei-f16) kernel
most likely workload running on AI PC, would use small batch-size and (src-f32, wei-f16) can help no matter it's SDPA or PA.

so we can safely enable it

dmitry-gorokhov added category: CPU OpenVINO CPU plugin do not merge do_not_review labels Aug 28, 2024

dmitry-gorokhov self-assigned this Aug 28, 2024

dmitry-gorokhov requested review from a team as code owners August 28, 2024 12:03

dmitry-gorokhov added this to the 2024.5 milestone Oct 11, 2024

[CPU] Enabled FP16 Compressed FC on models with PagedAttention

438403b

dmitry-gorokhov force-pushed the feature/fp16_compressed_fc branch from 4bd98c1 to 438403b Compare October 11, 2024 08:04

dmitry-gorokhov removed do not merge do_not_review labels Oct 11, 2024

ilya-lavrenov approved these changes Oct 12, 2024

View reviewed changes

usstq approved these changes Oct 14, 2024

View reviewed changes

dmitry-gorokhov added this pull request to the merge queue Oct 14, 2024

Merged via the queue into openvinotoolkit:master with commit 7250c1e Oct 14, 2024
153 checks passed

dmitry-gorokhov deleted the feature/fp16_compressed_fc branch October 14, 2024 07:19

Provide feedback