[CPU] Avoid copy result and force allocation #15

luo-cheng2021 · 2024-03-25T01:06:04Z

avoid copy result, using the approach the genai used, for the 1024 of input len, first inference can reduce from ~830 ms to ~790 ms.
force kvcache allocation ahead to avoid the first few prompts cost is longer than expected, it can save about 20~30ms for the first inference cost of problem sentences.

ilya-lavrenov · 2024-03-25T08:14:15Z

vllm/model_executor/openvino_model_loader.py

@@ -51,8 +51,9 @@ def ov_wrapper(self, *args, **kwargs) -> torch.Tensor:
    else:
        inputs.append(np.array(0, dtype=np.int32))   # for optimum-based models this parameter can be used even on the first iteration

-    outputs = self._ov_request.infer(inputs, share_inputs=True, share_outputs=False)
-    return torch.from_numpy(outputs[0])
+    self._ov_request.start_async(inputs, share_inputs=True)


is it the same as share_outputs=True ?

Yes, it is. But if there are several outputs, current approach should be the best.

…g bf16 (#23620) ### Details: - *Use specific kernel for 2d f32 to bf16 conversion instead of multiple calls to cpu_convert* - there is an invocation of parallel_for inside cpu_convert, when copying count is small such as only a head size: 128, each core will only copy ~2 elements if core number is 60, this will result false sharing. The cost can reduce from ~1700ms to ~860ms after the fix. SDPA path will copy a block of heads such as 32*128, so it will not easily be impacted but very small prompt size should also suffer from the problem. - *Change the loop order from B,H,L to B,L,H due to the physical layout, can reduce the cost from ~860ms to ~830ms.* - *Changes in vLLM: ilya-lavrenov/vllm#15 ### Tickets: - *ticket-id*

…g bf16 (openvinotoolkit#23620) ### Details: - *Use specific kernel for 2d f32 to bf16 conversion instead of multiple calls to cpu_convert* - there is an invocation of parallel_for inside cpu_convert, when copying count is small such as only a head size: 128, each core will only copy ~2 elements if core number is 60, this will result false sharing. The cost can reduce from ~1700ms to ~860ms after the fix. SDPA path will copy a block of heads such as 32*128, so it will not easily be impacted but very small prompt size should also suffer from the problem. - *Change the loop order from B,H,L to B,L,H due to the physical layout, can reduce the cost from ~860ms to ~830ms.* - *Changes in vLLM: ilya-lavrenov/vllm#15 ### Tickets: - *ticket-id*

avoid copy result and force allocation

0f83539

luo-cheng2021 mentioned this pull request Mar 25, 2024

[CPU] Optimize first inference latency for PagedAttention when running bf16 openvinotoolkit/openvino#23620

Merged

ilya-lavrenov reviewed Mar 25, 2024

View reviewed changes

ilya-lavrenov approved these changes Mar 25, 2024

View reviewed changes

ilya-lavrenov merged commit f3a397f into ilya-lavrenov:openvino-model-executor Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Avoid copy result and force allocation #15

[CPU] Avoid copy result and force allocation #15

luo-cheng2021 commented Mar 25, 2024

ilya-lavrenov Mar 25, 2024 •

edited

Loading

luo-cheng2021 Mar 25, 2024 •

edited

Loading

[CPU] Avoid copy result and force allocation #15

[CPU] Avoid copy result and force allocation #15

Conversation

luo-cheng2021 commented Mar 25, 2024

ilya-lavrenov Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

luo-cheng2021 Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

ilya-lavrenov Mar 25, 2024 •

edited

Loading

luo-cheng2021 Mar 25, 2024 •

edited

Loading