Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] Avoid copy result and force allocation #15

Conversation

luo-cheng2021
Copy link

  • avoid copy result, using the approach the genai used, for the 1024 of input len, first inference can reduce from ~830 ms to ~790 ms.
  • force kvcache allocation ahead to avoid the first few prompts cost is longer than expected, it can save about 20~30ms for the first inference cost of problem sentences.

@@ -51,8 +51,9 @@ def ov_wrapper(self, *args, **kwargs) -> torch.Tensor:
else:
inputs.append(np.array(0, dtype=np.int32)) # for optimum-based models this parameter can be used even on the first iteration

outputs = self._ov_request.infer(inputs, share_inputs=True, share_outputs=False)
return torch.from_numpy(outputs[0])
self._ov_request.start_async(inputs, share_inputs=True)
Copy link
Owner

@ilya-lavrenov ilya-lavrenov Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the same as share_outputs=True ?

Copy link
Author

@luo-cheng2021 luo-cheng2021 Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. But if there are several outputs, current approach should be the best.

@ilya-lavrenov ilya-lavrenov merged commit f3a397f into ilya-lavrenov:openvino-model-executor Mar 25, 2024
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this pull request Mar 26, 2024
…g bf16 (#23620)

### Details:
- *Use specific kernel for 2d f32 to bf16 conversion instead of multiple
calls to cpu_convert*
- there is an invocation of parallel_for inside cpu_convert, when
copying count is small such as only a head size: 128, each core will
only copy ~2 elements if core number is 60, this will result false
sharing. The cost can reduce from ~1700ms to ~860ms after the fix. SDPA
path will copy a block of heads such as 32*128, so it will not easily be
impacted but very small prompt size should also suffer from the problem.
- *Change the loop order from B,H,L to B,L,H due to the physical layout,
can reduce the cost from ~860ms to ~830ms.*
 - *Changes in vLLM:  ilya-lavrenov/vllm#15

### Tickets:
 - *ticket-id*
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this pull request Mar 26, 2024
…g bf16 (#23620)

### Details:
- *Use specific kernel for 2d f32 to bf16 conversion instead of multiple
calls to cpu_convert*
- there is an invocation of parallel_for inside cpu_convert, when
copying count is small such as only a head size: 128, each core will
only copy ~2 elements if core number is 60, this will result false
sharing. The cost can reduce from ~1700ms to ~860ms after the fix. SDPA
path will copy a block of heads such as 32*128, so it will not easily be
impacted but very small prompt size should also suffer from the problem.
- *Change the loop order from B,H,L to B,L,H due to the physical layout,
can reduce the cost from ~860ms to ~830ms.*
 - *Changes in vLLM:  ilya-lavrenov/vllm#15

### Tickets:
 - *ticket-id*
itikhono pushed a commit to itikhono/openvino that referenced this pull request Mar 28, 2024
…g bf16 (openvinotoolkit#23620)

### Details:
- *Use specific kernel for 2d f32 to bf16 conversion instead of multiple
calls to cpu_convert*
- there is an invocation of parallel_for inside cpu_convert, when
copying count is small such as only a head size: 128, each core will
only copy ~2 elements if core number is 60, this will result false
sharing. The cost can reduce from ~1700ms to ~860ms after the fix. SDPA
path will copy a block of heads such as 32*128, so it will not easily be
impacted but very small prompt size should also suffer from the problem.
- *Change the loop order from B,H,L to B,L,H due to the physical layout,
can reduce the cost from ~860ms to ~830ms.*
 - *Changes in vLLM:  ilya-lavrenov/vllm#15

### Tickets:
 - *ticket-id*
dnkurek pushed a commit to dnkurek/openvino that referenced this pull request Apr 8, 2024
…g bf16 (openvinotoolkit#23620)

### Details:
- *Use specific kernel for 2d f32 to bf16 conversion instead of multiple
calls to cpu_convert*
- there is an invocation of parallel_for inside cpu_convert, when
copying count is small such as only a head size: 128, each core will
only copy ~2 elements if core number is 60, this will result false
sharing. The cost can reduce from ~1700ms to ~860ms after the fix. SDPA
path will copy a block of heads such as 32*128, so it will not easily be
impacted but very small prompt size should also suffer from the problem.
- *Change the loop order from B,H,L to B,L,H due to the physical layout,
can reduce the cost from ~860ms to ~830ms.*
 - *Changes in vLLM:  ilya-lavrenov/vllm#15

### Tickets:
 - *ticket-id*
bbielawx pushed a commit to bbielawx/openvino that referenced this pull request Apr 12, 2024
…g bf16 (openvinotoolkit#23620)

### Details:
- *Use specific kernel for 2d f32 to bf16 conversion instead of multiple
calls to cpu_convert*
- there is an invocation of parallel_for inside cpu_convert, when
copying count is small such as only a head size: 128, each core will
only copy ~2 elements if core number is 60, this will result false
sharing. The cost can reduce from ~1700ms to ~860ms after the fix. SDPA
path will copy a block of heads such as 32*128, so it will not easily be
impacted but very small prompt size should also suffer from the problem.
- *Change the loop order from B,H,L to B,L,H due to the physical layout,
can reduce the cost from ~860ms to ~830ms.*
 - *Changes in vLLM:  ilya-lavrenov/vllm#15

### Tickets:
 - *ticket-id*
alvoron pushed a commit to alvoron/openvino that referenced this pull request Apr 29, 2024
…g bf16 (openvinotoolkit#23620)

### Details:
- *Use specific kernel for 2d f32 to bf16 conversion instead of multiple
calls to cpu_convert*
- there is an invocation of parallel_for inside cpu_convert, when
copying count is small such as only a head size: 128, each core will
only copy ~2 elements if core number is 60, this will result false
sharing. The cost can reduce from ~1700ms to ~860ms after the fix. SDPA
path will copy a block of heads such as 32*128, so it will not easily be
impacted but very small prompt size should also suffer from the problem.
- *Change the loop order from B,H,L to B,L,H due to the physical layout,
can reduce the cost from ~860ms to ~830ms.*
 - *Changes in vLLM:  ilya-lavrenov/vllm#15

### Tickets:
 - *ticket-id*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants