forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] Avoid copy result and force allocation #15
Merged
ilya-lavrenov
merged 1 commit into
ilya-lavrenov:openvino-model-executor
from
luo-cheng2021:luocheng/openvino-model-executor-opt
Mar 25, 2024
Merged
[CPU] Avoid copy result and force allocation #15
ilya-lavrenov
merged 1 commit into
ilya-lavrenov:openvino-model-executor
from
luo-cheng2021:luocheng/openvino-model-executor-opt
Mar 25, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
luo-cheng2021
commented
Mar 25, 2024
- avoid copy result, using the approach the genai used, for the 1024 of input len, first inference can reduce from ~830 ms to ~790 ms.
- force kvcache allocation ahead to avoid the first few prompts cost is longer than expected, it can save about 20~30ms for the first inference cost of problem sentences.
@@ -51,8 +51,9 @@ def ov_wrapper(self, *args, **kwargs) -> torch.Tensor: | |||
else: | |||
inputs.append(np.array(0, dtype=np.int32)) # for optimum-based models this parameter can be used even on the first iteration | |||
|
|||
outputs = self._ov_request.infer(inputs, share_inputs=True, share_outputs=False) | |||
return torch.from_numpy(outputs[0]) | |||
self._ov_request.start_async(inputs, share_inputs=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it the same as share_outputs=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is. But if there are several outputs, current approach should be the best.
ilya-lavrenov
approved these changes
Mar 25, 2024
github-merge-queue bot
pushed a commit
to openvinotoolkit/openvino
that referenced
this pull request
Mar 26, 2024
…g bf16 (#23620) ### Details: - *Use specific kernel for 2d f32 to bf16 conversion instead of multiple calls to cpu_convert* - there is an invocation of parallel_for inside cpu_convert, when copying count is small such as only a head size: 128, each core will only copy ~2 elements if core number is 60, this will result false sharing. The cost can reduce from ~1700ms to ~860ms after the fix. SDPA path will copy a block of heads such as 32*128, so it will not easily be impacted but very small prompt size should also suffer from the problem. - *Change the loop order from B,H,L to B,L,H due to the physical layout, can reduce the cost from ~860ms to ~830ms.* - *Changes in vLLM: ilya-lavrenov/vllm#15 ### Tickets: - *ticket-id*
github-merge-queue bot
pushed a commit
to openvinotoolkit/openvino
that referenced
this pull request
Mar 26, 2024
…g bf16 (#23620) ### Details: - *Use specific kernel for 2d f32 to bf16 conversion instead of multiple calls to cpu_convert* - there is an invocation of parallel_for inside cpu_convert, when copying count is small such as only a head size: 128, each core will only copy ~2 elements if core number is 60, this will result false sharing. The cost can reduce from ~1700ms to ~860ms after the fix. SDPA path will copy a block of heads such as 32*128, so it will not easily be impacted but very small prompt size should also suffer from the problem. - *Change the loop order from B,H,L to B,L,H due to the physical layout, can reduce the cost from ~860ms to ~830ms.* - *Changes in vLLM: ilya-lavrenov/vllm#15 ### Tickets: - *ticket-id*
itikhono
pushed a commit
to itikhono/openvino
that referenced
this pull request
Mar 28, 2024
…g bf16 (openvinotoolkit#23620) ### Details: - *Use specific kernel for 2d f32 to bf16 conversion instead of multiple calls to cpu_convert* - there is an invocation of parallel_for inside cpu_convert, when copying count is small such as only a head size: 128, each core will only copy ~2 elements if core number is 60, this will result false sharing. The cost can reduce from ~1700ms to ~860ms after the fix. SDPA path will copy a block of heads such as 32*128, so it will not easily be impacted but very small prompt size should also suffer from the problem. - *Change the loop order from B,H,L to B,L,H due to the physical layout, can reduce the cost from ~860ms to ~830ms.* - *Changes in vLLM: ilya-lavrenov/vllm#15 ### Tickets: - *ticket-id*
dnkurek
pushed a commit
to dnkurek/openvino
that referenced
this pull request
Apr 8, 2024
…g bf16 (openvinotoolkit#23620) ### Details: - *Use specific kernel for 2d f32 to bf16 conversion instead of multiple calls to cpu_convert* - there is an invocation of parallel_for inside cpu_convert, when copying count is small such as only a head size: 128, each core will only copy ~2 elements if core number is 60, this will result false sharing. The cost can reduce from ~1700ms to ~860ms after the fix. SDPA path will copy a block of heads such as 32*128, so it will not easily be impacted but very small prompt size should also suffer from the problem. - *Change the loop order from B,H,L to B,L,H due to the physical layout, can reduce the cost from ~860ms to ~830ms.* - *Changes in vLLM: ilya-lavrenov/vllm#15 ### Tickets: - *ticket-id*
bbielawx
pushed a commit
to bbielawx/openvino
that referenced
this pull request
Apr 12, 2024
…g bf16 (openvinotoolkit#23620) ### Details: - *Use specific kernel for 2d f32 to bf16 conversion instead of multiple calls to cpu_convert* - there is an invocation of parallel_for inside cpu_convert, when copying count is small such as only a head size: 128, each core will only copy ~2 elements if core number is 60, this will result false sharing. The cost can reduce from ~1700ms to ~860ms after the fix. SDPA path will copy a block of heads such as 32*128, so it will not easily be impacted but very small prompt size should also suffer from the problem. - *Change the loop order from B,H,L to B,L,H due to the physical layout, can reduce the cost from ~860ms to ~830ms.* - *Changes in vLLM: ilya-lavrenov/vllm#15 ### Tickets: - *ticket-id*
alvoron
pushed a commit
to alvoron/openvino
that referenced
this pull request
Apr 29, 2024
…g bf16 (openvinotoolkit#23620) ### Details: - *Use specific kernel for 2d f32 to bf16 conversion instead of multiple calls to cpu_convert* - there is an invocation of parallel_for inside cpu_convert, when copying count is small such as only a head size: 128, each core will only copy ~2 elements if core number is 60, this will result false sharing. The cost can reduce from ~1700ms to ~860ms after the fix. SDPA path will copy a block of heads such as 32*128, so it will not easily be impacted but very small prompt size should also suffer from the problem. - *Change the loop order from B,H,L to B,L,H due to the physical layout, can reduce the cost from ~860ms to ~830ms.* - *Changes in vLLM: ilya-lavrenov/vllm#15 ### Tickets: - *ticket-id*
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.