-
Notifications
You must be signed in to change notification settings - Fork 244
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Cache weight for large batch inference for full bf16 and WOQ lowp-mod…
…e=bf16 (#2898) * Keep bf16 weight for WOQ first token * Keep first token weight for woq int4 and full bf16 * Revert unnecessary changes * fix clang-format issue * Fix UT failures * Fix concat linear * Fix UT failures * fix lint issue * Cache extra weight at runtime instead of ahead-of-time * fix lint
- Loading branch information
1 parent
2795053
commit 52f8c48
Showing
30 changed files
with
968 additions
and
283 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.