Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking #20149

Merged

Conversation

kunal-vaishnavi
Copy link
Contributor

Description

This PR adds flash attention v2 and support for INT4 CUDA benchmarking in PyTorch.

Motivation and Context

The flash attention v2 algorithm helps improve model performance in PyTorch. Support for INT4 CUDA in PyTorch is done through the bitsandbytes package.

@kunal-vaishnavi kunal-vaishnavi merged commit a0ebd5f into microsoft:main Mar 30, 2024
94 checks passed
YUNQIUGUO pushed a commit that referenced this pull request Apr 2, 2024
### Description
This PR adds flash attention v2 and support for INT4 CUDA benchmarking
in PyTorch.

### Motivation and Context
The [flash attention v2](https://github.com/Dao-AILab/flash-attention)
algorithm helps improve model performance in PyTorch. Support for INT4
CUDA in PyTorch is done through the
[`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) package.
TedThemistokleous pushed a commit to TedThemistokleous/onnxruntime that referenced this pull request May 7, 2024
…osoft#20149)

### Description
This PR adds flash attention v2 and support for INT4 CUDA benchmarking
in PyTorch.

### Motivation and Context
The [flash attention v2](https://github.com/Dao-AILab/flash-attention)
algorithm helps improve model performance in PyTorch. Support for INT4
CUDA in PyTorch is done through the
[`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants