-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write README and front page of doc #147
Conversation
@zhuohan123 PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left some comments on sentence phrasing and formatting.
```bash | ||
python test_cli_client.py | ||
``` | ||
- State-of-the-art performance in serving throughput |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- State-of-the-art performance in serving throughput | |
- State-of-the-art serving throughput |
README.md
Outdated
python test_cli_client.py | ||
``` | ||
- State-of-the-art performance in serving throughput | ||
- Efficient management of cached attention keys and values with **PagedAttention** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Efficient management of cached attention keys and values with **PagedAttention** | |
- Efficient management of cached attention keys and values memory with **PagedAttention** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "memory" here is redundant and a bit confusing as we already said they are "cached".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to Efficient management of attention key and value memory with **PagedAttention**
docs/source/index.rst
Outdated
- Efficient support for various decoding algorithms such as parallel sampling and beam search | ||
- Tensor parallelism support for multi-GPU inference | ||
- Streaming outputs | ||
- OpenAI-compatible API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto on comments in README.
* Log more HPU memory metrics during vLLM startup * Overhaul memory management in HPUGraph capture * fix percentage in decode buckets
Closes #124