📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
-
Updated
Jul 23, 2025 - Python
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
A nearly-live implementation of OpenAI's Whisper.
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.
Higher performance OpenAI LLM service than vLLM serve: A pure C++ high-performance OpenAI LLM service implemented with GPRS+TensorRT-LLM+Tokenizers.cpp, supporting chat and function call, AI agents, distributed multi-GPU inference, multimodal capabilities, and a Gradio chat interface.
Chat With RTX Python API
Add-in for new Outlook that adds LLM new features (Composition, Summarizing, Q&A). It uses a local LLM via Nvidia TensorRT-LLM
A tool for benchmarking LLMs on Modal
大模型推理框架加速,让 LLM 飞起来
LLM tutorial materials include but not limited to NVIDIA NeMo, TensorRT-LLM, Triton Inference Server, and NeMo Guardrails.
Whisper optimization for real-time application
A simple project demonstrating LLM assisted review of documentation on Atlasssian Confluence.
A Large Language Models (LLM) oriented project providing easy-to-use features like RAG, translation, summarization, ...
Add a description, image, and links to the tensorrt-llm topic page so that developers can more easily learn about it.
To associate your repository with the tensorrt-llm topic, visit your repo's landing page and select "manage topics."