A high-throughput and memory-efficient inference and serving engine for LLMs
-
Updated
Aug 9, 2025 - Python
A high-throughput and memory-efficient inference and serving engine for LLMs
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
Standardized Serverless ML Inference Platform on Kubernetes
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
The simplest way to serve AI/ML models in production
Community maintained hardware plugin for vLLM on Ascend
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
Python + Inference - Model Deployment library in Python. Simplest model inference server ever.
Serverless LLM Serving for Everyone.
Take control of your context. Orchestrate LLMs through APIs or private deployments with context automation using your data. Run anywhere - local, cloud, or bare metal.
FastAPI Skeleton App to serve machine learning models production-ready.
Learn to serve Stable Diffusion models on cloud infrastructure at scale. This Lightning App shows load-balancing, orchestrating, pre-provisioning, dynamic batching, GPU-inference, micro-services working together via the Lightning Apps framework.
BentoDiffusion: A collection of diffusion models served with BentoML
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
A multi-functional library for full-stack Deep Learning. Simplifies Model Building, API development, and Model Deployment.
Add a description, image, and links to the model-serving topic page so that developers can more easily learn about it.
To associate your repository with the model-serving topic, visit your repo's landing page and select "manage topics."