MLSys Literature

Inference

Full Stack Optimization of Transformer Inference: a Survey: https://arxiv.org/pdf/2302.14017.pdf
Large Transformer Model Inference Optimization: https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
High-throughput Generative Inference of Large Language Models with a Single GPU: https://arxiv.org/pdf/2303.06865.pdf
Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations: https://arxiv.org/pdf/2304.11267.pdf

Quantization

Up or Down? Adaptive Rounding for Post-Training Quantization: https://arxiv.org/pdf/2004.10568.pdf
8-bit Optimizers via Block-wise Quantization: https://arxiv.org/pdf/2110.02861.pdf
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale: https://arxiv.org/pdf/2208.07339.pdf
ULPPACK: FAST SUB-8-BIT MATRIX MULTIPLY ON COMMODITY SIMD HARDWARE: https://proceedings.mlsys.org/paper/2022/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf
GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS: https://arxiv.org/pdf/2210.17323.pdf
RPTQ: Reorder-based Post-training Quantization for Large Language Models: https://arxiv.org/pdf/2304.01089.pdf

Training

Training Compute-Optimal Large Language Models: https://arxiv.org/pdf/2203.15556.pdf
Decentralized Training of Foundation Models in Heterogeneous Environments: https://arxiv.org/pdf/2206.01288.pdf
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models: https://arxiv.org/pdf/1910.02054.pdf
Stable and low-precision training for large-scale vision-language models: https://arxiv.org/pdf/2304.13013.pdf

Scaling

Sparse is Enough in Scaling Transformers: https://proceedings.neurips.cc/paper/2021/file/51f15efdd170e6043fa02a74882f0470-Paper.pdf
Scaling Transformer to 1M tokens and beyond with RMT: https://arxiv.org/pdf/2304.11062.pdf

Compilation

RAF: Holistic Compilation for Deep Learning Model Training: https://arxiv.org/pdf/2303.04759v1.pdf
Graphene: An IR for Optimized Tensor Computations on GPUs: https://dl.acm.org/doi/pdf/10.1145/3582016.3582018
Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs: https://arxiv.org/pdf/2210.09603.pdf

Fine-Tuning

LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS: https://arxiv.org/pdf/2106.09685.pdf

Sparsity:

JAXPRUNER: A CONCISE LIBRARY FOR SPARSITY RESEARCH: https://arxiv.org/pdf/2304.14082.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLSys Literature

Inference

Quantization

Training

Scaling

Compilation

Fine-Tuning

Sparsity:

About

Releases

Packages

Kathryn-cat/mlsys-literature

Folders and files

Latest commit

History

Repository files navigation

MLSys Literature

Inference

Quantization

Training

Scaling

Compilation

Fine-Tuning

Sparsity:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages