- Full Stack Optimization of Transformer Inference: a Survey: https://arxiv.org/pdf/2302.14017.pdf
- Large Transformer Model Inference Optimization: https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
- High-throughput Generative Inference of Large Language Models with a Single GPU: https://arxiv.org/pdf/2303.06865.pdf
- Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations: https://arxiv.org/pdf/2304.11267.pdf
- Up or Down? Adaptive Rounding for Post-Training Quantization: https://arxiv.org/pdf/2004.10568.pdf
- 8-bit Optimizers via Block-wise Quantization: https://arxiv.org/pdf/2110.02861.pdf
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale: https://arxiv.org/pdf/2208.07339.pdf
- ULPPACK: FAST SUB-8-BIT MATRIX MULTIPLY ON COMMODITY SIMD HARDWARE: https://proceedings.mlsys.org/paper/2022/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf
- GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS: https://arxiv.org/pdf/2210.17323.pdf
- RPTQ: Reorder-based Post-training Quantization for Large Language Models: https://arxiv.org/pdf/2304.01089.pdf
- Training Compute-Optimal Large Language Models: https://arxiv.org/pdf/2203.15556.pdf
- Decentralized Training of Foundation Models in Heterogeneous Environments: https://arxiv.org/pdf/2206.01288.pdf
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models: https://arxiv.org/pdf/1910.02054.pdf
- Stable and low-precision training for large-scale vision-language models: https://arxiv.org/pdf/2304.13013.pdf
- Sparse is Enough in Scaling Transformers: https://proceedings.neurips.cc/paper/2021/file/51f15efdd170e6043fa02a74882f0470-Paper.pdf
- Scaling Transformer to 1M tokens and beyond with RMT: https://arxiv.org/pdf/2304.11062.pdf
- RAF: Holistic Compilation for Deep Learning Model Training: https://arxiv.org/pdf/2303.04759v1.pdf
- Graphene: An IR for Optimized Tensor Computations on GPUs: https://dl.acm.org/doi/pdf/10.1145/3582016.3582018
- Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs: https://arxiv.org/pdf/2210.09603.pdf
- LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS: https://arxiv.org/pdf/2106.09685.pdf
- JAXPRUNER: A CONCISE LIBRARY FOR SPARSITY RESEARCH: https://arxiv.org/pdf/2304.14082.pdf