- Introduction
- Performance and bandwidth
- Model parallelism
- Computational complexity of transformers
- Efficient transformers: Inference optimizations
- Efficient transformers: Architecture modifications
- Kernel programming
- Accelerators
- Conclusion
- Single Instruction/Multiple Data (SIMD) and GPUs
- FLOPs vs FMACs
- Data parallel vs model parallel vs tensor parallel
- SRAM vs DRAM
- Hooker, S. (2020). The hardware lottery.
- Sevilla, J. et al. (2022). Compute trends across three eras of machine learning.
- He, H. (2022). Making deep learning go brrrr from first principles.
- Geiping, J. & Goldstein, T. (2022). Cramming: Training a language model on a single GPU in one day.
- Spector, B. (2024). GPUs go brrr.
Roofline plots:
- Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for multicore architectures.
- Chen, L. (2023). Dissecting batching effects in GPT inference.
- Chng, P. (2024). The naive roofline model in performance modeling.
- Kao, S.C. et al. (2022). FRAME: Fast Roofline Analytical Modeling and Estimation. https://github.com/maestro-project/frame
- Yuan, Z. et al. (2024). LLM inference unveiled: Survey and roofline model insights. https://arxiv.org/abs/2402.16363
- Model parallelism - HuggingFace
- Pipeline parallelism
- Tensor parallelism
- Chen, C. (2022). Transformer inference arithmetic.
- Bahdanau, D. (2022). The FLOPs calculus of language model training.
- Sanger, A. (2023). Inference characteristics of Llama-2.
- Shenoy, V. & Kiely, P. (2023). A guide to LLM inference and performance.
- Anthony, Q., Biderman, S., & Schoelkopf, H. (2023). Transformer math 101.
- Ouyang, A. (2023). Understanding the Performance of Transformer. (MS thesis)
- Casson, A. (2023). Transformer FLOPs.
- Dao, T., Fu, D.Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness.
- Pope, R. et al. (2022). Efficiently scaling transformer inference. - KV cache
- Dao, T. (2023). FlashAttention-2: Faster attention with better parallelism and work partitioning.
- Kim, S. et al. (2023). Full stack optimization of transformer inference: A survey.
- PyTorch. (2023). Accelerating generative AI with PyTorch II: GPT, Fast.
- Nvidia. (2023). Mastering LLM techniques: Inference optimization.
- Weng, L. (2023). Large transformer model inference optimization.
- Kwon, W. et al. (2023). Efficient memory management for large language model serving with PagedAttention.
- Zhang, L. (2023). Dissecting the runtime performance of the training, fine-tuning, and inference of large language models.
- Fu, Y. (2023). Towards 100x speedup: Full stack transformer inference optimization.
- Fu, Y. (2024). Challenges in deploying long-context transformers: A theoretical peak performance analysis.
- Fu, Y. et al. (2024). Data engineering for scaling language models to 128K context.
- Kwon, W. et al. (2023). Efficient memory management for large language model serving with PagedAttention. (vLLM)
- Nvidia. (2023). Mastering LLM techniques: Inference optimization.
- Chng, P. (2024). What is the transformer KV cache?
- Shah, J. et al. (2024). FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.
- Shi, L. et al. (2024). Keep the cost down: A review on methods to optimize LLM' s KV-cache consumption.
- Shazeer, N. (2019). Fast transformer decoding: One write-head is all you need. - MQA
- Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2022). Efficient transformers: A survey.
- Leviathan, Y., Kalman, M., & Matias, Y. (2022). Fast inference from transformers via speculative decoding.
- Ainslie, J. (2023). GQA: Training generalized multi-query transformer models from multi-head checkpoints. - GQA
- Nvidia. (2024). CUDA C++ Programming Guide.
- Harris, M. (2017). Nvidia blog: An Even Easier Introduction to CUDA.
- Boehm, S. (2022). How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog.
- github.com/ANSANJAY/KernelDev101
- github.com/cupy/cupy
- github.com/NVIDIA/cuda-python
- github.com/NVIDIA/nvmath-python
- Volkov, V. (2016). Understanding Latency Hiding on GPUs. (PhD thesis)
- Dettmers, T. (2023). Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning.
- Nvidia. (2023). GPU performance background user's guide.
- tinybox, red uses 6x 7900XTX
- First AI chip startup to go public, in 2025 via SPAC
- Blaize. (2025). S-1 filing with the SEC. 2025/01/21.
- Cerebras. (2020). Fast stencil-code computation on a wafer-scale processor.
- Cerebras. (2021). The path to successful wafer-scale integration: The cerebras story.
- Cerebras. (2022). Wafer-scale fast fourier transforms.
- Cerebras. (2023). Cerebras architecture deep dive: First look inside the hardware/software co-design for deep learning.
- Cerebras. (2023). Training giant neural networks using weight streaming on cerebras wafer-scale systems.
- Cerebras. (2024). S-1 filing with the SEC. 2024/09/30.
- Furiosa. (2024). TCP: A Tensor Contraction Processor for AI workloads industrial product.
- Groq. (2020). Think Fast: A Tensor Streaming Processor (TSP) for accelerating deep learning workloads.
- Linley Group. (2020). Groq rocks neural networks.
- Groq. (2022). A software-defined tensor streaming multiprocessor for large-scale machine learning.
- Groq. (2024). Optimized simulation methodology of warpage and localized stress hotspot prediction for assembly risk assessment.
- Rebellions. (2024). ATOM Architecture: Finding the Sweet Spot for GenAI.
- SambaNova. (2024). SambaNova SN40L: Scaling the AI memory wall with dataflow and composition of experts.
- SambaNova. (2024). Why SambaNova's SN40L chip is the best for inference.
- Thüning, M. (2024). Attention in SRAM on Tenstorrent Grayskull.
- Tenstorrent. (2024). Onepager with Wormhole and Grayskull.
- Tenstorrent. (2024). Wormhole Tensix Processor.
- Brown, N. & Barton, R. (2024). Accelerating stencils on the Tenstorrent Grayskull RISC-V accelerator.
Others:
- d-Matrix
- Etched
- Graphcore
- In July 2024, Softbank Group agreed to acquire Graphcore for around $500 million. The deal is under review by the UK's Business Department's investment security unit. [Wikipedia]
- Lightmatter
- MatX
- Taalas
- Untether AI
TODO
- Up next: Misc
- Previous: Natural language