Parallelism and hardware

Introduction

Single Instruction/Multiple Data (SIMD) and GPUs
FLOPs vs FMACs
Data parallel vs model parallel vs tensor parallel
SRAM vs DRAM

Hooker, S. (2020). The hardware lottery.
Sevilla, J. et al. (2022). Compute trends across three eras of machine learning.
He, H. (2022). Making deep learning go brrrr from first principles.
Geiping, J. & Goldstein, T. (2022). Cramming: Training a language model on a single GPU in one day.
Spector, B. (2024). GPUs go brrr.

Performance and bandwidth

Roofline plots:

Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for multicore architectures.
Chen, L. (2023). Dissecting batching effects in GPT inference.
Chng, P. (2024). The naive roofline model in performance modeling.
Kao, S.C. et al. (2022). FRAME: Fast Roofline Analytical Modeling and Estimation. https://github.com/maestro-project/frame
Yuan, Z. et al. (2024). LLM inference unveiled: Survey and roofline model insights. https://arxiv.org/abs/2402.16363
- https://github.com/hahnyuan/LLM-Viewer
- http://llm-viewer.com/

Model parallelism

Model parallelism - HuggingFace
Pipeline parallelism
Tensor parallelism

Computational complexity of transformers

Chen, C. (2022). Transformer inference arithmetic.
Bahdanau, D. (2022). The FLOPs calculus of language model training.
Sanger, A. (2023). Inference characteristics of Llama-2.
Shenoy, V. & Kiely, P. (2023). A guide to LLM inference and performance.
Anthony, Q., Biderman, S., & Schoelkopf, H. (2023). Transformer math 101.
Ouyang, A. (2023). Understanding the Performance of Transformer. (MS thesis)
Casson, A. (2023). Transformer FLOPs.

Efficient transformers: Inference optimizations

Dao, T., Fu, D.Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness.
Pope, R. et al. (2022). Efficiently scaling transformer inference. - KV cache
Dao, T. (2023). FlashAttention-2: Faster attention with better parallelism and work partitioning.
Kim, S. et al. (2023). Full stack optimization of transformer inference: A survey.
PyTorch. (2023). Accelerating generative AI with PyTorch II: GPT, Fast.
Nvidia. (2023). Mastering LLM techniques: Inference optimization.
Weng, L. (2023). Large transformer model inference optimization.
Kwon, W. et al. (2023). Efficient memory management for large language model serving with PagedAttention.
Zhang, L. (2023). Dissecting the runtime performance of the training, fine-tuning, and inference of large language models.
Fu, Y. (2023). Towards 100x speedup: Full stack transformer inference optimization.
Fu, Y. (2024). Challenges in deploying long-context transformers: A theoretical peak performance analysis.
Fu, Y. et al. (2024). Data engineering for scaling language models to 128K context.
Kwon, W. et al. (2023). Efficient memory management for large language model serving with PagedAttention. (vLLM)
Nvidia. (2023). Mastering LLM techniques: Inference optimization.
Chng, P. (2024). What is the transformer KV cache?
Shah, J. et al. (2024). FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.
Shi, L. et al. (2024). Keep the cost down: A review on methods to optimize LLM' s KV-cache consumption.

Efficient transformers: Architecture modifications

Shazeer, N. (2019). Fast transformer decoding: One write-head is all you need. - MQA
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2022). Efficient transformers: A survey.
Leviathan, Y., Kalman, M., & Matias, Y. (2022). Fast inference from transformers via speculative decoding.
Ainslie, J. (2023). GQA: Training generalized multi-query transformer models from multi-head checkpoints. - GQA

Kernel programming

Nvidia: CUDA

Nvidia. (2024). CUDA C++ Programming Guide.
Harris, M. (2017). Nvidia blog: An Even Easier Introduction to CUDA.
Boehm, S. (2022). How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog.
github.com/ANSANJAY/KernelDev101
github.com/cupy/cupy
github.com/NVIDIA/cuda-python
github.com/NVIDIA/nvmath-python

AMD: ROCm

ROCm Documentation

Accelerators

Nvidia

Volkov, V. (2016). Understanding Latency Hiding on GPUs. (PhD thesis)
Dettmers, T. (2023). Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning.
Nvidia. (2023). GPU performance background user's guide.

AMD

tinybox, red uses 6x 7900XTX

Intel (Habana)

Gaudi 2

Blaize

First AI chip startup to go public, in 2025 via SPAC
Blaize. (2025). S-1 filing with the SEC. 2025/01/21.

Cerebras

Cerebras. (2020). Fast stencil-code computation on a wafer-scale processor.
Cerebras. (2021). The path to successful wafer-scale integration: The cerebras story.
Cerebras. (2022). Wafer-scale fast fourier transforms.
Cerebras. (2023). Cerebras architecture deep dive: First look inside the hardware/software co-design for deep learning.
Cerebras. (2023). Training giant neural networks using weight streaming on cerebras wafer-scale systems.
Cerebras. (2024). S-1 filing with the SEC. 2024/09/30.

Furiosa

Furiosa. (2024). TCP: A Tensor Contraction Processor for AI workloads industrial product.

Groq

Groq. (2020). Think Fast: A Tensor Streaming Processor (TSP) for accelerating deep learning workloads.
Linley Group. (2020). Groq rocks neural networks.
Groq. (2022). A software-defined tensor streaming multiprocessor for large-scale machine learning.
Groq. (2024). Optimized simulation methodology of warpage and localized stress hotspot prediction for assembly risk assessment.

Rebellions

Rebellions. (2024). ATOM Architecture: Finding the Sweet Spot for GenAI.

SambaNova

SambaNova. (2024). SambaNova SN40L: Scaling the AI memory wall with dataflow and composition of experts.
SambaNova. (2024). Why SambaNova's SN40L chip is the best for inference.

Tenstorrent

Thüning, M. (2024). Attention in SRAM on Tenstorrent Grayskull.
Tenstorrent. (2024). Onepager with Wormhole and Grayskull.
Tenstorrent. (2024). Wormhole Tensix Processor.
Brown, N. & Barton, R. (2024). Accelerating stencils on the Tenstorrent Grayskull RISC-V accelerator.

Others:

d-Matrix
Etched
Graphcore
- In July 2024, Softbank Group agreed to acquire Graphcore for around $500 million. The deal is under review by the UK's Business Department's investment security unit. [Wikipedia]
Lightmatter
MatX
Taalas
Untether AI

Conclusion

TODO

Up next: Misc
Previous: Natural language

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelism-and-hw.md

parallelism-and-hw.md

Parallelism and hardware

Contents

Introduction

Performance and bandwidth

Model parallelism

Computational complexity of transformers

Efficient transformers: Inference optimizations

Efficient transformers: Architecture modifications

Kernel programming

Nvidia: CUDA

AMD: ROCm

Accelerators

Nvidia

AMD

Intel (Habana)

Blaize

Cerebras

Furiosa

Groq

Rebellions

SambaNova

Tenstorrent

Conclusion

Files

parallelism-and-hw.md

Latest commit

History

parallelism-and-hw.md

File metadata and controls

Parallelism and hardware

Contents

Introduction

Performance and bandwidth

Model parallelism

Computational complexity of transformers

Efficient transformers: Inference optimizations

Efficient transformers: Architecture modifications

Kernel programming

Nvidia: CUDA

AMD: ROCm

Accelerators

Nvidia

AMD

Intel (Habana)

Blaize

Cerebras

Furiosa

Groq

Rebellions

SambaNova

Tenstorrent

Conclusion