Newton's method provides one of the earliest insight in optimization theories. Based on gradient descent, a lot of optimization theories including momentum, adaptive learning, sign of gradient, second-order optimization, variance reduction and scheduler-free optimization have been proposed. However, there is a lack of comprehensive and clear summary of these approaches with unified notation system. This paper attempts to give a systematic, explicit and concise formulation of many of these methods with citation. I hope it can promote innovation in optimization theory in deep learning, while facilitating relevant researchers to search for references.
You can also call this:
- A brief history of variance reduction optimization methods.(Some optimizers are omitted)
- A brief history of large language models. Some optimizers are used in training well-known large language models. The base figure is from https://medium.com/@lmpo/a-brief-history-of-lmms-from-transformers-2017-to-deepseek-r1-2025-dae75dd3f59a
According to the limitation of time, I may miss some excellent works on optimization in deep learning. I am very pleased to communicate with others and adding more interesting works to this sheet.
And the bibtex file is upload in this repository.
Reference and citation is welcome!
@misc{liu2024summary,
author={Yifeng Liu},
title={A Summary Sheet of Optimization in Deep Learning},
howpublished = "\url{https://github.com/lauyikfung/A-Summary-Sheet-of-Optimization-in-Deep-Learning/A_Summary_Sheet_of_Optimization_in_Deep_Learning.pdf}",
month = "Dec",
year = "2024",
}