Selecting the Optimizer
The following Blog post by Sebastian Ruder gives you a detailed introduction into the different existing optimizers and their specifics.
An overview of gradient descent optimization algorithms
Sebastian Ruder
The Blog post pasted below was published in The Batch on September 9, 2020, and gives you a quick overview of the current most important optimizer and their empirical differences, based on a paper from Robin Schmidt and colleagues from the University of Tübingen.

Optimizer Shootout

14 most popular optimizers according to arXiv mentions
Everyone has a favorite optimization method, but it’s not always clear which one works best in a given situation. New research aims to establish a set of benchmarks. What’s new: Robin Schmidt and colleagues at University of Tübingen evaluated 14 popular optimizers using the Deep Optimization Benchmark Suite some of them introduced last year. Key insight: Choosing an optimizer is something of a dark art. Testing the most popular ones in several common tasks is a first step toward setting baselines for comparison. How it works: The authors evaluated methods including AMSGrad, AdaGrad, Adam (see Andrew’s video on the topic), RMSProp (video), and stochastic gradient descent. Their selection was based on the number of mentions a given optimizer received in the abstracts of preprints.
  • The authors tested each optimization method on eight deep learning problems consisting of a dataset (image or text), standard architecture, and loss function. The problems include both generative and classification tasks.
  • They used the initial hyperparameter values proposed by each optimizer’s original authors. They also searched 25 and 50 random values to probe each one’s robustness.
  • They applied four different learning rate schedules including constant value, smooth decay, cyclical values, and a trapezoidal method (in which the learning rate increased linearly at the beginning, maintained its value, and decreased linearly at the very end).
  • Each experiment was performed using 10 different initializations in case a given initialization degraded performance.
Results:No particular method yielded the best performance in all problems, but several popular ones worked well on the majority of problems. (These included Adam, giving weight to the common advice to use it as a default choice.) No particular hyperparameter search or learning rate schedule proved universally superior, but hyperparameter search raised median performance among all optimizers on every task. Why it matters: Optimizers are so numerous that it’s impossible to compare them all, and differences among models and datasets are bound to introduce confounding variables. Rather than relying on a few personal favorites, machine learning engineers can use this work to get an objective read on the options. We’re thinking: That’s 14 optimizers down and hundreds to go! The code is open source, so in time we may get to the rest.
Copy link