Adam solvers are the hassle free standard for optimizers.
Empirically, Adam solvers converge faster and are more robust towards hyper-parameter settings than SGD
. However, they generalize slightly worse. So, a good approach can be to start with Adam, and when you struggle to get good results, switch to the more costly SGD
Most relevant hyper-parameters:
Hyper-parameter tuning usually yields 1-3% marginal gains in performance. Fixing your data is usually more effective.
The intuition behind Adam solvers is similar to the one behind SGD. The main difference is though, that Adam solvers are adaptive notifiers. Adam also adjusts the learning rate based on the gradients' magnitude using Root Mean Square Propagation (RMSProp). This follows a similar logic as using momentum + dampening for SGD. This makes it robust for the non-convex optimization landscape of neural network.