While the Adam optimizer, which made use of momentum as well as the RMS prop, was efficient in adjusting the learning rates and finding the optimal solution, we have found that certain convergence issues with it. Research has been able to show that there are simple one dimensional convex functions for which the Adam is not able to converge.

AMSgrad tackles this convergence issue.

Adam makes use of adaptive gradient and updates the parameters separately. The updates might increase or decrease depending upon the calculated exponential moving average of the gradients. But sometimes, the updates of the the parameters is large which doesn't result in convergence. Citing the paper "On The Convergence of ADAM and beyond",

The key difference between AMSGRAD and ADAM is that it maintains the maximum of all $\$v\_t\$$ until the present time step and uses this maximum value for normalizing the running average of the gradient instead of $\$v\_t\$$ in ADAM

Hence, the difference between the AMSgrad and Adam is the calculated second moment vector which is used to update the parameters. To put it simply, AMSgrad uses the maximum second moment up until the $\u20ac\u20aci^\{th\}\u20ac\u20ac$ iteration to update the parameters.

Performance of ADAM and AMSgrad has been presented on a synthetic function to outline the convergence problem of the ADAM. Source

```
import torch
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate,amsgrad=true)
for t in range(500):
y_pred = model(x)
loss = loss_fn(y_pred, y)
print(t, loss.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
```