Adam can be understood as updating weights inversely proportional to the scaled L2 norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past gradients.

The calculation of the infinite norm exhibits a stable behavior. Mathematically, the infinite norm can be viewed as,

$u_t=\beta2^\infty v\left\{t-1\right\} + \left(1-\beta2^\infty v\left\{t-1\right\}\right)|g_t|^\infty=max\left(\beta2 \cdot v\left\{t-1\right\},|g_t|\right)$

We see the mathematical equivalence of the calculation of the infinite norm with the maximum of the parameters calculated up to t. Now the update would be similar to Adam,

$\theta_\left\{t+1\right\}=\theta_t- \eta \cdot \frac\left\{m_t\right\}\left\{u_t\right\}$

Again, here $\eta$ is the base learning rate and $m_t$ is the momentum similar to as discussed in Adam.

The exponential moving average and the infinite norm are calculated in Adamax. Mathematically, given by the formula,

$V_\left\{dw\right\}=\beta1 \cdot V\left\{dw\right\}+\left(1-\beta_1\right)\cdot \partial w\ u_t=\beta2^\infty v\left\{t-1\right\} + \left(1-\beta2^\infty v\left\{t-1\right\}\right)|g_t|^\infty$

Here $\beta_1$ and $\beta_2$ are the betas. They are the exponential decay rates of the first moment and the exponentially weighted infinity norm.

python
      # importing the library
import torch
import torch.nn as nn

x = torch.randn(10, 3)
y = torch.randn(10, 2)

# Build a fully connected layer.
linear = nn.Linear(3, 2)

# Build MSE loss function and optimizer.
criterion = nn.MSELoss()

betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

# Forward pass.
pred = linear(x)

# Compute loss.
loss = criterion(pred, y)
print('loss:', loss.item())

optimizer.step()