Adam can be understood as updating weights inversely proportional to the scaled L2 norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past gradients.
The calculation of the infinite norm exhibits a stable behavior. Mathematically, the infinite norm can be viewed as,
We see the mathematical equivalence of the calculation of the infinite norm with the maximum of the parameters calculated up to t. Now the update would be similar to Adam,
Here β1\beta_1β1 and β2\beta_2β2 are the betas. They are the exponential decay rates of the first moment and the exponentially weighted infinity norm.
Hello, thank you for using the code provided by CloudFactory. Please note that some code blocks might not be 100% complete and ready to be run as is. This is done intentionally as we focus on implementing only the most challenging parts that might be tough to pick up from scratch. View our code block as a LEGO block - you can’t use it as a standalone solution, but you can take it and add it to your system to complement it.
python
# importing the libraryimport torch
import torch.nn as nn
x = torch.randn(10, 3)
y = torch.randn(10, 2)
# Build a fully connected layer.
linear = nn.Linear(3, 2)
# Build MSE loss function and optimizer.
criterion = nn.MSELoss()
# Optimization method using Adamax
optimizer = torch.optim.Adamax(linear.parameters(), lr=0.002,
betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
# Forward pass.
pred = linear(x)
# Compute loss.
loss = criterion(pred, y)
print('loss:', loss.item())
optimizer.step()
Boost model performance quickly with AI-powered labeling and 100% QA.