Adagrad

Adagrad , short for adaptive gradient, is a gradient based optimizer that automatically tunes its learning rate in the training process. The learning rate is updated parameter wise, i.e. we have a different learning rate for each of the parameters.

The parameters associated with frequently occurring features have small updates (low learning rate), and the parameters associated with seldom occurring features have bigger updates (high learning rate).

Due to this, Adagrad is a suitable solver for sparse data.

Mathematically Adagrad can be formulated as,

$€€g{t,i} = \nabla J (\theta{t,i})€€$

Where $€€g_{t,i}€€$ is the gradient of the objective function with respect to the parameter $€€\theta_i€€$

The parameter is updated as follows,

$$$\theta{t+1,i}=\theta{t,i} - \eta \cdot \frac{g{t,i}}{\sqrt{G{t,ii}}+\epsilon}$$$

Here $€€\theta{t,i}€€$ is the parameter to be updated, $€€G{t,ii}€€$ is the sum of the square of all the gradient till time t. We can see that the learning rate is adjusted according to the previous encountered gradients. $€€eta€€$ is the Base Learning Rate.

Here, the base learning rate is usually initialized to 0.01.

$€€\epsilon€€$ is used for numeric stability. Its value is $€€10^{-8}€€$ by default.

Major Parameters

Learning Rate Decay

It is a technique where a large learning rate is adopted in the beginning of the training process and then it is decayed by the certain factor after pre-defined epochs. Higher learning rate decay suggests that the initial learning rate will decay more in the epochs.

Setting a learning rate decay might potentially slow the training process since we decrease the learning rate.

Code Implementation

  
Hello, thank you for using the code provided by CloudFactory. Please note that some code blocks might not be 100% complete and ready to be run as is. This is done intentionally as we focus on implementing only the most challenging parts that might be tough to pick up from scratch. View our code block as a LEGO block - you can’t use it as a standalone solution, but you can take it and add it to your system to complement it.

      python
      
    
      # importing the library
import torch
import torch.nn as nn

x = torch.randn(10, 3)
y = torch.randn(10, 2)

# Build a fully connected layer.
linear = nn.Linear(3, 2)

# Build MSE loss function and optimizer.
criterion = nn.MSELoss()

# Optimization method using Adagrad
optimizer = torch.optim.Adagrad(linear.parameters(), lr=0.01, lr_decay=0, weight_decay=0,eps=1e-10)

# Forward pass.
pred = linear(x)

# Compute loss.
loss = criterion(pred, y)
print('loss:', loss.item())

optimizer.step()
    

Boost model performance quickly with AI-powered labeling and 100% QA.

Learn more

Last modified 14d ago

Previous - Solver / Optimizer

Adadelta

Next - Solver / Optimizer

AdaMax