# Epsilon Coefficient

We shall make use of Adam optimization to briefly explain the epsilon coefficient. For the Adam optimizer, we know that the first and second moments are calculated via;

$V_\left\{dw\right\}=\beta_1 \cdot V_\left\{dw\right\}+\left(1-\beta_1\right)\cdot \partial w\S_\left\{dw\right\}=\beta_2 S_\left\{dw\right\}+\left(1-\beta_2\right)\cdot \partial w^2$

$\partial w$ is the derivative of the loss function with respect to a parameter.

$V\left\{dw\right\}$ is the running average of the decaying gradients(momentum term) and $S\left\{dw\right\}$ is the decaying average of the gradients.

$V_\left\{dw\right\}=\beta_1 \cdot V_\left\{dw\right\}+\left(1-\beta_1\right)$

And the parameter updates are done as follows;

$theta_\left\{k+1\right\}=\theta_k-\eta \cdot \frac\left\{V_\left\{dw\right\}^\left\{corrected\right\}\right\}\left\{\sqrt\left\{S_\left\{dw\right\}^\left\{corrected\right\}\right\}+\epsilon\right\}$

The epsilon in the aforementioned update is the epsilon coefficient.

Note that when the bias-corrected $S_\left\{dw\right\}$ gets close to zero, the denominator is undefined. Hence, the update is arbitrary. To rectify this, we use a small epsilon such that it stabilizes this numeric.

The standard value of the epsilon is 1e-08.
python
      import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algorithms. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
#setting the amsgrad to be true
#setting the epsilon to be 1e-08
#note that we are using Adam in our example
for t in range(500):
# Forward pass: compute predicted y by passing x to the model.
y_pred = model(x)

# Compute and print loss.
loss = loss_fn(y_pred, y)
print(t, loss.item())

# Before the backward pass, use the optimizer object to zero all of the
# gradients for the Tensors it will update (which are the learnable weights
# of the model)
optimizer.step()