ASGD
Average Stochastic Gradient Descent, abbreviated as ASGD, averages the weights that are calculated in every iteration.
where being the weight tensor , being the base learning rate and being the gradient of the objective function evaluated at .
With the given update rule SGD assigns calculated weight to the model. But with ASGD assigns the following averaged weight ,
where is the weight tensor calculated in iteration 't'.
Major Parameters
- Base Learning Rate
- Weight Decay
- Lambda
- Alpha
- TO
Lambda
It is the decay term for the past weights used in the average.
Alpha
It is the power value that is used to update the learning rate.
TO
It is the optimization step at which the averaging is started. If the required number of iteration is lower than the TO value, then the averaging will not happen.