- Getting started
- Introduction to the Wiki
- Overview of topics
- How to contribute
- General best practices
- Key principles of Computer Vision
- Convolution
- Advanced convolution techniques and layers
- Pooling
- Overfitting
- Underfitting
- Overfitting Vs. Underfitting in Machine Learning
- Upsampling and Downsampling techniques in Machine Learning
- Computer Vision tasks
- The complete glossary of the modern Computer Vision tasks
- Classification / Tagging
- Object Detection
- Semantic Segmentation
- Instance Segmentation
- Panoptic Segmentation
- Attribute Prediction
- Computer Vision model architectures
- ResNet
- Faster R-CNN
- Mask R-CNN
- DeepLabv3+
- U-Net
- FBNetV3
- U-Net++
- Efficient Net
- PAN
- PSPNet
- LinkNet
- FPN
- RetinaNet
- Cascade R-CNN
- FBNetV3IS
- FBNetV3OD
- CascadeMask R-CNN
- HybridTask Cascade
- Computer Vision metrics
- Confusion Matrix
- Intersection over Union (IoU)
- Accuracy
- Hamming score
- Precision
- Recall
- Precision-Recall curve and AUC-PR
- F-score
- Average Precision
- mean Average Precision (mAP)
- Loss functions in Machine Learning
- Comprehensive overview of loss functions in Machine Learning
- Cross-Entropy Loss
- Binary Cross-Entropy Loss
- Focal loss
- Bounding Box Regression Loss
- CrossEntropyIoULoss2D
- Average Loss
- Solver / Optimizer
- Comprehensive overview of solvers/optimizers in Deep Learning
- Adam
- SGD
- Adadelta
- Adagrad
- AdaMax
- Adamw
- ASGD
- Rprop
- RMSprop
- Lion
- Weight Decay
- Base Learning Rate
- Momentum (SGD)
- Epsilon Coefficient
- Training Parameters
- Patience
- Min delta
- Seed
- Everything you need to know about batches in Machine Learning
- Iterations
- Epoch
- Scheduler
- Comprehensive overview of learning rate schedulers in Machine Learning
- ExponentialLR
- CyclicLR
- StepLR
- MultiStepLR
- ReduceLROnPlateau
- CosineAnnealingLR
- Computer Vision augmentations
- Comprehensive overview of augmentations in Machine Learning
- Horizontal Flip
- Vertical Flip
- Random Crop
- Random Sized Crop
- Rotate
- Resize
- Blur
- Smallest max size
- Center Crop
- Color Jitter
- Gaussian Noise
- Shift Scale Rotate
- Longest max size
- Equalize
- To gray
- Shear
- Mosaic
- Copy Paste
- Extrapolation methods
- Interpolation methods
- Deployment
- Primitive deployment using web frameworks
- Commonly used web frameworks
- Containerized Deployment
- Orchestrated Deployment
- Challenges of Deployment
- Splits
- Data Splitting in Machine Learning

- Getting started
- Introduction to the Wiki
- Overview of topics
- How to contribute
- General best practices
- Key principles of Computer Vision
- Convolution
- Advanced convolution techniques and layers
- Pooling
- Overfitting
- Underfitting
- Overfitting Vs. Underfitting in Machine Learning
- Upsampling and Downsampling techniques in Machine Learning
- Computer Vision tasks
- The complete glossary of the modern Computer Vision tasks
- Classification / Tagging
- Object Detection
- Semantic Segmentation
- Instance Segmentation
- Panoptic Segmentation
- Attribute Prediction
- Computer Vision model architectures
- ResNet
- Faster R-CNN
- Mask R-CNN
- DeepLabv3+
- U-Net
- FBNetV3
- U-Net++
- Efficient Net
- PAN
- PSPNet
- LinkNet
- FPN
- RetinaNet
- Cascade R-CNN
- FBNetV3IS
- FBNetV3OD
- CascadeMask R-CNN
- HybridTask Cascade
- Computer Vision metrics
- Confusion Matrix
- Intersection over Union (IoU)
- Accuracy
- Hamming score
- Precision
- Recall
- Precision-Recall curve and AUC-PR
- F-score
- Average Precision
- mean Average Precision (mAP)
- Loss functions in Machine Learning
- Comprehensive overview of loss functions in Machine Learning
- Cross-Entropy Loss
- Binary Cross-Entropy Loss
- Focal loss
- Bounding Box Regression Loss
- CrossEntropyIoULoss2D
- Average Loss
- Solver / Optimizer
- Comprehensive overview of solvers/optimizers in Deep Learning
- Adam
- SGD
- Adadelta
- Adagrad
- AdaMax
- Adamw
- ASGD
- Rprop
- RMSprop
- Lion
- Weight Decay
- Base Learning Rate
- Momentum (SGD)
- Epsilon Coefficient
- Training Parameters
- Patience
- Min delta
- Seed
- Everything you need to know about batches in Machine Learning
- Iterations
- Epoch
- Scheduler
- Comprehensive overview of learning rate schedulers in Machine Learning
- ExponentialLR
- CyclicLR
- StepLR
- MultiStepLR
- ReduceLROnPlateau
- CosineAnnealingLR
- Computer Vision augmentations
- Comprehensive overview of augmentations in Machine Learning
- Horizontal Flip
- Vertical Flip
- Random Crop
- Random Sized Crop
- Rotate
- Resize
- Blur
- Smallest max size
- Center Crop
- Color Jitter
- Gaussian Noise
- Shift Scale Rotate
- Longest max size
- Equalize
- To gray
- Shear
- Mosaic
- Copy Paste
- Extrapolation methods
- Interpolation methods
- Deployment
- Primitive deployment using web frameworks
- Commonly used web frameworks
- Containerized Deployment
- Orchestrated Deployment
- Challenges of Deployment
- Splits
- Data Splitting in Machine Learning

- CloudFactory Computer Vision Wiki
- Solver / Optimizer
- Comprehensive overview of solvers/optimizers in Deep Learning

In deep learning, the optimizer (also known as a solver) is an algorithm used to **update the parameters (weights and biases) of the model. **The goal of optimizers is to find such parameters with which the model will perform the best on a given task.

On this page, we will discuss:

What is the idea behind solvers/optimizers, and how do they work;

What are the main solvers/optimizers used in DL;

What is the difference between a solver and optimizer;

How to use solvers/optimizers in Hasty.

Let’s jump in.

As we mentioned in the intro, an optimizer is an algorithm that updates the model’s parameters (weights and biases) to minimize the loss function and lead the model to its **best possible performance** for the given task.

The **loss function **reflects the difference between the output predicted by the neural network and the actual ground-truth output. There are different types of loss functions; the final choice depends on the nature of your task and the data you work with. You can learn more about the Loss function on our MP Wiki page.

The most classical example of an optimizer is an algorithm called **Gradient Descent (GD)**. It calculates the gradient (slope) of the loss function with respect to the weights and biases and updates them in the direction that minimizes the loss. The goal is to reach the global minimum (in comparison to local minima, as in the image below).

Generally, the formula for computing Gradient Descent is given as θ = θ - α∇J(θ), where:

θ is the vector of parameters to be updated;

α is the learning rate, which determines the step size at each iteration;

∇J(θ) is the gradient of the cost function J(θ) with respect to the parameters θ.

To give you more understanding of GD, let’s calculate one example manually. Say, we have to minimize a loss function defined as f(x) = x^2− 4x + 3.

Let’s initialize **x_1** with the value 0, and set the step size (learning rate) as **0.3**.

- First, we should find the derivative of the loss function. In this case, it will be:
**f'(x) = 2x - 4** - Then, we can plug in the x value:
**x_1 = 0**

f'(0) = -4 - Now, we can update the parameters based on our learning rate and the gradient achieved with previous parameters:
**x_2 = x_1 - 0.3 * f'(x_1)**

x_2 = 0 - 0.3 * (-4) = 1.2

As you see, our parameters shifted notably to the right, toward the global minimum (from 0 to 1.2). - Now, we take the updated parameters and plug them into the derivative again.
**x_2 = 1.2****f'(1.2) = 2*1.2 - 4 = -1.6**The loss value decreased from -4 to -1.6! - Again, we update the parameters.
**x3 = x2 - 0.3 * f'(x_2)**

x_3 = 1.2 - 0.3 * (-1.6) = 1.68

Note that the second parameter update was notably smaller than the first one (compare 0 → 1.2 Vs. 1.2 → 1.68). This is because, as we are approaching the global minimum, we want to slow down and be more careful not to skip it.

Steps 2 and 3 are repeated until the loss with the updated parameters actually **stops decreasing.** Of course, this fact alone does not guarantee that we found a global minimum; it could also be some local minimum. Various algorithms playing with the learning rate were developed to overcome this issue.

Note that the example case has two dimensions only (x and y) and can be represented as a 2D graph. If there are more dimensions, however, you would have to find partial derivatives of the loss function with respect to each of your input parameters.

Even though Gradient Descent is relatively easy to comprehend and compute, it comes with its own disadvantages (that amplify as the dataset gets larger):

To calculate the gradient of the loss function, one should consider each training point from the dataset for each computation. This can lead to slow training and

**high computational costs.**Because the entire dataset is required for each computation,

**memory constraints**might pose an issue.When the dataset is large, there might be

**many local optima**, and gradient descent may not converge to the global optimum. Hence, the model might underperform.

There are other optimizers that address the shortcomings of Gradient Descent. Each of them has their own advantages and disadvantages, as well. Below, we will describe optimizers available to you in Hasty.

**Stochastic Gradient Descent (SGD)****,**for example, is a variant of GD that computes the gradient on a random small batch of training data (so-called “mini-batch”). This can lead to faster convergence and better usage of memory. It is a preferable choice to GD when you work with large datasets.**Adagrad**(Adaptive Gradient) optimizer is worth considering if you work with sparse data, especially in high dimensions. The learning rate for each parameter is adjusted based on its gradient history. Nevertheless, Adagrad's learning rate may decrease too quickly, which could result in slow or too early convergence.**Adadelta**addresses the decaying learning rate problem found in Adagrad. The difference is that Adagrad considers all the past gradients to make an update, whereas Adadelta takes only a certain range of past gradients into the calculation.**ASGD****Rprop****RMSprop****Adam (Adaptive Moment Estimation)****AdaMax****AdamW**differs from Adam in that it regularizes the weight decay to prevent overfitting. Hence, AdamW tends to generalize a bit better.**Lion (EvoLved Sign Momentum)**

To sum up, the choice of the optimizer depends on various factors:

- the task you are performing;
- the size of the dataset;
- the variance of your data;
- the complexity of the model;
- etc.

While both schedulers and optimizers aim at improving the model’s performance, these are two different types of algorithms.

- Optimizers adjust the
**weights**of the neural network during training with the goal of minimizing the loss function. They work by computing the gradient of the loss with respect to the weights and then updating the weights in a way that reduces the loss. Examples of optimizers include Gradient Descent, Stochastic Gradient Descent, Adagrad, RMSprop, Adam, and so on. - Schedulers adjust the
**learning rate**during training in order to improve the performance of the optimizer. They work by reducing the learning rate as the training progresses, which can help the optimizer converge more efficiently and avoid getting stuck in local minima. Examples of schedulers include StepLR, MultiStepLR, CyclicLR, ExponentialLR, ReduceLROnPlateau, CosineAnnealingLR, and so on.

In other words, optimizers are responsible for updating the weights of the neural network to minimize the loss, while schedulers are responsible for adjusting the learning rate used by the optimizer to improve convergence and performance.

Note that certain frameworks, like PyTorch and TensorFlow, offer combined optimizer-scheduler classes that enable users to work both with the optimizer and the scheduler within one object.

Solvers/optimizers are used during model training and running the experiments.

- To start the experiment, first, access Model Playground in the menu on the left.
- Select the split on which you want to run an experiment or create a new split.
- Create a new experiment and choose the desired architecture and other parameters.
- Go to the
**Solver & scheduler**section and select the solver you want to use. - Select solver parameters. They might differ from solver to solver, but the most common ones are:

Last modified 9d ago

© 2010-2024 CloudFactory Limited. All rights reserved.