When working on a Computer Vision (CV) project, the first thing one would want to know is probably the Convolution operation. With convolution being one of the basic CV concepts, understanding it is essential for exploring the vision AI field.

This page is dedicated to advanced techniques and approaches you can use when working with convolutions. Therefore, it requires basic knowledge of the convolution concept. So, if you are entirely new to the convolution term, please navigate to our comprehensive guide that explores the term in-depth.

• Parameters that deepen and expand the convolution concept;

• Convolutional kernel;

• Stride;

• Receptive field;

• Different types of convolutions you might want to try out;

• Transposed convolution;

• Depthwise convolution;

• Pointwise convolution;

• Spatial separable convolution;

• Depthwise separable convolution;

• Useful techniques related to convolutions;

• Blurring;

• Edge detection;

• Embossing;

• And more.

Let’s jump in.

To put it simply, convolution is the process of sliding a matrix of weights over an image, multiplying every element piecewise, and then summing them to receive a new value of the new image. This window we move across the image is called the convolution kernel and is usually defined with the letter K.

Performing the convolution operation reveals valuable features of an image, for example, edges, specific lines, color channel intensities, etc.

Convolution kernels are usually odd-numbered square matrices (usually but not necessarily!). One thing that defines a kernel is its size. Usually, kernels match the number of input channels, also called the depth.

• If our input is a 32x32 colored image, then it has the dimension of 32x32x3 with a separate channel for each base color Red, Green, and Blue. Accordingly, our kernel would also have a depth of 3.

However, you can still tune the width and height of our kernel. You can go with a small window of size 3x3 or with a larger 5x5 or 7x7 (and beyond).

Intuitively, a small kernel size 3x3 will use only 9 pixels surrounding the target pixel to contribute to the output value, while larger kernels will cover more pixels. The logical question is how should I select the perfect kernel size. Unfortunately, there is no definite answer to that.

Still, it is good to know that:

• Smaller kernels learn small features like edges well;

• Larger kernels are better at learning the shapes of objects but are more affected by noise.

Although smaller kernel sizes are usually a preferred choice, in general, kernel size is a neural network hyperparameter you can adjust when playing with the model’s architecture.

Moreover, if you opt for some well-known neural network architecture, you will not have to choose the kernel size, as the authors have already done it for you.

One of the uses of convolution operation is blurring images. Blurring an image means merging adjacent pixel values, making them less different from each other. There are various blurring techniques, but for simplicity, let's look at the most basic blurring with a mean filter.

When you convolve an image with the kernel illustrated above, each output pixel will be the mean of 9 adjacent input pixels, resulting in a blurred image.

Another use for convolution operations is edge detection. In images, edges are sudden changes in pixel values. That being said, you can identify an edge by subtracting two-pixel values from each other. If the difference is large then it is an edge! Here you can see two examples of edge detection, horizontal edges detection, and vertical edges detection.

Pay attention that you see only horizontal changes in pixel values. If you look closely, you can notice that the tall building behind completely vanished, because it has the shape of a strictly vertical rectangle.

Here, on the other hand, you see vertical changes in pixel values, and the building is visible.

Another type of valuable convolutional kernels are called emboss kernels. They are used to enhance the depth of the image using edges and highlights. There are in total 4 primary emboss filter masks. You can manipulate and combine them to enhance details from different directions to our needs.

Relation to the receptive field: kernel size is directly related to the pixel’s receptive field which will be covered in detail down the line.

We defined convolution as sliding a kernel across an input image to produce an output image. We can place the kernel so that its upper left corner matches the upper left corner of an image, but it would massively shrink the dimensions of the output feature map.

Let’s do something else and place the kernel so that its center matches the upper left corner of an image. Performing such a technique will mean that with every convolution, our image dimensions will shrink by (k-1) / 2 pixels, where k is the size of our square kernel.

Let’s imagine you had an image of size 32x32 and applied the convolution with kernel 3x3. Then our output image has the dimensions of 30x30 because you had to ignore one line of pixels on every edge. To avoid that shrinkage of the dimension the padding concept comes into play.

Padding suggests filling the edges of an image with temporal values. Applying it lets you provide the missing values to compute an output properly. There are multiple ways of performing padding, so let’s closely examine them.

• Zero padding is one of the most simple and popular approaches. It works by simply adding zeros around an image.
Remember that here we added zero padding of size 1, but we can choose whatever size we want, this could be 2 or 3 lines of zeros if we want to.If you have a kernel size of 5x5 then adding 1 line of padding is not enough, as now you have a bigger kernel and you need a bigger padding. Hence, you should adjust our padding size accordingly. This is called same padding.
• Same padding is a padding that is set so the output dimensions after performing the convolution are the same as the input dimensions. If we input a 32x32 image and apply convolution, the output should have the same 32x32 size.
Although these methods are simple and effective, placing zeros around an image might be not the best idea. Sometimes we do not want to add non-existing black borders around our image. What do we do in such a case? We extend an image using reflective padding.
• Reflective padding takes an edge pixel and uses its value as padding. This preserves the image structure on the edges and does not add non-existing values to it.

Well, with padding, you can preserve the dimensions of the input. Now, let's talk about shifting the kernel across the image. The classic approach is shifting it 1 pixel at a time, but you can enhance the value. This is called stride.

Stride is the number of pixels the convolution kernel shifts across the image.

But why don't we use larger strides? The simple answer is – they are not that effective.

• One possible use case for a stride larger than 1 is when you have a huge input image and want to downsample it.

• Setting stride to 2, in this case, reduces the image’s dimensions by half on the output, at the cost of losing valuable information.

• Setting strides to larger values also reduces the computational load of the system, as now it does not perform convolution at every pixel. However, there are better ways for both performance optimization and dimensionality reduction, so the stride of 1 is almost always used.

So far, you learned that you could adjust your convolution operation by choosing the kernel size K, padding size P, and stride S. The big question is - how do you calculate the dimensions of the output matrix using all these numbers?

You can use the following formula:

$Width = \left(\left(W - k + 2 * p\right) / s\right) + 1$

$Height = \left(\left(H - k + 2 * p\right) / s\right) + 1$

where W and H are the width and height of the input image.

Let’s check out an example. Imagine having an input image of 9x9, a convolution kernel of 3x3, no padding, and a stride of 1. Then our output image dimensions are:

$Width = \left(\left(9 - 3 + 2 * 0\right) / 1\right) + 1 = 7$

$Height = \left(\left(9 - 3 + 2 * 0\right) / 1\right) + 1 = 7$

$Width = \left(\left(9 - 3 + 2 * 1\right) / 1\right) + 1 = 9$

$Height = \left(\left(9 - 3 + 2 * 1\right) / 1\right) + 1 = 9$

As you can see, adding padding preserved the initial size of the image.

Now, let's see how stride affects our output by increasing it to 2. Now the output dimensions are:

$Width = \left(\left(9 - 3 + 2 * 1\right) / 2\right) + 1 = 5$

$Height = \left(\left(9 - 3 + 2 * 1\right) / 2\right) + 1 = 5$

The size of the stride serves as the denominator for an image size. A stride size of 2 basically halves the image dimensions, and so on.

If you use a kernel size of 3x3 then 9 pixels of the original image will contribute to a single pixel of the output image. That is called the receptive field of the pixel. It has a biological background. In biology, the reception field is the portion of the sensory space of our eyes that can trigger specific neurons in our brains. In computer vision, the receptive field has the same idea.

The receptive field is the number of pixels in the original image to which the output pixel is sensitive. For example, if you change a single value out of 9 original pixels contributing to the output pixel, the output pixel value will also change. That means that the receptive field of that pixel is 9.

If you chose a larger kernel size of 7x7, then the receptive field would also increase to 7 * 7 = 49, because now the output pixel value is affected by 49 input pixels.

Now, imagine executing two consecutive convolution operations. Each uses a kernel of size 3x3. The receptive field of a pixel after the first operation is 9, that is simple. But the second convolution uses the output of the first convolutional operation as an input. That means that, although it also uses 9 pixels to produce a single output, each pixel it uses was made using 9 pixels of the original image. So, the receptive field of a pixel after the second convolution is 9 * 9 = 81 already.

What does the reception field tell us? The reception field of a pixel gives us an approximate idea of what this pixel represents. For example, a small receptive field of 9 might tell us that this pixel represents some edge or line on that area of the input image. But further into the model, the pixel can already represent some specific shape or a corner when the reception field increases.

Can you increase the receptive size of a pixel without increasing the kernel? Yes! Usually, convolutional operations are followed by pooling operations.

Pooling is a technique for downsampling an image. It takes several pixels and replaces them with one that represents them all. There are different types of pooling. Here are some of them:

• Mean pooling - take the mean of all pixels in the area;
• Max pooling - take the maximum value of the pixels in the area;
• Min pooling - take the minimum value of the pixels in the area.

Pooling operation increases the reception field of the following convolutional layers by reducing the size of the output.

• Another method for increasing the reception field is dilated convolutions. It is easier to understand visually, so take a look at the animated illustration below.
• Dilated convolution works by inserting holes (zero values) between kernel values to make it bigger. It was proven to improve the results of classification models.

It is worth noting that although the receptive field increases exponentially over the layers, not all the input pixels significantly contribute to the final output. Usually, the central pixel has the highest impact, while surrounding pixels might lose their weight over time. That is why the term Effective Reception Field exists.

The Effective Receptive Field is the portion of the receptive field that has an actual influence on the output. If we have a kernel size of 3x3, then after 5 convolutional layers we would have a receptive field of 59049, but not all of these input pixels make an actual contribution to the output. Increasing the number of layers increases the theoretical receptive field, but the effective receptive field ratio decreases.

You already know that the convolution operation reduces the dimensions of the input and that you can use padding to avoid that effect. But is there a way to get an increased dimension after the convolution? Yes, it turns out there is a special type of convolution called Transposed Convolution. It serves as a reverse operation for regular convolution. The produced output by this operation has a higher dimension than the input.

Transposed Convolution follows the following steps:

1. Place s - 1 number of zeros between each pixel of input, where s is the stride of our regular convolution.
2. Pad the resulting image with p zeros, where p is the padding size of our regular convolution
3. Perform a regular convolution on the resulting input using a stride of 1.

From the idea of transposed convolution, you can get the intuition that it is used for upsampling. Upsampling is the process of getting an image with a higher resolution than the input image.

• This means that, for example, wyou feed 32x32 images into the convolution layer and get 64x64 images on the output.

Next comes depthwise convolution. Remember that convolution kernels have the same depth as the input, and convolving with such kernels produces an output of depth 1? That is not the case for depthwise convolutions. Depthwise convolution applies different kernels of depth 1 to each input channel separately and then stacks results together again.

Doing convolution in such a way always preserves the depth of the input. It will become clear why do you need it later, so for now stick to the definition.

Pointwise convolution is somehow similar to depthwise convolution. In depthwise convolution you preserve the depth of our input, but change its width and height. In pointwise convolution we do the opposite.

Pointwise convolution uses a 1x1xN-sized kernel, where N is the depth of our input. If it has a stride of 1, then it always preserves the width and height of our input and does not need padding.

Pointwise convolution observes only 1 pixel at a time and is used to regulate the feature channels. If you only want to increase the number of channels from 3 to 64, for example, without changing the input resolution, you can use 64 pointwise kernels and do it a lot more efficiently than the traditional convolution. However, in that case, these feature channels won’t learn much as they are guided by only a single pixel without a context. So it needed to be used in conjunction with something else.

In real-life applications, a convolution operator is a lot “cheaper” than having a fully connected layer. However, there is a way to make it even more efficient, by introducing spatial separable convolutions. The mathematical concept of convolution allows us to separate our kernel into two linear kernels and apply them consequently to receive the same result.

This allows developers to perform convolution where instead of having to multiply 9 values and then add them up you can multiply 3 values 2 times and then add them up.

If this method is more efficient, why does not everyone use it only? Well, one of the most significant problems is that it requires the kernel to be separable, which is not true in most cases. When it works, it gives a significant boost in performance, but in most cases, it cannot be used. That is why there exist depthwise separable convolutions.

Depthwise separable convolutions allow users to separate traditional convolutions into two operations even if the original kernel is not separable.

Quick recap: in traditional convolution, you have a kernel of some size that usually matches the depth of our input. For example, for the 32x32x3 input image, you could use a 3x3x3 kernel. That, in combination with padding, gives us output dimensions 32x32x1.

As you can see, the original image downgrades from 3 feature channels to 1. In reality, you want to increase the number of feature channels to learn more about an image, so you want to get something like 32x32x64 output with 64 different feature channels.

For that, you take 64 kernels and convolve each kernel with the input to receive 64 different 32x32x1 feature outputs and then just combine them into one big 32x32x64 output.

So how can you separate this pipeline into two separate more efficient operations? You already have the answer: you use depthwise and pointwise convolutions.

• First, you use depthwise convolutions to gather contextual data on the input, giving you some output of equal depth as the input.
• And then you apply pointwise convolution with N kernels to this intermediate result to learn N feature channels.

Performing these two operations separately is equivalent to the traditional convolution, but more efficient in terms of the number of computations, which makes it a popular choice for large CNNs where performance is crucial.