Computer Vision: How to Set Up Your CNN Architecture

Learn how to design the architecture for your next computer vision project and view example code in PyTorch

Published in

Better Programming

5 min readJan 8, 2021

An aerial shot of the United States, showing the most populated areas lit up by lights. — Photo by NASA on Unsplash

While creating a new computer vision project, there are a lot of decisions you have to make that’ll ultimately affect the resulting performance of your model. You can choose between different types of layers, such as the convolutional layer, pooling layer, fully connected layer, softmax layer, and dropout layer. In addition, it’s quite common to have multiple layers of the same type.

Further, most of the different types of layers can be customized, and you’ll usually have to set the number of input and output nodes — as well as other parameters. This article will help you select an appropriate number for the different types of layers and set reasonable parameter values.

Convolutional Layer

Convolutional layers perform convolutions, which are operations where a filter is moved over an input image, calculating the values in a resulting feature map.

A convolutional layer is usually built up of multiple filters, which will produce multiple feature maps. During training of the CNN, the model will learn what weights to apply to the different feature maps and, hence, be able to recognize which features to extract from the input images.

By increasing the number of convolutional layers in the CNN, the model will be able to detect more complex features in an image.

However, with more layers, it’ll take more time to train the model and increase the likelihood of overfitting. While setting up a fairly simple classification task, two convolutional layers will usually be enough. And then the number of layers can be increased if the resulting accuracy is too low.

The appropriate number of nodes is also highly dependent on the complexity of the images and the task at hand. By varying the number of nodes and evaluating the resulting accuracy, the model can be run multiple times until a satisfying result is achieved.

After doing multiple computer vision projects, developers will better be able to guess what number of nodes will work on a certain type of project and, hence, reduce the number of iterations needed.

Using PyTorch, the convolutional layers are usually defined inside the __init__ function of a CNN model class defined by the developer. Importing torch.nn as nn, one can define two convolutional layers like this:

self.conv1 = nn.Conv2d(1, 10, 3)
self.conv2 = nn.Conv2d(10, 32, 3)

Pooling Layer

A pooling layer is a layer that reduces the computational cost of the model and helps fight overfitting by reducing the dimensionality of its input. There are different types of pooling layers:

Max pooling: Select the largest value in the matrix
Min pooling: Select the smallest value in the matrix
Average pooling: Select the average of the values in the matrix

In a pooling layer, a filter is applied to the different areas of an image. The window size and stride will decide the size of the output and how the filter is moved over the input matrix. The most common is to choose window size (2, 2) and stride 2.

There’s no textbook answer for how often a pooling layer should be applied, and the developer will again be encouraged to iterate until an acceptable answer is reached. However, the well-known computer vision model VGG-16 uses two to three convolutional layers between the pooling layers, while VGG-19 uses up to four layers.

Using PyTorch, the pooling layers are usually defined inside the __init__ function of a CNN model class defined by the developer. Importing torch.nn as nn, one can define a pooling layer like this:

self.pool = nn.MaxPool2d(2, 2)

Fully Connected Layer

A fully connected layer transforms its input to the desired output format. In a classification task, this typically includes converting a matrix of image features into a 1xC vector where C is the number of classes.

There isn’t necessarily a correct answer to how many fully connected layers should be chosen in a CNN model. For most models, however, it’d be sufficient to start with one or two fully connected layers, later adjusting the number depending on the resulting performance.

Using PyTorch, the fully connected layers are usually defined inside the __init__ function of a CNN model class defined by the developer. Importing torch.nn as nn, one can define two fully connected layers like this:

self.fc1 = nn.Linear(20*32*5*5, 120)
self.fc2 = nn.Linear(120, 60)

Softmax Layer

A softmax layer is most commonly applied after the fully connected layers. This layer takes a vector of size 1xC as an input, where C is the number of classes and all of the numbers add up to 1.

The softmax layer then uses this vector and creates a new vector where each of the inputs represents a probability for the image to be of that particular class. A softmax is therefore mostly used in classification tasks.

For most computer vision projects, one softmax layer will be sufficient.

Dropout Layer

A dropout layer involves turning off nodes randomly with a probability p during training. Such layers are especially helpful to fight overfitting in models with a lot of complexity.

Dropout layers can be convenient to apply to fully connected layers and convolutional layers.