Convolutional Neural Networks (CNNs)

Convolutional Neural Networks are a class of deep neural networks designed to process structured grid data, most commonly images. CNNs automatically learn spatial hierarchies of features — from low-level edges and textures to high-level objects and scenes.

Why CNNs for Images?

A standard fully-connected (dense) network treats each pixel as an independent input. For a 224x224 RGB image, that means 224 * 224 * 3 = 150,528 input features — and with even one hidden layer of 1,000 neurons, the number of parameters explodes to over 150 million. This is impractical and ignores the spatial structure of images.

CNNs solve this by exploiting three key ideas:

Principle	Description
Local connectivity	Each neuron connects to only a small local region of the input
Parameter sharing	The same filter (kernel) is applied across the entire input
Translation invariance	A feature learned in one part of the image can be detected anywhere

The Convolution Operation

A convolution slides a small filter (kernel) across the input, computing the element-wise product and summing the result at each position to produce a feature map (also called an activation map).

Input (5x5)          Filter (3x3)         Output (3x3)
1  0  1  0  1        1  0  1              4  3  4
0  1  0  1  0    *   0  1  0    =         2  4  3
1  0  1  0  1        1  0  1              4  3  4
0  1  0  1  0
1  0  1  0  1

Key Parameters

Parameter	Description
Kernel size	The dimensions of the filter (e.g., 3x3, 5x5)
Stride	How many pixels the filter moves at each step
Padding	Zeros added around the input to control output size
Number of filters	How many different filters (and thus feature maps) to learn

Output Size Formula

output_size = (input_size - kernel_size + 2 * padding) / stride + 1

Example: Input = 32, Kernel = 3, Padding = 1, Stride = 1 → Output = (32 - 3 + 2) / 1 + 1 = 32

CNN Building Blocks

1. Convolutional Layer

Learns filters that detect specific features (edges, textures, patterns).

import torch.nn as nn

# 3 input channels (RGB), 16 output filters, 3x3 kernel
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)

2. Activation (ReLU)

Applies non-linearity after each convolution.

relu = nn.ReLU()

3. Pooling Layer

Reduces the spatial dimensions, making the network more computationally efficient and providing a degree of translation invariance.

Pooling Type	Description
Max Pooling	Takes the maximum value in each window
Average Pooling	Takes the average value in each window
Global Average Pooling	Averages each entire feature map to a single value

# Max pooling with a 2x2 window — reduces spatial dimensions by half
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

# Global average pooling
gap = nn.AdaptiveAvgPool2d(1)

4. Fully-Connected Layer

After the convolutional and pooling layers extract features, one or more fully-connected layers produce the final classification.

fc = nn.Linear(in_features=256, out_features=10)

Building a CNN in PyTorch

import torch
import torch.nn as nn

Convolutional Neural Networks