You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Convolutional Neural Networks are a class of deep neural networks designed to process structured grid data, most commonly images. CNNs automatically learn spatial hierarchies of features — from low-level edges and textures to high-level objects and scenes.
A standard fully-connected (dense) network treats each pixel as an independent input. For a 224x224 RGB image, that means 224 * 224 * 3 = 150,528 input features — and with even one hidden layer of 1,000 neurons, the number of parameters explodes to over 150 million. This is impractical and ignores the spatial structure of images.
CNNs solve this by exploiting three key ideas:
| Principle | Description |
|---|---|
| Local connectivity | Each neuron connects to only a small local region of the input |
| Parameter sharing | The same filter (kernel) is applied across the entire input |
| Translation invariance | A feature learned in one part of the image can be detected anywhere |
A convolution slides a small filter (kernel) across the input, computing the element-wise product and summing the result at each position to produce a feature map (also called an activation map).
Input (5x5) Filter (3x3) Output (3x3)
1 0 1 0 1 1 0 1 4 3 4
0 1 0 1 0 * 0 1 0 = 2 4 3
1 0 1 0 1 1 0 1 4 3 4
0 1 0 1 0
1 0 1 0 1
| Parameter | Description |
|---|---|
| Kernel size | The dimensions of the filter (e.g., 3x3, 5x5) |
| Stride | How many pixels the filter moves at each step |
| Padding | Zeros added around the input to control output size |
| Number of filters | How many different filters (and thus feature maps) to learn |
output_size = (input_size - kernel_size + 2 * padding) / stride + 1
Example: Input = 32, Kernel = 3, Padding = 1, Stride = 1 → Output = (32 - 3 + 2) / 1 + 1 = 32
Learns filters that detect specific features (edges, textures, patterns).
import torch.nn as nn
# 3 input channels (RGB), 16 output filters, 3x3 kernel
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
Applies non-linearity after each convolution.
relu = nn.ReLU()
Reduces the spatial dimensions, making the network more computationally efficient and providing a degree of translation invariance.
| Pooling Type | Description |
|---|---|
| Max Pooling | Takes the maximum value in each window |
| Average Pooling | Takes the average value in each window |
| Global Average Pooling | Averages each entire feature map to a single value |
# Max pooling with a 2x2 window — reduces spatial dimensions by half
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
# Global average pooling
gap = nn.AdaptiveAvgPool2d(1)
After the convolutional and pooling layers extract features, one or more fully-connected layers produce the final classification.
fc = nn.Linear(in_features=256, out_features=10)
import torch
import torch.nn as nn
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.