You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Recurrent Neural Networks are designed to process sequential data — data where the order matters. While CNNs are ideal for spatial data like images, RNNs excel at temporal and sequential data such as text, time series, audio, and video frames.
Standard feedforward networks treat each input independently. They have no memory of previous inputs. But many real-world problems involve sequences where context from earlier elements influences later ones:
| Domain | Sequential Data |
|---|---|
| NLP | Words in a sentence, characters in a word |
| Time Series | Stock prices, sensor readings, weather data |
| Speech | Audio waveforms, phoneme sequences |
| Music | Note sequences, chord progressions |
| Video | Frames in a video sequence |
A Recurrent Neural Network maintains a hidden state that acts as memory. At each time step, the network takes the current input and the previous hidden state, producing a new hidden state and (optionally) an output.
h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b_h)
y_t = W_hy * h_t + b_y
Where:
h_t is the hidden state at time step tx_t is the input at time step tW_hh, W_xh, W_hy are weight matricesb_h, b_y are biasesimport torch
import torch.nn as nn
rnn = nn.RNN(
input_size=10, # Size of each input element
hidden_size=64, # Size of the hidden state
num_layers=2, # Number of stacked RNN layers
batch_first=True, # Input shape: (batch, seq_len, features)
)
# Input: batch of 32 sequences, each with 50 time steps, 10 features
x = torch.randn(32, 50, 10)
output, h_n = rnn(x) # output: (32, 50, 64), h_n: (2, 32, 64)
Vanilla RNNs struggle with long-range dependencies. During backpropagation through time (BPTT), gradients are multiplied at each time step. If these multiplied values are less than 1, the gradient shrinks exponentially — this is the vanishing gradient problem.
| Sequence Length | Gradient Behaviour | Learning |
|---|---|---|
| Short (5–10 steps) | Gradients remain usable | RNN learns well |
| Medium (20–50 steps) | Gradients start to vanish | Degraded learning |
| Long (100+ steps) | Gradients effectively zero | RNN cannot learn long-range dependencies |
This motivated the development of gated architectures: LSTM and GRU.
The LSTM (Hochreiter and Schmidhuber, 1997) introduces a cell state — a separate memory pathway — and three gates that control the flow of information:
| Gate | Purpose |
|---|---|
| Forget gate | Decides what information to discard from the cell state |
| Input gate | Decides what new information to store in the cell state |
| Output gate | Decides what information from the cell state to output as the hidden state |
f_t = sigmoid(W_f * [h_{t-1}, x_t] + b_f) # Forget gate
i_t = sigmoid(W_i * [h_{t-1}, x_t] + b_i) # Input gate
c_hat_t = tanh(W_c * [h_{t-1}, x_t] + b_c) # Candidate cell state
c_t = f_t * c_{t-1} + i_t * c_hat_t # New cell state
o_t = sigmoid(W_o * [h_{t-1}, x_t] + b_o) # Output gate
h_t = o_t * tanh(c_t) # New hidden state
lstm = nn.LSTM(
input_size=10,
hidden_size=64,
num_layers=2,
batch_first=True,
dropout=0.2, # Dropout between LSTM layers
)
x = torch.randn(32, 50, 10)
output, (h_n, c_n) = lstm(x) # Also returns cell state c_n
The GRU (Cho et al., 2014) is a simplified variant of the LSTM with only two gates and no separate cell state. It often performs comparably to LSTM with fewer parameters.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.