Recurrent Neural Networks (RNNs, LSTMs, GRUs)

Recurrent Neural Networks are designed to process sequential data — data where the order matters. While CNNs are ideal for spatial data like images, RNNs excel at temporal and sequential data such as text, time series, audio, and video frames.

Why Recurrent Networks?

Standard feedforward networks treat each input independently. They have no memory of previous inputs. But many real-world problems involve sequences where context from earlier elements influences later ones:

Domain	Sequential Data
NLP	Words in a sentence, characters in a word
Time Series	Stock prices, sensor readings, weather data
Speech	Audio waveforms, phoneme sequences
Music	Note sequences, chord progressions
Video	Frames in a video sequence

The Vanilla RNN

A Recurrent Neural Network maintains a hidden state that acts as memory. At each time step, the network takes the current input and the previous hidden state, producing a new hidden state and (optionally) an output.

Mathematical Formulation

h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b_h)
y_t = W_hy * h_t + b_y

Where:

h_t is the hidden state at time step t
x_t is the input at time step t
W_hh, W_xh, W_hy are weight matrices
b_h, b_y are biases

RNN in PyTorch

import torch
import torch.nn as nn

rnn = nn.RNN(
    input_size=10,     # Size of each input element
    hidden_size=64,    # Size of the hidden state
    num_layers=2,      # Number of stacked RNN layers
    batch_first=True,  # Input shape: (batch, seq_len, features)
)

# Input: batch of 32 sequences, each with 50 time steps, 10 features
x = torch.randn(32, 50, 10)
output, h_n = rnn(x)  # output: (32, 50, 64), h_n: (2, 32, 64)

The Vanishing Gradient Problem in RNNs

Vanilla RNNs struggle with long-range dependencies. During backpropagation through time (BPTT), gradients are multiplied at each time step. If these multiplied values are less than 1, the gradient shrinks exponentially — this is the vanishing gradient problem.

Sequence Length	Gradient Behaviour	Learning
Short (5–10 steps)	Gradients remain usable	RNN learns well
Medium (20–50 steps)	Gradients start to vanish	Degraded learning
Long (100+ steps)	Gradients effectively zero	RNN cannot learn long-range dependencies

This motivated the development of gated architectures: LSTM and GRU.

Long Short-Term Memory (LSTM)

The LSTM (Hochreiter and Schmidhuber, 1997) introduces a cell state — a separate memory pathway — and three gates that control the flow of information:

Gate	Purpose
Forget gate	Decides what information to discard from the cell state
Input gate	Decides what new information to store in the cell state
Output gate	Decides what information from the cell state to output as the hidden state

LSTM Equations

f_t = sigmoid(W_f * [h_{t-1}, x_t] + b_f)       # Forget gate
i_t = sigmoid(W_i * [h_{t-1}, x_t] + b_i)       # Input gate
c_hat_t = tanh(W_c * [h_{t-1}, x_t] + b_c)      # Candidate cell state
c_t = f_t * c_{t-1} + i_t * c_hat_t              # New cell state
o_t = sigmoid(W_o * [h_{t-1}, x_t] + b_o)       # Output gate
h_t = o_t * tanh(c_t)                             # New hidden state

LSTM in PyTorch

lstm = nn.LSTM(
    input_size=10,
    hidden_size=64,
    num_layers=2,
    batch_first=True,
    dropout=0.2,       # Dropout between LSTM layers
)

x = torch.randn(32, 50, 10)
output, (h_n, c_n) = lstm(x)  # Also returns cell state c_n

Gated Recurrent Unit (GRU)

The GRU (Cho et al., 2014) is a simplified variant of the LSTM with only two gates and no separate cell state. It often performs comparably to LSTM with fewer parameters.

Recurrent Neural Networks