Sequence-to-Sequence Models

Sequence-to-sequence (Seq2Seq) models transform an input sequence into an output sequence, where the two sequences can have different lengths. They are the foundation of machine translation, text summarisation, question answering, and many other NLP tasks.

What is Sequence-to-Sequence?

Component	Role
Encoder	Reads the input sequence and compresses it into a fixed-size context vector
Decoder	Takes the context vector and generates the output sequence one token at a time

Applications

Task	Input	Output
Machine translation	"The cat sat on the mat"	"Le chat s'est assis sur le tapis"
Text summarisation	Long article	Short summary
Question answering	Context + question	Answer
Chatbots	User message	Bot response
Text-to-SQL	"Show all users"	"SELECT * FROM users"
Code generation	Natural language description	Code snippet

The Encoder-Decoder Architecture

Basic Seq2Seq with RNNs

The original Seq2Seq model (Sutskever et al., 2014) uses two RNNs:

Encoder RNN processes the input sequence and produces a final hidden state (the context vector)
Decoder RNN takes the context vector as its initial state and generates the output sequence

Encoder:  [The] → [cat] → [sat] → context_vector
Decoder:  context_vector → [Le] → [chat] → [s'est] → [assis]

The Bottleneck Problem

The basic Seq2Seq model compresses the entire input into a single fixed-size vector. For long sequences, this vector cannot capture all the information — information is lost.

Input Length	Problem
Short (< 20 tokens)	Works reasonably well
Medium (20–50 tokens)	Noticeable quality degradation
Long (50+ tokens)	Severe information loss

Attention Mechanism

The attention mechanism (Bahdanau et al., 2014) solves the bottleneck problem by allowing the decoder to look at all encoder hidden states — not just the final one — and focus on the most relevant parts for each output step.

How Attention Works

The encoder produces a hidden state for each input token
At each decoder step, compute attention scores between the decoder state and all encoder states
Normalise scores with softmax to get attention weights
Compute a weighted sum of encoder states — the context vector for this step
The decoder uses this context to generate the next token

Types of Attention

Type	Formula	Description
Additive (Bahdanau)	score = V * tanh(W1 * h_enc + W2 * h_dec)	Learned alignment with a feed-forward network
Multiplicative (Luong)	score = h_dec^T * W * h_enc	Dot product with a weight matrix
Dot-product	score = h_dec^T * h_enc	Simple dot product (no learnable weights)
Scaled dot-product	score = (h_dec^T * h_enc) / sqrt(d)	Dot product scaled by dimension size (used in Transformers)

Implementing Seq2Seq with Attention in PyTorch

Encoder

import torch
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.GRU(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, hidden_dim)

    def forward(self, src):
        embedded = self.embedding(src)
        outputs, hidden = self.rnn(embedded)
        # Combine forward and backward hidden states
        hidden = torch.cat((hidden[-2], hidden[-1]), dim=1)
        hidden = torch.tanh(self.fc(hidden))
        return outputs, hidden.unsqueeze(0)

Sequence-to-Sequence Models

Sequence-to-Sequence Models

What is Sequence-to-Sequence?

Applications

The Encoder-Decoder Architecture

Basic Seq2Seq with RNNs

The Bottleneck Problem

Attention Mechanism

How Attention Works

Types of Attention

Implementing Seq2Seq with Attention in PyTorch

Encoder

Attention Layer

More in Data Science