You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Sequence-to-sequence (Seq2Seq) models transform an input sequence into an output sequence, where the two sequences can have different lengths. They are the foundation of machine translation, text summarisation, question answering, and many other NLP tasks.
| Component | Role |
|---|---|
| Encoder | Reads the input sequence and compresses it into a fixed-size context vector |
| Decoder | Takes the context vector and generates the output sequence one token at a time |
| Task | Input | Output |
|---|---|---|
| Machine translation | "The cat sat on the mat" | "Le chat s'est assis sur le tapis" |
| Text summarisation | Long article | Short summary |
| Question answering | Context + question | Answer |
| Chatbots | User message | Bot response |
| Text-to-SQL | "Show all users" | "SELECT * FROM users" |
| Code generation | Natural language description | Code snippet |
The original Seq2Seq model (Sutskever et al., 2014) uses two RNNs:
Encoder: [The] → [cat] → [sat] → context_vector
Decoder: context_vector → [Le] → [chat] → [s'est] → [assis]
The basic Seq2Seq model compresses the entire input into a single fixed-size vector. For long sequences, this vector cannot capture all the information — information is lost.
| Input Length | Problem |
|---|---|
| Short (< 20 tokens) | Works reasonably well |
| Medium (20–50 tokens) | Noticeable quality degradation |
| Long (50+ tokens) | Severe information loss |
The attention mechanism (Bahdanau et al., 2014) solves the bottleneck problem by allowing the decoder to look at all encoder hidden states — not just the final one — and focus on the most relevant parts for each output step.
| Type | Formula | Description |
|---|---|---|
| Additive (Bahdanau) | score = V * tanh(W1 * h_enc + W2 * h_dec) | Learned alignment with a feed-forward network |
| Multiplicative (Luong) | score = h_dec^T * W * h_enc | Dot product with a weight matrix |
| Dot-product | score = h_dec^T * h_enc | Simple dot product (no learnable weights) |
| Scaled dot-product | score = (h_dec^T * h_enc) / sqrt(d) | Dot product scaled by dimension size (used in Transformers) |
import torch
import torch.nn as nn
class Encoder(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.rnn = nn.GRU(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
self.fc = nn.Linear(hidden_dim * 2, hidden_dim)
def forward(self, src):
embedded = self.embedding(src)
outputs, hidden = self.rnn(embedded)
# Combine forward and backward hidden states
hidden = torch.cat((hidden[-2], hidden[-1]), dim=1)
hidden = torch.tanh(self.fc(hidden))
return outputs, hidden.unsqueeze(0)
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.