You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
A language model is a probabilistic model that assigns a probability to a sequence of words. Language models answer the question: "How likely is this sequence of words?" They are the foundation of many NLP applications — from autocomplete and spell checking to machine translation and text generation.
Given a sequence of words w1, w2, ..., wn, a language model estimates:
P(w1, w2, ..., wn) — the probability of the entire sequence
Or equivalently, the probability of the next word given all previous words:
P(wn | w1, w2, ..., wn-1) — the next-word prediction
| Application | How Language Models Help |
|---|---|
| Autocomplete | Predicts the most likely next word |
| Spell checking | "teh cat" is less probable than "the cat" |
| Speech recognition | Disambiguates similar-sounding words using context |
| Machine translation | Scores fluency of candidate translations |
| Text generation | Generates coherent text word by word |
The simplest language models. An n-gram model approximates the probability of a word using only the previous n-1 words (the Markov assumption).
| Model | Context | Example |
|---|---|---|
| Unigram | No context | P(word) |
| Bigram | Previous 1 word | P(word |
| Trigram | Previous 2 words | P(word |
| 4-gram | Previous 3 words | P(word |
import nltk
from collections import Counter, defaultdict
# Training corpus
corpus = [
"the cat sat on the mat",
"the cat ate the mouse",
"the dog sat on the log",
]
# Build bigram counts
bigram_counts = defaultdict(Counter)
for sentence in corpus:
tokens = sentence.split()
for i in range(len(tokens) - 1):
bigram_counts[tokens[i]][tokens[i + 1]] += 1
# Calculate bigram probabilities
def bigram_prob(word, prev_word):
total = sum(bigram_counts[prev_word].values())
if total == 0:
return 0.0
return bigram_counts[prev_word][word] / total
print(f"P(cat | the) = {bigram_prob('cat', 'the'):.3f}")
print(f"P(dog | the) = {bigram_prob('dog', 'the'):.3f}")
print(f"P(sat | cat) = {bigram_prob('sat', 'cat'):.3f}")
import nltk
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline
# Training data
text = [
['the', 'cat', 'sat', 'on', 'the', 'mat'],
['the', 'cat', 'ate', 'the', 'mouse'],
['the', 'dog', 'sat', 'on', 'the', 'log'],
]
# Build a trigram model
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, text)
model = MLE(n)
model.fit(train_data, padded_sents)
# Generate text
generated = model.generate(10, random_seed=42)
print(' '.join(generated))
N-gram models assign zero probability to unseen n-grams. Smoothing redistributes probability mass to avoid this.
| Technique | Description |
|---|---|
| Laplace (Add-1) Smoothing | Add 1 to every n-gram count |
| Add-k Smoothing | Add a fraction k (e.g., 0.01) instead of 1 |
| Backoff | Fall back to shorter n-grams if the longer one is unseen |
| Interpolation | Weighted combination of unigram, bigram, and trigram probabilities |
| Kneser-Ney Smoothing | State-of-the-art smoothing — uses absolute discounting and lower-order distributions |
Perplexity measures how well a language model predicts a sample of text. Lower perplexity = better model.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.