Language Models

A language model is a probabilistic model that assigns a probability to a sequence of words. Language models answer the question: "How likely is this sequence of words?" They are the foundation of many NLP applications — from autocomplete and spell checking to machine translation and text generation.

What is a Language Model?

Given a sequence of words w1, w2, ..., wn, a language model estimates:

P(w1, w2, ..., wn) — the probability of the entire sequence

Or equivalently, the probability of the next word given all previous words:

P(wn | w1, w2, ..., wn-1) — the next-word prediction

Why Are Language Models Useful?

Application	How Language Models Help
Autocomplete	Predicts the most likely next word
Spell checking	"teh cat" is less probable than "the cat"
Speech recognition	Disambiguates similar-sounding words using context
Machine translation	Scores fluency of candidate translations
Text generation	Generates coherent text word by word

N-gram Language Models

The simplest language models. An n-gram model approximates the probability of a word using only the previous n-1 words (the Markov assumption).

Types of N-gram Models

Model	Context	Example
Unigram	No context	P(word)
Bigram	Previous 1 word	P(word
Trigram	Previous 2 words	P(word
4-gram	Previous 3 words	P(word

Bigram Model Example

import nltk
from collections import Counter, defaultdict

# Training corpus
corpus = [
    "the cat sat on the mat",
    "the cat ate the mouse",
    "the dog sat on the log",
]

# Build bigram counts
bigram_counts = defaultdict(Counter)
for sentence in corpus:
    tokens = sentence.split()
    for i in range(len(tokens) - 1):
        bigram_counts[tokens[i]][tokens[i + 1]] += 1

# Calculate bigram probabilities
def bigram_prob(word, prev_word):
    total = sum(bigram_counts[prev_word].values())
    if total == 0:
        return 0.0
    return bigram_counts[prev_word][word] / total

print(f"P(cat | the) = {bigram_prob('cat', 'the'):.3f}")
print(f"P(dog | the) = {bigram_prob('dog', 'the'):.3f}")
print(f"P(sat | cat) = {bigram_prob('sat', 'cat'):.3f}")

NLTK N-gram Language Model

import nltk
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

# Training data
text = [
    ['the', 'cat', 'sat', 'on', 'the', 'mat'],
    ['the', 'cat', 'ate', 'the', 'mouse'],
    ['the', 'dog', 'sat', 'on', 'the', 'log'],
]

# Build a trigram model
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, text)
model = MLE(n)
model.fit(train_data, padded_sents)

# Generate text
generated = model.generate(10, random_seed=42)
print(' '.join(generated))

Smoothing Techniques

N-gram models assign zero probability to unseen n-grams. Smoothing redistributes probability mass to avoid this.

Technique	Description
Laplace (Add-1) Smoothing	Add 1 to every n-gram count
Add-k Smoothing	Add a fraction k (e.g., 0.01) instead of 1
Backoff	Fall back to shorter n-grams if the longer one is unseen
Interpolation	Weighted combination of unigram, bigram, and trigram probabilities
Kneser-Ney Smoothing	State-of-the-art smoothing — uses absolute discounting and lower-order distributions

Evaluating Language Models: Perplexity

Perplexity measures how well a language model predicts a sample of text. Lower perplexity = better model.

Language Models

Language Models

What is a Language Model?

Why Are Language Models Useful?

N-gram Language Models

Types of N-gram Models

Bigram Model Example

NLTK N-gram Language Model

Smoothing Techniques

Evaluating Language Models: Perplexity

More in Data Science