Text Representation

Machine learning models work with numbers, not words. To apply ML to text, we must convert words, sentences, and documents into numerical representations — a process called text representation or text vectorisation.

Why Text Representation Matters

The quality of your text representation directly affects model performance. A good representation captures semantic meaning, while a poor one treats every word as an independent, unrelated symbol.

Representation	Captures Meaning?	Dimensionality	Use Case
Bag of Words	No	High (sparse)	Simple classification
TF-IDF	Partially	High (sparse)	Information retrieval, classification
Word Embeddings	Yes	Low (dense)	Most modern NLP tasks
Contextual Embeddings	Yes (context-aware)	Low (dense)	State-of-the-art NLP

Bag of Words (BoW)

The simplest text representation. Each document is represented as a vector of word counts. The order of words is ignored — only their frequency matters.

How It Works

Given a corpus:

Document 1: "the cat sat on the mat"
Document 2: "the dog sat on the log"

The vocabulary is: [cat, dog, log, mat, on, sat, the]

Document	cat	dog	log	mat	on	sat	the
Doc 1	1	0	0	1	1	1	2
Doc 2	0	1	1	0	1	1	2

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "the cat sat on the mat",
    "the dog sat on the log"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("Matrix:\n", X.toarray())

Limitations of Bag of Words

No word order: "dog bites man" and "man bites dog" have the same representation
High dimensionality: Vocabulary can be enormous (tens of thousands of words)
Sparse vectors: Most entries are zero
No semantic meaning: "good" and "excellent" are as different as "good" and "terrible"

TF-IDF (Term Frequency–Inverse Document Frequency)

TF-IDF improves on BoW by weighting words by their importance. Words that appear frequently in one document but rarely across all documents get a higher score.

The Formula

Component	Formula	Meaning
TF (Term Frequency)	count(term, doc) / total_terms(doc)	How often the term appears in this document
IDF (Inverse Document Frequency)	log(N / df(term))	How rare the term is across all documents
TF-IDF	TF x IDF	Importance of the term to this document

Example

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat chased the dog"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", X.toarray().round(2))

TF-IDF in Practice

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data
texts = [
    "I love this movie", "Great film, really enjoyed it",
    "Terrible movie, waste of time", "Awful film, very boring",
    "Excellent acting and plot", "Poor storyline and bad acting"
]
labels = [1, 1, 0, 0, 1, 0]  # 1 = positive, 0 = negative

# Vectorise
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# Train classifier
clf = LogisticRegression()
clf.fit(X, labels)

# Predict
new_text = vectorizer.transform(["The acting was wonderful"])
print(clf.predict(new_text))  # Output: [1]

Tip: TF-IDF is a strong baseline for many text classification tasks. It often outperforms more complex methods on small datasets.

N-grams

N-grams capture word order to some extent by considering sequences of n consecutive words.

Text Representation

Text Representation

Why Text Representation Matters

Bag of Words (BoW)

How It Works

Limitations of Bag of Words

TF-IDF (Term Frequency–Inverse Document Frequency)

The Formula

Example

TF-IDF in Practice

N-grams

More in Data Science