You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Machine learning models work with numbers, not words. To apply ML to text, we must convert words, sentences, and documents into numerical representations — a process called text representation or text vectorisation.
The quality of your text representation directly affects model performance. A good representation captures semantic meaning, while a poor one treats every word as an independent, unrelated symbol.
| Representation | Captures Meaning? | Dimensionality | Use Case |
|---|---|---|---|
| Bag of Words | No | High (sparse) | Simple classification |
| TF-IDF | Partially | High (sparse) | Information retrieval, classification |
| Word Embeddings | Yes | Low (dense) | Most modern NLP tasks |
| Contextual Embeddings | Yes (context-aware) | Low (dense) | State-of-the-art NLP |
The simplest text representation. Each document is represented as a vector of word counts. The order of words is ignored — only their frequency matters.
Given a corpus:
The vocabulary is: [cat, dog, log, mat, on, sat, the]
| Document | cat | dog | log | mat | on | sat | the |
|---|---|---|---|---|---|---|---|
| Doc 1 | 1 | 0 | 0 | 1 | 1 | 1 | 2 |
| Doc 2 | 0 | 1 | 1 | 0 | 1 | 1 | 2 |
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"the cat sat on the mat",
"the dog sat on the log"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Matrix:\n", X.toarray())
TF-IDF improves on BoW by weighting words by their importance. Words that appear frequently in one document but rarely across all documents get a higher score.
| Component | Formula | Meaning |
|---|---|---|
| TF (Term Frequency) | count(term, doc) / total_terms(doc) | How often the term appears in this document |
| IDF (Inverse Document Frequency) | log(N / df(term)) | How rare the term is across all documents |
| TF-IDF | TF x IDF | Importance of the term to this document |
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"the cat sat on the mat",
"the dog sat on the log",
"the cat chased the dog"
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", X.toarray().round(2))
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Sample data
texts = [
"I love this movie", "Great film, really enjoyed it",
"Terrible movie, waste of time", "Awful film, very boring",
"Excellent acting and plot", "Poor storyline and bad acting"
]
labels = [1, 1, 0, 0, 1, 0] # 1 = positive, 0 = negative
# Vectorise
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
# Train classifier
clf = LogisticRegression()
clf.fit(X, labels)
# Predict
new_text = vectorizer.transform(["The acting was wonderful"])
print(clf.predict(new_text)) # Output: [1]
Tip: TF-IDF is a strong baseline for many text classification tasks. It often outperforms more complex methods on small datasets.
N-grams capture word order to some extent by considering sequences of n consecutive words.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.