Text Classification

Text classification is the task of assigning a predefined category (or label) to a piece of text. It is one of the most fundamental and widely used NLP tasks, powering applications from spam filtering to content moderation.

Types of Text Classification

Type	Description	Example
Binary	Two classes	Spam vs Not Spam
Multi-class	Three or more classes (one label per document)	News category: Sports, Politics, Business, Tech
Multi-label	Multiple labels per document	A movie tagged as both "Action" and "Comedy"

The Text Classification Pipeline

Collect and label data — Gather text samples with their categories
Preprocess — Clean, tokenise, and normalise the text
Represent — Convert text to numerical features (TF-IDF, embeddings)
Train — Fit a classifier on the training data
Evaluate — Measure performance on held-out test data
Deploy — Use the model to classify new text

Classical Approaches

Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes' theorem with the "naive" assumption that features are conditionally independent given the class.

Variant	Best For
MultinomialNB	Word counts / TF-IDF (most common for text)
BernoulliNB	Binary features (word present or absent)
GaussianNB	Continuous features

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Sample dataset
texts = [
    "free money winner lottery", "claim your prize now",
    "meeting tomorrow at 10am", "project deadline next week",
    "congratulations you won", "buy cheap products now",
    "quarterly report attached", "lunch with team on Friday",
    "win a free iPhone today", "please review the document",
]
labels = [1, 1, 0, 0, 1, 1, 0, 0, 1, 0]  # 1=spam, 0=ham

X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.3, random_state=42
)

# Build pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB()),
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['ham', 'spam']))

Logistic Regression

A strong baseline for text classification. Often surprisingly competitive with more complex models.

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 2))),
    ('clf', LogisticRegression(max_iter=1000)),
])

pipeline.fit(X_train, y_train)
print(f"Accuracy: {pipeline.score(X_test, y_test):.2f}")

Support Vector Machine

SVMs with a linear kernel are particularly effective for high-dimensional text data.

from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000)),
    ('clf', LinearSVC()),
])

pipeline.fit(X_train, y_train)
print(f"Accuracy: {pipeline.score(X_test, y_test):.2f}")

Text Classification

Text Classification

Types of Text Classification

The Text Classification Pipeline

Classical Approaches

Naive Bayes

Logistic Regression

Support Vector Machine

Comparison of Classical Classifiers

More in Data Science