You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Text classification is the task of assigning a predefined category (or label) to a piece of text. It is one of the most fundamental and widely used NLP tasks, powering applications from spam filtering to content moderation.
| Type | Description | Example |
|---|---|---|
| Binary | Two classes | Spam vs Not Spam |
| Multi-class | Three or more classes (one label per document) | News category: Sports, Politics, Business, Tech |
| Multi-label | Multiple labels per document | A movie tagged as both "Action" and "Comedy" |
Naive Bayes is a probabilistic classifier based on Bayes' theorem with the "naive" assumption that features are conditionally independent given the class.
| Variant | Best For |
|---|---|
| MultinomialNB | Word counts / TF-IDF (most common for text) |
| BernoulliNB | Binary features (word present or absent) |
| GaussianNB | Continuous features |
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Sample dataset
texts = [
"free money winner lottery", "claim your prize now",
"meeting tomorrow at 10am", "project deadline next week",
"congratulations you won", "buy cheap products now",
"quarterly report attached", "lunch with team on Friday",
"win a free iPhone today", "please review the document",
]
labels = [1, 1, 0, 0, 1, 1, 0, 0, 1, 0] # 1=spam, 0=ham
X_train, X_test, y_train, y_test = train_test_split(
texts, labels, test_size=0.3, random_state=42
)
# Build pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', MultinomialNB()),
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['ham', 'spam']))
A strong baseline for text classification. Often surprisingly competitive with more complex models.
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 2))),
('clf', LogisticRegression(max_iter=1000)),
])
pipeline.fit(X_train, y_train)
print(f"Accuracy: {pipeline.score(X_test, y_test):.2f}")
SVMs with a linear kernel are particularly effective for high-dimensional text data.
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=10000)),
('clf', LinearSVC()),
])
pipeline.fit(X_train, y_train)
print(f"Accuracy: {pipeline.score(X_test, y_test):.2f}")
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.