Model Evaluation and Validation

Building a machine learning model is only half the job — you also need to evaluate how well it performs and validate that it will generalise to new, unseen data. Choosing the right evaluation metrics and validation strategies is critical to building reliable ML systems.

Why Evaluation Matters

A model that appears to perform well on training data may fail completely on new data. Proper evaluation answers these questions:

How accurate are the model's predictions?
Will the model generalise to unseen data?
Is the model biased towards certain classes?
Which model or hyperparameter setting is the best?

Classification Metrics

Confusion Matrix

The confusion matrix is the foundation of classification evaluation. For a binary classifier:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

Key Classification Metrics

Metric	Formula	Description	When to Use
Accuracy	(TP + TN) / Total	Proportion of correct predictions	Balanced classes
Precision	TP / (TP + FP)	Of all positive predictions, how many are correct	When false positives are costly (e.g., spam filter)
Recall	TP / (TP + FN)	Of all actual positives, how many were found	When false negatives are costly (e.g., disease detection)
F1 Score	2 * (Precision * Recall) / (Precision + Recall)	Harmonic mean of precision and recall	Imbalanced classes
Specificity	TN / (TN + FP)	Of all actual negatives, how many were correct	When false positives matter

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, classification_report
)

print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.4f}")

# Full report
print(classification_report(y_test, y_pred))

Why Accuracy Can Be Misleading

Consider a dataset with 95% negative and 5% positive samples. A model that always predicts "negative" achieves 95% accuracy but fails to detect any positive cases. In such imbalanced scenarios, precision, recall, and F1 score are more informative.

The ROC Curve and AUC

The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (recall) against the False Positive Rate at various classification thresholds.

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Get predicted probabilities
y_proba = model.predict_proba(X_test)[:, 1]

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

# Plot
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.5)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

AUC Value	Interpretation
1.0	Perfect classifier
0.9 – 1.0	Excellent
0.8 – 0.9	Good
0.7 – 0.8	Fair
0.5	Random guessing

Model Evaluation and Validation

Model Evaluation and Validation

Why Evaluation Matters

Classification Metrics

Confusion Matrix

Key Classification Metrics

Why Accuracy Can Be Misleading

The ROC Curve and AUC

Regression Metrics

More in Data Science