You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Building a machine learning model is only half the job — you also need to evaluate how well it performs and validate that it will generalise to new, unseen data. Choosing the right evaluation metrics and validation strategies is critical to building reliable ML systems.
A model that appears to perform well on training data may fail completely on new data. Proper evaluation answers these questions:
The confusion matrix is the foundation of classification evaluation. For a binary classifier:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()
| Metric | Formula | Description | When to Use |
|---|---|---|---|
| Accuracy | (TP + TN) / Total | Proportion of correct predictions | Balanced classes |
| Precision | TP / (TP + FP) | Of all positive predictions, how many are correct | When false positives are costly (e.g., spam filter) |
| Recall | TP / (TP + FN) | Of all actual positives, how many were found | When false negatives are costly (e.g., disease detection) |
| F1 Score | 2 * (Precision * Recall) / (Precision + Recall) | Harmonic mean of precision and recall | Imbalanced classes |
| Specificity | TN / (TN + FP) | Of all actual negatives, how many were correct | When false positives matter |
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, classification_report
)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
# Full report
print(classification_report(y_test, y_pred))
Consider a dataset with 95% negative and 5% positive samples. A model that always predicts "negative" achieves 95% accuracy but fails to detect any positive cases. In such imbalanced scenarios, precision, recall, and F1 score are more informative.
The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (recall) against the False Positive Rate at various classification thresholds.
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
# Get predicted probabilities
y_proba = model.predict_proba(X_test)[:, 1]
# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
# Plot
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.5)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
| AUC Value | Interpretation |
|---|---|
| 1.0 | Perfect classifier |
| 0.9 – 1.0 | Excellent |
| 0.8 – 0.9 | Good |
| 0.7 – 0.8 | Fair |
| 0.5 | Random guessing |
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.