You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Traditional software tests check for exact outputs. AI applications produce varied, probabilistic outputs — making evaluation fundamentally different. This lesson covers evaluation strategies, metrics, and testing frameworks for AI applications.
# Traditional test — exact match
def test_add():
assert add(2, 3) == 5 # Always works
# AI test — output varies every time
def test_summarise():
result = summarise("Long article about AI...")
assert result == ??? # What do we compare against?
AI outputs are non-deterministic — the same input can produce different (but equally valid) outputs. We need new approaches.
Compare outputs against gold-standard reference answers:
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], use_stemmer=True)
reference = "AI uses machine learning to make predictions."
generated = "Artificial intelligence leverages ML for predictions."
scores = scorer.score(reference, generated)
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}") # ~0.6
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}") # ~0.5
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.