You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Traditional software tests check for exact outputs. AI applications produce varied, probabilistic outputs — making evaluation fundamentally different. This lesson covers evaluation strategies, metrics, and testing frameworks for AI applications.
# Traditional test — exact match
def test_add():
assert add(2, 3) == 5 # Always works
# AI test — output varies every time
def test_summarise():
result = summarise("Long article about AI...")
assert result == ??? # What do we compare against?
AI outputs are non-deterministic — the same input can produce different (but equally valid) outputs. We need new approaches.
Compare outputs against gold-standard reference answers:
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], use_stemmer=True)
reference = "AI uses machine learning to make predictions."
generated = "Artificial intelligence leverages ML for predictions."
scores = scorer.score(reference, generated)
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}") # ~0.6
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}") # ~0.5
| Metric | What It Measures | Best For |
|---|---|---|
| ROUGE | Overlap of n-grams with reference | Summarisation |
| BLEU | Precision of n-grams | Translation |
| Exact Match | Strict string equality | Factual QA |
| F1 | Token-level precision and recall | Extractive QA |
Use a powerful LLM to evaluate the output of another model:
def llm_judge(question: str, answer: str, reference: str = "") -> dict:
judge_prompt = f"""Rate the following answer on a scale of 1-5 for each criterion.
Return JSON with scores and brief justifications.
Question: {question}
Answer: {answer}
{"Reference answer: " + reference if reference else ""}
Criteria:
- relevance: Does the answer address the question?
- accuracy: Is the information factually correct?
- completeness: Does it cover all key points?
- clarity: Is it well-written and easy to understand?"""
response = client.chat.completions.create(
model="gpt-4o", # Use a strong model as judge
response_format={"type": "json_object"},
messages=[{"role": "user", "content": judge_prompt}],
)
return json.loads(response.choices[0].message.content)
scores = llm_judge(
question="What is RAG?",
answer="RAG stands for Retrieval-Augmented Generation...",
)
print(scores)
Design evaluations around the specific task:
# For classification tasks
def evaluate_classifier(test_set: list[dict]) -> dict:
correct = 0
total = len(test_set)
for item in test_set:
prediction = classify(item["text"])
if prediction == item["expected_label"]:
correct += 1
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.