Evaluating & Testing AI Applications

Traditional software tests check for exact outputs. AI applications produce varied, probabilistic outputs — making evaluation fundamentally different. This lesson covers evaluation strategies, metrics, and testing frameworks for AI applications.

Why Traditional Tests Aren't Enough

# Traditional test — exact match
def test_add():
    assert add(2, 3) == 5  # Always works

# AI test — output varies every time
def test_summarise():
    result = summarise("Long article about AI...")
    assert result == ???  # What do we compare against?

AI outputs are non-deterministic — the same input can produce different (but equally valid) outputs. We need new approaches.

Evaluation Approaches

1. Reference-Based Evaluation

Compare outputs against gold-standard reference answers:

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], use_stemmer=True)

reference = "AI uses machine learning to make predictions."
generated = "Artificial intelligence leverages ML for predictions."

scores = scorer.score(reference, generated)
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}")  # ~0.6
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}")  # ~0.5

Common Reference-Based Metrics

Metric	What It Measures	Best For
ROUGE	Overlap of n-grams with reference	Summarisation
BLEU	Precision of n-grams	Translation
Exact Match	Strict string equality	Factual QA
F1	Token-level precision and recall	Extractive QA

2. LLM-as-Judge

Use a powerful LLM to evaluate the output of another model:

def llm_judge(question: str, answer: str, reference: str = "") -> dict:
    judge_prompt = f"""Rate the following answer on a scale of 1-5 for each criterion.
Return JSON with scores and brief justifications.

Question: {question}
Answer: {answer}
{"Reference answer: " + reference if reference else ""}

Criteria:
- relevance: Does the answer address the question?
- accuracy: Is the information factually correct?
- completeness: Does it cover all key points?
- clarity: Is it well-written and easy to understand?"""

    response = client.chat.completions.create(
        model="gpt-4o",  # Use a strong model as judge
        response_format={"type": "json_object"},
        messages=[{"role": "user", "content": judge_prompt}],
    )

    return json.loads(response.choices[0].message.content)

scores = llm_judge(
    question="What is RAG?",
    answer="RAG stands for Retrieval-Augmented Generation...",
)
print(scores)

Best Practices for LLM-as-Judge

Use a stronger model than the one being evaluated
Provide clear rubrics with specific scoring criteria
Include reference answers when available
Run evaluations multiple times and average the scores
Watch for position bias (judges may favour the first or last option)

3. Task-Specific Evaluation

Design evaluations around the specific task:

# For classification tasks
def evaluate_classifier(test_set: list[dict]) -> dict:
    correct = 0
    total = len(test_set)

    for item in test_set:
        prediction = classify(item["text"])
        if prediction == item["expected_label"]:
            correct += 1

Evaluating & Testing AI Applications

Evaluating & Testing AI Applications

Why Traditional Tests Aren't Enough

Evaluation Approaches

1. Reference-Based Evaluation

Common Reference-Based Metrics

2. LLM-as-Judge

Best Practices for LLM-as-Judge

3. Task-Specific Evaluation

More in AI