You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Evaluating agents is harder than evaluating simple LLM calls. Agents take multiple steps, use tools, and produce intermediate results — so you need metrics that go beyond just "was the final answer correct?" This lesson covers task completion metrics, trajectory evaluation, cost and latency tracking, and benchmarking different agent architectures.
| Aspect | Simple LLM Call | Agent |
|---|---|---|
| Output | Single response | Final answer + full trajectory |
| Steps | 1 | Variable (1–50+) |
| Cost | Predictable | Variable (depends on steps taken) |
| Latency | Predictable | Variable (depends on tool calls, steps) |
| Success criteria | Output quality | Output quality + efficiency + safety |
from dataclasses import dataclass
@dataclass
class AgentResult:
task: str
final_answer: str
steps_taken: int
tools_used: list[str]
total_tokens: int
total_cost_usd: float
latency_seconds: float
success: bool
error: str | None = None
def task_completion_rate(results: list[AgentResult]) -> float:
"""Percentage of tasks the agent completed successfully."""
if not results:
return 0.0
return sum(1 for r in results if r.success) / len(results)
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.