Guardrails & Safety for Agents

Agents have the power to take real-world actions — which means they can also cause real-world harm. Without guardrails, an agent might delete critical files, send embarrassing emails, spend thousands of dollars on API calls, or get stuck in an infinite loop. This lesson covers input/output validation, action sandboxing, human-in-the-loop patterns, rate limiting, and preventing runaway agents.

Why Guardrails Are Essential

Risk	Example
Data loss	Agent deletes production database records
Financial damage	Agent makes unlimited API calls, costing thousands
Security breach	Agent exfiltrates sensitive data via a tool call
Reputation damage	Agent sends incorrect emails to customers
Infinite loops	Agent calls the same tool repeatedly with no progress
Prompt injection	Malicious input causes agent to bypass instructions

Input Validation

Validate all inputs before they reach the agent:

from pydantic import BaseModel, validator
import re

class AgentInput(BaseModel):
    task: str
    user_id: str
    max_steps: int = 10

    @validator("task")
    def task_not_empty(cls, v):
        if len(v.strip()) < 5:
            raise ValueError("Task must be at least 5 characters.")
        if len(v) > 5000:
            raise ValueError("Task must be under 5000 characters.")
        return v

    @validator("max_steps")
    def reasonable_step_limit(cls, v):
        if v < 1 or v > 50:
            raise ValueError("max_steps must be between 1 and 50.")
        return v

def detect_injection(text: str) -> bool:
    """Detect common prompt injection patterns."""
    suspicious_patterns = [
        r"ignore (?:all )?previous instructions",
        r"you are now",
        r"system:\s",
        r"forget (?:everything|all|your)",
        r"new persona",
    ]
    for pattern in suspicious_patterns:
        if re.search(pattern, text, re.IGNORECASE):
            return True
    return False

Output Validation

Validate what the agent produces before returning it to the user:

class OutputValidator:
    """Validate agent outputs before returning to the user."""

    def __init__(self):
        self.blocked_patterns = [
            r"\b(?:password|secret|api.?key)\s*[:=]\s*\S+",
            r"\b\d{3}-\d{2}-\d{4}\b",   # SSN pattern
            r"\b\d{16}\b",                 # Credit card pattern
        ]

    def validate(self, output: str) -> tuple[bool, str]:
        """Returns (is_safe, sanitised_output)."""
        for pattern in self.blocked_patterns:
            if re.search(pattern, output, re.IGNORECASE):
                return False, "[Output blocked: contains sensitive information]"

        if len(output) > 50000:
            return False, output[:50000] + "\n[Output truncated]"

        return True, output

# Usage
validator = OutputValidator()
is_safe, clean_output = validator.validate(agent_result)
if not is_safe:
    log_blocked_output(agent_result)

Action Sandboxing

Restrict which actions an agent can take:

from enum import Enum

class ActionPermission(Enum):
    READ_ONLY = "read_only"
    READ_WRITE = "read_write"
    ADMIN = "admin"

class ActionSandbox:
    """Restrict agent actions based on permission level."""

    TOOL_PERMISSIONS = {
        "web_search":       ActionPermission.READ_ONLY,
        "read_file":        ActionPermission.READ_ONLY,
        "write_file":       ActionPermission.READ_WRITE,
        "delete_file":      ActionPermission.ADMIN,
        "run_code":         ActionPermission.READ_WRITE,
        "send_email":       ActionPermission.ADMIN,
        "database_query":   ActionPermission.READ_ONLY,
        "database_write":   ActionPermission.READ_WRITE,
        "database_delete":  ActionPermission.ADMIN,
    }

    PERMISSION_HIERARCHY = {
        ActionPermission.READ_ONLY: 1,
        ActionPermission.READ_WRITE: 2,
        ActionPermission.ADMIN: 3,
    }

    def __init__(self, user_permission: ActionPermission):
        self.user_level = self.PERMISSION_HIERARCHY[user_permission]

    def is_allowed(self, tool_name: str) -> bool:
        required = self.TOOL_PERMISSIONS.get(tool_name)
        if required is None:
            return False  # Unknown tools are blocked
        return self.user_level >= self.PERMISSION_HIERARCHY[required]

    def filter_tools(self, tools: list[str]) -> list[str]:
        return [t for t in tools if self.is_allowed(t)]

# Usage
sandbox = ActionSandbox(ActionPermission.READ_WRITE)
sandbox.is_allowed("web_search")    # True
sandbox.is_allowed("delete_file")   # False
sandbox.is_allowed("send_email")    # False

Human-in-the-Loop

For high-risk actions, require human approval:

class HumanApprovalGate:
    """Require human approval for specified actions."""

    REQUIRES_APPROVAL = {
        "send_email", "delete_file", "database_delete",
        "make_payment", "deploy_code",
    }

    def check(self, tool_name: str, arguments: dict) -> bool:
        """Returns True if approved, False if rejected."""
        if tool_name not in self.REQUIRES_APPROVAL:
            return True  # Auto-approve safe actions

        print(f"\n{'='*60}")
        print(f"APPROVAL REQUIRED")
        print(f"Action: {tool_name}")
        print(f"Arguments: {json.dumps(arguments, indent=2)}")
        print(f"{'='*60}")

Guardrails & Safety for Agents

Guardrails & Safety for Agents

Why Guardrails Are Essential

Input Validation

Output Validation

Action Sandboxing

Human-in-the-Loop

More in AI