08.03 · Guardrails & Policy Enforcement

Level: Advanced
Pre-reading: 08 · AI Security · 08.01 · Prompt Injection


What Are Guardrails?

Guardrails are validation layers that inspect agent inputs and outputs before they cause effects. They enforce your organisation's policies programmatically, not by trusting the LLM to behave correctly.

graph LR
    A[User input] --> B[Input Guardrail]
    B -->|Pass| C[LLM Agent]
    C --> D[Output Guardrail]
    D -->|Pass| E[Execute action / return response]
    B -->|Fail| F[Reject + log]
    D -->|Fail| G[Reject + re-prompt or escalate]

Never assume the LLM will follow your system prompt instructions perfectly. Guardrails are independent validators.


Input Guardrails

Guard What It Checks Action
PII detector JIRA/ticket contains customer PII Block + ask for anonymised replacement
Secret detector Input contains API key / password pattern Block immediately + alert security
Injection detector Known prompt injection patterns in text Block + log as potential attack
Scope validator Ticket references a service the agent can't touch Block + escalate to human
Language filter Input language match (English-only agents) Reject non-matching language

Output Guardrails

Guard What It Checks Action
Diff scope validator Code changes outside allowed service directories Reject + re-prompt with stricter constraints
Secret output detector Generated code contains hardcoded secrets Reject immediately
Diff size limiter Diff > N lines (threshold for review) Route to human review instead of auto-PR
Build validator Code change compiles successfully Reject if compilation fails
Security linter OWASP-category issues in generated code (via SpotBugs, SonarQube) Flag for human review
Test coverage New code has adequate test coverage Reject if < threshold

Guardrails AI Framework

Guardrails AI is a Python library for defining structured validators on LLM outputs:

graph LR
    A[LLM Output: code diff] --> B[Rail: no_secrets validator]
    B --> C[Rail: valid_json_diff validator]
    C --> D[Rail: max_files_changed: 5]
    D --> E[Rail: no_test_deletion]
    E --> F[Validated output ready to apply]

Each validator returns pass, fix (automatic correction), or fail (reject and re-prompt). String up multiple validators in a pipeline.


Human Escalation Tiers

Condition Escalation Path
Injection attempt detected → Security team alert (email + Slack)
Diff scope violation → Tech lead for ticket's team
Compilation failure after 3 retries → Ticket author + agent team lead
Cross-service change detected → Architect sign-off required
Sensitive file access (secrets, config) → Security team + DevOps
Cost anomaly (> 10x normal token usage) → Platform engineering alert

Policy as Code

Define agent behaviour policy in a versioned config file, not hardcoded in prompts:

# agent-policy.yaml
max_iterations: 20
max_diff_lines: 100
max_files_changed: 10
allowed_service_dirs:
  - order-service/
  - notification-service/
forbidden_file_patterns:
  - "**/*.env"
  - "**/secrets/**"
  - "**/.github/workflows/**"
require_tests: true
min_test_coverage: 0.80
require_human_approval_for_pr: true
auto_merge_enabled: false

Load this config at agent startup and use it in both the system prompt construction and output validators.


How do you handle a guardrail that keeps blocking legitimate code changes?

Treat guardrails like application tests — if they emit false positives, tune them rather than disabling them. For the diff size limiter: if legitimate refactors regularly exceed the limit, raise the threshold and add a "large change" label to the resulting PR instead of blocking. Document all guardrail tuning decisions with rationale.