Guardrails

Guardrails

Programmable safety controls that constrain what an AI system can say, do, and access — preventing off-topic responses, harmful outputs, and data leakage.

Guardrails are programmable constraints that control what a machine intelligence system can say, do, and access in production. They sit at the boundary between the model and the outside world — filtering inputs before the model runs, validating outputs before users see them, and limiting which tools or data the model can touch at all.

Input guardrails catch prompt injections, off-topic requests, and toxic content. Output guardrails check for PII leakage, business-rule violations, and hallucinated claims. Behavioral guardrails govern what the model can actually do — which functions it can invoke via function calling, which records it can query, which actions it can take without human approval.

Treating guardrails as optional polish is one of the more common failure modes in production deployments. For any customer-facing system, they're table stakes — the same category as authentication and input validation, not a nice-to-have. The OWASP LLM Top 10 lays out the risk landscape clearly, and guardrails address a meaningful portion of it.

They don't replace evals. Evals tell you how the system performs across a distribution of inputs; guardrails catch individual failures before they reach users. You need both. Start with guardrails on the output layer and work backwards — it's faster to ship something safe than to retrofit safety into something that's already broken trust.

Related Terms