Wavelength
Glossary

Plain-language definitions — no jargon for jargon's sake.

All terms
AI Engineering

Evals (Evaluations)

Systematic tests that measure how well an AI system performs on specific tasks — the AI equivalent of a test suite, used to catch regressions and compare models.

Evals are systematic tests that measure how well an AI system performs on specific tasks. Think of them as the test suite for a non-deterministic system: you define inputs, expected behaviors, and scoring criteria, then run your model against them to get a quantified performance score. Swap a model, change a prompt, update your RAG pipeline — evals tell you whether things got better, worse, or broke entirely.

This is the most underinvested area in AI development, and it's not close. Teams will spend weeks on prompt engineering and retrieval tuning, then evaluate the results by running a few examples and eyeballing the output. That's not engineering — that's shipping on vibes. Without evals, you have no idea if Tuesday's prompt change regressed the 15% of edge cases you stopped manually testing. You find out when a customer does.

A solid eval framework has three components: a dataset of representative inputs, a task that runs those inputs through your system, and scorers that grade the outputs. Scorers can be deterministic (exact match, regex, JSON schema validation) or model-graded (using an LLM to judge quality, relevance, or factual accuracy against a rubric). Model-graded evals are particularly valuable for open-ended generation where there's no single right answer — exactly the cases where hallucination risk is highest and guardrails alone aren't sufficient.

The gap between a demo and a product is measurement. Evals are how you close it.

Further reading: