115
Dictionary/ai-evals
Dictionary

AI Evals

Reproducible test suites that measure LLM output quality across model, prompt and code changes.

Definition
Evals are the unit tests of AI systems. You define a labeled dataset and scoring functions (exact match, rubric grading, LLM-as-judge) and run them on every change — so you catch regressions when swapping models, tweaking prompts or upgrading a tool.
Example
Before promoting a new system prompt, a team runs 200 saved customer questions through both the old and new prompt and compares helpfulness, accuracy and refusal rates side by side.
Related Workflows
Related Tool Stacks