Dictionary

AI Evals

Reproducible test suites that measure LLM output quality across model, prompt and code changes.

Definition

Evals are the unit tests of AI systems. You define a labeled dataset and scoring functions (exact match, rubric grading, LLM-as-judge) and run them on every change — so you catch regressions when swapping models, tweaking prompts or upgrading a tool.

Example

Before promoting a new system prompt, a team runs 200 saved customer questions through both the old and new prompt and compares helpfulness, accuracy and refusal rates side by side.

Related Workflows

Workflow · AI Agent Monitoring System

Workflow · Prompt Library Operations

Related Tool Stacks

Tool Stack · AI Ops Observability Stack

↳ connected nodes

Workflow↳ linked

AI Agent Monitoring System

Track agent runs, failures, cost, and review queues from one operational surface.

Workflow↳ linked

Prompt Library Operations

Version, evaluate, and reuse prompts as operational assets rather than loose text snippets.

Tool Stack↳ linked

AI Ops Observability Stack

Monitoring layer for agent runs, workflow health, cost, errors, and review queues.