Dictionary
AI Evals
Reproducible test suites that measure LLM output quality across model, prompt and code changes.
Definition
Evals are the unit tests of AI systems. You define a labeled dataset and scoring functions (exact match, rubric grading, LLM-as-judge) and run them on every change — so you catch regressions when swapping models, tweaking prompts or upgrading a tool.
Example
Before promoting a new system prompt, a team runs 200 saved customer questions through both the old and new prompt and compares helpfulness, accuracy and refusal rates side by side.
Related Workflows
Related Tool Stacks
↳ connected nodes
Workflow↳ linked
AI Agent Monitoring System
Track agent runs, failures, cost, and review queues from one operational surface.
Workflow↳ linked
Prompt Library Operations
Version, evaluate, and reuse prompts as operational assets rather than loose text snippets.
Tool Stack↳ linked
AI Ops Observability Stack
Monitoring layer for agent runs, workflow health, cost, errors, and review queues.