Term

Eval

A reproducible test that measures how an LLM or LLM application performs on a specific task. Golden test sets, rubric grading, A/B comparisons. The closest thing to unit tests for prompts.

Background

An eval is a repeatable procedure that scores how well a model or an LLM-powered feature does a defined task, turning subjective impressions of quality into a number you can track. Mechanically, an eval is a dataset of inputs paired with a scoring function. Each input is run through the system, the output is graded, and the results are aggregated into a metric such as accuracy, pass rate, or average score. Grading comes in a few flavors: exact-match or regex against a gold answer, programmatic checks (does the JSON parse, does the code compile and pass unit tests), similarity metrics against a reference, or an LLM-as-judge that rates each output against a rubric. For example, to evaluate a support-bot summarizer you might collect a hundred real tickets, write the ideal summary or a rubric for each, run the current prompt, and have a judge model score faithfulness and completeness on every one. Because the dataset and scorer are fixed, you can rerun the same eval after changing a prompt, swapping models, or upgrading a retrieval step and see whether the metric moved. Evals matter because building with LLMs is otherwise a guessing game: outputs are nondeterministic, and a change that looks better on three hand-checked cases can silently regress on many others. Evals catch those regressions, let you compare candidates objectively, and act as the test suite that makes iterating on prompts and models safe rather than superstitious.

Background

Tools that use it