Eval
A reproducible test that measures how an LLM or LLM application performs on a specific task. Golden test sets, rubric grading, A/B comparisons. The closest thing to unit tests for prompts.
Background
Eval design is the discipline that separates serious AI applications from prompt-hacking. A useful eval has: clear inputs, a measurable outcome (exact-match, LLM-as-judge, human grade), enough examples to spot regressions, and version-controlled to track changes. LangSmith, Braintrust, OpenAI Evals, and Promptfoo are the most-used evaluation platforms.
Tools that use it
- 01→Braintrust
Evaluation, prompt playground, and observability for LLM apps in production.
- 02→Humanloop
Collaboration platform for prompt engineering, evaluation, and deployment of LLM features.
- 03→LangSmith
Observability and evaluation for LLM apps — traces, datasets, A/B testing, and feedback collection.