Term

LLM as judge

Using an LLM (often a stronger one than the one being tested) to grade outputs against a rubric. Replaces or supplements human grading for evals at scale. Accuracy of the judge is itself a metric you have to measure.

Background

LLM-as-judge became standard practice for evaluating chat applications where exact-match grading isn't possible. Best practices: a calibrated rubric, the judge model sees the rubric and not the original prompt, output is a structured score plus a one-sentence rationale, and judge bias is audited against human labels every so often.

Tools that use it

01
LangSmith
Observability and evaluation for LLM apps — traces, datasets, A/B testing, and feedback collection.
→
02
Braintrust
Evaluation, prompt playground, and observability for LLM apps in production.
→