Term
LLM as judge
Using an LLM (often a stronger one than the one being tested) to grade outputs against a rubric. Replaces or supplements human grading for evals at scale. Accuracy of the judge is itself a metric you have to measure.
Background
LLM-as-judge became standard practice for evaluating chat applications where exact-match grading isn't possible. Best practices: a calibrated rubric, the judge model sees the rubric and not the original prompt, output is a structured score plus a one-sentence rationale, and judge bias is audited against human labels every so often.