Evals & scorers
Measure quality
You can’t improve what you only eyeball. A scorer (eval) is an automated check that grades an agent’s output — relevance, correctness, safety, or a custom rubric — into a score you can track and gate on. Run them ad hoc while building, on every output in production, or against a reference set in CI.
How a scorer fits
Define a scorer
Section titled “Define a scorer”createScorer gives the scorer an identity; .generateScore returns a number from the run context. This one is code-based (deterministic) — fast and free.
import { createScorer } from '@mastra/core/evals';
export const completenessScorer = createScorer({ id: 'task-complete', name: 'Task Completeness',}).generateScore(async (context) => { const text = (context.run.output ?? '').toString(); const hasAnalysis = text.includes('analysis'); const hasRecommendation = text.includes('recommendation'); return hasAnalysis && hasRecommendation ? 1 : 0;});Two kinds of scorer
Section titled “Two kinds of scorer”- Code-based — deterministic checks (length, contains, regex, schema-valid). Cheap, instant, great for structural guarantees.
- LLM-as-judge — a model grades against a rubric (relevance, faithfulness, tone). Slower and costs tokens, but catches quality a regex can’t.
Where scorers run
Section titled “Where scorers run”- Inline, gating a supervisor — pass them to
isTaskCompleteso an agent keeps working until the bar is met. - In CI — run scorers over a fixed reference set on every PR to catch quality regressions before they ship (pairs with Deployment → evals in CI).
- In production — score a sample of live outputs and chart it alongside observability traces.
Reference: Scorers overview · createScorer · Running in CI
Next: Observability — see every call in production.