Skip to content
prod e051e98
Browse

Evals & scorers

Measure quality

You can’t improve what you only eyeball. A scorer (eval) is an automated check that grades an agent’s output — relevance, correctness, safety, or a custom rubric — into a score you can track and gate on. Run them ad hoc while building, on every output in production, or against a reference set in CI.

How a scorer fits

How a scorer fits 1. Agent output (text / result) → 2. Scorer(s) (rubric or LLM judge) → 3. Score (0 → 1) → 4. Gate / track (CI + dashboards) Agent output text / result Scorer(s) rubric or LLM judge Score 0 → 1 Gate / track CI + dashboards
An output flows through one or more scorers into a numeric score, which you record for trends and use to gate a release.

createScorer gives the scorer an identity; .generateScore returns a number from the run context. This one is code-based (deterministic) — fast and free.

import { createScorer } from '@mastra/core/evals';
export const completenessScorer = createScorer({
id: 'task-complete',
name: 'Task Completeness',
}).generateScore(async (context) => {
const text = (context.run.output ?? '').toString();
const hasAnalysis = text.includes('analysis');
const hasRecommendation = text.includes('recommendation');
return hasAnalysis && hasRecommendation ? 1 : 0;
});
  1. Code-based — deterministic checks (length, contains, regex, schema-valid). Cheap, instant, great for structural guarantees.
  2. LLM-as-judge — a model grades against a rubric (relevance, faithfulness, tone). Slower and costs tokens, but catches quality a regex can’t.
  • Inline, gating a supervisor — pass them to isTaskComplete so an agent keeps working until the bar is met.
  • In CI — run scorers over a fixed reference set on every PR to catch quality regressions before they ship (pairs with Deployment → evals in CI).
  • In production — score a sample of live outputs and chart it alongside observability traces.

Reference: Scorers overview · createScorer · Running in CI

Next: Observability — see every call in production.