Evals & scorers

# ZajLibrary — handbook

You are a **technical tutor**. You help the user understand the handbook below and apply it to their own situation.

**Mission:** Explain the handbook below and help the user apply it to what they are building.

## Metadata
- title: Evals & scorers
- url: https://library.zajapps.com/ai-systems/agent-frameworks/mastra/learn/handbooks/evals/
- shelf: Learn & Understand
- doc_type: handbook
- status: current
- kind: handbook
- collection: mastra
- category: ai-systems
- subcategory: agent-frameworks
- topic: mastra
- description: Turn "looks fine" into a number — automated quality checks on agent output you can track over time and gate releases on.
- tags: mastra, evals, scorers, quality

## How to use this page
- Use the body as the source of truth: explain the ideas, then help the user apply them to their own situation.
- Surface trade-offs, decisions, and prerequisites — not just definitions.
- Cite section headings (`##`, `###`) when quoting or referring to specific parts.

---

# Evals & scorers

<p class="eyebrow">Measure quality</p>

You can't improve what you only eyeball. A **scorer** (eval) is an automated check that grades an agent's output — relevance, correctness, safety, or a custom rubric — into a score you can track and gate on. Run them ad hoc while building, on every output in production, or against a reference set in CI.

<StageFlow
  title="How a scorer fits"
  caption="An output flows through one or more scorers into a numeric score, which you record for trends and use to gate a release."
  stages={[
    { label: 'Agent output', sub: 'text / result' },
    { label: 'Scorer(s)', sub: 'rubric or LLM judge', tone: 'core' },
    { label: 'Score', sub: '0 → 1' },
    { label: 'Gate / track', sub: 'CI + dashboards', tone: 'good' },
  ]}
/>

## Define a scorer

`createScorer` gives the scorer an identity; `.generateScore` returns a number from the run context. This one is code-based (deterministic) — fast and free.

```ts

  id: 'task-complete',
  name: 'Task Completeness',
}).generateScore(async (context) => {
  const text = (context.run.output ?? '').toString();
  const hasAnalysis = text.includes('analysis');
  const hasRecommendation = text.includes('recommendation');
  return hasAnalysis && hasRecommendation ? 1 : 0;
});
```

## Two kinds of scorer

1. **Code-based** — deterministic checks (length, contains, regex, schema-valid). Cheap, instant, great for structural guarantees.
2. **LLM-as-judge** — a model grades against a rubric (relevance, faithfulness, tone). Slower and costs tokens, but catches quality a regex can't.

## Where scorers run

- **Inline, gating a supervisor** — pass them to [`isTaskComplete`](/ai-systems/agent-frameworks/mastra/learn/handbooks/multi-agent/#gate-on-task-completion) so an agent keeps working until the bar is met.
- **In CI** — run scorers over a fixed reference set on every PR to catch quality regressions before they ship (pairs with [Deployment → evals in CI](/ai-systems/agent-frameworks/mastra/learn/handbooks/deployment/#production-concerns)).
- **In production** — score a sample of live outputs and chart it alongside [observability](/ai-systems/agent-frameworks/mastra/learn/handbooks/observability/) traces.

> [!TIP]
> Start with one or two **code-based** scorers that encode a hard requirement ("must return valid JSON", "must cite a source"). They're the cheapest way to stop a regression, and they never flake.

---

**Reference:** [Scorers overview](https://mastra.ai/docs/scorers/overview) · [`createScorer`](https://mastra.ai/reference/scorers/create-scorer) · [Running in CI](https://mastra.ai/docs/scorers/running-in-ci)

Next: [**Observability**](/ai-systems/agent-frameworks/mastra/learn/handbooks/observability/) — see every call in production.

# Evals & scorers

> Source: https://library.zajapps.com/ai-systems/agent-frameworks/mastra/learn/handbooks/evals/

<p class="eyebrow">Measure quality</p>

You can't improve what you only eyeball. A **scorer** (eval) is an automated check that grades an agent's output — relevance, correctness, safety, or a custom rubric — into a score you can track and gate on. Run them ad hoc while building, on every output in production, or against a reference set in CI.

<StageFlow
  title="How a scorer fits"
  caption="An output flows through one or more scorers into a numeric score, which you record for trends and use to gate a release."
  stages={[
    { label: 'Agent output', sub: 'text / result' },
    { label: 'Scorer(s)', sub: 'rubric or LLM judge', tone: 'core' },
    { label: 'Score', sub: '0 → 1' },
    { label: 'Gate / track', sub: 'CI + dashboards', tone: 'good' },
  ]}
/>

## Define a scorer

`createScorer` gives the scorer an identity; `.generateScore` returns a number from the run context. This one is code-based (deterministic) — fast and free.

```ts

  id: 'task-complete',
  name: 'Task Completeness',
}).generateScore(async (context) => {
  const text = (context.run.output ?? '').toString();
  const hasAnalysis = text.includes('analysis');
  const hasRecommendation = text.includes('recommendation');
  return hasAnalysis && hasRecommendation ? 1 : 0;
});
```

## Two kinds of scorer

1. **Code-based** — deterministic checks (length, contains, regex, schema-valid). Cheap, instant, great for structural guarantees.
2. **LLM-as-judge** — a model grades against a rubric (relevance, faithfulness, tone). Slower and costs tokens, but catches quality a regex can't.

## Where scorers run

- **Inline, gating a supervisor** — pass them to [`isTaskComplete`](/ai-systems/agent-frameworks/mastra/learn/handbooks/multi-agent/#gate-on-task-completion) so an agent keeps working until the bar is met.
- **In CI** — run scorers over a fixed reference set on every PR to catch quality regressions before they ship (pairs with [Deployment → evals in CI](/ai-systems/agent-frameworks/mastra/learn/handbooks/deployment/#production-concerns)).
- **In production** — score a sample of live outputs and chart it alongside [observability](/ai-systems/agent-frameworks/mastra/learn/handbooks/observability/) traces.

> [!TIP]
> Start with one or two **code-based** scorers that encode a hard requirement ("must return valid JSON", "must cite a source"). They're the cheapest way to stop a regression, and they never flake.

---

**Reference:** [Scorers overview](https://mastra.ai/docs/scorers/overview) · [`createScorer`](https://mastra.ai/reference/scorers/create-scorer) · [Running in CI](https://mastra.ai/docs/scorers/running-in-ci)

Next: [**Observability**](/ai-systems/agent-frameworks/mastra/learn/handbooks/observability/) — see every call in production.

Measure quality

You can’t improve what you only eyeball. A scorer (eval) is an automated check that grades an agent’s output — relevance, correctness, safety, or a custom rubric — into a score you can track and gate on. Run them ad hoc while building, on every output in production, or against a reference set in CI.

How a scorer fits

An output flows through one or more scorers into a numeric score, which you record for trends and use to gate a release.

Define a scorer

createScorer gives the scorer an identity; .generateScore returns a number from the run context. This one is code-based (deterministic) — fast and free.

import { createScorer } from '@mastra/core/evals';

export const completenessScorer = createScorer({
  id: 'task-complete',
  name: 'Task Completeness',
}).generateScore(async (context) => {
  const text = (context.run.output ?? '').toString();
  const hasAnalysis = text.includes('analysis');
  const hasRecommendation = text.includes('recommendation');
  return hasAnalysis && hasRecommendation ? 1 : 0;
});

Two kinds of scorer

Code-based — deterministic checks (length, contains, regex, schema-valid). Cheap, instant, great for structural guarantees.
LLM-as-judge — a model grades against a rubric (relevance, faithfulness, tone). Slower and costs tokens, but catches quality a regex can’t.

Where scorers run

Inline, gating a supervisor — pass them to isTaskComplete so an agent keeps working until the bar is met.
In CI — run scorers over a fixed reference set on every PR to catch quality regressions before they ship (pairs with Deployment → evals in CI).
In production — score a sample of live outputs and chart it alongside observability traces.

Reference: Scorers overview · createScorer · Running in CI

Next: Observability — see every call in production.

# ZajLibrary — handbook

You are a **technical tutor**. You help the user understand the handbook below and apply it to their own situation.

**Mission:** Explain the handbook below and help the user apply it to what they are building.

## Metadata
- title: Evals & scorers
- url: https://library.zajapps.com/ai-systems/agent-frameworks/mastra/learn/handbooks/evals/
- shelf: Learn & Understand
- doc_type: handbook
- status: current
- kind: handbook
- collection: mastra
- category: ai-systems
- subcategory: agent-frameworks
- topic: mastra
- description: Turn "looks fine" into a number — automated quality checks on agent output you can track over time and gate releases on.
- tags: mastra, evals, scorers, quality

## How to use this page
- Use the body as the source of truth: explain the ideas, then help the user apply them to their own situation.
- Surface trade-offs, decisions, and prerequisites — not just definitions.
- Cite section headings (`##`, `###`) when quoting or referring to specific parts.

---

# Evals & scorers

<p class="eyebrow">Measure quality</p>

You can't improve what you only eyeball. A **scorer** (eval) is an automated check that grades an agent's output — relevance, correctness, safety, or a custom rubric — into a score you can track and gate on. Run them ad hoc while building, on every output in production, or against a reference set in CI.

<StageFlow
  title="How a scorer fits"
  caption="An output flows through one or more scorers into a numeric score, which you record for trends and use to gate a release."
  stages={[
    { label: 'Agent output', sub: 'text / result' },
    { label: 'Scorer(s)', sub: 'rubric or LLM judge', tone: 'core' },
    { label: 'Score', sub: '0 → 1' },
    { label: 'Gate / track', sub: 'CI + dashboards', tone: 'good' },
  ]}
/>

## Define a scorer

`createScorer` gives the scorer an identity; `.generateScore` returns a number from the run context. This one is code-based (deterministic) — fast and free.

```ts

  id: 'task-complete',
  name: 'Task Completeness',
}).generateScore(async (context) => {
  const text = (context.run.output ?? '').toString();
  const hasAnalysis = text.includes('analysis');
  const hasRecommendation = text.includes('recommendation');
  return hasAnalysis && hasRecommendation ? 1 : 0;
});
```

## Two kinds of scorer

1. **Code-based** — deterministic checks (length, contains, regex, schema-valid). Cheap, instant, great for structural guarantees.
2. **LLM-as-judge** — a model grades against a rubric (relevance, faithfulness, tone). Slower and costs tokens, but catches quality a regex can't.

## Where scorers run

- **Inline, gating a supervisor** — pass them to [`isTaskComplete`](/ai-systems/agent-frameworks/mastra/learn/handbooks/multi-agent/#gate-on-task-completion) so an agent keeps working until the bar is met.
- **In CI** — run scorers over a fixed reference set on every PR to catch quality regressions before they ship (pairs with [Deployment → evals in CI](/ai-systems/agent-frameworks/mastra/learn/handbooks/deployment/#production-concerns)).
- **In production** — score a sample of live outputs and chart it alongside [observability](/ai-systems/agent-frameworks/mastra/learn/handbooks/observability/) traces.

> [!TIP]
> Start with one or two **code-based** scorers that encode a hard requirement ("must return valid JSON", "must cite a source"). They're the cheapest way to stop a regression, and they never flake.

---

**Reference:** [Scorers overview](https://mastra.ai/docs/scorers/overview) · [`createScorer`](https://mastra.ai/reference/scorers/create-scorer) · [Running in CI](https://mastra.ai/docs/scorers/running-in-ci)

Next: [**Observability**](/ai-systems/agent-frameworks/mastra/learn/handbooks/observability/) — see every call in production.

# Evals & scorers

> Source: https://library.zajapps.com/ai-systems/agent-frameworks/mastra/learn/handbooks/evals/

<p class="eyebrow">Measure quality</p>

You can't improve what you only eyeball. A **scorer** (eval) is an automated check that grades an agent's output — relevance, correctness, safety, or a custom rubric — into a score you can track and gate on. Run them ad hoc while building, on every output in production, or against a reference set in CI.

<StageFlow
  title="How a scorer fits"
  caption="An output flows through one or more scorers into a numeric score, which you record for trends and use to gate a release."
  stages={[
    { label: 'Agent output', sub: 'text / result' },
    { label: 'Scorer(s)', sub: 'rubric or LLM judge', tone: 'core' },
    { label: 'Score', sub: '0 → 1' },
    { label: 'Gate / track', sub: 'CI + dashboards', tone: 'good' },
  ]}
/>

## Define a scorer

`createScorer` gives the scorer an identity; `.generateScore` returns a number from the run context. This one is code-based (deterministic) — fast and free.

```ts

  id: 'task-complete',
  name: 'Task Completeness',
}).generateScore(async (context) => {
  const text = (context.run.output ?? '').toString();
  const hasAnalysis = text.includes('analysis');
  const hasRecommendation = text.includes('recommendation');
  return hasAnalysis && hasRecommendation ? 1 : 0;
});
```

## Two kinds of scorer

1. **Code-based** — deterministic checks (length, contains, regex, schema-valid). Cheap, instant, great for structural guarantees.
2. **LLM-as-judge** — a model grades against a rubric (relevance, faithfulness, tone). Slower and costs tokens, but catches quality a regex can't.

## Where scorers run

- **Inline, gating a supervisor** — pass them to [`isTaskComplete`](/ai-systems/agent-frameworks/mastra/learn/handbooks/multi-agent/#gate-on-task-completion) so an agent keeps working until the bar is met.
- **In CI** — run scorers over a fixed reference set on every PR to catch quality regressions before they ship (pairs with [Deployment → evals in CI](/ai-systems/agent-frameworks/mastra/learn/handbooks/deployment/#production-concerns)).
- **In production** — score a sample of live outputs and chart it alongside [observability](/ai-systems/agent-frameworks/mastra/learn/handbooks/observability/) traces.

> [!TIP]
> Start with one or two **code-based** scorers that encode a hard requirement ("must return valid JSON", "must cite a source"). They're the cheapest way to stop a regression, and they never flake.

---

**Reference:** [Scorers overview](https://mastra.ai/docs/scorers/overview) · [`createScorer`](https://mastra.ai/reference/scorers/create-scorer) · [Running in CI](https://mastra.ai/docs/scorers/running-in-ci)

Next: [**Observability**](/ai-systems/agent-frameworks/mastra/learn/handbooks/observability/) — see every call in production.