5 · Observability

# ZajLibrary — implementation playbook

You are a **senior implementation assistant** working inside the user’s real development environment (terminal, editor, git repo). You are **not** a documentation writer, tutor, or library editor.

**Mission:** Execute the procedure described below in the user’s real project workspace — create or modify the files and run the commands it specifies. Do not just summarize it.

## Metadata
- title: 5 · Observability
- url: https://library.zajapps.com/tech-stack/laravel/codecanyon/build/playbooks/setup-new/07-security-monitoring/05-observability/
- shelf: Build & Create
- doc_type: implementation playbook
- status: current
- kind: playbook
- category: tech-stack
- subcategory: laravel
- topic: codecanyon
- phase: 7
- step_number: 5
- est_time: ~2 hr
- description: Extend the basic Sentry DSN into cron monitors, performance, and release health; add rotated + structured logs; expose a health endpoint with external uptime monitoring; and define alert escalation backed by a quarterly restore drill.
- tags: codecanyon, laravel, sentry, monitoring, uptime

## Execution rules (binding)
- This is step 5 of phase 7 (~2 hr) — do only this step, then stop at its `## Checklist`.
- Follow `## Steps` in order. Do not skip steps or declare the work complete early.
- Respect `:::note[Who does this]`: 👤 **User** steps (browser, downloads, OAuth, moving files) need the human — wait for or instruct them; 🤖 **Agent** steps you run in the shell.
- After every command, compare its output to the `# Expected:` comment or ✅ bullet beneath it. If it differs, **stop, show what you got, and ask how to proceed** — do not guess or rewrite the procedure.
- Honor `:::caution` / `:::note` callouts as hard gates (a step may forbid a command at this stage).
- Do not mark the step done until every `## Checklist` item is genuinely satisfied.
- Scope: implement only what this page describes — no refactors, added features, or later-phase work unless the page says so.
- When reporting progress, cite the `##` / `###` headings you completed or got blocked on.
- Secrets: never put API keys, passwords, or `.env` values in frontend code, commits, or chat logs — use server-side `.env` / config only, and keep `.env` gitignored.

---

# 5 · Observability

## TL;DR

**Objective** — make every failure loud: extend the basic Sentry DSN into cron monitors, performance, and release health; add rotated + structured logs; expose a health endpoint with external uptime monitoring; and define alert escalation backed by a quarterly restore drill.

:::note[User part + done-when]
**👤 User part** — create external uptime monitors and alert rules in the provider dashboard, generate the Sentry auth token, and decide the on-call escalation. The code, env, log channels, and health route are terminal-runnable.

**Done when** Sentry cron monitors are green, releases carry the git SHA, `LOG_STACK=daily,sentry_logs` ships warnings to Sentry, `/health` returns 200, external uptime monitors are wired (production), the escalation matrix is documented, and the first quarterly restore drill has passed. &nbsp;·&nbsp; **⏱ ~2 hr** &nbsp;·&nbsp; 🔀 mixed (👤 dashboards + monitors, 🤖 code + config).
:::

## Background

With the app hardened and backups landing off-server, make every failure **loud**. Each section here is a "should," not a "must" — but a failed backup that screams beats one that fails silently for weeks. These extend the basic Sentry DSN + incident runbook from the Phase 4 deploy pipeline; they don't recreate them.

:::note[Wire on staging, monitor on production]
You can rehearse this whole stack on staging, but **defer creating external uptime monitors and alert rules until production** — staging URLs churn and deliberate deploy downtime creates false alarms.
:::

## Steps

### 1. Full Sentry — crons, performance, release health

Confirm the SDK + DSN are already in place (Phase 3 installed the SDK, Phase 4 set the production DSN). Everything below is what comes *after* that — do not re-create the project or re-install the SDK.

:::note[Who does this]
🔀 **User + Agent** — the agent wires cron monitors, sample rates, and the release config; 👤 **User** generates the Sentry auth token at Sentry → Settings → Auth Tokens (the agent can't sign in to the Sentry dashboard).
:::

:::note[Alert rules are dashboard-only]
Sentry alert-rule creation has no CLI or MCP equivalent in this workflow — it is a dashboard task (Sentry → Alerts). The agent can wire crons, sample rates, and release config from the terminal, but the five alert rules in section 4 must be created by hand in the Sentry UI.
:::

<Steps>

1. **Wrap each critical scheduled command in a cron monitor.** Without heartbeats, a failed `backup:run` fails silently.

   ```php
   // app/Console/Kernel.php
   $schedule->command('backup:run --only-db')->dailyAt('02:00')->sentryMonitor('daily-db-backup');
   // Sentry's free tier = ONLY 1 cron monitor (each extra bills ~$0.78/mo). Wrap just the
   // single most-critical job — the daily DB backup — and leave the rest unmonitored here:
   $schedule->command('backup:run')->weeklyOn(0, '02:30');
   $schedule->command('backup:monitor')->dailyAt('04:00');
   $schedule->command('queue:prune-batches --hours=48')->daily();
   ```

   - ✅ The one monitor auto-registers in Sentry → **Crons** on first run; expect a green checkmark within 24 h. **Free tier = 1 monitor**, so wrap only the daily DB backup (wrapping all four silently bills ~$28/yr). MonSpark (section 3) can cover the rest — pick one, not both.

2. **Set performance sampling + PII stripping** in `.env`.

   ```dotenv
   SENTRY_TRACES_SAMPLE_RATE=0.1     # 10% of requests
   SENTRY_PROFILES_SAMPLE_RATE=0.1   # 10% profiling on traced requests
   SENTRY_SEND_DEFAULT_PII=false     # GDPR: strip IP + email by default
   ```

   - ✅ Traces/profiles sample at 10% and default PII is stripped.

3. **Tag each release with the git SHA** so a regression alert points at the commit that caused it. In `config/sentry.php` set `'release' => env('SENTRY_RELEASE')` and `'environment' => env('APP_ENV')`, then set `SENTRY_RELEASE` in a Deployer task (`git rev-parse --short HEAD`) and finalize the release with `sentry-cli` (`releases new` → `set-commits --auto` → `finalize`). Generate the auth token (👤) at Sentry → Settings → Auth Tokens (scope `project:releases`); store `SENTRY_ORG` / `SENTRY_PROJECT` / `SENTRY_AUTH_TOKEN` in the server `shared/.env`.

   - ✅ Releases are tagged with the SHA and finalized via `sentry-cli`.

4. **Upload JS source maps** (only if you ship a bundle). If the frontend has a build step (Vite/Webpack), add `sentry-cli sourcemaps upload` to the production `npm run build` flow with `SENTRY_AUTH_TOKEN` set in CI. Skip this entirely for Blade-only apps.

   - ✅ JS errors show readable stack traces (or this is skipped for a Blade-only app).

</Steps>

### 2. Log aggregation + rotation

Keep a rotated `daily` channel plus a `sentry_logs` channel so warnings/errors flow to Sentry.

<Steps>

1. **Confirm the log channels exist** in `config/logging.php` (the vendor or Phase 3 may already have them).

   ```php
   'daily' => ['driver' => 'daily', 'path' => storage_path('logs/laravel.log'),
               'level' => env('LOG_LEVEL', 'debug'), 'days' => env('LOG_DAILY_DAYS', 14)],
   'security' => ['driver' => 'daily', 'path' => storage_path('logs/security.log'), 'level' => 'info', 'days' => 30],
   'payments' => ['driver' => 'daily', 'path' => storage_path('logs/payments.log'), 'level' => 'info', 'days' => 90],
   'sentry_logs' => ['driver' => 'sentry', 'level' => env('SENTRY_LOG_LEVEL', 'warning'), 'bubble' => true],
   ```

   - ✅ The `daily`, `security`, `payments`, and `sentry_logs` channels are defined.

2. **Set the per-environment log stack + level.**

   | Env | `LOG_STACK` | `LOG_LEVEL` |
   |---|---|---|
   | Local | `single` | `debug` |
   | Production / staging | `daily,sentry_logs` | `warning` |

   - ✅ Production runs `LOG_STACK=daily,sentry_logs` at `warning`, so warnings reach Sentry.

</Steps>

Prefer **structured** logging — `Log::channel('payments')->info('Stripe webhook', ['event' => $event->id])` — so fields stay searchable. On servers with sudo, add a `logrotate.d` rule (`daily`, `rotate 14`, `compress`); on shared hosting, Laravel's `daily` driver handles rotation. For centralized shipping, the `sentry_logs` channel (free with your existing Sentry plan) is the zero-new-vendor default; add Better Stack / Papertrail / Axiom only if volume exceeds Sentry's quota.

### 3. Health endpoint + uptime monitoring

Expose a health check (or use Laravel 11+'s built-in `/up`), then point an external monitor at it.

:::note[Who does this]
🔀 **User + Agent** — the agent adds the `/health` route; 👤 **User** creates the external uptime monitor and alert rules in the provider dashboard (defer to production).
:::

<Steps>

1. **Add a `/health` route** to `routes/web.php` if missing.

   ```php
   use Illuminate\Support\Facades\{DB, Cache, Route};

   Route::get('/health', function (\Illuminate\Http\Request $request) {
       $checks = ['app' => true, 'db' => false, 'cache' => false];
       try { DB::connection()->getPdo(); $checks['db'] = true; } catch (\Throwable $e) {}
       try { Cache::put('hc', '1', 10); $checks['cache'] = Cache::get('hc') === '1'; } catch (\Throwable $e) {}
       $healthy = !in_array(false, $checks, true);

       // Detailed subsystem breakdown only with the monitor token; anonymous
       // callers get a bare status so /health can't fingerprint internals.
       $token = config('services.healthcheck.token');
       $authed = $token && hash_equals($token, (string) $request->query('token'));
       $body = $authed ? ['status' => $healthy ? 'healthy' : 'unhealthy', 'checks' => $checks]
                       : ['status' => $healthy ? 'healthy' : 'unhealthy'];
       return response()->json($body, $healthy ? 200 : 503);
   })->middleware('throttle:30,1'); // 30 req/min/IP — a hit-the-DB route must be rate-limited
   ```

   - ✅ `/health` returns 200 (bare `{status}`) to anonymous callers, 503 when down, full `checks` only with `?token=` matching `HEALTHCHECK_TOKEN`, and is throttled to 30/min so it can't be used as a DB-DoS amplifier.

2. **Pick one external monitor** (👤) and point it at the app every 1–5 min (defer this to production).

   | Provider | Free tier | Interval | Best for |
   |---|---|---|---|
   | **UptimeRobot** | 50 monitors | 5 min | Most monitors for free |
   | **HetrixTools** | 15 monitors | 1 min | Uptime + server RAM/CPU agent |
   | **MonSpark** | 2–4 monitors | 1 min | All-in-one: status page, cron monitor, phone calls |
   | **Better Stack** | 10 monitors | 3 min | On-call rotation + incident timeline |

   - ✅ External monitors watch homepage + `/health` (P0, 1–5 min), login path (P1, 5 min), and SSL-cert + DNS expiry (P1, daily). Optionally publish a public status page at `status.[DOMAIN]`.

   :::note[MonSpark as the all-in-one path]
   If you standardize on **MonSpark**, one vendor can cover uptime plus the rest:

   - Cron monitoring (heartbeat URLs) can replace Sentry Crons for backup monitoring — it is either/or, not both.
   - Server agent — RAM / CPU / disk monitoring, installed via SSH after production deploy.
   - Four monitor categories — availability, security, SEO, and integrity — plus alert channels, escalation policies, and a status page.
   :::

   :::tip[Status-page component map]
   If you publish a public status page at `status.[DOMAIN]`, map these components so each line corresponds to a real check:

   | Component | Backed by |
   |---|---|
   | **Application** | Homepage + `/health` endpoint |
   | **Authentication** | Login endpoint |
   | **Email** | Your ESP's third-party status page |
   | **Payments** | Stripe / PayPal status page |
   :::

</Steps>

### 4. Alert escalation + restore drills

Define who gets paged, how, and when — routed through one shared webhook config (the same one the backup and Sentry alerts use), never hardcoded URLs.

```mermaid
flowchart TD
  EV["Alert fires"] --> SEV{Severity}
  SEV -->|P0| P0["SMS + phone + Slack #alerts-p0<br/>5 min · 24/7"]
  SEV -->|P1| P1["Slack #alerts-p1 + email<br/>30 min · business hours"]
  SEV -->|P2| P2["Slack team channel<br/>4 hr"]
  SEV -->|P3| P3["Weekly email digest"]
```

<Steps>

1. **Document the severity → channel matrix**, routed via the shared webhook config.

   | Severity | Example trigger | Channel |
   |---|---|---|
   | **P0** | Cron monitor fails · error rate > 10× baseline · homepage/health down > 2 min | SMS + phone + Slack |
   | **P1** | New unresolved production issue · warning-level log pattern | Slack + email |
   | **P2** | P95 latency regression > 2× baseline | Slack team channel |
   | **P3** | Crash-free sessions < 99% | Email digest |

   - ✅ Each severity maps to a channel, routed through one shared webhook config.

2. **Create these five rules in Sentry → Alerts** (👤). The generic matrix above defines who gets paged; these are the concrete Sentry rules that fire the P0–P3 signals.

   | Severity | Trigger (create in Sentry → Alerts) | Channel |
   |---|---|---|
   | **P0** | Cron monitor fails (any) | SMS + phone + Slack `#alerts-p0` |
   | **P0** | Error rate > 10× the 1 h baseline | Slack `#alerts-p0` + email |
   | **P1** | Any new unresolved issue in `production` | Slack `#alerts-p1` + email |
   | **P2** | P95 transaction duration regression > 2× baseline | Slack team channel |
   | **P3** | Release health: crash-free sessions < 99% | Email digest |

   - ✅ All five Sentry alert rules are active and routed through the shared webhook config.

3. **Schedule the quarterly restore drill** and capture the runbook.

   - ✅ A quarterly restore drill is scheduled, the first drill passed, and provider + bucket, Sentry URL + cron slugs, log retention, uptime monitors, on-call, and next drill date are captured in a `monitoring-state.md` bus-factor document written as if someone else must use it without your help.

</Steps>

:::tip[Solo operator path]
No team to escalate to? Collapse the table: P0 → phone call (MonSpark gives ~10 calls/mo) or the Sentry mobile app; P1 → email; P2–P3 → a weekly review. Skip on-call rotation entirely.
:::

:::note[Weekly monitoring review]
Set a recurring calendar event to keep alerting honest and catch slow drift:

- Review last week's alerts — tune false positives.
- Check the backup success rate.
- Check the Sentry issue backlog — any P1 issues older than 48 h?
- Review the off-site storage cost trend for the `backup-offsite` disk.
:::

Whatever the escalation path, the backups behind it still have to be provably restorable:

:::danger[A backup you never restore is not a backup]
Schedule a **quarterly restore drill**: pull a random recent backup from the off-site disk, import it into a throwaway database, and verify row counts + a sample query + that `APP_KEY` decryption works. Log the result (date, backup used, time-to-restore). **If a drill fails, STOP and fix it** before the next one passes — solo operators, set a recurring calendar reminder, this is the most-skipped step.
:::

## Checklist

Do not mark this step done until **every** box below is checked.

- [ ] **🔀 Sentry extended** — cron monitors green; performance data visible; releases tagged with the git SHA (👤 auth token generated).
- [ ] **🤖 Logs aggregated** — `LOG_STACK=daily,sentry_logs` + `LOG_LEVEL=warning` on the server; warnings reach Sentry.
- [ ] **🔀 Health + uptime wired** — `/health` (or `/up`) returns 200 with JSON; external uptime monitors wired (👤, production).
- [ ] **🤖 Escalation documented** — severity → channel matrix routed via the shared webhook config.
- [ ] **🔀 Restore drill done** — quarterly drill scheduled; first drill passed; `monitoring-state.md` current.

:::next[Next step]
[Legal, privacy & GDPR](/tech-stack/laravel/codecanyon/build/playbooks/setup-new/07-security-monitoring/06-legal-gdpr/)
:::

# 5 · Observability

> Source: https://library.zajapps.com/tech-stack/laravel/codecanyon/build/playbooks/setup-new/07-security-monitoring/05-observability/

## TL;DR

**Objective** — make every failure loud: extend the basic Sentry DSN into cron monitors, performance, and release health; add rotated + structured logs; expose a health endpoint with external uptime monitoring; and define alert escalation backed by a quarterly restore drill.

:::note[User part + done-when]
**👤 User part** — create external uptime monitors and alert rules in the provider dashboard, generate the Sentry auth token, and decide the on-call escalation. The code, env, log channels, and health route are terminal-runnable.

**Done when** Sentry cron monitors are green, releases carry the git SHA, `LOG_STACK=daily,sentry_logs` ships warnings to Sentry, `/health` returns 200, external uptime monitors are wired (production), the escalation matrix is documented, and the first quarterly restore drill has passed. &nbsp;·&nbsp; **⏱ ~2 hr** &nbsp;·&nbsp; 🔀 mixed (👤 dashboards + monitors, 🤖 code + config).
:::

## Background

With the app hardened and backups landing off-server, make every failure **loud**. Each section here is a "should," not a "must" — but a failed backup that screams beats one that fails silently for weeks. These extend the basic Sentry DSN + incident runbook from the Phase 4 deploy pipeline; they don't recreate them.

:::note[Wire on staging, monitor on production]
You can rehearse this whole stack on staging, but **defer creating external uptime monitors and alert rules until production** — staging URLs churn and deliberate deploy downtime creates false alarms.
:::

## Steps

### 1. Full Sentry — crons, performance, release health

Confirm the SDK + DSN are already in place (Phase 3 installed the SDK, Phase 4 set the production DSN). Everything below is what comes *after* that — do not re-create the project or re-install the SDK.

:::note[Who does this]
🔀 **User + Agent** — the agent wires cron monitors, sample rates, and the release config; 👤 **User** generates the Sentry auth token at Sentry → Settings → Auth Tokens (the agent can't sign in to the Sentry dashboard).
:::

:::note[Alert rules are dashboard-only]
Sentry alert-rule creation has no CLI or MCP equivalent in this workflow — it is a dashboard task (Sentry → Alerts). The agent can wire crons, sample rates, and release config from the terminal, but the five alert rules in section 4 must be created by hand in the Sentry UI.
:::

<Steps>

1. **Wrap each critical scheduled command in a cron monitor.** Without heartbeats, a failed `backup:run` fails silently.

   ```php
   // app/Console/Kernel.php
   $schedule->command('backup:run --only-db')->dailyAt('02:00')->sentryMonitor('daily-db-backup');
   // Sentry's free tier = ONLY 1 cron monitor (each extra bills ~$0.78/mo). Wrap just the
   // single most-critical job — the daily DB backup — and leave the rest unmonitored here:
   $schedule->command('backup:run')->weeklyOn(0, '02:30');
   $schedule->command('backup:monitor')->dailyAt('04:00');
   $schedule->command('queue:prune-batches --hours=48')->daily();
   ```

   - ✅ The one monitor auto-registers in Sentry → **Crons** on first run; expect a green checkmark within 24 h. **Free tier = 1 monitor**, so wrap only the daily DB backup (wrapping all four silently bills ~$28/yr). MonSpark (section 3) can cover the rest — pick one, not both.

2. **Set performance sampling + PII stripping** in `.env`.

   ```dotenv
   SENTRY_TRACES_SAMPLE_RATE=0.1     # 10% of requests
   SENTRY_PROFILES_SAMPLE_RATE=0.1   # 10% profiling on traced requests
   SENTRY_SEND_DEFAULT_PII=false     # GDPR: strip IP + email by default
   ```

   - ✅ Traces/profiles sample at 10% and default PII is stripped.

3. **Tag each release with the git SHA** so a regression alert points at the commit that caused it. In `config/sentry.php` set `'release' => env('SENTRY_RELEASE')` and `'environment' => env('APP_ENV')`, then set `SENTRY_RELEASE` in a Deployer task (`git rev-parse --short HEAD`) and finalize the release with `sentry-cli` (`releases new` → `set-commits --auto` → `finalize`). Generate the auth token (👤) at Sentry → Settings → Auth Tokens (scope `project:releases`); store `SENTRY_ORG` / `SENTRY_PROJECT` / `SENTRY_AUTH_TOKEN` in the server `shared/.env`.

   - ✅ Releases are tagged with the SHA and finalized via `sentry-cli`.

4. **Upload JS source maps** (only if you ship a bundle). If the frontend has a build step (Vite/Webpack), add `sentry-cli sourcemaps upload` to the production `npm run build` flow with `SENTRY_AUTH_TOKEN` set in CI. Skip this entirely for Blade-only apps.

   - ✅ JS errors show readable stack traces (or this is skipped for a Blade-only app).

</Steps>

### 2. Log aggregation + rotation

Keep a rotated `daily` channel plus a `sentry_logs` channel so warnings/errors flow to Sentry.

<Steps>

1. **Confirm the log channels exist** in `config/logging.php` (the vendor or Phase 3 may already have them).

   ```php
   'daily' => ['driver' => 'daily', 'path' => storage_path('logs/laravel.log'),
               'level' => env('LOG_LEVEL', 'debug'), 'days' => env('LOG_DAILY_DAYS', 14)],
   'security' => ['driver' => 'daily', 'path' => storage_path('logs/security.log'), 'level' => 'info', 'days' => 30],
   'payments' => ['driver' => 'daily', 'path' => storage_path('logs/payments.log'), 'level' => 'info', 'days' => 90],
   'sentry_logs' => ['driver' => 'sentry', 'level' => env('SENTRY_LOG_LEVEL', 'warning'), 'bubble' => true],
   ```

   - ✅ The `daily`, `security`, `payments`, and `sentry_logs` channels are defined.

2. **Set the per-environment log stack + level.**

   | Env | `LOG_STACK` | `LOG_LEVEL` |
   |---|---|---|
   | Local | `single` | `debug` |
   | Production / staging | `daily,sentry_logs` | `warning` |

   - ✅ Production runs `LOG_STACK=daily,sentry_logs` at `warning`, so warnings reach Sentry.

</Steps>

Prefer **structured** logging — `Log::channel('payments')->info('Stripe webhook', ['event' => $event->id])` — so fields stay searchable. On servers with sudo, add a `logrotate.d` rule (`daily`, `rotate 14`, `compress`); on shared hosting, Laravel's `daily` driver handles rotation. For centralized shipping, the `sentry_logs` channel (free with your existing Sentry plan) is the zero-new-vendor default; add Better Stack / Papertrail / Axiom only if volume exceeds Sentry's quota.

### 3. Health endpoint + uptime monitoring

Expose a health check (or use Laravel 11+'s built-in `/up`), then point an external monitor at it.

:::note[Who does this]
🔀 **User + Agent** — the agent adds the `/health` route; 👤 **User** creates the external uptime monitor and alert rules in the provider dashboard (defer to production).
:::

<Steps>

1. **Add a `/health` route** to `routes/web.php` if missing.

   ```php
   use Illuminate\Support\Facades\{DB, Cache, Route};

   Route::get('/health', function (\Illuminate\Http\Request $request) {
       $checks = ['app' => true, 'db' => false, 'cache' => false];
       try { DB::connection()->getPdo(); $checks['db'] = true; } catch (\Throwable $e) {}
       try { Cache::put('hc', '1', 10); $checks['cache'] = Cache::get('hc') === '1'; } catch (\Throwable $e) {}
       $healthy = !in_array(false, $checks, true);

       // Detailed subsystem breakdown only with the monitor token; anonymous
       // callers get a bare status so /health can't fingerprint internals.
       $token = config('services.healthcheck.token');
       $authed = $token && hash_equals($token, (string) $request->query('token'));
       $body = $authed ? ['status' => $healthy ? 'healthy' : 'unhealthy', 'checks' => $checks]
                       : ['status' => $healthy ? 'healthy' : 'unhealthy'];
       return response()->json($body, $healthy ? 200 : 503);
   })->middleware('throttle:30,1'); // 30 req/min/IP — a hit-the-DB route must be rate-limited
   ```

   - ✅ `/health` returns 200 (bare `{status}`) to anonymous callers, 503 when down, full `checks` only with `?token=` matching `HEALTHCHECK_TOKEN`, and is throttled to 30/min so it can't be used as a DB-DoS amplifier.

2. **Pick one external monitor** (👤) and point it at the app every 1–5 min (defer this to production).

   | Provider | Free tier | Interval | Best for |
   |---|---|---|---|
   | **UptimeRobot** | 50 monitors | 5 min | Most monitors for free |
   | **HetrixTools** | 15 monitors | 1 min | Uptime + server RAM/CPU agent |
   | **MonSpark** | 2–4 monitors | 1 min | All-in-one: status page, cron monitor, phone calls |
   | **Better Stack** | 10 monitors | 3 min | On-call rotation + incident timeline |

   - ✅ External monitors watch homepage + `/health` (P0, 1–5 min), login path (P1, 5 min), and SSL-cert + DNS expiry (P1, daily). Optionally publish a public status page at `status.[DOMAIN]`.

   :::note[MonSpark as the all-in-one path]
   If you standardize on **MonSpark**, one vendor can cover uptime plus the rest:

   - Cron monitoring (heartbeat URLs) can replace Sentry Crons for backup monitoring — it is either/or, not both.
   - Server agent — RAM / CPU / disk monitoring, installed via SSH after production deploy.
   - Four monitor categories — availability, security, SEO, and integrity — plus alert channels, escalation policies, and a status page.
   :::

   :::tip[Status-page component map]
   If you publish a public status page at `status.[DOMAIN]`, map these components so each line corresponds to a real check:

   | Component | Backed by |
   |---|---|
   | **Application** | Homepage + `/health` endpoint |
   | **Authentication** | Login endpoint |
   | **Email** | Your ESP's third-party status page |
   | **Payments** | Stripe / PayPal status page |
   :::

</Steps>

### 4. Alert escalation + restore drills

Define who gets paged, how, and when — routed through one shared webhook config (the same one the backup and Sentry alerts use), never hardcoded URLs.

```mermaid
flowchart TD
  EV["Alert fires"] --> SEV{Severity}
  SEV -->|P0| P0["SMS + phone + Slack #alerts-p0<br/>5 min · 24/7"]
  SEV -->|P1| P1["Slack #alerts-p1 + email<br/>30 min · business hours"]
  SEV -->|P2| P2["Slack team channel<br/>4 hr"]
  SEV -->|P3| P3["Weekly email digest"]
```

<Steps>

1. **Document the severity → channel matrix**, routed via the shared webhook config.

   | Severity | Example trigger | Channel |
   |---|---|---|
   | **P0** | Cron monitor fails · error rate > 10× baseline · homepage/health down > 2 min | SMS + phone + Slack |
   | **P1** | New unresolved production issue · warning-level log pattern | Slack + email |
   | **P2** | P95 latency regression > 2× baseline | Slack team channel |
   | **P3** | Crash-free sessions < 99% | Email digest |

   - ✅ Each severity maps to a channel, routed through one shared webhook config.

2. **Create these five rules in Sentry → Alerts** (👤). The generic matrix above defines who gets paged; these are the concrete Sentry rules that fire the P0–P3 signals.

   | Severity | Trigger (create in Sentry → Alerts) | Channel |
   |---|---|---|
   | **P0** | Cron monitor fails (any) | SMS + phone + Slack `#alerts-p0` |
   | **P0** | Error rate > 10× the 1 h baseline | Slack `#alerts-p0` + email |
   | **P1** | Any new unresolved issue in `production` | Slack `#alerts-p1` + email |
   | **P2** | P95 transaction duration regression > 2× baseline | Slack team channel |
   | **P3** | Release health: crash-free sessions < 99% | Email digest |

   - ✅ All five Sentry alert rules are active and routed through the shared webhook config.

3. **Schedule the quarterly restore drill** and capture the runbook.

   - ✅ A quarterly restore drill is scheduled, the first drill passed, and provider + bucket, Sentry URL + cron slugs, log retention, uptime monitors, on-call, and next drill date are captured in a `monitoring-state.md` bus-factor document written as if someone else must use it without your help.

</Steps>

:::tip[Solo operator path]
No team to escalate to? Collapse the table: P0 → phone call (MonSpark gives ~10 calls/mo) or the Sentry mobile app; P1 → email; P2–P3 → a weekly review. Skip on-call rotation entirely.
:::

:::note[Weekly monitoring review]
Set a recurring calendar event to keep alerting honest and catch slow drift:

- Review last week's alerts — tune false positives.
- Check the backup success rate.
- Check the Sentry issue backlog — any P1 issues older than 48 h?
- Review the off-site storage cost trend for the `backup-offsite` disk.
:::

Whatever the escalation path, the backups behind it still have to be provably restorable:

:::danger[A backup you never restore is not a backup]
Schedule a **quarterly restore drill**: pull a random recent backup from the off-site disk, import it into a throwaway database, and verify row counts + a sample query + that `APP_KEY` decryption works. Log the result (date, backup used, time-to-restore). **If a drill fails, STOP and fix it** before the next one passes — solo operators, set a recurring calendar reminder, this is the most-skipped step.
:::

## Checklist

Do not mark this step done until **every** box below is checked.

- [ ] **🔀 Sentry extended** — cron monitors green; performance data visible; releases tagged with the git SHA (👤 auth token generated).
- [ ] **🤖 Logs aggregated** — `LOG_STACK=daily,sentry_logs` + `LOG_LEVEL=warning` on the server; warnings reach Sentry.
- [ ] **🔀 Health + uptime wired** — `/health` (or `/up`) returns 200 with JSON; external uptime monitors wired (👤, production).
- [ ] **🤖 Escalation documented** — severity → channel matrix routed via the shared webhook config.
- [ ] **🔀 Restore drill done** — quarterly drill scheduled; first drill passed; `monitoring-state.md` current.

:::next[Next step]
[Legal, privacy & GDPR](/tech-stack/laravel/codecanyon/build/playbooks/setup-new/07-security-monitoring/06-legal-gdpr/)
:::

TL;DR

Objective — make every failure loud: extend the basic Sentry DSN into cron monitors, performance, and release health; add rotated + structured logs; expose a health endpoint with external uptime monitoring; and define alert escalation backed by a quarterly restore drill.

Background

With the app hardened and backups landing off-server, make every failure loud. Each section here is a “should,” not a “must” — but a failed backup that screams beats one that fails silently for weeks. These extend the basic Sentry DSN + incident runbook from the Phase 4 deploy pipeline; they don’t recreate them.

Steps

1. Full Sentry — crons, performance, release health

Confirm the SDK + DSN are already in place (Phase 3 installed the SDK, Phase 4 set the production DSN). Everything below is what comes after that — do not re-create the project or re-install the SDK.

Wrap each critical scheduled command in a cron monitor. Without heartbeats, a failed backup:run fails silently.

$schedule->command('backup:run --only-db')->dailyAt('02:00')->sentryMonitor('daily-db-backup');
// Sentry's free tier = ONLY 1 cron monitor (each extra bills ~$0.78/mo). Wrap just the
// single most-critical job — the daily DB backup — and leave the rest unmonitored here:
$schedule->command('backup:run')->weeklyOn(0, '02:30');
$schedule->command('backup:monitor')->dailyAt('04:00');
$schedule->command('queue:prune-batches --hours=48')->daily();

✅ The one monitor auto-registers in Sentry → Crons on first run; expect a green checkmark within 24 h. Free tier = 1 monitor, so wrap only the daily DB backup (wrapping all four silently bills ~$28/yr). MonSpark (section 3) can cover the rest — pick one, not both.

Set performance sampling + PII stripping in .env.

SENTRY_TRACES_SAMPLE_RATE=0.1     # 10% of requests
SENTRY_PROFILES_SAMPLE_RATE=0.1   # 10% profiling on traced requests
SENTRY_SEND_DEFAULT_PII=false     # GDPR: strip IP + email by default

✅ Traces/profiles sample at 10% and default PII is stripped.

Tag each release with the git SHA so a regression alert points at the commit that caused it. In config/sentry.php set 'release' => env('SENTRY_RELEASE') and 'environment' => env('APP_ENV'), then set SENTRY_RELEASE in a Deployer task (git rev-parse --short HEAD) and finalize the release with sentry-cli (releases new → set-commits --auto → finalize). Generate the auth token (👤) at Sentry → Settings → Auth Tokens (scope project:releases); store SENTRY_ORG / SENTRY_PROJECT / SENTRY_AUTH_TOKEN in the server shared/.env.
- ✅ Releases are tagged with the SHA and finalized via sentry-cli.
Upload JS source maps (only if you ship a bundle). If the frontend has a build step (Vite/Webpack), add sentry-cli sourcemaps upload to the production npm run build flow with SENTRY_AUTH_TOKEN set in CI. Skip this entirely for Blade-only apps.
- ✅ JS errors show readable stack traces (or this is skipped for a Blade-only app).

2. Log aggregation + rotation

Keep a rotated daily channel plus a sentry_logs channel so warnings/errors flow to Sentry.

Confirm the log channels exist in config/logging.php (the vendor or Phase 3 may already have them).

'daily' => ['driver' => 'daily', 'path' => storage_path('logs/laravel.log'),
            'level' => env('LOG_LEVEL', 'debug'), 'days' => env('LOG_DAILY_DAYS', 14)],
'security' => ['driver' => 'daily', 'path' => storage_path('logs/security.log'), 'level' => 'info', 'days' => 30],
'payments' => ['driver' => 'daily', 'path' => storage_path('logs/payments.log'), 'level' => 'info', 'days' => 90],
'sentry_logs' => ['driver' => 'sentry', 'level' => env('SENTRY_LOG_LEVEL', 'warning'), 'bubble' => true],

✅ The daily, security, payments, and sentry_logs channels are defined.

Set the per-environment log stack + level.

Env LOG_STACK LOG_LEVEL
Local single debug
Production / staging daily,sentry_logs warning
- ✅ Production runs LOG_STACK=daily,sentry_logs at warning, so warnings reach Sentry.

Env	`LOG_STACK`	`LOG_LEVEL`
Local	`single`	`debug`
Production / staging	`daily,sentry_logs`	`warning`

Prefer structured logging — Log::channel('payments')->info('Stripe webhook', ['event' => $event->id]) — so fields stay searchable. On servers with sudo, add a logrotate.d rule (daily, rotate 14, compress); on shared hosting, Laravel’s daily driver handles rotation. For centralized shipping, the sentry_logs channel (free with your existing Sentry plan) is the zero-new-vendor default; add Better Stack / Papertrail / Axiom only if volume exceeds Sentry’s quota.

3. Health endpoint + uptime monitoring

Expose a health check (or use Laravel 11+‘s built-in /up), then point an external monitor at it.

Add a /health route to routes/web.php if missing.

use Illuminate\Support\Facades\{DB, Cache, Route};

Route::get('/health', function (\Illuminate\Http\Request $request) {
    $checks = ['app' => true, 'db' => false, 'cache' => false];
    try { DB::connection()->getPdo(); $checks['db'] = true; } catch (\Throwable $e) {}
    try { Cache::put('hc', '1', 10); $checks['cache'] = Cache::get('hc') === '1'; } catch (\Throwable $e) {}
    $healthy = !in_array(false, $checks, true);

    // Detailed subsystem breakdown only with the monitor token; anonymous
    // callers get a bare status so /health can't fingerprint internals.
    $token = config('services.healthcheck.token');
    $authed = $token && hash_equals($token, (string) $request->query('token'));
    $body = $authed ? ['status' => $healthy ? 'healthy' : 'unhealthy', 'checks' => $checks]
                    : ['status' => $healthy ? 'healthy' : 'unhealthy'];
    return response()->json($body, $healthy ? 200 : 503);
})->middleware('throttle:30,1'); // 30 req/min/IP — a hit-the-DB route must be rate-limited

✅ /health returns 200 (bare {status}) to anonymous callers, 503 when down, full checks only with ?token= matching HEALTHCHECK_TOKEN, and is throttled to 30/min so it can’t be used as a DB-DoS amplifier.

Pick one external monitor (👤) and point it at the app every 1–5 min (defer this to production).

Provider	Free tier	Interval	Best for
UptimeRobot	50 monitors	5 min	Most monitors for free
HetrixTools	15 monitors	1 min	Uptime + server RAM/CPU agent
MonSpark	2–4 monitors	1 min	All-in-one: status page, cron monitor, phone calls
Better Stack	10 monitors	3 min	On-call rotation + incident timeline

✅ External monitors watch homepage + /health (P0, 1–5 min), login path (P1, 5 min), and SSL-cert + DNS expiry (P1, daily). Optionally publish a public status page at status.[DOMAIN].

If you publish a public status page at status.[DOMAIN], map these components so each line corresponds to a real check:

Component	Backed by
Application	Homepage + `/health` endpoint
Authentication	Login endpoint
Email	Your ESP’s third-party status page
Payments	Stripe / PayPal status page

4. Alert escalation + restore drills

Define who gets paged, how, and when — routed through one shared webhook config (the same one the backup and Sentry alerts use), never hardcoded URLs.

flowchart TD
  EV["Alert fires"] --> SEV{Severity}
  SEV -->|P0| P0["SMS + phone + Slack #alerts-p0<br/>5 min · 24/7"]
  SEV -->|P1| P1["Slack #alerts-p1 + email<br/>30 min · business hours"]
  SEV -->|P2| P2["Slack team channel<br/>4 hr"]
  SEV -->|P3| P3["Weekly email digest"]

Document the severity → channel matrix, routed via the shared webhook config.

Severity	Example trigger	Channel
P0	Cron monitor fails · error rate > 10× baseline · homepage/health down > 2 min	SMS + phone + Slack
P1	New unresolved production issue · warning-level log pattern	Slack + email
P2	P95 latency regression > 2× baseline	Slack team channel
P3	Crash-free sessions < 99%	Email digest

✅ Each severity maps to a channel, routed through one shared webhook config.

Create these five rules in Sentry → Alerts (👤). The generic matrix above defines who gets paged; these are the concrete Sentry rules that fire the P0–P3 signals.

Severity	Trigger (create in Sentry → Alerts)	Channel
P0	Cron monitor fails (any)	SMS + phone + Slack `#alerts-p0`
P0	Error rate > 10× the 1 h baseline	Slack `#alerts-p0` + email
P1	Any new unresolved issue in `production`	Slack `#alerts-p1` + email
P2	P95 transaction duration regression > 2× baseline	Slack team channel
P3	Release health: crash-free sessions < 99%	Email digest

✅ All five Sentry alert rules are active and routed through the shared webhook config.

Schedule the quarterly restore drill and capture the runbook.
- ✅ A quarterly restore drill is scheduled, the first drill passed, and provider + bucket, Sentry URL + cron slugs, log retention, uptime monitors, on-call, and next drill date are captured in a monitoring-state.md bus-factor document written as if someone else must use it without your help.

Whatever the escalation path, the backups behind it still have to be provably restorable:

Checklist

Do not mark this step done until every box below is checked.

🔀 Sentry extended — cron monitors green; performance data visible; releases tagged with the git SHA (👤 auth token generated).
🤖 Logs aggregated — LOG_STACK=daily,sentry_logs + LOG_LEVEL=warning on the server; warnings reach Sentry.
🔀 Health + uptime wired — /health (or /up) returns 200 with JSON; external uptime monitors wired (👤, production).
🤖 Escalation documented — severity → channel matrix routed via the shared webhook config.
🔀 Restore drill done — quarterly drill scheduled; first drill passed; monitoring-state.md current.

Legal, privacy & GDPR

# ZajLibrary — implementation playbook

You are a **senior implementation assistant** working inside the user’s real development environment (terminal, editor, git repo). You are **not** a documentation writer, tutor, or library editor.

**Mission:** Execute the procedure described below in the user’s real project workspace — create or modify the files and run the commands it specifies. Do not just summarize it.

## Metadata
- title: 5 · Observability
- url: https://library.zajapps.com/tech-stack/laravel/codecanyon/build/playbooks/setup-new/07-security-monitoring/05-observability/
- shelf: Build & Create
- doc_type: implementation playbook
- status: current
- kind: playbook
- category: tech-stack
- subcategory: laravel
- topic: codecanyon
- phase: 7
- step_number: 5
- est_time: ~2 hr
- description: Extend the basic Sentry DSN into cron monitors, performance, and release health; add rotated + structured logs; expose a health endpoint with external uptime monitoring; and define alert escalation backed by a quarterly restore drill.
- tags: codecanyon, laravel, sentry, monitoring, uptime

## Execution rules (binding)
- This is step 5 of phase 7 (~2 hr) — do only this step, then stop at its `## Checklist`.
- Follow `## Steps` in order. Do not skip steps or declare the work complete early.
- Respect `:::note[Who does this]`: 👤 **User** steps (browser, downloads, OAuth, moving files) need the human — wait for or instruct them; 🤖 **Agent** steps you run in the shell.
- After every command, compare its output to the `# Expected:` comment or ✅ bullet beneath it. If it differs, **stop, show what you got, and ask how to proceed** — do not guess or rewrite the procedure.
- Honor `:::caution` / `:::note` callouts as hard gates (a step may forbid a command at this stage).
- Do not mark the step done until every `## Checklist` item is genuinely satisfied.
- Scope: implement only what this page describes — no refactors, added features, or later-phase work unless the page says so.
- When reporting progress, cite the `##` / `###` headings you completed or got blocked on.
- Secrets: never put API keys, passwords, or `.env` values in frontend code, commits, or chat logs — use server-side `.env` / config only, and keep `.env` gitignored.

---

# 5 · Observability

## TL;DR

**Objective** — make every failure loud: extend the basic Sentry DSN into cron monitors, performance, and release health; add rotated + structured logs; expose a health endpoint with external uptime monitoring; and define alert escalation backed by a quarterly restore drill.

:::note[User part + done-when]
**👤 User part** — create external uptime monitors and alert rules in the provider dashboard, generate the Sentry auth token, and decide the on-call escalation. The code, env, log channels, and health route are terminal-runnable.

**Done when** Sentry cron monitors are green, releases carry the git SHA, `LOG_STACK=daily,sentry_logs` ships warnings to Sentry, `/health` returns 200, external uptime monitors are wired (production), the escalation matrix is documented, and the first quarterly restore drill has passed. &nbsp;·&nbsp; **⏱ ~2 hr** &nbsp;·&nbsp; 🔀 mixed (👤 dashboards + monitors, 🤖 code + config).
:::

## Background

With the app hardened and backups landing off-server, make every failure **loud**. Each section here is a "should," not a "must" — but a failed backup that screams beats one that fails silently for weeks. These extend the basic Sentry DSN + incident runbook from the Phase 4 deploy pipeline; they don't recreate them.

:::note[Wire on staging, monitor on production]
You can rehearse this whole stack on staging, but **defer creating external uptime monitors and alert rules until production** — staging URLs churn and deliberate deploy downtime creates false alarms.
:::

## Steps

### 1. Full Sentry — crons, performance, release health

Confirm the SDK + DSN are already in place (Phase 3 installed the SDK, Phase 4 set the production DSN). Everything below is what comes *after* that — do not re-create the project or re-install the SDK.

:::note[Who does this]
🔀 **User + Agent** — the agent wires cron monitors, sample rates, and the release config; 👤 **User** generates the Sentry auth token at Sentry → Settings → Auth Tokens (the agent can't sign in to the Sentry dashboard).
:::

:::note[Alert rules are dashboard-only]
Sentry alert-rule creation has no CLI or MCP equivalent in this workflow — it is a dashboard task (Sentry → Alerts). The agent can wire crons, sample rates, and release config from the terminal, but the five alert rules in section 4 must be created by hand in the Sentry UI.
:::

<Steps>

1. **Wrap each critical scheduled command in a cron monitor.** Without heartbeats, a failed `backup:run` fails silently.

   ```php
   // app/Console/Kernel.php
   $schedule->command('backup:run --only-db')->dailyAt('02:00')->sentryMonitor('daily-db-backup');
   // Sentry's free tier = ONLY 1 cron monitor (each extra bills ~$0.78/mo). Wrap just the
   // single most-critical job — the daily DB backup — and leave the rest unmonitored here:
   $schedule->command('backup:run')->weeklyOn(0, '02:30');
   $schedule->command('backup:monitor')->dailyAt('04:00');
   $schedule->command('queue:prune-batches --hours=48')->daily();
   ```

   - ✅ The one monitor auto-registers in Sentry → **Crons** on first run; expect a green checkmark within 24 h. **Free tier = 1 monitor**, so wrap only the daily DB backup (wrapping all four silently bills ~$28/yr). MonSpark (section 3) can cover the rest — pick one, not both.

2. **Set performance sampling + PII stripping** in `.env`.

   ```dotenv
   SENTRY_TRACES_SAMPLE_RATE=0.1     # 10% of requests
   SENTRY_PROFILES_SAMPLE_RATE=0.1   # 10% profiling on traced requests
   SENTRY_SEND_DEFAULT_PII=false     # GDPR: strip IP + email by default
   ```

   - ✅ Traces/profiles sample at 10% and default PII is stripped.

3. **Tag each release with the git SHA** so a regression alert points at the commit that caused it. In `config/sentry.php` set `'release' => env('SENTRY_RELEASE')` and `'environment' => env('APP_ENV')`, then set `SENTRY_RELEASE` in a Deployer task (`git rev-parse --short HEAD`) and finalize the release with `sentry-cli` (`releases new` → `set-commits --auto` → `finalize`). Generate the auth token (👤) at Sentry → Settings → Auth Tokens (scope `project:releases`); store `SENTRY_ORG` / `SENTRY_PROJECT` / `SENTRY_AUTH_TOKEN` in the server `shared/.env`.

   - ✅ Releases are tagged with the SHA and finalized via `sentry-cli`.

4. **Upload JS source maps** (only if you ship a bundle). If the frontend has a build step (Vite/Webpack), add `sentry-cli sourcemaps upload` to the production `npm run build` flow with `SENTRY_AUTH_TOKEN` set in CI. Skip this entirely for Blade-only apps.

   - ✅ JS errors show readable stack traces (or this is skipped for a Blade-only app).

</Steps>

### 2. Log aggregation + rotation

Keep a rotated `daily` channel plus a `sentry_logs` channel so warnings/errors flow to Sentry.

<Steps>

1. **Confirm the log channels exist** in `config/logging.php` (the vendor or Phase 3 may already have them).

   ```php
   'daily' => ['driver' => 'daily', 'path' => storage_path('logs/laravel.log'),
               'level' => env('LOG_LEVEL', 'debug'), 'days' => env('LOG_DAILY_DAYS', 14)],
   'security' => ['driver' => 'daily', 'path' => storage_path('logs/security.log'), 'level' => 'info', 'days' => 30],
   'payments' => ['driver' => 'daily', 'path' => storage_path('logs/payments.log'), 'level' => 'info', 'days' => 90],
   'sentry_logs' => ['driver' => 'sentry', 'level' => env('SENTRY_LOG_LEVEL', 'warning'), 'bubble' => true],
   ```

   - ✅ The `daily`, `security`, `payments`, and `sentry_logs` channels are defined.

2. **Set the per-environment log stack + level.**

   | Env | `LOG_STACK` | `LOG_LEVEL` |
   |---|---|---|
   | Local | `single` | `debug` |
   | Production / staging | `daily,sentry_logs` | `warning` |

   - ✅ Production runs `LOG_STACK=daily,sentry_logs` at `warning`, so warnings reach Sentry.

</Steps>

Prefer **structured** logging — `Log::channel('payments')->info('Stripe webhook', ['event' => $event->id])` — so fields stay searchable. On servers with sudo, add a `logrotate.d` rule (`daily`, `rotate 14`, `compress`); on shared hosting, Laravel's `daily` driver handles rotation. For centralized shipping, the `sentry_logs` channel (free with your existing Sentry plan) is the zero-new-vendor default; add Better Stack / Papertrail / Axiom only if volume exceeds Sentry's quota.

### 3. Health endpoint + uptime monitoring

Expose a health check (or use Laravel 11+'s built-in `/up`), then point an external monitor at it.

:::note[Who does this]
🔀 **User + Agent** — the agent adds the `/health` route; 👤 **User** creates the external uptime monitor and alert rules in the provider dashboard (defer to production).
:::

<Steps>

1. **Add a `/health` route** to `routes/web.php` if missing.

   ```php
   use Illuminate\Support\Facades\{DB, Cache, Route};

   Route::get('/health', function (\Illuminate\Http\Request $request) {
       $checks = ['app' => true, 'db' => false, 'cache' => false];
       try { DB::connection()->getPdo(); $checks['db'] = true; } catch (\Throwable $e) {}
       try { Cache::put('hc', '1', 10); $checks['cache'] = Cache::get('hc') === '1'; } catch (\Throwable $e) {}
       $healthy = !in_array(false, $checks, true);

       // Detailed subsystem breakdown only with the monitor token; anonymous
       // callers get a bare status so /health can't fingerprint internals.
       $token = config('services.healthcheck.token');
       $authed = $token && hash_equals($token, (string) $request->query('token'));
       $body = $authed ? ['status' => $healthy ? 'healthy' : 'unhealthy', 'checks' => $checks]
                       : ['status' => $healthy ? 'healthy' : 'unhealthy'];
       return response()->json($body, $healthy ? 200 : 503);
   })->middleware('throttle:30,1'); // 30 req/min/IP — a hit-the-DB route must be rate-limited
   ```

   - ✅ `/health` returns 200 (bare `{status}`) to anonymous callers, 503 when down, full `checks` only with `?token=` matching `HEALTHCHECK_TOKEN`, and is throttled to 30/min so it can't be used as a DB-DoS amplifier.

2. **Pick one external monitor** (👤) and point it at the app every 1–5 min (defer this to production).

   | Provider | Free tier | Interval | Best for |
   |---|---|---|---|
   | **UptimeRobot** | 50 monitors | 5 min | Most monitors for free |
   | **HetrixTools** | 15 monitors | 1 min | Uptime + server RAM/CPU agent |
   | **MonSpark** | 2–4 monitors | 1 min | All-in-one: status page, cron monitor, phone calls |
   | **Better Stack** | 10 monitors | 3 min | On-call rotation + incident timeline |

   - ✅ External monitors watch homepage + `/health` (P0, 1–5 min), login path (P1, 5 min), and SSL-cert + DNS expiry (P1, daily). Optionally publish a public status page at `status.[DOMAIN]`.

   :::note[MonSpark as the all-in-one path]
   If you standardize on **MonSpark**, one vendor can cover uptime plus the rest:

   - Cron monitoring (heartbeat URLs) can replace Sentry Crons for backup monitoring — it is either/or, not both.
   - Server agent — RAM / CPU / disk monitoring, installed via SSH after production deploy.
   - Four monitor categories — availability, security, SEO, and integrity — plus alert channels, escalation policies, and a status page.
   :::

   :::tip[Status-page component map]
   If you publish a public status page at `status.[DOMAIN]`, map these components so each line corresponds to a real check:

   | Component | Backed by |
   |---|---|
   | **Application** | Homepage + `/health` endpoint |
   | **Authentication** | Login endpoint |
   | **Email** | Your ESP's third-party status page |
   | **Payments** | Stripe / PayPal status page |
   :::

</Steps>

### 4. Alert escalation + restore drills

Define who gets paged, how, and when — routed through one shared webhook config (the same one the backup and Sentry alerts use), never hardcoded URLs.

```mermaid
flowchart TD
  EV["Alert fires"] --> SEV{Severity}
  SEV -->|P0| P0["SMS + phone + Slack #alerts-p0<br/>5 min · 24/7"]
  SEV -->|P1| P1["Slack #alerts-p1 + email<br/>30 min · business hours"]
  SEV -->|P2| P2["Slack team channel<br/>4 hr"]
  SEV -->|P3| P3["Weekly email digest"]
```

<Steps>

1. **Document the severity → channel matrix**, routed via the shared webhook config.

   | Severity | Example trigger | Channel |
   |---|---|---|
   | **P0** | Cron monitor fails · error rate > 10× baseline · homepage/health down > 2 min | SMS + phone + Slack |
   | **P1** | New unresolved production issue · warning-level log pattern | Slack + email |
   | **P2** | P95 latency regression > 2× baseline | Slack team channel |
   | **P3** | Crash-free sessions < 99% | Email digest |

   - ✅ Each severity maps to a channel, routed through one shared webhook config.

2. **Create these five rules in Sentry → Alerts** (👤). The generic matrix above defines who gets paged; these are the concrete Sentry rules that fire the P0–P3 signals.

   | Severity | Trigger (create in Sentry → Alerts) | Channel |
   |---|---|---|
   | **P0** | Cron monitor fails (any) | SMS + phone + Slack `#alerts-p0` |
   | **P0** | Error rate > 10× the 1 h baseline | Slack `#alerts-p0` + email |
   | **P1** | Any new unresolved issue in `production` | Slack `#alerts-p1` + email |
   | **P2** | P95 transaction duration regression > 2× baseline | Slack team channel |
   | **P3** | Release health: crash-free sessions < 99% | Email digest |

   - ✅ All five Sentry alert rules are active and routed through the shared webhook config.

3. **Schedule the quarterly restore drill** and capture the runbook.

   - ✅ A quarterly restore drill is scheduled, the first drill passed, and provider + bucket, Sentry URL + cron slugs, log retention, uptime monitors, on-call, and next drill date are captured in a `monitoring-state.md` bus-factor document written as if someone else must use it without your help.

</Steps>

:::tip[Solo operator path]
No team to escalate to? Collapse the table: P0 → phone call (MonSpark gives ~10 calls/mo) or the Sentry mobile app; P1 → email; P2–P3 → a weekly review. Skip on-call rotation entirely.
:::

:::note[Weekly monitoring review]
Set a recurring calendar event to keep alerting honest and catch slow drift:

- Review last week's alerts — tune false positives.
- Check the backup success rate.
- Check the Sentry issue backlog — any P1 issues older than 48 h?
- Review the off-site storage cost trend for the `backup-offsite` disk.
:::

Whatever the escalation path, the backups behind it still have to be provably restorable:

:::danger[A backup you never restore is not a backup]
Schedule a **quarterly restore drill**: pull a random recent backup from the off-site disk, import it into a throwaway database, and verify row counts + a sample query + that `APP_KEY` decryption works. Log the result (date, backup used, time-to-restore). **If a drill fails, STOP and fix it** before the next one passes — solo operators, set a recurring calendar reminder, this is the most-skipped step.
:::

## Checklist

Do not mark this step done until **every** box below is checked.

- [ ] **🔀 Sentry extended** — cron monitors green; performance data visible; releases tagged with the git SHA (👤 auth token generated).
- [ ] **🤖 Logs aggregated** — `LOG_STACK=daily,sentry_logs` + `LOG_LEVEL=warning` on the server; warnings reach Sentry.
- [ ] **🔀 Health + uptime wired** — `/health` (or `/up`) returns 200 with JSON; external uptime monitors wired (👤, production).
- [ ] **🤖 Escalation documented** — severity → channel matrix routed via the shared webhook config.
- [ ] **🔀 Restore drill done** — quarterly drill scheduled; first drill passed; `monitoring-state.md` current.

:::next[Next step]
[Legal, privacy & GDPR](/tech-stack/laravel/codecanyon/build/playbooks/setup-new/07-security-monitoring/06-legal-gdpr/)
:::

# 5 · Observability

> Source: https://library.zajapps.com/tech-stack/laravel/codecanyon/build/playbooks/setup-new/07-security-monitoring/05-observability/

## TL;DR

**Objective** — make every failure loud: extend the basic Sentry DSN into cron monitors, performance, and release health; add rotated + structured logs; expose a health endpoint with external uptime monitoring; and define alert escalation backed by a quarterly restore drill.

:::note[User part + done-when]
**👤 User part** — create external uptime monitors and alert rules in the provider dashboard, generate the Sentry auth token, and decide the on-call escalation. The code, env, log channels, and health route are terminal-runnable.

**Done when** Sentry cron monitors are green, releases carry the git SHA, `LOG_STACK=daily,sentry_logs` ships warnings to Sentry, `/health` returns 200, external uptime monitors are wired (production), the escalation matrix is documented, and the first quarterly restore drill has passed. &nbsp;·&nbsp; **⏱ ~2 hr** &nbsp;·&nbsp; 🔀 mixed (👤 dashboards + monitors, 🤖 code + config).
:::

## Background

With the app hardened and backups landing off-server, make every failure **loud**. Each section here is a "should," not a "must" — but a failed backup that screams beats one that fails silently for weeks. These extend the basic Sentry DSN + incident runbook from the Phase 4 deploy pipeline; they don't recreate them.

:::note[Wire on staging, monitor on production]
You can rehearse this whole stack on staging, but **defer creating external uptime monitors and alert rules until production** — staging URLs churn and deliberate deploy downtime creates false alarms.
:::

## Steps

### 1. Full Sentry — crons, performance, release health

Confirm the SDK + DSN are already in place (Phase 3 installed the SDK, Phase 4 set the production DSN). Everything below is what comes *after* that — do not re-create the project or re-install the SDK.

:::note[Who does this]
🔀 **User + Agent** — the agent wires cron monitors, sample rates, and the release config; 👤 **User** generates the Sentry auth token at Sentry → Settings → Auth Tokens (the agent can't sign in to the Sentry dashboard).
:::

:::note[Alert rules are dashboard-only]
Sentry alert-rule creation has no CLI or MCP equivalent in this workflow — it is a dashboard task (Sentry → Alerts). The agent can wire crons, sample rates, and release config from the terminal, but the five alert rules in section 4 must be created by hand in the Sentry UI.
:::

<Steps>

1. **Wrap each critical scheduled command in a cron monitor.** Without heartbeats, a failed `backup:run` fails silently.

   ```php
   // app/Console/Kernel.php
   $schedule->command('backup:run --only-db')->dailyAt('02:00')->sentryMonitor('daily-db-backup');
   // Sentry's free tier = ONLY 1 cron monitor (each extra bills ~$0.78/mo). Wrap just the
   // single most-critical job — the daily DB backup — and leave the rest unmonitored here:
   $schedule->command('backup:run')->weeklyOn(0, '02:30');
   $schedule->command('backup:monitor')->dailyAt('04:00');
   $schedule->command('queue:prune-batches --hours=48')->daily();
   ```

   - ✅ The one monitor auto-registers in Sentry → **Crons** on first run; expect a green checkmark within 24 h. **Free tier = 1 monitor**, so wrap only the daily DB backup (wrapping all four silently bills ~$28/yr). MonSpark (section 3) can cover the rest — pick one, not both.

2. **Set performance sampling + PII stripping** in `.env`.

   ```dotenv
   SENTRY_TRACES_SAMPLE_RATE=0.1     # 10% of requests
   SENTRY_PROFILES_SAMPLE_RATE=0.1   # 10% profiling on traced requests
   SENTRY_SEND_DEFAULT_PII=false     # GDPR: strip IP + email by default
   ```

   - ✅ Traces/profiles sample at 10% and default PII is stripped.

3. **Tag each release with the git SHA** so a regression alert points at the commit that caused it. In `config/sentry.php` set `'release' => env('SENTRY_RELEASE')` and `'environment' => env('APP_ENV')`, then set `SENTRY_RELEASE` in a Deployer task (`git rev-parse --short HEAD`) and finalize the release with `sentry-cli` (`releases new` → `set-commits --auto` → `finalize`). Generate the auth token (👤) at Sentry → Settings → Auth Tokens (scope `project:releases`); store `SENTRY_ORG` / `SENTRY_PROJECT` / `SENTRY_AUTH_TOKEN` in the server `shared/.env`.

   - ✅ Releases are tagged with the SHA and finalized via `sentry-cli`.

4. **Upload JS source maps** (only if you ship a bundle). If the frontend has a build step (Vite/Webpack), add `sentry-cli sourcemaps upload` to the production `npm run build` flow with `SENTRY_AUTH_TOKEN` set in CI. Skip this entirely for Blade-only apps.

   - ✅ JS errors show readable stack traces (or this is skipped for a Blade-only app).

</Steps>

### 2. Log aggregation + rotation

Keep a rotated `daily` channel plus a `sentry_logs` channel so warnings/errors flow to Sentry.

<Steps>

1. **Confirm the log channels exist** in `config/logging.php` (the vendor or Phase 3 may already have them).

   ```php
   'daily' => ['driver' => 'daily', 'path' => storage_path('logs/laravel.log'),
               'level' => env('LOG_LEVEL', 'debug'), 'days' => env('LOG_DAILY_DAYS', 14)],
   'security' => ['driver' => 'daily', 'path' => storage_path('logs/security.log'), 'level' => 'info', 'days' => 30],
   'payments' => ['driver' => 'daily', 'path' => storage_path('logs/payments.log'), 'level' => 'info', 'days' => 90],
   'sentry_logs' => ['driver' => 'sentry', 'level' => env('SENTRY_LOG_LEVEL', 'warning'), 'bubble' => true],
   ```

   - ✅ The `daily`, `security`, `payments`, and `sentry_logs` channels are defined.

2. **Set the per-environment log stack + level.**

   | Env | `LOG_STACK` | `LOG_LEVEL` |
   |---|---|---|
   | Local | `single` | `debug` |
   | Production / staging | `daily,sentry_logs` | `warning` |

   - ✅ Production runs `LOG_STACK=daily,sentry_logs` at `warning`, so warnings reach Sentry.

</Steps>

Prefer **structured** logging — `Log::channel('payments')->info('Stripe webhook', ['event' => $event->id])` — so fields stay searchable. On servers with sudo, add a `logrotate.d` rule (`daily`, `rotate 14`, `compress`); on shared hosting, Laravel's `daily` driver handles rotation. For centralized shipping, the `sentry_logs` channel (free with your existing Sentry plan) is the zero-new-vendor default; add Better Stack / Papertrail / Axiom only if volume exceeds Sentry's quota.

### 3. Health endpoint + uptime monitoring

Expose a health check (or use Laravel 11+'s built-in `/up`), then point an external monitor at it.

:::note[Who does this]
🔀 **User + Agent** — the agent adds the `/health` route; 👤 **User** creates the external uptime monitor and alert rules in the provider dashboard (defer to production).
:::

<Steps>

1. **Add a `/health` route** to `routes/web.php` if missing.

   ```php
   use Illuminate\Support\Facades\{DB, Cache, Route};

   Route::get('/health', function (\Illuminate\Http\Request $request) {
       $checks = ['app' => true, 'db' => false, 'cache' => false];
       try { DB::connection()->getPdo(); $checks['db'] = true; } catch (\Throwable $e) {}
       try { Cache::put('hc', '1', 10); $checks['cache'] = Cache::get('hc') === '1'; } catch (\Throwable $e) {}
       $healthy = !in_array(false, $checks, true);

       // Detailed subsystem breakdown only with the monitor token; anonymous
       // callers get a bare status so /health can't fingerprint internals.
       $token = config('services.healthcheck.token');
       $authed = $token && hash_equals($token, (string) $request->query('token'));
       $body = $authed ? ['status' => $healthy ? 'healthy' : 'unhealthy', 'checks' => $checks]
                       : ['status' => $healthy ? 'healthy' : 'unhealthy'];
       return response()->json($body, $healthy ? 200 : 503);
   })->middleware('throttle:30,1'); // 30 req/min/IP — a hit-the-DB route must be rate-limited
   ```

   - ✅ `/health` returns 200 (bare `{status}`) to anonymous callers, 503 when down, full `checks` only with `?token=` matching `HEALTHCHECK_TOKEN`, and is throttled to 30/min so it can't be used as a DB-DoS amplifier.

2. **Pick one external monitor** (👤) and point it at the app every 1–5 min (defer this to production).

   | Provider | Free tier | Interval | Best for |
   |---|---|---|---|
   | **UptimeRobot** | 50 monitors | 5 min | Most monitors for free |
   | **HetrixTools** | 15 monitors | 1 min | Uptime + server RAM/CPU agent |
   | **MonSpark** | 2–4 monitors | 1 min | All-in-one: status page, cron monitor, phone calls |
   | **Better Stack** | 10 monitors | 3 min | On-call rotation + incident timeline |

   - ✅ External monitors watch homepage + `/health` (P0, 1–5 min), login path (P1, 5 min), and SSL-cert + DNS expiry (P1, daily). Optionally publish a public status page at `status.[DOMAIN]`.

   :::note[MonSpark as the all-in-one path]
   If you standardize on **MonSpark**, one vendor can cover uptime plus the rest:

   - Cron monitoring (heartbeat URLs) can replace Sentry Crons for backup monitoring — it is either/or, not both.
   - Server agent — RAM / CPU / disk monitoring, installed via SSH after production deploy.
   - Four monitor categories — availability, security, SEO, and integrity — plus alert channels, escalation policies, and a status page.
   :::

   :::tip[Status-page component map]
   If you publish a public status page at `status.[DOMAIN]`, map these components so each line corresponds to a real check:

   | Component | Backed by |
   |---|---|
   | **Application** | Homepage + `/health` endpoint |
   | **Authentication** | Login endpoint |
   | **Email** | Your ESP's third-party status page |
   | **Payments** | Stripe / PayPal status page |
   :::

</Steps>

### 4. Alert escalation + restore drills

Define who gets paged, how, and when — routed through one shared webhook config (the same one the backup and Sentry alerts use), never hardcoded URLs.

```mermaid
flowchart TD
  EV["Alert fires"] --> SEV{Severity}
  SEV -->|P0| P0["SMS + phone + Slack #alerts-p0<br/>5 min · 24/7"]
  SEV -->|P1| P1["Slack #alerts-p1 + email<br/>30 min · business hours"]
  SEV -->|P2| P2["Slack team channel<br/>4 hr"]
  SEV -->|P3| P3["Weekly email digest"]
```

<Steps>

1. **Document the severity → channel matrix**, routed via the shared webhook config.

   | Severity | Example trigger | Channel |
   |---|---|---|
   | **P0** | Cron monitor fails · error rate > 10× baseline · homepage/health down > 2 min | SMS + phone + Slack |
   | **P1** | New unresolved production issue · warning-level log pattern | Slack + email |
   | **P2** | P95 latency regression > 2× baseline | Slack team channel |
   | **P3** | Crash-free sessions < 99% | Email digest |

   - ✅ Each severity maps to a channel, routed through one shared webhook config.

2. **Create these five rules in Sentry → Alerts** (👤). The generic matrix above defines who gets paged; these are the concrete Sentry rules that fire the P0–P3 signals.

   | Severity | Trigger (create in Sentry → Alerts) | Channel |
   |---|---|---|
   | **P0** | Cron monitor fails (any) | SMS + phone + Slack `#alerts-p0` |
   | **P0** | Error rate > 10× the 1 h baseline | Slack `#alerts-p0` + email |
   | **P1** | Any new unresolved issue in `production` | Slack `#alerts-p1` + email |
   | **P2** | P95 transaction duration regression > 2× baseline | Slack team channel |
   | **P3** | Release health: crash-free sessions < 99% | Email digest |

   - ✅ All five Sentry alert rules are active and routed through the shared webhook config.

3. **Schedule the quarterly restore drill** and capture the runbook.

   - ✅ A quarterly restore drill is scheduled, the first drill passed, and provider + bucket, Sentry URL + cron slugs, log retention, uptime monitors, on-call, and next drill date are captured in a `monitoring-state.md` bus-factor document written as if someone else must use it without your help.

</Steps>

:::tip[Solo operator path]
No team to escalate to? Collapse the table: P0 → phone call (MonSpark gives ~10 calls/mo) or the Sentry mobile app; P1 → email; P2–P3 → a weekly review. Skip on-call rotation entirely.
:::

:::note[Weekly monitoring review]
Set a recurring calendar event to keep alerting honest and catch slow drift:

- Review last week's alerts — tune false positives.
- Check the backup success rate.
- Check the Sentry issue backlog — any P1 issues older than 48 h?
- Review the off-site storage cost trend for the `backup-offsite` disk.
:::

Whatever the escalation path, the backups behind it still have to be provably restorable:

:::danger[A backup you never restore is not a backup]
Schedule a **quarterly restore drill**: pull a random recent backup from the off-site disk, import it into a throwaway database, and verify row counts + a sample query + that `APP_KEY` decryption works. Log the result (date, backup used, time-to-restore). **If a drill fails, STOP and fix it** before the next one passes — solo operators, set a recurring calendar reminder, this is the most-skipped step.
:::

## Checklist

Do not mark this step done until **every** box below is checked.

- [ ] **🔀 Sentry extended** — cron monitors green; performance data visible; releases tagged with the git SHA (👤 auth token generated).
- [ ] **🤖 Logs aggregated** — `LOG_STACK=daily,sentry_logs` + `LOG_LEVEL=warning` on the server; warnings reach Sentry.
- [ ] **🔀 Health + uptime wired** — `/health` (or `/up`) returns 200 with JSON; external uptime monitors wired (👤, production).
- [ ] **🤖 Escalation documented** — severity → channel matrix routed via the shared webhook config.
- [ ] **🔀 Restore drill done** — quarterly drill scheduled; first drill passed; `monitoring-state.md` current.

:::next[Next step]
[Legal, privacy & GDPR](/tech-stack/laravel/codecanyon/build/playbooks/setup-new/07-security-monitoring/06-legal-gdpr/)
:::