Skip to content
prod e051e98
Browse

5 · Observability

Objective — make every failure loud: extend the basic Sentry DSN into cron monitors, performance, and release health; add rotated + structured logs; expose a health endpoint with external uptime monitoring; and define alert escalation backed by a quarterly restore drill.

With the app hardened and backups landing off-server, make every failure loud. Each section here is a “should,” not a “must” — but a failed backup that screams beats one that fails silently for weeks. These extend the basic Sentry DSN + incident runbook from the Phase 4 deploy pipeline; they don’t recreate them.

1. Full Sentry — crons, performance, release health

Section titled “1. Full Sentry — crons, performance, release health”

Confirm the SDK + DSN are already in place (Phase 3 installed the SDK, Phase 4 set the production DSN). Everything below is what comes after that — do not re-create the project or re-install the SDK.

  1. Wrap each critical scheduled command in a cron monitor. Without heartbeats, a failed backup:run fails silently.

    app/Console/Kernel.php
    $schedule->command('backup:run --only-db')->dailyAt('02:00')->sentryMonitor('daily-db-backup');
    // Sentry's free tier = ONLY 1 cron monitor (each extra bills ~$0.78/mo). Wrap just the
    // single most-critical job — the daily DB backup — and leave the rest unmonitored here:
    $schedule->command('backup:run')->weeklyOn(0, '02:30');
    $schedule->command('backup:monitor')->dailyAt('04:00');
    $schedule->command('queue:prune-batches --hours=48')->daily();
    • ✅ The one monitor auto-registers in Sentry → Crons on first run; expect a green checkmark within 24 h. Free tier = 1 monitor, so wrap only the daily DB backup (wrapping all four silently bills ~$28/yr). MonSpark (section 3) can cover the rest — pick one, not both.
  2. Set performance sampling + PII stripping in .env.

    SENTRY_TRACES_SAMPLE_RATE=0.1 # 10% of requests
    SENTRY_PROFILES_SAMPLE_RATE=0.1 # 10% profiling on traced requests
    SENTRY_SEND_DEFAULT_PII=false # GDPR: strip IP + email by default
    • ✅ Traces/profiles sample at 10% and default PII is stripped.
  3. Tag each release with the git SHA so a regression alert points at the commit that caused it. In config/sentry.php set 'release' => env('SENTRY_RELEASE') and 'environment' => env('APP_ENV'), then set SENTRY_RELEASE in a Deployer task (git rev-parse --short HEAD) and finalize the release with sentry-cli (releases newset-commits --autofinalize). Generate the auth token (👤) at Sentry → Settings → Auth Tokens (scope project:releases); store SENTRY_ORG / SENTRY_PROJECT / SENTRY_AUTH_TOKEN in the server shared/.env.

    • ✅ Releases are tagged with the SHA and finalized via sentry-cli.
  4. Upload JS source maps (only if you ship a bundle). If the frontend has a build step (Vite/Webpack), add sentry-cli sourcemaps upload to the production npm run build flow with SENTRY_AUTH_TOKEN set in CI. Skip this entirely for Blade-only apps.

    • ✅ JS errors show readable stack traces (or this is skipped for a Blade-only app).

Keep a rotated daily channel plus a sentry_logs channel so warnings/errors flow to Sentry.

  1. Confirm the log channels exist in config/logging.php (the vendor or Phase 3 may already have them).

    'daily' => ['driver' => 'daily', 'path' => storage_path('logs/laravel.log'),
    'level' => env('LOG_LEVEL', 'debug'), 'days' => env('LOG_DAILY_DAYS', 14)],
    'security' => ['driver' => 'daily', 'path' => storage_path('logs/security.log'), 'level' => 'info', 'days' => 30],
    'payments' => ['driver' => 'daily', 'path' => storage_path('logs/payments.log'), 'level' => 'info', 'days' => 90],
    'sentry_logs' => ['driver' => 'sentry', 'level' => env('SENTRY_LOG_LEVEL', 'warning'), 'bubble' => true],
    • ✅ The daily, security, payments, and sentry_logs channels are defined.
  2. Set the per-environment log stack + level.

    EnvLOG_STACKLOG_LEVEL
    Localsingledebug
    Production / stagingdaily,sentry_logswarning
    • ✅ Production runs LOG_STACK=daily,sentry_logs at warning, so warnings reach Sentry.

Prefer structured logging — Log::channel('payments')->info('Stripe webhook', ['event' => $event->id]) — so fields stay searchable. On servers with sudo, add a logrotate.d rule (daily, rotate 14, compress); on shared hosting, Laravel’s daily driver handles rotation. For centralized shipping, the sentry_logs channel (free with your existing Sentry plan) is the zero-new-vendor default; add Better Stack / Papertrail / Axiom only if volume exceeds Sentry’s quota.

Expose a health check (or use Laravel 11+‘s built-in /up), then point an external monitor at it.

  1. Add a /health route to routes/web.php if missing.

    use Illuminate\Support\Facades\{DB, Cache, Route};
    Route::get('/health', function (\Illuminate\Http\Request $request) {
    $checks = ['app' => true, 'db' => false, 'cache' => false];
    try { DB::connection()->getPdo(); $checks['db'] = true; } catch (\Throwable $e) {}
    try { Cache::put('hc', '1', 10); $checks['cache'] = Cache::get('hc') === '1'; } catch (\Throwable $e) {}
    $healthy = !in_array(false, $checks, true);
    // Detailed subsystem breakdown only with the monitor token; anonymous
    // callers get a bare status so /health can't fingerprint internals.
    $token = config('services.healthcheck.token');
    $authed = $token && hash_equals($token, (string) $request->query('token'));
    $body = $authed ? ['status' => $healthy ? 'healthy' : 'unhealthy', 'checks' => $checks]
    : ['status' => $healthy ? 'healthy' : 'unhealthy'];
    return response()->json($body, $healthy ? 200 : 503);
    })->middleware('throttle:30,1'); // 30 req/min/IP — a hit-the-DB route must be rate-limited
    • /health returns 200 (bare {status}) to anonymous callers, 503 when down, full checks only with ?token= matching HEALTHCHECK_TOKEN, and is throttled to 30/min so it can’t be used as a DB-DoS amplifier.
  2. Pick one external monitor (👤) and point it at the app every 1–5 min (defer this to production).

    ProviderFree tierIntervalBest for
    UptimeRobot50 monitors5 minMost monitors for free
    HetrixTools15 monitors1 minUptime + server RAM/CPU agent
    MonSpark2–4 monitors1 minAll-in-one: status page, cron monitor, phone calls
    Better Stack10 monitors3 minOn-call rotation + incident timeline
    • ✅ External monitors watch homepage + /health (P0, 1–5 min), login path (P1, 5 min), and SSL-cert + DNS expiry (P1, daily). Optionally publish a public status page at status.[DOMAIN].

Define who gets paged, how, and when — routed through one shared webhook config (the same one the backup and Sentry alerts use), never hardcoded URLs.

flowchart TD
EV["Alert fires"] --> SEV{Severity}
SEV -->|P0| P0["SMS + phone + Slack #alerts-p0<br/>5 min · 24/7"]
SEV -->|P1| P1["Slack #alerts-p1 + email<br/>30 min · business hours"]
SEV -->|P2| P2["Slack team channel<br/>4 hr"]
SEV -->|P3| P3["Weekly email digest"]
  1. Document the severity → channel matrix, routed via the shared webhook config.

    SeverityExample triggerChannel
    P0Cron monitor fails · error rate > 10× baseline · homepage/health down > 2 minSMS + phone + Slack
    P1New unresolved production issue · warning-level log patternSlack + email
    P2P95 latency regression > 2× baselineSlack team channel
    P3Crash-free sessions < 99%Email digest
    • ✅ Each severity maps to a channel, routed through one shared webhook config.
  2. Create these five rules in Sentry → Alerts (👤). The generic matrix above defines who gets paged; these are the concrete Sentry rules that fire the P0–P3 signals.

    SeverityTrigger (create in Sentry → Alerts)Channel
    P0Cron monitor fails (any)SMS + phone + Slack #alerts-p0
    P0Error rate > 10× the 1 h baselineSlack #alerts-p0 + email
    P1Any new unresolved issue in productionSlack #alerts-p1 + email
    P2P95 transaction duration regression > 2× baselineSlack team channel
    P3Release health: crash-free sessions < 99%Email digest
    • ✅ All five Sentry alert rules are active and routed through the shared webhook config.
  3. Schedule the quarterly restore drill and capture the runbook.

    • ✅ A quarterly restore drill is scheduled, the first drill passed, and provider + bucket, Sentry URL + cron slugs, log retention, uptime monitors, on-call, and next drill date are captured in a monitoring-state.md bus-factor document written as if someone else must use it without your help.

Whatever the escalation path, the backups behind it still have to be provably restorable:

Do not mark this step done until every box below is checked.

  • 🔀 Sentry extended — cron monitors green; performance data visible; releases tagged with the git SHA (👤 auth token generated).
  • 🤖 Logs aggregatedLOG_STACK=daily,sentry_logs + LOG_LEVEL=warning on the server; warnings reach Sentry.
  • 🔀 Health + uptime wired/health (or /up) returns 200 with JSON; external uptime monitors wired (👤, production).
  • 🤖 Escalation documented — severity → channel matrix routed via the shared webhook config.
  • 🔀 Restore drill done — quarterly drill scheduled; first drill passed; monitoring-state.md current.