How we built automated health-check monitoring for 34 namespaces

Functional correctness and uptime are two different questions. An endpoint can be perfectly correct today and silently break tomorrow because an upstream free API changed its response shape, hit a rate limit, or went down entirely. We needed a way to know that happened without manually re-testing every namespace by hand.

One real sample call per namespace, not per endpoint

Checking all 400+ endpoints twice a day would be excessive — most failure modes (an upstream API going down, a shared library breaking) affect an entire namespace at once, not one endpoint in isolation. So the health check hits one real, valid sample call per live namespace — the same list our public /status page already uses for its client-side checks, now shared in one module so the two can never drift apart.

export async function checkServiceHealth(baseUrl: string, endpoint: string) {
  const t0 = Date.now();
  try {
    const res = await fetch(`${baseUrl}${endpoint}`, { signal: AbortSignal.timeout(8000), cache: "no-store" });
    const latencyMs = Date.now() - t0;
    if (res.status === 503) return { status: "degraded", latencyMs };
    if (res.status >= 500) return { status: "outage", latencyMs };
    return { status: "operational", latencyMs };
  } catch {
    return { status: "outage", latencyMs: Date.now() - t0 };
  }
}

Twice a day, via Vercel cron

A cron entry in vercel.json hits the health-check route at 6 AM and 6 PM UTC. Vercel automatically sends an Authorization: Bearer $CRON_SECRET header on cron-triggered requests, which the route checks before doing anything — the same route also accepts a manual trigger from an admin dashboard button, gated by a separate admin-token cookie check instead of the cron secret.

Results go to two places

Every check writes a row to a history table (namespace, status, latency, timestamp), so there's a queryable record over time, not just a current snapshot. If a check comes back as anything other than operational, it additionally writes an alert into our existing admin alerts table — reusing infrastructure that already existed for payment and security events — and sends an email.

Why alert-only, not a routine digest

We deliberately did not build a “here's your twice-daily status report” email. On a normal day, every namespace is operational, and an email saying so twice a day forever is exactly the kind of notification that trains you to stop reading notifications. Silence means healthy. An email only arrives when a namespace actually fails its check — which means every email is real signal, not noise to filter out.