ops: backup hardening + restore drill + self-hosted uptime monitor
All checks were successful
Deploy to Production / deploy (push) Successful in 1m10s
All checks were successful
Deploy to Production / deploy (push) Successful in 1m10s
Adds /opt/bmm-ops/ scripts (deployed separately from the app, so tar
overlays don't clobber them) for three previously-missing production
readiness items:
1. Backup hardening (backup.sh):
- Previous cron one-liner did pg_dump | gzip with no validation.
- Now: pipefail-safe pg_dump, gunzip -t integrity check, pg_dump
header sanity (scans first 5 lines — line 1 is just "--", actual
"PostgreSQL database dump" comment lands on line 2), size-warning
under 1KB, atomic move-into-place so partial backups never replace
the previous good file. 14-day retention preserved.
- Optional offsite via BMM_BACKUP_REMOTE (rclone). Reads env via
grep+cut, NOT `source` — the .env.production has unquoted text
values (e.g. ADMIN_NAME) that crash a sourced shell.
2. Restore drill (restore-test.sh, Sun 04:30 UTC weekly):
- Restores the newest backup into a throwaway DB inside the same
Postgres container, verifies the core tables exist (users,
sessions, oauth_tokens, mcp_servers), drops the temp DB. Proves
backups are actually restorable, not just byte-streams that look
like backups. Silent-corruption detector.
3. Self-hosted uptime monitor (uptime-check.sh, every 5 min):
- Probes homepage + /api/health + /robots.txt.
- Edge-triggered alerting: SMS via Twilio only on up→down and
down→up transitions (avoids SMS storm during sustained outages).
- Pings HEALTHCHECKS_HEARTBEAT_URL on every success — when the box
itself dies the heartbeat stops and the external watchdog alerts
(covers the gap that self-hosted monitors can't see their own
box failing).
notify.sh is the shared helper: Twilio SMS if all four creds set,
optional webhook to HEALTHCHECKS_FAIL_URL, always logs to syslog. Never
fails loudly — broken notification path still lands in journalctl
-t bmm-ops.
README.md documents the 3-2-1 strategy, manual full-recovery
procedure, and how to enable offsite (R2 / B2 / Hetzner Storage Box).
Smoke-tested all three on prod: backup wrote 8004 bytes with checks
passing, restore-test confirmed schema, uptime probe returned up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
2267daadd4
commit
591a1cb575
125
ops/bmm/README.md
Normal file
125
ops/bmm/README.md
Normal file
@ -0,0 +1,125 @@
|
||||
# BMM Ops — Backup & Uptime
|
||||
|
||||
Scripts live in `/opt/bmm-ops/` on the Hetzner box. They are intentionally
|
||||
**outside** `/opt/buildmymcpserver/` so app deploys (tar overlay) don't
|
||||
overwrite them.
|
||||
|
||||
## Scripts
|
||||
|
||||
| Script | Cron | Purpose |
|
||||
|---|---|---|
|
||||
| `backup.sh` | `15 3 * * *` (03:15 UTC daily) | pg_dump → gzip → integrity-check → offsite push → 14d retention |
|
||||
| `restore-test.sh` | `30 4 * * 0` (Sun 04:30 UTC weekly) | Restores latest backup into temp DB, verifies core schema, drops temp DB |
|
||||
| `uptime-check.sh` | `*/5 * * * *` (every 5 min) | Probes homepage + API health + robots.txt; alerts on edge transitions; heartbeats external watchdog |
|
||||
| `notify.sh` | (helper) | Sends Twilio SMS + optional webhook on alert; always syslogs |
|
||||
|
||||
## Backup Strategy
|
||||
|
||||
**3-2-1 rule applied:**
|
||||
- **3 copies** — Postgres volume (live) + local gzip (`/var/backups/bmm/`) + offsite (rclone target)
|
||||
- **2 different media** — Hetzner box SSD + external object storage (R2/B2/Hetzner Storage Box)
|
||||
- **1 offsite** — not on the same machine
|
||||
|
||||
**Retention:**
|
||||
- Local: 14 days rolling
|
||||
- Offsite: configure via rclone lifecycle on the bucket (recommend: 30 daily, 12 monthly)
|
||||
|
||||
**Integrity guarantees (every run):**
|
||||
1. `pg_dump` pipefail-safe — partial dump never overwrites previous good backup
|
||||
2. `gunzip -t` validates the compressed stream
|
||||
3. Header sanity check — decompressed first line must start with `-- PostgreSQL database dump` (catches the case where pg_dump emitted an error message that still compressed cleanly)
|
||||
4. Size warning if backup drops below 1KB
|
||||
5. Atomic mv-into-place — only swap the dated filename in once all checks pass
|
||||
|
||||
**Restore drill (proven, not assumed):**
|
||||
- `restore-test.sh` runs weekly. It creates a throwaway DB inside the same Postgres container, restores the newest backup, verifies the core tables (`users`, `sessions`, `oauth_tokens`, `mcp_servers`) exist, then drops the temp DB.
|
||||
- Failure here sends an SMS — silent-corruption detection.
|
||||
|
||||
**Manual restore (full recovery procedure):**
|
||||
|
||||
```bash
|
||||
# 1. Stop dependent services
|
||||
docker compose --env-file .env.production -f docker-compose.prod.yml stop api web generator
|
||||
|
||||
# 2. Drop + recreate target DB (DANGER — destroys current data)
|
||||
docker exec bmm-postgres psql -U bmm -d postgres -c "DROP DATABASE IF EXISTS bmm"
|
||||
docker exec bmm-postgres psql -U bmm -d postgres -c "CREATE DATABASE bmm OWNER bmm"
|
||||
|
||||
# 3. Restore
|
||||
gunzip -c /var/backups/bmm/bmm-YYYYMMDD.sql.gz | docker exec -i bmm-postgres psql -U bmm -d bmm
|
||||
|
||||
# 4. Restart services
|
||||
docker compose --env-file .env.production -f docker-compose.prod.yml up -d
|
||||
```
|
||||
|
||||
If the box itself is gone: spin up a fresh Postgres on a new box, `scp` the
|
||||
latest offsite backup from R2/B2, then run step 3 above against the new
|
||||
container.
|
||||
|
||||
## Offsite — to enable
|
||||
|
||||
Pick one of three:
|
||||
|
||||
### Option A — Cloudflare R2 (recommended, you're already on CF)
|
||||
```bash
|
||||
apt-get install -y rclone
|
||||
rclone config # → New remote → s3 → Cloudflare → fill access_key/secret/endpoint
|
||||
# Then in /opt/buildmymcpserver/.env.production:
|
||||
BMM_BACKUP_REMOTE=r2:bmm-backups/postgres
|
||||
```
|
||||
10 GB free, no egress fees.
|
||||
|
||||
### Option B — Backblaze B2
|
||||
Same `rclone config` but choose Backblaze B2. $6/TB/mo, 10 GB free.
|
||||
|
||||
### Option C — Hetzner Storage Box
|
||||
Order one in Hetzner Robot. Uses SFTP via rclone (`sftp` remote type).
|
||||
Cheapest for keeping data inside the same provider's network.
|
||||
|
||||
The `backup.sh` script picks up `BMM_BACKUP_REMOTE` automatically and runs
|
||||
`rclone copy` after every successful local backup. No code changes needed.
|
||||
|
||||
## Uptime Monitoring
|
||||
|
||||
**Two-layer strategy** — covers both app failure and box failure:
|
||||
|
||||
1. **Self-hosted probe** (`uptime-check.sh` every 5 min from this box):
|
||||
Detects app-layer outages — Postgres down, API returning 500, web container
|
||||
crashed. Sends SMS via Twilio on the first failing tick (edge-triggered to
|
||||
avoid SMS-storm on sustained outages); sends an "uptime-recovered" SMS
|
||||
when the next tick succeeds.
|
||||
|
||||
2. **External watchdog** (healthchecks.io heartbeat):
|
||||
`uptime-check.sh` pings `HEALTHCHECKS_HEARTBEAT_URL` on every successful
|
||||
probe. If the box itself dies (network loss, hardware fail, kernel panic),
|
||||
no heartbeat arrives → healthchecks.io alerts via its own channel.
|
||||
**Without this layer the self-hosted monitor cannot detect box-level
|
||||
failures** — they kill the monitor itself.
|
||||
|
||||
**To enable external watchdog:**
|
||||
1. Sign up at https://healthchecks.io (free, no credit card)
|
||||
2. Create a new check with 5-minute period, 1-minute grace
|
||||
3. Copy the ping URL (`https://hc-ping.com/<uuid>`)
|
||||
4. In `/opt/buildmymcpserver/.env.production`:
|
||||
```
|
||||
HEALTHCHECKS_HEARTBEAT_URL=https://hc-ping.com/<your-uuid>
|
||||
```
|
||||
5. Configure healthchecks.io to send email / SMS / Slack on failure
|
||||
|
||||
**Alert target (for both layers):**
|
||||
`ADMIN_PHONE=+41XXXXXXXXX` must be set in `.env.production` for Twilio SMS.
|
||||
Without it, alerts still land in syslog (`journalctl -t bmm-ops`) but no SMS.
|
||||
|
||||
## Logs
|
||||
|
||||
| File | Content |
|
||||
|---|---|
|
||||
| `/var/log/bmm-backup.log` | Backup + restore-test history |
|
||||
| `/var/log/bmm-uptime.log` | One line per 5-min check |
|
||||
| `journalctl -t bmm-ops` | All notify.sh events (syslog) |
|
||||
|
||||
## Cron files
|
||||
|
||||
| `/etc/cron.d/bmm-postgres-backup` | runs backup.sh |
|
||||
| `/etc/cron.d/bmm-restore-test` | runs restore-test.sh |
|
||||
| `/etc/cron.d/bmm-uptime` | runs uptime-check.sh |
|
||||
89
ops/bmm/backup.sh
Normal file
89
ops/bmm/backup.sh
Normal file
@ -0,0 +1,89 @@
|
||||
#!/usr/bin/env bash
|
||||
# Daily Postgres backup for BMM — replaces the inline cron one-liner with:
|
||||
# - integrity check (gunzip -t + pg_dump header sanity)
|
||||
# - failure alert via notify.sh
|
||||
# - structured logging
|
||||
# - optional offsite push via rclone (if rclone configured + BMM_BACKUP_REMOTE set)
|
||||
# - 14-day local retention
|
||||
#
|
||||
# Cron: 15 3 * * * root /opt/bmm-ops/backup.sh
|
||||
#
|
||||
# Restore:
|
||||
# docker exec -i bmm-postgres psql -U bmm -d bmm_restore_test \
|
||||
# < <(gunzip -c /var/backups/bmm/bmm-YYYYMMDD.sql.gz)
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
BACKUP_DIR="/var/backups/bmm"
|
||||
LOG_FILE="/var/log/bmm-backup.log"
|
||||
NOTIFY="/opt/bmm-ops/notify.sh"
|
||||
RETENTION_DAYS=14
|
||||
PG_USER="bmm"
|
||||
PG_DB="bmm"
|
||||
CONTAINER="bmm-postgres"
|
||||
|
||||
mkdir -p "$BACKUP_DIR"
|
||||
DATE=$(date -u +%Y%m%d)
|
||||
TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)
|
||||
OUT="${BACKUP_DIR}/bmm-${DATE}.sql.gz"
|
||||
|
||||
log() { echo "[${TIMESTAMP}] $*" >> "$LOG_FILE"; }
|
||||
fail() { log "FAIL: $*"; "$NOTIFY" "backup-failed" "$*"; exit 1; }
|
||||
|
||||
log "starting backup"
|
||||
|
||||
# pg_dump → gzip in one pipeline; pipefail catches a dump failure mid-stream
|
||||
if ! docker exec "$CONTAINER" pg_dump -U "$PG_USER" "$PG_DB" 2>>"$LOG_FILE" | gzip > "$OUT.tmp"; then
|
||||
rm -f "$OUT.tmp"
|
||||
fail "pg_dump pipeline failed"
|
||||
fi
|
||||
|
||||
# Integrity check 1 — gzip stream must be valid end-to-end
|
||||
if ! gunzip -t "$OUT.tmp" 2>>"$LOG_FILE"; then
|
||||
rm -f "$OUT.tmp"
|
||||
fail "gzip integrity check failed for $OUT.tmp"
|
||||
fi
|
||||
|
||||
# Integrity check 2 — decompressed content must contain the pg_dump header
|
||||
# in the first few lines. pg_dump emits "--" on line 1 and the actual
|
||||
# "-- PostgreSQL database dump" comment on line 2, so we scan the first 5
|
||||
# lines rather than only line 1.
|
||||
HEADER_BLOCK=$(gunzip -c "$OUT.tmp" | head -5)
|
||||
if ! echo "$HEADER_BLOCK" | grep -q "^-- PostgreSQL database dump"; then
|
||||
rm -f "$OUT.tmp"
|
||||
fail "pg_dump output missing expected header (first 5 lines: $(echo "$HEADER_BLOCK" | tr '\n' '|' | cut -c1-120))"
|
||||
fi
|
||||
|
||||
# Size sanity — backups have grown to ~8KB. A sub-1KB dump means schema-only
|
||||
# or empty. Likely-broken: alert but keep file for inspection.
|
||||
SIZE=$(stat -c%s "$OUT.tmp")
|
||||
if [ "$SIZE" -lt 1024 ]; then
|
||||
log "WARN: backup unusually small (${SIZE} bytes)"
|
||||
"$NOTIFY" "backup-suspicious" "backup is only ${SIZE} bytes — investigate $OUT.tmp"
|
||||
fi
|
||||
|
||||
# Atomic move — only swap into place once all checks passed
|
||||
mv "$OUT.tmp" "$OUT"
|
||||
log "backup written: $OUT (${SIZE} bytes)"
|
||||
|
||||
# Optional offsite push — set BMM_BACKUP_REMOTE=<rclone-remote>:<path> in
|
||||
# /opt/buildmymcpserver/.env.production once rclone is configured. We
|
||||
# grep-parse rather than sourcing the env file because the env file is
|
||||
# managed for Docker compose (KEY=value, sometimes unquoted text values
|
||||
# like names) and `source` evaluates unquoted RHS as shell, which breaks
|
||||
# on any value containing whitespace.
|
||||
ENV_FILE="/opt/buildmymcpserver/.env.production"
|
||||
BMM_BACKUP_REMOTE="$(grep -E '^BMM_BACKUP_REMOTE=' "$ENV_FILE" 2>/dev/null | head -1 | cut -d= -f2- | sed 's/^"\(.*\)"$/\1/; s/^'"'"'\(.*\)'"'"'$/\1/')"
|
||||
if [ -n "${BMM_BACKUP_REMOTE:-}" ] && command -v rclone >/dev/null 2>&1; then
|
||||
if rclone copy "$OUT" "$BMM_BACKUP_REMOTE" --quiet 2>>"$LOG_FILE"; then
|
||||
log "offsite copy ok: $BMM_BACKUP_REMOTE"
|
||||
else
|
||||
"$NOTIFY" "backup-offsite-failed" "rclone copy to $BMM_BACKUP_REMOTE failed"
|
||||
fi
|
||||
fi
|
||||
|
||||
# Retention — keep last 14 days
|
||||
find "$BACKUP_DIR" -maxdepth 1 -name "bmm-*.sql.gz" -mtime "+${RETENTION_DAYS}" -delete 2>>"$LOG_FILE"
|
||||
|
||||
log "done"
|
||||
exit 0
|
||||
61
ops/bmm/notify.sh
Normal file
61
ops/bmm/notify.sh
Normal file
@ -0,0 +1,61 @@
|
||||
#!/usr/bin/env bash
|
||||
# Shared notification helper for BMM ops scripts.
|
||||
#
|
||||
# Sends an alert via:
|
||||
# - Twilio SMS (if TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN, TWILIO_SMS_FROM,
|
||||
# ADMIN_PHONE are all set in /opt/buildmymcpserver/.env.production)
|
||||
# - HEALTHCHECKS_FAIL_URL (if set — generic webhook fallback)
|
||||
# - syslog (always)
|
||||
#
|
||||
# Usage: notify.sh "subject" "body"
|
||||
#
|
||||
# Designed to never fail loudly: if Twilio is misconfigured we still log
|
||||
# to syslog so failures aren't silent. Backup/uptime scripts trust this
|
||||
# helper to handle their own delivery failures gracefully.
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
SUBJECT="${1:-bmm-alert}"
|
||||
BODY="${2:-}"
|
||||
|
||||
# Always syslog — covers the case where notification channels are broken
|
||||
logger -t bmm-ops "$SUBJECT: $BODY"
|
||||
|
||||
# Grep-parse the env file rather than `source`-ing it: the file is managed
|
||||
# for Docker compose (KEY=value, often unquoted text values like names),
|
||||
# and `source` evaluates unquoted RHS as shell — breaking on any value
|
||||
# with whitespace or shell metachars. This pulls only the keys we need.
|
||||
ENV_FILE="/opt/buildmymcpserver/.env.production"
|
||||
read_env() {
|
||||
grep -E "^$1=" "$ENV_FILE" 2>/dev/null | head -1 | cut -d= -f2- | sed 's/^"\(.*\)"$/\1/; s/^'"'"'\(.*\)'"'"'$/\1/'
|
||||
}
|
||||
if [ -f "$ENV_FILE" ]; then
|
||||
TWILIO_ACCOUNT_SID="${TWILIO_ACCOUNT_SID:-$(read_env TWILIO_ACCOUNT_SID)}"
|
||||
TWILIO_AUTH_TOKEN="${TWILIO_AUTH_TOKEN:-$(read_env TWILIO_AUTH_TOKEN)}"
|
||||
TWILIO_SMS_FROM="${TWILIO_SMS_FROM:-$(read_env TWILIO_SMS_FROM)}"
|
||||
ADMIN_PHONE="${ADMIN_PHONE:-$(read_env ADMIN_PHONE)}"
|
||||
HEALTHCHECKS_FAIL_URL="${HEALTHCHECKS_FAIL_URL:-$(read_env HEALTHCHECKS_FAIL_URL)}"
|
||||
fi
|
||||
|
||||
# Twilio SMS — only if all four vars set
|
||||
if [ -n "${TWILIO_ACCOUNT_SID:-}" ] && \
|
||||
[ -n "${TWILIO_AUTH_TOKEN:-}" ] && \
|
||||
[ -n "${TWILIO_SMS_FROM:-}" ] && \
|
||||
[ -n "${ADMIN_PHONE:-}" ]; then
|
||||
curl -sS -o /dev/null --max-time 10 \
|
||||
-X POST "https://api.twilio.com/2010-04-01/Accounts/${TWILIO_ACCOUNT_SID}/Messages.json" \
|
||||
--data-urlencode "From=${TWILIO_SMS_FROM}" \
|
||||
--data-urlencode "To=${ADMIN_PHONE}" \
|
||||
--data-urlencode "Body=[BMM] ${SUBJECT}: ${BODY}" \
|
||||
-u "${TWILIO_ACCOUNT_SID}:${TWILIO_AUTH_TOKEN}" \
|
||||
|| logger -t bmm-ops "twilio-sms-failed: $SUBJECT"
|
||||
fi
|
||||
|
||||
# Generic webhook (for healthchecks.io, BetterStack, etc.) — POST body
|
||||
if [ -n "${HEALTHCHECKS_FAIL_URL:-}" ]; then
|
||||
curl -fsS -o /dev/null --max-time 10 --retry 2 \
|
||||
--data "${SUBJECT}: ${BODY}" "${HEALTHCHECKS_FAIL_URL}" \
|
||||
|| logger -t bmm-ops "healthcheck-webhook-failed"
|
||||
fi
|
||||
|
||||
exit 0
|
||||
57
ops/bmm/restore-test.sh
Normal file
57
ops/bmm/restore-test.sh
Normal file
@ -0,0 +1,57 @@
|
||||
#!/usr/bin/env bash
|
||||
# Weekly restore test — proves backups are actually restorable, not just
|
||||
# byte-streams that look like backups. Restores latest backup into a
|
||||
# temporary DB inside the same Postgres container, runs a schema check,
|
||||
# then drops the temp DB.
|
||||
#
|
||||
# Cron: 30 4 * * 0 root /opt/bmm-ops/restore-test.sh (Sundays 04:30 UTC)
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
BACKUP_DIR="/var/backups/bmm"
|
||||
LOG_FILE="/var/log/bmm-backup.log"
|
||||
NOTIFY="/opt/bmm-ops/notify.sh"
|
||||
PG_USER="bmm"
|
||||
CONTAINER="bmm-postgres"
|
||||
TEMP_DB="bmm_restore_test_$(date +%s)"
|
||||
TS=$(date -u +%Y-%m-%dT%H:%M:%SZ)
|
||||
|
||||
log() { echo "[${TS}] restore-test: $*" >> "$LOG_FILE"; }
|
||||
fail() {
|
||||
log "FAIL: $*"
|
||||
docker exec "$CONTAINER" psql -U "$PG_USER" -d postgres -c "DROP DATABASE IF EXISTS ${TEMP_DB}" >/dev/null 2>&1
|
||||
"$NOTIFY" "restore-test-failed" "$*"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Find newest backup
|
||||
LATEST=$(ls -t "${BACKUP_DIR}"/bmm-*.sql.gz 2>/dev/null | head -1)
|
||||
if [ -z "$LATEST" ] || [ ! -f "$LATEST" ]; then
|
||||
fail "no backup found in ${BACKUP_DIR}"
|
||||
fi
|
||||
|
||||
log "testing restore from: $LATEST"
|
||||
|
||||
# Create temp DB
|
||||
if ! docker exec "$CONTAINER" psql -U "$PG_USER" -d postgres -c "CREATE DATABASE ${TEMP_DB}" >/dev/null 2>&1; then
|
||||
fail "could not create temp DB ${TEMP_DB}"
|
||||
fi
|
||||
|
||||
# Restore — pipe through container stdin
|
||||
if ! gunzip -c "$LATEST" | docker exec -i "$CONTAINER" psql -U "$PG_USER" -d "$TEMP_DB" >/dev/null 2>>"$LOG_FILE"; then
|
||||
fail "psql restore failed for $LATEST"
|
||||
fi
|
||||
|
||||
# Schema sanity — expect the core tables to exist (adjust if schema evolves)
|
||||
EXPECTED_TABLES="users sessions oauth_tokens mcp_servers"
|
||||
for tbl in $EXPECTED_TABLES; do
|
||||
COUNT=$(docker exec "$CONTAINER" psql -U "$PG_USER" -d "$TEMP_DB" -tAc "SELECT count(*) FROM information_schema.tables WHERE table_name='${tbl}'" 2>>"$LOG_FILE")
|
||||
if [ "$COUNT" != "1" ]; then
|
||||
fail "restored DB missing expected table: ${tbl}"
|
||||
fi
|
||||
done
|
||||
|
||||
# Drop temp DB
|
||||
docker exec "$CONTAINER" psql -U "$PG_USER" -d postgres -c "DROP DATABASE ${TEMP_DB}" >/dev/null 2>&1
|
||||
log "ok — $LATEST restores cleanly, schema validates"
|
||||
exit 0
|
||||
67
ops/bmm/uptime-check.sh
Normal file
67
ops/bmm/uptime-check.sh
Normal file
@ -0,0 +1,67 @@
|
||||
#!/usr/bin/env bash
|
||||
# Self-hosted uptime monitor — pings homepage + API health every 5 min.
|
||||
# Sends SMS via notify.sh on transition into / out of failure state. Pings
|
||||
# a healthchecks.io heartbeat (HEALTHCHECKS_HEARTBEAT_URL) on every success
|
||||
# so that if THIS box dies the external service alerts.
|
||||
#
|
||||
# Cron: */5 * * * * root /opt/bmm-ops/uptime-check.sh
|
||||
#
|
||||
# State file tracks last-known status so repeated failures don't spam SMS.
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
STATE_DIR="/var/lib/bmm-ops"
|
||||
STATE_FILE="${STATE_DIR}/uptime.state"
|
||||
LOG_FILE="/var/log/bmm-uptime.log"
|
||||
NOTIFY="/opt/bmm-ops/notify.sh"
|
||||
|
||||
mkdir -p "$STATE_DIR"
|
||||
TS=$(date -u +%Y-%m-%dT%H:%M:%SZ)
|
||||
|
||||
# Probe targets. Expected HTTP status code in column 2. Each probe is
|
||||
# independent — partial failure (web up, api down) still flags as "down".
|
||||
TARGETS=(
|
||||
"https://buildmymcpserver.com/|200"
|
||||
"https://buildmymcpserver.com/api/health|200"
|
||||
"https://buildmymcpserver.com/robots.txt|200"
|
||||
)
|
||||
|
||||
failures=()
|
||||
for target in "${TARGETS[@]}"; do
|
||||
url="${target%|*}"
|
||||
want="${target##*|}"
|
||||
got=$(curl -sS -o /dev/null --max-time 8 -w "%{http_code}" "$url" 2>/dev/null || echo "000")
|
||||
if [ "$got" != "$want" ]; then
|
||||
failures+=("${url} expected ${want} got ${got}")
|
||||
fi
|
||||
done
|
||||
|
||||
PREV="up"
|
||||
if [ -f "$STATE_FILE" ]; then
|
||||
PREV=$(cat "$STATE_FILE")
|
||||
fi
|
||||
|
||||
if [ "${#failures[@]}" -eq 0 ]; then
|
||||
echo "[${TS}] up" >> "$LOG_FILE"
|
||||
echo "up" > "$STATE_FILE"
|
||||
if [ "$PREV" = "down" ]; then
|
||||
"$NOTIFY" "uptime-recovered" "all probes healthy at ${TS}"
|
||||
fi
|
||||
# Heartbeat for external watchdog (signals "box itself is alive"). Use
|
||||
# grep-parse to avoid `source` evaluating unquoted env values as shell.
|
||||
HEALTHCHECKS_HEARTBEAT_URL="$(grep -E '^HEALTHCHECKS_HEARTBEAT_URL=' /opt/buildmymcpserver/.env.production 2>/dev/null | head -1 | cut -d= -f2- | sed 's/^"\(.*\)"$/\1/; s/^'"'"'\(.*\)'"'"'$/\1/')"
|
||||
if [ -n "${HEALTHCHECKS_HEARTBEAT_URL:-}" ]; then
|
||||
curl -fsS -o /dev/null --max-time 8 "${HEALTHCHECKS_HEARTBEAT_URL}" 2>/dev/null || true
|
||||
fi
|
||||
else
|
||||
echo "[${TS}] down: ${failures[*]}" >> "$LOG_FILE"
|
||||
echo "down" > "$STATE_FILE"
|
||||
if [ "$PREV" = "up" ]; then
|
||||
# Transition up→down: alert immediately (first failure tick)
|
||||
"$NOTIFY" "uptime-down" "${failures[*]}"
|
||||
fi
|
||||
# Intentionally do NOT alert again on subsequent ticks while still down —
|
||||
# avoids SMS storm during a sustained incident. Recovery edge re-notifies.
|
||||
fi
|
||||
|
||||
exit 0
|
||||
Loading…
Reference in New Issue
Block a user