All checks were successful
Deploy to Production / deploy (push) Successful in 1m10s
Adds /opt/bmm-ops/ scripts (deployed separately from the app, so tar
overlays don't clobber them) for three previously-missing production
readiness items:
1. Backup hardening (backup.sh):
- Previous cron one-liner did pg_dump | gzip with no validation.
- Now: pipefail-safe pg_dump, gunzip -t integrity check, pg_dump
header sanity (scans first 5 lines — line 1 is just "--", actual
"PostgreSQL database dump" comment lands on line 2), size-warning
under 1KB, atomic move-into-place so partial backups never replace
the previous good file. 14-day retention preserved.
- Optional offsite via BMM_BACKUP_REMOTE (rclone). Reads env via
grep+cut, NOT `source` — the .env.production has unquoted text
values (e.g. ADMIN_NAME) that crash a sourced shell.
2. Restore drill (restore-test.sh, Sun 04:30 UTC weekly):
- Restores the newest backup into a throwaway DB inside the same
Postgres container, verifies the core tables exist (users,
sessions, oauth_tokens, mcp_servers), drops the temp DB. Proves
backups are actually restorable, not just byte-streams that look
like backups. Silent-corruption detector.
3. Self-hosted uptime monitor (uptime-check.sh, every 5 min):
- Probes homepage + /api/health + /robots.txt.
- Edge-triggered alerting: SMS via Twilio only on up→down and
down→up transitions (avoids SMS storm during sustained outages).
- Pings HEALTHCHECKS_HEARTBEAT_URL on every success — when the box
itself dies the heartbeat stops and the external watchdog alerts
(covers the gap that self-hosted monitors can't see their own
box failing).
notify.sh is the shared helper: Twilio SMS if all four creds set,
optional webhook to HEALTHCHECKS_FAIL_URL, always logs to syslog. Never
fails loudly — broken notification path still lands in journalctl
-t bmm-ops.
README.md documents the 3-2-1 strategy, manual full-recovery
procedure, and how to enable offsite (R2 / B2 / Hetzner Storage Box).
Smoke-tested all three on prod: backup wrote 8004 bytes with checks
passing, restore-test confirmed schema, uptime probe returned up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
68 lines
2.3 KiB
Bash
68 lines
2.3 KiB
Bash
#!/usr/bin/env bash
|
|
# Self-hosted uptime monitor — pings homepage + API health every 5 min.
|
|
# Sends SMS via notify.sh on transition into / out of failure state. Pings
|
|
# a healthchecks.io heartbeat (HEALTHCHECKS_HEARTBEAT_URL) on every success
|
|
# so that if THIS box dies the external service alerts.
|
|
#
|
|
# Cron: */5 * * * * root /opt/bmm-ops/uptime-check.sh
|
|
#
|
|
# State file tracks last-known status so repeated failures don't spam SMS.
|
|
|
|
set -uo pipefail
|
|
|
|
STATE_DIR="/var/lib/bmm-ops"
|
|
STATE_FILE="${STATE_DIR}/uptime.state"
|
|
LOG_FILE="/var/log/bmm-uptime.log"
|
|
NOTIFY="/opt/bmm-ops/notify.sh"
|
|
|
|
mkdir -p "$STATE_DIR"
|
|
TS=$(date -u +%Y-%m-%dT%H:%M:%SZ)
|
|
|
|
# Probe targets. Expected HTTP status code in column 2. Each probe is
|
|
# independent — partial failure (web up, api down) still flags as "down".
|
|
TARGETS=(
|
|
"https://buildmymcpserver.com/|200"
|
|
"https://buildmymcpserver.com/api/health|200"
|
|
"https://buildmymcpserver.com/robots.txt|200"
|
|
)
|
|
|
|
failures=()
|
|
for target in "${TARGETS[@]}"; do
|
|
url="${target%|*}"
|
|
want="${target##*|}"
|
|
got=$(curl -sS -o /dev/null --max-time 8 -w "%{http_code}" "$url" 2>/dev/null || echo "000")
|
|
if [ "$got" != "$want" ]; then
|
|
failures+=("${url} expected ${want} got ${got}")
|
|
fi
|
|
done
|
|
|
|
PREV="up"
|
|
if [ -f "$STATE_FILE" ]; then
|
|
PREV=$(cat "$STATE_FILE")
|
|
fi
|
|
|
|
if [ "${#failures[@]}" -eq 0 ]; then
|
|
echo "[${TS}] up" >> "$LOG_FILE"
|
|
echo "up" > "$STATE_FILE"
|
|
if [ "$PREV" = "down" ]; then
|
|
"$NOTIFY" "uptime-recovered" "all probes healthy at ${TS}"
|
|
fi
|
|
# Heartbeat for external watchdog (signals "box itself is alive"). Use
|
|
# grep-parse to avoid `source` evaluating unquoted env values as shell.
|
|
HEALTHCHECKS_HEARTBEAT_URL="$(grep -E '^HEALTHCHECKS_HEARTBEAT_URL=' /opt/buildmymcpserver/.env.production 2>/dev/null | head -1 | cut -d= -f2- | sed 's/^"\(.*\)"$/\1/; s/^'"'"'\(.*\)'"'"'$/\1/')"
|
|
if [ -n "${HEALTHCHECKS_HEARTBEAT_URL:-}" ]; then
|
|
curl -fsS -o /dev/null --max-time 8 "${HEALTHCHECKS_HEARTBEAT_URL}" 2>/dev/null || true
|
|
fi
|
|
else
|
|
echo "[${TS}] down: ${failures[*]}" >> "$LOG_FILE"
|
|
echo "down" > "$STATE_FILE"
|
|
if [ "$PREV" = "up" ]; then
|
|
# Transition up→down: alert immediately (first failure tick)
|
|
"$NOTIFY" "uptime-down" "${failures[*]}"
|
|
fi
|
|
# Intentionally do NOT alert again on subsequent ticks while still down —
|
|
# avoids SMS storm during a sustained incident. Recovery edge re-notifies.
|
|
fi
|
|
|
|
exit 0
|