Commit Graph

1 Commits

Author SHA1 Message Date
Marco Sadjadi
591a1cb575 ops: backup hardening + restore drill + self-hosted uptime monitor
All checks were successful
Deploy to Production / deploy (push) Successful in 1m10s
Adds /opt/bmm-ops/ scripts (deployed separately from the app, so tar
overlays don't clobber them) for three previously-missing production
readiness items:

1. Backup hardening (backup.sh):
   - Previous cron one-liner did pg_dump | gzip with no validation.
   - Now: pipefail-safe pg_dump, gunzip -t integrity check, pg_dump
     header sanity (scans first 5 lines — line 1 is just "--", actual
     "PostgreSQL database dump" comment lands on line 2), size-warning
     under 1KB, atomic move-into-place so partial backups never replace
     the previous good file. 14-day retention preserved.
   - Optional offsite via BMM_BACKUP_REMOTE (rclone). Reads env via
     grep+cut, NOT `source` — the .env.production has unquoted text
     values (e.g. ADMIN_NAME) that crash a sourced shell.

2. Restore drill (restore-test.sh, Sun 04:30 UTC weekly):
   - Restores the newest backup into a throwaway DB inside the same
     Postgres container, verifies the core tables exist (users,
     sessions, oauth_tokens, mcp_servers), drops the temp DB. Proves
     backups are actually restorable, not just byte-streams that look
     like backups. Silent-corruption detector.

3. Self-hosted uptime monitor (uptime-check.sh, every 5 min):
   - Probes homepage + /api/health + /robots.txt.
   - Edge-triggered alerting: SMS via Twilio only on up→down and
     down→up transitions (avoids SMS storm during sustained outages).
   - Pings HEALTHCHECKS_HEARTBEAT_URL on every success — when the box
     itself dies the heartbeat stops and the external watchdog alerts
     (covers the gap that self-hosted monitors can't see their own
     box failing).

notify.sh is the shared helper: Twilio SMS if all four creds set,
optional webhook to HEALTHCHECKS_FAIL_URL, always logs to syslog. Never
fails loudly — broken notification path still lands in journalctl
-t bmm-ops.

README.md documents the 3-2-1 strategy, manual full-recovery
procedure, and how to enable offsite (R2 / B2 / Hetzner Storage Box).

Smoke-tested all three on prod: backup wrote 8004 bytes with checks
passing, restore-test confirmed schema, uptime probe returned up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 23:46:42 +02:00