buildmymcpserver/ops/bmm
Marco Sadjadi 591a1cb575
All checks were successful
Deploy to Production / deploy (push) Successful in 1m10s
ops: backup hardening + restore drill + self-hosted uptime monitor
Adds /opt/bmm-ops/ scripts (deployed separately from the app, so tar
overlays don't clobber them) for three previously-missing production
readiness items:

1. Backup hardening (backup.sh):
   - Previous cron one-liner did pg_dump | gzip with no validation.
   - Now: pipefail-safe pg_dump, gunzip -t integrity check, pg_dump
     header sanity (scans first 5 lines — line 1 is just "--", actual
     "PostgreSQL database dump" comment lands on line 2), size-warning
     under 1KB, atomic move-into-place so partial backups never replace
     the previous good file. 14-day retention preserved.
   - Optional offsite via BMM_BACKUP_REMOTE (rclone). Reads env via
     grep+cut, NOT `source` — the .env.production has unquoted text
     values (e.g. ADMIN_NAME) that crash a sourced shell.

2. Restore drill (restore-test.sh, Sun 04:30 UTC weekly):
   - Restores the newest backup into a throwaway DB inside the same
     Postgres container, verifies the core tables exist (users,
     sessions, oauth_tokens, mcp_servers), drops the temp DB. Proves
     backups are actually restorable, not just byte-streams that look
     like backups. Silent-corruption detector.

3. Self-hosted uptime monitor (uptime-check.sh, every 5 min):
   - Probes homepage + /api/health + /robots.txt.
   - Edge-triggered alerting: SMS via Twilio only on up→down and
     down→up transitions (avoids SMS storm during sustained outages).
   - Pings HEALTHCHECKS_HEARTBEAT_URL on every success — when the box
     itself dies the heartbeat stops and the external watchdog alerts
     (covers the gap that self-hosted monitors can't see their own
     box failing).

notify.sh is the shared helper: Twilio SMS if all four creds set,
optional webhook to HEALTHCHECKS_FAIL_URL, always logs to syslog. Never
fails loudly — broken notification path still lands in journalctl
-t bmm-ops.

README.md documents the 3-2-1 strategy, manual full-recovery
procedure, and how to enable offsite (R2 / B2 / Hetzner Storage Box).

Smoke-tested all three on prod: backup wrote 8004 bytes with checks
passing, restore-test confirmed schema, uptime probe returned up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 23:46:42 +02:00
..
backup.sh ops: backup hardening + restore drill + self-hosted uptime monitor 2026-05-26 23:46:42 +02:00
notify.sh ops: backup hardening + restore drill + self-hosted uptime monitor 2026-05-26 23:46:42 +02:00
README.md ops: backup hardening + restore drill + self-hosted uptime monitor 2026-05-26 23:46:42 +02:00
restore-test.sh ops: backup hardening + restore drill + self-hosted uptime monitor 2026-05-26 23:46:42 +02:00
uptime-check.sh ops: backup hardening + restore drill + self-hosted uptime monitor 2026-05-26 23:46:42 +02:00

BMM Ops — Backup & Uptime

Scripts live in /opt/bmm-ops/ on the Hetzner box. They are intentionally outside /opt/buildmymcpserver/ so app deploys (tar overlay) don't overwrite them.

Scripts

Script Cron Purpose
backup.sh 15 3 * * * (03:15 UTC daily) pg_dump → gzip → integrity-check → offsite push → 14d retention
restore-test.sh 30 4 * * 0 (Sun 04:30 UTC weekly) Restores latest backup into temp DB, verifies core schema, drops temp DB
uptime-check.sh */5 * * * * (every 5 min) Probes homepage + API health + robots.txt; alerts on edge transitions; heartbeats external watchdog
notify.sh (helper) Sends Twilio SMS + optional webhook on alert; always syslogs

Backup Strategy

3-2-1 rule applied:

  • 3 copies — Postgres volume (live) + local gzip (/var/backups/bmm/) + offsite (rclone target)
  • 2 different media — Hetzner box SSD + external object storage (R2/B2/Hetzner Storage Box)
  • 1 offsite — not on the same machine

Retention:

  • Local: 14 days rolling
  • Offsite: configure via rclone lifecycle on the bucket (recommend: 30 daily, 12 monthly)

Integrity guarantees (every run):

  1. pg_dump pipefail-safe — partial dump never overwrites previous good backup
  2. gunzip -t validates the compressed stream
  3. Header sanity check — decompressed first line must start with -- PostgreSQL database dump (catches the case where pg_dump emitted an error message that still compressed cleanly)
  4. Size warning if backup drops below 1KB
  5. Atomic mv-into-place — only swap the dated filename in once all checks pass

Restore drill (proven, not assumed):

  • restore-test.sh runs weekly. It creates a throwaway DB inside the same Postgres container, restores the newest backup, verifies the core tables (users, sessions, oauth_tokens, mcp_servers) exist, then drops the temp DB.
  • Failure here sends an SMS — silent-corruption detection.

Manual restore (full recovery procedure):

# 1. Stop dependent services
docker compose --env-file .env.production -f docker-compose.prod.yml stop api web generator

# 2. Drop + recreate target DB (DANGER — destroys current data)
docker exec bmm-postgres psql -U bmm -d postgres -c "DROP DATABASE IF EXISTS bmm"
docker exec bmm-postgres psql -U bmm -d postgres -c "CREATE DATABASE bmm OWNER bmm"

# 3. Restore
gunzip -c /var/backups/bmm/bmm-YYYYMMDD.sql.gz | docker exec -i bmm-postgres psql -U bmm -d bmm

# 4. Restart services
docker compose --env-file .env.production -f docker-compose.prod.yml up -d

If the box itself is gone: spin up a fresh Postgres on a new box, scp the latest offsite backup from R2/B2, then run step 3 above against the new container.

Offsite — to enable

Pick one of three:

apt-get install -y rclone
rclone config  # → New remote → s3 → Cloudflare → fill access_key/secret/endpoint
# Then in /opt/buildmymcpserver/.env.production:
BMM_BACKUP_REMOTE=r2:bmm-backups/postgres

10 GB free, no egress fees.

Option B — Backblaze B2

Same rclone config but choose Backblaze B2. $6/TB/mo, 10 GB free.

Option C — Hetzner Storage Box

Order one in Hetzner Robot. Uses SFTP via rclone (sftp remote type). Cheapest for keeping data inside the same provider's network.

The backup.sh script picks up BMM_BACKUP_REMOTE automatically and runs rclone copy after every successful local backup. No code changes needed.

Uptime Monitoring

Two-layer strategy — covers both app failure and box failure:

  1. Self-hosted probe (uptime-check.sh every 5 min from this box): Detects app-layer outages — Postgres down, API returning 500, web container crashed. Sends SMS via Twilio on the first failing tick (edge-triggered to avoid SMS-storm on sustained outages); sends an "uptime-recovered" SMS when the next tick succeeds.

  2. External watchdog (healthchecks.io heartbeat): uptime-check.sh pings HEALTHCHECKS_HEARTBEAT_URL on every successful probe. If the box itself dies (network loss, hardware fail, kernel panic), no heartbeat arrives → healthchecks.io alerts via its own channel. Without this layer the self-hosted monitor cannot detect box-level failures — they kill the monitor itself.

To enable external watchdog:

  1. Sign up at https://healthchecks.io (free, no credit card)
  2. Create a new check with 5-minute period, 1-minute grace
  3. Copy the ping URL (https://hc-ping.com/<uuid>)
  4. In /opt/buildmymcpserver/.env.production:
    HEALTHCHECKS_HEARTBEAT_URL=https://hc-ping.com/<your-uuid>
    
  5. Configure healthchecks.io to send email / SMS / Slack on failure

Alert target (for both layers): ADMIN_PHONE=+41XXXXXXXXX must be set in .env.production for Twilio SMS. Without it, alerts still land in syslog (journalctl -t bmm-ops) but no SMS.

Logs

File Content
/var/log/bmm-backup.log Backup + restore-test history
/var/log/bmm-uptime.log One line per 5-min check
journalctl -t bmm-ops All notify.sh events (syslog)

Cron files

| /etc/cron.d/bmm-postgres-backup | runs backup.sh | | /etc/cron.d/bmm-restore-test | runs restore-test.sh | | /etc/cron.d/bmm-uptime | runs uptime-check.sh |