From 591a1cb575be2e83eb20141ce6194ef71dd87d9a Mon Sep 17 00:00:00 2001 From: Marco Sadjadi Date: Tue, 26 May 2026 23:46:42 +0200 Subject: [PATCH] ops: backup hardening + restore drill + self-hosted uptime monitor MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds /opt/bmm-ops/ scripts (deployed separately from the app, so tar overlays don't clobber them) for three previously-missing production readiness items: 1. Backup hardening (backup.sh): - Previous cron one-liner did pg_dump | gzip with no validation. - Now: pipefail-safe pg_dump, gunzip -t integrity check, pg_dump header sanity (scans first 5 lines — line 1 is just "--", actual "PostgreSQL database dump" comment lands on line 2), size-warning under 1KB, atomic move-into-place so partial backups never replace the previous good file. 14-day retention preserved. - Optional offsite via BMM_BACKUP_REMOTE (rclone). Reads env via grep+cut, NOT `source` — the .env.production has unquoted text values (e.g. ADMIN_NAME) that crash a sourced shell. 2. Restore drill (restore-test.sh, Sun 04:30 UTC weekly): - Restores the newest backup into a throwaway DB inside the same Postgres container, verifies the core tables exist (users, sessions, oauth_tokens, mcp_servers), drops the temp DB. Proves backups are actually restorable, not just byte-streams that look like backups. Silent-corruption detector. 3. Self-hosted uptime monitor (uptime-check.sh, every 5 min): - Probes homepage + /api/health + /robots.txt. - Edge-triggered alerting: SMS via Twilio only on up→down and down→up transitions (avoids SMS storm during sustained outages). - Pings HEALTHCHECKS_HEARTBEAT_URL on every success — when the box itself dies the heartbeat stops and the external watchdog alerts (covers the gap that self-hosted monitors can't see their own box failing). notify.sh is the shared helper: Twilio SMS if all four creds set, optional webhook to HEALTHCHECKS_FAIL_URL, always logs to syslog. Never fails loudly — broken notification path still lands in journalctl -t bmm-ops. README.md documents the 3-2-1 strategy, manual full-recovery procedure, and how to enable offsite (R2 / B2 / Hetzner Storage Box). Smoke-tested all three on prod: backup wrote 8004 bytes with checks passing, restore-test confirmed schema, uptime probe returned up. Co-Authored-By: Claude Opus 4.7 (1M context) --- ops/bmm/README.md | 125 ++++++++++++++++++++++++++++++++++++++++ ops/bmm/backup.sh | 89 ++++++++++++++++++++++++++++ ops/bmm/notify.sh | 61 ++++++++++++++++++++ ops/bmm/restore-test.sh | 57 ++++++++++++++++++ ops/bmm/uptime-check.sh | 67 +++++++++++++++++++++ 5 files changed, 399 insertions(+) create mode 100644 ops/bmm/README.md create mode 100644 ops/bmm/backup.sh create mode 100644 ops/bmm/notify.sh create mode 100644 ops/bmm/restore-test.sh create mode 100644 ops/bmm/uptime-check.sh diff --git a/ops/bmm/README.md b/ops/bmm/README.md new file mode 100644 index 0000000..4874e46 --- /dev/null +++ b/ops/bmm/README.md @@ -0,0 +1,125 @@ +# BMM Ops — Backup & Uptime + +Scripts live in `/opt/bmm-ops/` on the Hetzner box. They are intentionally +**outside** `/opt/buildmymcpserver/` so app deploys (tar overlay) don't +overwrite them. + +## Scripts + +| Script | Cron | Purpose | +|---|---|---| +| `backup.sh` | `15 3 * * *` (03:15 UTC daily) | pg_dump → gzip → integrity-check → offsite push → 14d retention | +| `restore-test.sh` | `30 4 * * 0` (Sun 04:30 UTC weekly) | Restores latest backup into temp DB, verifies core schema, drops temp DB | +| `uptime-check.sh` | `*/5 * * * *` (every 5 min) | Probes homepage + API health + robots.txt; alerts on edge transitions; heartbeats external watchdog | +| `notify.sh` | (helper) | Sends Twilio SMS + optional webhook on alert; always syslogs | + +## Backup Strategy + +**3-2-1 rule applied:** +- **3 copies** — Postgres volume (live) + local gzip (`/var/backups/bmm/`) + offsite (rclone target) +- **2 different media** — Hetzner box SSD + external object storage (R2/B2/Hetzner Storage Box) +- **1 offsite** — not on the same machine + +**Retention:** +- Local: 14 days rolling +- Offsite: configure via rclone lifecycle on the bucket (recommend: 30 daily, 12 monthly) + +**Integrity guarantees (every run):** +1. `pg_dump` pipefail-safe — partial dump never overwrites previous good backup +2. `gunzip -t` validates the compressed stream +3. Header sanity check — decompressed first line must start with `-- PostgreSQL database dump` (catches the case where pg_dump emitted an error message that still compressed cleanly) +4. Size warning if backup drops below 1KB +5. Atomic mv-into-place — only swap the dated filename in once all checks pass + +**Restore drill (proven, not assumed):** +- `restore-test.sh` runs weekly. It creates a throwaway DB inside the same Postgres container, restores the newest backup, verifies the core tables (`users`, `sessions`, `oauth_tokens`, `mcp_servers`) exist, then drops the temp DB. +- Failure here sends an SMS — silent-corruption detection. + +**Manual restore (full recovery procedure):** + +```bash +# 1. Stop dependent services +docker compose --env-file .env.production -f docker-compose.prod.yml stop api web generator + +# 2. Drop + recreate target DB (DANGER — destroys current data) +docker exec bmm-postgres psql -U bmm -d postgres -c "DROP DATABASE IF EXISTS bmm" +docker exec bmm-postgres psql -U bmm -d postgres -c "CREATE DATABASE bmm OWNER bmm" + +# 3. Restore +gunzip -c /var/backups/bmm/bmm-YYYYMMDD.sql.gz | docker exec -i bmm-postgres psql -U bmm -d bmm + +# 4. Restart services +docker compose --env-file .env.production -f docker-compose.prod.yml up -d +``` + +If the box itself is gone: spin up a fresh Postgres on a new box, `scp` the +latest offsite backup from R2/B2, then run step 3 above against the new +container. + +## Offsite — to enable + +Pick one of three: + +### Option A — Cloudflare R2 (recommended, you're already on CF) +```bash +apt-get install -y rclone +rclone config # → New remote → s3 → Cloudflare → fill access_key/secret/endpoint +# Then in /opt/buildmymcpserver/.env.production: +BMM_BACKUP_REMOTE=r2:bmm-backups/postgres +``` +10 GB free, no egress fees. + +### Option B — Backblaze B2 +Same `rclone config` but choose Backblaze B2. $6/TB/mo, 10 GB free. + +### Option C — Hetzner Storage Box +Order one in Hetzner Robot. Uses SFTP via rclone (`sftp` remote type). +Cheapest for keeping data inside the same provider's network. + +The `backup.sh` script picks up `BMM_BACKUP_REMOTE` automatically and runs +`rclone copy` after every successful local backup. No code changes needed. + +## Uptime Monitoring + +**Two-layer strategy** — covers both app failure and box failure: + +1. **Self-hosted probe** (`uptime-check.sh` every 5 min from this box): + Detects app-layer outages — Postgres down, API returning 500, web container + crashed. Sends SMS via Twilio on the first failing tick (edge-triggered to + avoid SMS-storm on sustained outages); sends an "uptime-recovered" SMS + when the next tick succeeds. + +2. **External watchdog** (healthchecks.io heartbeat): + `uptime-check.sh` pings `HEALTHCHECKS_HEARTBEAT_URL` on every successful + probe. If the box itself dies (network loss, hardware fail, kernel panic), + no heartbeat arrives → healthchecks.io alerts via its own channel. + **Without this layer the self-hosted monitor cannot detect box-level + failures** — they kill the monitor itself. + +**To enable external watchdog:** +1. Sign up at https://healthchecks.io (free, no credit card) +2. Create a new check with 5-minute period, 1-minute grace +3. Copy the ping URL (`https://hc-ping.com/`) +4. In `/opt/buildmymcpserver/.env.production`: + ``` + HEALTHCHECKS_HEARTBEAT_URL=https://hc-ping.com/ + ``` +5. Configure healthchecks.io to send email / SMS / Slack on failure + +**Alert target (for both layers):** +`ADMIN_PHONE=+41XXXXXXXXX` must be set in `.env.production` for Twilio SMS. +Without it, alerts still land in syslog (`journalctl -t bmm-ops`) but no SMS. + +## Logs + +| File | Content | +|---|---| +| `/var/log/bmm-backup.log` | Backup + restore-test history | +| `/var/log/bmm-uptime.log` | One line per 5-min check | +| `journalctl -t bmm-ops` | All notify.sh events (syslog) | + +## Cron files + +| `/etc/cron.d/bmm-postgres-backup` | runs backup.sh | +| `/etc/cron.d/bmm-restore-test` | runs restore-test.sh | +| `/etc/cron.d/bmm-uptime` | runs uptime-check.sh | diff --git a/ops/bmm/backup.sh b/ops/bmm/backup.sh new file mode 100644 index 0000000..3a916b8 --- /dev/null +++ b/ops/bmm/backup.sh @@ -0,0 +1,89 @@ +#!/usr/bin/env bash +# Daily Postgres backup for BMM — replaces the inline cron one-liner with: +# - integrity check (gunzip -t + pg_dump header sanity) +# - failure alert via notify.sh +# - structured logging +# - optional offsite push via rclone (if rclone configured + BMM_BACKUP_REMOTE set) +# - 14-day local retention +# +# Cron: 15 3 * * * root /opt/bmm-ops/backup.sh +# +# Restore: +# docker exec -i bmm-postgres psql -U bmm -d bmm_restore_test \ +# < <(gunzip -c /var/backups/bmm/bmm-YYYYMMDD.sql.gz) + +set -uo pipefail + +BACKUP_DIR="/var/backups/bmm" +LOG_FILE="/var/log/bmm-backup.log" +NOTIFY="/opt/bmm-ops/notify.sh" +RETENTION_DAYS=14 +PG_USER="bmm" +PG_DB="bmm" +CONTAINER="bmm-postgres" + +mkdir -p "$BACKUP_DIR" +DATE=$(date -u +%Y%m%d) +TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ) +OUT="${BACKUP_DIR}/bmm-${DATE}.sql.gz" + +log() { echo "[${TIMESTAMP}] $*" >> "$LOG_FILE"; } +fail() { log "FAIL: $*"; "$NOTIFY" "backup-failed" "$*"; exit 1; } + +log "starting backup" + +# pg_dump → gzip in one pipeline; pipefail catches a dump failure mid-stream +if ! docker exec "$CONTAINER" pg_dump -U "$PG_USER" "$PG_DB" 2>>"$LOG_FILE" | gzip > "$OUT.tmp"; then + rm -f "$OUT.tmp" + fail "pg_dump pipeline failed" +fi + +# Integrity check 1 — gzip stream must be valid end-to-end +if ! gunzip -t "$OUT.tmp" 2>>"$LOG_FILE"; then + rm -f "$OUT.tmp" + fail "gzip integrity check failed for $OUT.tmp" +fi + +# Integrity check 2 — decompressed content must contain the pg_dump header +# in the first few lines. pg_dump emits "--" on line 1 and the actual +# "-- PostgreSQL database dump" comment on line 2, so we scan the first 5 +# lines rather than only line 1. +HEADER_BLOCK=$(gunzip -c "$OUT.tmp" | head -5) +if ! echo "$HEADER_BLOCK" | grep -q "^-- PostgreSQL database dump"; then + rm -f "$OUT.tmp" + fail "pg_dump output missing expected header (first 5 lines: $(echo "$HEADER_BLOCK" | tr '\n' '|' | cut -c1-120))" +fi + +# Size sanity — backups have grown to ~8KB. A sub-1KB dump means schema-only +# or empty. Likely-broken: alert but keep file for inspection. +SIZE=$(stat -c%s "$OUT.tmp") +if [ "$SIZE" -lt 1024 ]; then + log "WARN: backup unusually small (${SIZE} bytes)" + "$NOTIFY" "backup-suspicious" "backup is only ${SIZE} bytes — investigate $OUT.tmp" +fi + +# Atomic move — only swap into place once all checks passed +mv "$OUT.tmp" "$OUT" +log "backup written: $OUT (${SIZE} bytes)" + +# Optional offsite push — set BMM_BACKUP_REMOTE=: in +# /opt/buildmymcpserver/.env.production once rclone is configured. We +# grep-parse rather than sourcing the env file because the env file is +# managed for Docker compose (KEY=value, sometimes unquoted text values +# like names) and `source` evaluates unquoted RHS as shell, which breaks +# on any value containing whitespace. +ENV_FILE="/opt/buildmymcpserver/.env.production" +BMM_BACKUP_REMOTE="$(grep -E '^BMM_BACKUP_REMOTE=' "$ENV_FILE" 2>/dev/null | head -1 | cut -d= -f2- | sed 's/^"\(.*\)"$/\1/; s/^'"'"'\(.*\)'"'"'$/\1/')" +if [ -n "${BMM_BACKUP_REMOTE:-}" ] && command -v rclone >/dev/null 2>&1; then + if rclone copy "$OUT" "$BMM_BACKUP_REMOTE" --quiet 2>>"$LOG_FILE"; then + log "offsite copy ok: $BMM_BACKUP_REMOTE" + else + "$NOTIFY" "backup-offsite-failed" "rclone copy to $BMM_BACKUP_REMOTE failed" + fi +fi + +# Retention — keep last 14 days +find "$BACKUP_DIR" -maxdepth 1 -name "bmm-*.sql.gz" -mtime "+${RETENTION_DAYS}" -delete 2>>"$LOG_FILE" + +log "done" +exit 0 diff --git a/ops/bmm/notify.sh b/ops/bmm/notify.sh new file mode 100644 index 0000000..029ed63 --- /dev/null +++ b/ops/bmm/notify.sh @@ -0,0 +1,61 @@ +#!/usr/bin/env bash +# Shared notification helper for BMM ops scripts. +# +# Sends an alert via: +# - Twilio SMS (if TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN, TWILIO_SMS_FROM, +# ADMIN_PHONE are all set in /opt/buildmymcpserver/.env.production) +# - HEALTHCHECKS_FAIL_URL (if set — generic webhook fallback) +# - syslog (always) +# +# Usage: notify.sh "subject" "body" +# +# Designed to never fail loudly: if Twilio is misconfigured we still log +# to syslog so failures aren't silent. Backup/uptime scripts trust this +# helper to handle their own delivery failures gracefully. + +set -uo pipefail + +SUBJECT="${1:-bmm-alert}" +BODY="${2:-}" + +# Always syslog — covers the case where notification channels are broken +logger -t bmm-ops "$SUBJECT: $BODY" + +# Grep-parse the env file rather than `source`-ing it: the file is managed +# for Docker compose (KEY=value, often unquoted text values like names), +# and `source` evaluates unquoted RHS as shell — breaking on any value +# with whitespace or shell metachars. This pulls only the keys we need. +ENV_FILE="/opt/buildmymcpserver/.env.production" +read_env() { + grep -E "^$1=" "$ENV_FILE" 2>/dev/null | head -1 | cut -d= -f2- | sed 's/^"\(.*\)"$/\1/; s/^'"'"'\(.*\)'"'"'$/\1/' +} +if [ -f "$ENV_FILE" ]; then + TWILIO_ACCOUNT_SID="${TWILIO_ACCOUNT_SID:-$(read_env TWILIO_ACCOUNT_SID)}" + TWILIO_AUTH_TOKEN="${TWILIO_AUTH_TOKEN:-$(read_env TWILIO_AUTH_TOKEN)}" + TWILIO_SMS_FROM="${TWILIO_SMS_FROM:-$(read_env TWILIO_SMS_FROM)}" + ADMIN_PHONE="${ADMIN_PHONE:-$(read_env ADMIN_PHONE)}" + HEALTHCHECKS_FAIL_URL="${HEALTHCHECKS_FAIL_URL:-$(read_env HEALTHCHECKS_FAIL_URL)}" +fi + +# Twilio SMS — only if all four vars set +if [ -n "${TWILIO_ACCOUNT_SID:-}" ] && \ + [ -n "${TWILIO_AUTH_TOKEN:-}" ] && \ + [ -n "${TWILIO_SMS_FROM:-}" ] && \ + [ -n "${ADMIN_PHONE:-}" ]; then + curl -sS -o /dev/null --max-time 10 \ + -X POST "https://api.twilio.com/2010-04-01/Accounts/${TWILIO_ACCOUNT_SID}/Messages.json" \ + --data-urlencode "From=${TWILIO_SMS_FROM}" \ + --data-urlencode "To=${ADMIN_PHONE}" \ + --data-urlencode "Body=[BMM] ${SUBJECT}: ${BODY}" \ + -u "${TWILIO_ACCOUNT_SID}:${TWILIO_AUTH_TOKEN}" \ + || logger -t bmm-ops "twilio-sms-failed: $SUBJECT" +fi + +# Generic webhook (for healthchecks.io, BetterStack, etc.) — POST body +if [ -n "${HEALTHCHECKS_FAIL_URL:-}" ]; then + curl -fsS -o /dev/null --max-time 10 --retry 2 \ + --data "${SUBJECT}: ${BODY}" "${HEALTHCHECKS_FAIL_URL}" \ + || logger -t bmm-ops "healthcheck-webhook-failed" +fi + +exit 0 diff --git a/ops/bmm/restore-test.sh b/ops/bmm/restore-test.sh new file mode 100644 index 0000000..ee5e2c5 --- /dev/null +++ b/ops/bmm/restore-test.sh @@ -0,0 +1,57 @@ +#!/usr/bin/env bash +# Weekly restore test — proves backups are actually restorable, not just +# byte-streams that look like backups. Restores latest backup into a +# temporary DB inside the same Postgres container, runs a schema check, +# then drops the temp DB. +# +# Cron: 30 4 * * 0 root /opt/bmm-ops/restore-test.sh (Sundays 04:30 UTC) + +set -uo pipefail + +BACKUP_DIR="/var/backups/bmm" +LOG_FILE="/var/log/bmm-backup.log" +NOTIFY="/opt/bmm-ops/notify.sh" +PG_USER="bmm" +CONTAINER="bmm-postgres" +TEMP_DB="bmm_restore_test_$(date +%s)" +TS=$(date -u +%Y-%m-%dT%H:%M:%SZ) + +log() { echo "[${TS}] restore-test: $*" >> "$LOG_FILE"; } +fail() { + log "FAIL: $*" + docker exec "$CONTAINER" psql -U "$PG_USER" -d postgres -c "DROP DATABASE IF EXISTS ${TEMP_DB}" >/dev/null 2>&1 + "$NOTIFY" "restore-test-failed" "$*" + exit 1 +} + +# Find newest backup +LATEST=$(ls -t "${BACKUP_DIR}"/bmm-*.sql.gz 2>/dev/null | head -1) +if [ -z "$LATEST" ] || [ ! -f "$LATEST" ]; then + fail "no backup found in ${BACKUP_DIR}" +fi + +log "testing restore from: $LATEST" + +# Create temp DB +if ! docker exec "$CONTAINER" psql -U "$PG_USER" -d postgres -c "CREATE DATABASE ${TEMP_DB}" >/dev/null 2>&1; then + fail "could not create temp DB ${TEMP_DB}" +fi + +# Restore — pipe through container stdin +if ! gunzip -c "$LATEST" | docker exec -i "$CONTAINER" psql -U "$PG_USER" -d "$TEMP_DB" >/dev/null 2>>"$LOG_FILE"; then + fail "psql restore failed for $LATEST" +fi + +# Schema sanity — expect the core tables to exist (adjust if schema evolves) +EXPECTED_TABLES="users sessions oauth_tokens mcp_servers" +for tbl in $EXPECTED_TABLES; do + COUNT=$(docker exec "$CONTAINER" psql -U "$PG_USER" -d "$TEMP_DB" -tAc "SELECT count(*) FROM information_schema.tables WHERE table_name='${tbl}'" 2>>"$LOG_FILE") + if [ "$COUNT" != "1" ]; then + fail "restored DB missing expected table: ${tbl}" + fi +done + +# Drop temp DB +docker exec "$CONTAINER" psql -U "$PG_USER" -d postgres -c "DROP DATABASE ${TEMP_DB}" >/dev/null 2>&1 +log "ok — $LATEST restores cleanly, schema validates" +exit 0 diff --git a/ops/bmm/uptime-check.sh b/ops/bmm/uptime-check.sh new file mode 100644 index 0000000..841f4ab --- /dev/null +++ b/ops/bmm/uptime-check.sh @@ -0,0 +1,67 @@ +#!/usr/bin/env bash +# Self-hosted uptime monitor — pings homepage + API health every 5 min. +# Sends SMS via notify.sh on transition into / out of failure state. Pings +# a healthchecks.io heartbeat (HEALTHCHECKS_HEARTBEAT_URL) on every success +# so that if THIS box dies the external service alerts. +# +# Cron: */5 * * * * root /opt/bmm-ops/uptime-check.sh +# +# State file tracks last-known status so repeated failures don't spam SMS. + +set -uo pipefail + +STATE_DIR="/var/lib/bmm-ops" +STATE_FILE="${STATE_DIR}/uptime.state" +LOG_FILE="/var/log/bmm-uptime.log" +NOTIFY="/opt/bmm-ops/notify.sh" + +mkdir -p "$STATE_DIR" +TS=$(date -u +%Y-%m-%dT%H:%M:%SZ) + +# Probe targets. Expected HTTP status code in column 2. Each probe is +# independent — partial failure (web up, api down) still flags as "down". +TARGETS=( + "https://buildmymcpserver.com/|200" + "https://buildmymcpserver.com/api/health|200" + "https://buildmymcpserver.com/robots.txt|200" +) + +failures=() +for target in "${TARGETS[@]}"; do + url="${target%|*}" + want="${target##*|}" + got=$(curl -sS -o /dev/null --max-time 8 -w "%{http_code}" "$url" 2>/dev/null || echo "000") + if [ "$got" != "$want" ]; then + failures+=("${url} expected ${want} got ${got}") + fi +done + +PREV="up" +if [ -f "$STATE_FILE" ]; then + PREV=$(cat "$STATE_FILE") +fi + +if [ "${#failures[@]}" -eq 0 ]; then + echo "[${TS}] up" >> "$LOG_FILE" + echo "up" > "$STATE_FILE" + if [ "$PREV" = "down" ]; then + "$NOTIFY" "uptime-recovered" "all probes healthy at ${TS}" + fi + # Heartbeat for external watchdog (signals "box itself is alive"). Use + # grep-parse to avoid `source` evaluating unquoted env values as shell. + HEALTHCHECKS_HEARTBEAT_URL="$(grep -E '^HEALTHCHECKS_HEARTBEAT_URL=' /opt/buildmymcpserver/.env.production 2>/dev/null | head -1 | cut -d= -f2- | sed 's/^"\(.*\)"$/\1/; s/^'"'"'\(.*\)'"'"'$/\1/')" + if [ -n "${HEALTHCHECKS_HEARTBEAT_URL:-}" ]; then + curl -fsS -o /dev/null --max-time 8 "${HEALTHCHECKS_HEARTBEAT_URL}" 2>/dev/null || true + fi +else + echo "[${TS}] down: ${failures[*]}" >> "$LOG_FILE" + echo "down" > "$STATE_FILE" + if [ "$PREV" = "up" ]; then + # Transition up→down: alert immediately (first failure tick) + "$NOTIFY" "uptime-down" "${failures[*]}" + fi + # Intentionally do NOT alert again on subsequent ticks while still down — + # avoids SMS storm during a sustained incident. Recovery edge re-notifies. +fi + +exit 0