All checks were successful
Deploy to Production / deploy (push) Successful in 1m10s
Adds /opt/bmm-ops/ scripts (deployed separately from the app, so tar
overlays don't clobber them) for three previously-missing production
readiness items:
1. Backup hardening (backup.sh):
- Previous cron one-liner did pg_dump | gzip with no validation.
- Now: pipefail-safe pg_dump, gunzip -t integrity check, pg_dump
header sanity (scans first 5 lines — line 1 is just "--", actual
"PostgreSQL database dump" comment lands on line 2), size-warning
under 1KB, atomic move-into-place so partial backups never replace
the previous good file. 14-day retention preserved.
- Optional offsite via BMM_BACKUP_REMOTE (rclone). Reads env via
grep+cut, NOT `source` — the .env.production has unquoted text
values (e.g. ADMIN_NAME) that crash a sourced shell.
2. Restore drill (restore-test.sh, Sun 04:30 UTC weekly):
- Restores the newest backup into a throwaway DB inside the same
Postgres container, verifies the core tables exist (users,
sessions, oauth_tokens, mcp_servers), drops the temp DB. Proves
backups are actually restorable, not just byte-streams that look
like backups. Silent-corruption detector.
3. Self-hosted uptime monitor (uptime-check.sh, every 5 min):
- Probes homepage + /api/health + /robots.txt.
- Edge-triggered alerting: SMS via Twilio only on up→down and
down→up transitions (avoids SMS storm during sustained outages).
- Pings HEALTHCHECKS_HEARTBEAT_URL on every success — when the box
itself dies the heartbeat stops and the external watchdog alerts
(covers the gap that self-hosted monitors can't see their own
box failing).
notify.sh is the shared helper: Twilio SMS if all four creds set,
optional webhook to HEALTHCHECKS_FAIL_URL, always logs to syslog. Never
fails loudly — broken notification path still lands in journalctl
-t bmm-ops.
README.md documents the 3-2-1 strategy, manual full-recovery
procedure, and how to enable offsite (R2 / B2 / Hetzner Storage Box).
Smoke-tested all three on prod: backup wrote 8004 bytes with checks
passing, restore-test confirmed schema, uptime probe returned up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
90 lines
3.3 KiB
Bash
90 lines
3.3 KiB
Bash
#!/usr/bin/env bash
|
|
# Daily Postgres backup for BMM — replaces the inline cron one-liner with:
|
|
# - integrity check (gunzip -t + pg_dump header sanity)
|
|
# - failure alert via notify.sh
|
|
# - structured logging
|
|
# - optional offsite push via rclone (if rclone configured + BMM_BACKUP_REMOTE set)
|
|
# - 14-day local retention
|
|
#
|
|
# Cron: 15 3 * * * root /opt/bmm-ops/backup.sh
|
|
#
|
|
# Restore:
|
|
# docker exec -i bmm-postgres psql -U bmm -d bmm_restore_test \
|
|
# < <(gunzip -c /var/backups/bmm/bmm-YYYYMMDD.sql.gz)
|
|
|
|
set -uo pipefail
|
|
|
|
BACKUP_DIR="/var/backups/bmm"
|
|
LOG_FILE="/var/log/bmm-backup.log"
|
|
NOTIFY="/opt/bmm-ops/notify.sh"
|
|
RETENTION_DAYS=14
|
|
PG_USER="bmm"
|
|
PG_DB="bmm"
|
|
CONTAINER="bmm-postgres"
|
|
|
|
mkdir -p "$BACKUP_DIR"
|
|
DATE=$(date -u +%Y%m%d)
|
|
TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)
|
|
OUT="${BACKUP_DIR}/bmm-${DATE}.sql.gz"
|
|
|
|
log() { echo "[${TIMESTAMP}] $*" >> "$LOG_FILE"; }
|
|
fail() { log "FAIL: $*"; "$NOTIFY" "backup-failed" "$*"; exit 1; }
|
|
|
|
log "starting backup"
|
|
|
|
# pg_dump → gzip in one pipeline; pipefail catches a dump failure mid-stream
|
|
if ! docker exec "$CONTAINER" pg_dump -U "$PG_USER" "$PG_DB" 2>>"$LOG_FILE" | gzip > "$OUT.tmp"; then
|
|
rm -f "$OUT.tmp"
|
|
fail "pg_dump pipeline failed"
|
|
fi
|
|
|
|
# Integrity check 1 — gzip stream must be valid end-to-end
|
|
if ! gunzip -t "$OUT.tmp" 2>>"$LOG_FILE"; then
|
|
rm -f "$OUT.tmp"
|
|
fail "gzip integrity check failed for $OUT.tmp"
|
|
fi
|
|
|
|
# Integrity check 2 — decompressed content must contain the pg_dump header
|
|
# in the first few lines. pg_dump emits "--" on line 1 and the actual
|
|
# "-- PostgreSQL database dump" comment on line 2, so we scan the first 5
|
|
# lines rather than only line 1.
|
|
HEADER_BLOCK=$(gunzip -c "$OUT.tmp" | head -5)
|
|
if ! echo "$HEADER_BLOCK" | grep -q "^-- PostgreSQL database dump"; then
|
|
rm -f "$OUT.tmp"
|
|
fail "pg_dump output missing expected header (first 5 lines: $(echo "$HEADER_BLOCK" | tr '\n' '|' | cut -c1-120))"
|
|
fi
|
|
|
|
# Size sanity — backups have grown to ~8KB. A sub-1KB dump means schema-only
|
|
# or empty. Likely-broken: alert but keep file for inspection.
|
|
SIZE=$(stat -c%s "$OUT.tmp")
|
|
if [ "$SIZE" -lt 1024 ]; then
|
|
log "WARN: backup unusually small (${SIZE} bytes)"
|
|
"$NOTIFY" "backup-suspicious" "backup is only ${SIZE} bytes — investigate $OUT.tmp"
|
|
fi
|
|
|
|
# Atomic move — only swap into place once all checks passed
|
|
mv "$OUT.tmp" "$OUT"
|
|
log "backup written: $OUT (${SIZE} bytes)"
|
|
|
|
# Optional offsite push — set BMM_BACKUP_REMOTE=<rclone-remote>:<path> in
|
|
# /opt/buildmymcpserver/.env.production once rclone is configured. We
|
|
# grep-parse rather than sourcing the env file because the env file is
|
|
# managed for Docker compose (KEY=value, sometimes unquoted text values
|
|
# like names) and `source` evaluates unquoted RHS as shell, which breaks
|
|
# on any value containing whitespace.
|
|
ENV_FILE="/opt/buildmymcpserver/.env.production"
|
|
BMM_BACKUP_REMOTE="$(grep -E '^BMM_BACKUP_REMOTE=' "$ENV_FILE" 2>/dev/null | head -1 | cut -d= -f2- | sed 's/^"\(.*\)"$/\1/; s/^'"'"'\(.*\)'"'"'$/\1/')"
|
|
if [ -n "${BMM_BACKUP_REMOTE:-}" ] && command -v rclone >/dev/null 2>&1; then
|
|
if rclone copy "$OUT" "$BMM_BACKUP_REMOTE" --quiet 2>>"$LOG_FILE"; then
|
|
log "offsite copy ok: $BMM_BACKUP_REMOTE"
|
|
else
|
|
"$NOTIFY" "backup-offsite-failed" "rclone copy to $BMM_BACKUP_REMOTE failed"
|
|
fi
|
|
fi
|
|
|
|
# Retention — keep last 14 days
|
|
find "$BACKUP_DIR" -maxdepth 1 -name "bmm-*.sql.gz" -mtime "+${RETENTION_DAYS}" -delete 2>>"$LOG_FILE"
|
|
|
|
log "done"
|
|
exit 0
|