All checks were successful
Deploy to Production / deploy (push) Successful in 1m10s
Adds /opt/bmm-ops/ scripts (deployed separately from the app, so tar
overlays don't clobber them) for three previously-missing production
readiness items:
1. Backup hardening (backup.sh):
- Previous cron one-liner did pg_dump | gzip with no validation.
- Now: pipefail-safe pg_dump, gunzip -t integrity check, pg_dump
header sanity (scans first 5 lines — line 1 is just "--", actual
"PostgreSQL database dump" comment lands on line 2), size-warning
under 1KB, atomic move-into-place so partial backups never replace
the previous good file. 14-day retention preserved.
- Optional offsite via BMM_BACKUP_REMOTE (rclone). Reads env via
grep+cut, NOT `source` — the .env.production has unquoted text
values (e.g. ADMIN_NAME) that crash a sourced shell.
2. Restore drill (restore-test.sh, Sun 04:30 UTC weekly):
- Restores the newest backup into a throwaway DB inside the same
Postgres container, verifies the core tables exist (users,
sessions, oauth_tokens, mcp_servers), drops the temp DB. Proves
backups are actually restorable, not just byte-streams that look
like backups. Silent-corruption detector.
3. Self-hosted uptime monitor (uptime-check.sh, every 5 min):
- Probes homepage + /api/health + /robots.txt.
- Edge-triggered alerting: SMS via Twilio only on up→down and
down→up transitions (avoids SMS storm during sustained outages).
- Pings HEALTHCHECKS_HEARTBEAT_URL on every success — when the box
itself dies the heartbeat stops and the external watchdog alerts
(covers the gap that self-hosted monitors can't see their own
box failing).
notify.sh is the shared helper: Twilio SMS if all four creds set,
optional webhook to HEALTHCHECKS_FAIL_URL, always logs to syslog. Never
fails loudly — broken notification path still lands in journalctl
-t bmm-ops.
README.md documents the 3-2-1 strategy, manual full-recovery
procedure, and how to enable offsite (R2 / B2 / Hetzner Storage Box).
Smoke-tested all three on prod: backup wrote 8004 bytes with checks
passing, restore-test confirmed schema, uptime probe returned up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
58 lines
2.0 KiB
Bash
58 lines
2.0 KiB
Bash
#!/usr/bin/env bash
|
|
# Weekly restore test — proves backups are actually restorable, not just
|
|
# byte-streams that look like backups. Restores latest backup into a
|
|
# temporary DB inside the same Postgres container, runs a schema check,
|
|
# then drops the temp DB.
|
|
#
|
|
# Cron: 30 4 * * 0 root /opt/bmm-ops/restore-test.sh (Sundays 04:30 UTC)
|
|
|
|
set -uo pipefail
|
|
|
|
BACKUP_DIR="/var/backups/bmm"
|
|
LOG_FILE="/var/log/bmm-backup.log"
|
|
NOTIFY="/opt/bmm-ops/notify.sh"
|
|
PG_USER="bmm"
|
|
CONTAINER="bmm-postgres"
|
|
TEMP_DB="bmm_restore_test_$(date +%s)"
|
|
TS=$(date -u +%Y-%m-%dT%H:%M:%SZ)
|
|
|
|
log() { echo "[${TS}] restore-test: $*" >> "$LOG_FILE"; }
|
|
fail() {
|
|
log "FAIL: $*"
|
|
docker exec "$CONTAINER" psql -U "$PG_USER" -d postgres -c "DROP DATABASE IF EXISTS ${TEMP_DB}" >/dev/null 2>&1
|
|
"$NOTIFY" "restore-test-failed" "$*"
|
|
exit 1
|
|
}
|
|
|
|
# Find newest backup
|
|
LATEST=$(ls -t "${BACKUP_DIR}"/bmm-*.sql.gz 2>/dev/null | head -1)
|
|
if [ -z "$LATEST" ] || [ ! -f "$LATEST" ]; then
|
|
fail "no backup found in ${BACKUP_DIR}"
|
|
fi
|
|
|
|
log "testing restore from: $LATEST"
|
|
|
|
# Create temp DB
|
|
if ! docker exec "$CONTAINER" psql -U "$PG_USER" -d postgres -c "CREATE DATABASE ${TEMP_DB}" >/dev/null 2>&1; then
|
|
fail "could not create temp DB ${TEMP_DB}"
|
|
fi
|
|
|
|
# Restore — pipe through container stdin
|
|
if ! gunzip -c "$LATEST" | docker exec -i "$CONTAINER" psql -U "$PG_USER" -d "$TEMP_DB" >/dev/null 2>>"$LOG_FILE"; then
|
|
fail "psql restore failed for $LATEST"
|
|
fi
|
|
|
|
# Schema sanity — expect the core tables to exist (adjust if schema evolves)
|
|
EXPECTED_TABLES="users sessions oauth_tokens mcp_servers"
|
|
for tbl in $EXPECTED_TABLES; do
|
|
COUNT=$(docker exec "$CONTAINER" psql -U "$PG_USER" -d "$TEMP_DB" -tAc "SELECT count(*) FROM information_schema.tables WHERE table_name='${tbl}'" 2>>"$LOG_FILE")
|
|
if [ "$COUNT" != "1" ]; then
|
|
fail "restored DB missing expected table: ${tbl}"
|
|
fi
|
|
done
|
|
|
|
# Drop temp DB
|
|
docker exec "$CONTAINER" psql -U "$PG_USER" -d postgres -c "DROP DATABASE ${TEMP_DB}" >/dev/null 2>&1
|
|
log "ok — $LATEST restores cleanly, schema validates"
|
|
exit 0
|