126 lines
5.3 KiB
Markdown
126 lines
5.3 KiB
Markdown
|
|
# BMM Ops — Backup & Uptime
|
||
|
|
|
||
|
|
Scripts live in `/opt/bmm-ops/` on the Hetzner box. They are intentionally
|
||
|
|
**outside** `/opt/buildmymcpserver/` so app deploys (tar overlay) don't
|
||
|
|
overwrite them.
|
||
|
|
|
||
|
|
## Scripts
|
||
|
|
|
||
|
|
| Script | Cron | Purpose |
|
||
|
|
|---|---|---|
|
||
|
|
| `backup.sh` | `15 3 * * *` (03:15 UTC daily) | pg_dump → gzip → integrity-check → offsite push → 14d retention |
|
||
|
|
| `restore-test.sh` | `30 4 * * 0` (Sun 04:30 UTC weekly) | Restores latest backup into temp DB, verifies core schema, drops temp DB |
|
||
|
|
| `uptime-check.sh` | `*/5 * * * *` (every 5 min) | Probes homepage + API health + robots.txt; alerts on edge transitions; heartbeats external watchdog |
|
||
|
|
| `notify.sh` | (helper) | Sends Twilio SMS + optional webhook on alert; always syslogs |
|
||
|
|
|
||
|
|
## Backup Strategy
|
||
|
|
|
||
|
|
**3-2-1 rule applied:**
|
||
|
|
- **3 copies** — Postgres volume (live) + local gzip (`/var/backups/bmm/`) + offsite (rclone target)
|
||
|
|
- **2 different media** — Hetzner box SSD + external object storage (R2/B2/Hetzner Storage Box)
|
||
|
|
- **1 offsite** — not on the same machine
|
||
|
|
|
||
|
|
**Retention:**
|
||
|
|
- Local: 14 days rolling
|
||
|
|
- Offsite: configure via rclone lifecycle on the bucket (recommend: 30 daily, 12 monthly)
|
||
|
|
|
||
|
|
**Integrity guarantees (every run):**
|
||
|
|
1. `pg_dump` pipefail-safe — partial dump never overwrites previous good backup
|
||
|
|
2. `gunzip -t` validates the compressed stream
|
||
|
|
3. Header sanity check — decompressed first line must start with `-- PostgreSQL database dump` (catches the case where pg_dump emitted an error message that still compressed cleanly)
|
||
|
|
4. Size warning if backup drops below 1KB
|
||
|
|
5. Atomic mv-into-place — only swap the dated filename in once all checks pass
|
||
|
|
|
||
|
|
**Restore drill (proven, not assumed):**
|
||
|
|
- `restore-test.sh` runs weekly. It creates a throwaway DB inside the same Postgres container, restores the newest backup, verifies the core tables (`users`, `sessions`, `oauth_tokens`, `mcp_servers`) exist, then drops the temp DB.
|
||
|
|
- Failure here sends an SMS — silent-corruption detection.
|
||
|
|
|
||
|
|
**Manual restore (full recovery procedure):**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Stop dependent services
|
||
|
|
docker compose --env-file .env.production -f docker-compose.prod.yml stop api web generator
|
||
|
|
|
||
|
|
# 2. Drop + recreate target DB (DANGER — destroys current data)
|
||
|
|
docker exec bmm-postgres psql -U bmm -d postgres -c "DROP DATABASE IF EXISTS bmm"
|
||
|
|
docker exec bmm-postgres psql -U bmm -d postgres -c "CREATE DATABASE bmm OWNER bmm"
|
||
|
|
|
||
|
|
# 3. Restore
|
||
|
|
gunzip -c /var/backups/bmm/bmm-YYYYMMDD.sql.gz | docker exec -i bmm-postgres psql -U bmm -d bmm
|
||
|
|
|
||
|
|
# 4. Restart services
|
||
|
|
docker compose --env-file .env.production -f docker-compose.prod.yml up -d
|
||
|
|
```
|
||
|
|
|
||
|
|
If the box itself is gone: spin up a fresh Postgres on a new box, `scp` the
|
||
|
|
latest offsite backup from R2/B2, then run step 3 above against the new
|
||
|
|
container.
|
||
|
|
|
||
|
|
## Offsite — to enable
|
||
|
|
|
||
|
|
Pick one of three:
|
||
|
|
|
||
|
|
### Option A — Cloudflare R2 (recommended, you're already on CF)
|
||
|
|
```bash
|
||
|
|
apt-get install -y rclone
|
||
|
|
rclone config # → New remote → s3 → Cloudflare → fill access_key/secret/endpoint
|
||
|
|
# Then in /opt/buildmymcpserver/.env.production:
|
||
|
|
BMM_BACKUP_REMOTE=r2:bmm-backups/postgres
|
||
|
|
```
|
||
|
|
10 GB free, no egress fees.
|
||
|
|
|
||
|
|
### Option B — Backblaze B2
|
||
|
|
Same `rclone config` but choose Backblaze B2. $6/TB/mo, 10 GB free.
|
||
|
|
|
||
|
|
### Option C — Hetzner Storage Box
|
||
|
|
Order one in Hetzner Robot. Uses SFTP via rclone (`sftp` remote type).
|
||
|
|
Cheapest for keeping data inside the same provider's network.
|
||
|
|
|
||
|
|
The `backup.sh` script picks up `BMM_BACKUP_REMOTE` automatically and runs
|
||
|
|
`rclone copy` after every successful local backup. No code changes needed.
|
||
|
|
|
||
|
|
## Uptime Monitoring
|
||
|
|
|
||
|
|
**Two-layer strategy** — covers both app failure and box failure:
|
||
|
|
|
||
|
|
1. **Self-hosted probe** (`uptime-check.sh` every 5 min from this box):
|
||
|
|
Detects app-layer outages — Postgres down, API returning 500, web container
|
||
|
|
crashed. Sends SMS via Twilio on the first failing tick (edge-triggered to
|
||
|
|
avoid SMS-storm on sustained outages); sends an "uptime-recovered" SMS
|
||
|
|
when the next tick succeeds.
|
||
|
|
|
||
|
|
2. **External watchdog** (healthchecks.io heartbeat):
|
||
|
|
`uptime-check.sh` pings `HEALTHCHECKS_HEARTBEAT_URL` on every successful
|
||
|
|
probe. If the box itself dies (network loss, hardware fail, kernel panic),
|
||
|
|
no heartbeat arrives → healthchecks.io alerts via its own channel.
|
||
|
|
**Without this layer the self-hosted monitor cannot detect box-level
|
||
|
|
failures** — they kill the monitor itself.
|
||
|
|
|
||
|
|
**To enable external watchdog:**
|
||
|
|
1. Sign up at https://healthchecks.io (free, no credit card)
|
||
|
|
2. Create a new check with 5-minute period, 1-minute grace
|
||
|
|
3. Copy the ping URL (`https://hc-ping.com/<uuid>`)
|
||
|
|
4. In `/opt/buildmymcpserver/.env.production`:
|
||
|
|
```
|
||
|
|
HEALTHCHECKS_HEARTBEAT_URL=https://hc-ping.com/<your-uuid>
|
||
|
|
```
|
||
|
|
5. Configure healthchecks.io to send email / SMS / Slack on failure
|
||
|
|
|
||
|
|
**Alert target (for both layers):**
|
||
|
|
`ADMIN_PHONE=+41XXXXXXXXX` must be set in `.env.production` for Twilio SMS.
|
||
|
|
Without it, alerts still land in syslog (`journalctl -t bmm-ops`) but no SMS.
|
||
|
|
|
||
|
|
## Logs
|
||
|
|
|
||
|
|
| File | Content |
|
||
|
|
|---|---|
|
||
|
|
| `/var/log/bmm-backup.log` | Backup + restore-test history |
|
||
|
|
| `/var/log/bmm-uptime.log` | One line per 5-min check |
|
||
|
|
| `journalctl -t bmm-ops` | All notify.sh events (syslog) |
|
||
|
|
|
||
|
|
## Cron files
|
||
|
|
|
||
|
|
| `/etc/cron.d/bmm-postgres-backup` | runs backup.sh |
|
||
|
|
| `/etc/cron.d/bmm-restore-test` | runs restore-test.sh |
|
||
|
|
| `/etc/cron.d/bmm-uptime` | runs uptime-check.sh |
|