buildmymcpserver/ops/bmm/README.md

# BMM Ops — Backup & Uptime

Scripts live in `/opt/bmm-ops/` on the Hetzner box. They are intentionally
**outside** `/opt/buildmymcpserver/` so app deploys (tar overlay) don't
overwrite them.

## Scripts

| Script | Cron | Purpose |
|---|---|---|
| `backup.sh` | `15 3 * * *` (03:15 UTC daily) | pg_dump → gzip → integrity-check → offsite push → 14d retention |
| `restore-test.sh` | `30 4 * * 0` (Sun 04:30 UTC weekly) | Restores latest backup into temp DB, verifies core schema, drops temp DB |
| `uptime-check.sh` | `*/5 * * * *` (every 5 min) | Probes homepage + API health + robots.txt; alerts on edge transitions; heartbeats external watchdog |
| `notify.sh` | (helper) | Sends Twilio SMS + optional webhook on alert; always syslogs |

## Backup Strategy

**3-2-1 rule applied:**
- **3 copies** — Postgres volume (live) + local gzip (`/var/backups/bmm/`) + offsite (rclone target)
- **2 different media** — Hetzner box SSD + external object storage (R2/B2/Hetzner Storage Box)
- **1 offsite** — not on the same machine

**Retention:**
- Local: 14 days rolling
- Offsite: configure via rclone lifecycle on the bucket (recommend: 30 daily, 12 monthly)

**Integrity guarantees (every run):**
1. `pg_dump` pipefail-safe — partial dump never overwrites previous good backup
2. `gunzip -t` validates the compressed stream
3. Header sanity check — decompressed first line must start with `-- PostgreSQL database dump` (catches the case where pg_dump emitted an error message that still compressed cleanly)
4. Size warning if backup drops below 1KB
5. Atomic mv-into-place — only swap the dated filename in once all checks pass

**Restore drill (proven, not assumed):**
- `restore-test.sh` runs weekly. It creates a throwaway DB inside the same Postgres container, restores the newest backup, verifies the core tables (`users`, `sessions`, `oauth_tokens`, `mcp_servers`) exist, then drops the temp DB.
- Failure here sends an SMS — silent-corruption detection.

**Manual restore (full recovery procedure):**

```bash
# 1. Stop dependent services
docker compose --env-file .env.production -f docker-compose.prod.yml stop api web generator

# 2. Drop + recreate target DB (DANGER — destroys current data)
docker exec bmm-postgres psql -U bmm -d postgres -c "DROP DATABASE IF EXISTS bmm"
docker exec bmm-postgres psql -U bmm -d postgres -c "CREATE DATABASE bmm OWNER bmm"

# 3. Restore
gunzip -c /var/backups/bmm/bmm-YYYYMMDD.sql.gz | docker exec -i bmm-postgres psql -U bmm -d bmm

# 4. Restart services
docker compose --env-file .env.production -f docker-compose.prod.yml up -d
```

If the box itself is gone: spin up a fresh Postgres on a new box, `scp` the
latest offsite backup from R2/B2, then run step 3 above against the new
container.

## Offsite — to enable

Pick one of three:

### Option A — Cloudflare R2 (recommended, you're already on CF)
```bash
apt-get install -y rclone
rclone config  # → New remote → s3 → Cloudflare → fill access_key/secret/endpoint
# Then in /opt/buildmymcpserver/.env.production:
BMM_BACKUP_REMOTE=r2:bmm-backups/postgres
```
10 GB free, no egress fees.

### Option B — Backblaze B2
Same `rclone config` but choose Backblaze B2. $6/TB/mo, 10 GB free.

### Option C — Hetzner Storage Box
Order one in Hetzner Robot. Uses SFTP via rclone (`sftp` remote type).
Cheapest for keeping data inside the same provider's network.

The `backup.sh` script picks up `BMM_BACKUP_REMOTE` automatically and runs
`rclone copy` after every successful local backup. No code changes needed.

## Uptime Monitoring

**Two-layer strategy** — covers both app failure and box failure:

1. **Self-hosted probe** (`uptime-check.sh` every 5 min from this box):
   Detects app-layer outages — Postgres down, API returning 500, web container
   crashed. Sends SMS via Twilio on the first failing tick (edge-triggered to
   avoid SMS-storm on sustained outages); sends an "uptime-recovered" SMS
   when the next tick succeeds.

2. **External watchdog** (healthchecks.io heartbeat):
   `uptime-check.sh` pings `HEALTHCHECKS_HEARTBEAT_URL` on every successful
   probe. If the box itself dies (network loss, hardware fail, kernel panic),
   no heartbeat arrives → healthchecks.io alerts via its own channel.
   **Without this layer the self-hosted monitor cannot detect box-level
   failures** — they kill the monitor itself.

**To enable external watchdog:**
1. Sign up at https://healthchecks.io (free, no credit card)
2. Create a new check with 5-minute period, 1-minute grace
3. Copy the ping URL (`https://hc-ping.com/<uuid>`)
4. In `/opt/buildmymcpserver/.env.production`:
   ```
   HEALTHCHECKS_HEARTBEAT_URL=https://hc-ping.com/<your-uuid>
   ```
5. Configure healthchecks.io to send email / SMS / Slack on failure

**Alert target (for both layers):**
`ADMIN_PHONE=+41XXXXXXXXX` must be set in `.env.production` for Twilio SMS.
Without it, alerts still land in syslog (`journalctl -t bmm-ops`) but no SMS.

## Logs

| File | Content |
|---|---|
| `/var/log/bmm-backup.log` | Backup + restore-test history |
| `/var/log/bmm-uptime.log` | One line per 5-min check |
| `journalctl -t bmm-ops` | All notify.sh events (syslog) |

## Cron files

| `/etc/cron.d/bmm-postgres-backup` | runs backup.sh |
| `/etc/cron.d/bmm-restore-test`    | runs restore-test.sh |
| `/etc/cron.d/bmm-uptime`          | runs uptime-check.sh |
ops: backup hardening + restore drill + self-hosted uptime monitor Adds /opt/bmm-ops/ scripts (deployed separately from the app, so tar overlays don't clobber them) for three previously-missing production readiness items: 1. Backup hardening (backup.sh): - Previous cron one-liner did pg_dump \| gzip with no validation. - Now: pipefail-safe pg_dump, gunzip -t integrity check, pg_dump header sanity (scans first 5 lines — line 1 is just "--", actual "PostgreSQL database dump" comment lands on line 2), size-warning under 1KB, atomic move-into-place so partial backups never replace the previous good file. 14-day retention preserved. - Optional offsite via BMM_BACKUP_REMOTE (rclone). Reads env via grep+cut, NOT `source` — the .env.production has unquoted text values (e.g. ADMIN_NAME) that crash a sourced shell. 2. Restore drill (restore-test.sh, Sun 04:30 UTC weekly): - Restores the newest backup into a throwaway DB inside the same Postgres container, verifies the core tables exist (users, sessions, oauth_tokens, mcp_servers), drops the temp DB. Proves backups are actually restorable, not just byte-streams that look like backups. Silent-corruption detector. 3. Self-hosted uptime monitor (uptime-check.sh, every 5 min): - Probes homepage + /api/health + /robots.txt. - Edge-triggered alerting: SMS via Twilio only on up→down and down→up transitions (avoids SMS storm during sustained outages). - Pings HEALTHCHECKS_HEARTBEAT_URL on every success — when the box itself dies the heartbeat stops and the external watchdog alerts (covers the gap that self-hosted monitors can't see their own box failing). notify.sh is the shared helper: Twilio SMS if all four creds set, optional webhook to HEALTHCHECKS_FAIL_URL, always logs to syslog. Never fails loudly — broken notification path still lands in journalctl -t bmm-ops. README.md documents the 3-2-1 strategy, manual full-recovery procedure, and how to enable offsite (R2 / B2 / Hetzner Storage Box). Smoke-tested all three on prod: backup wrote 8004 bytes with checks passing, restore-test confirmed schema, uptime probe returned up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-05-26 23:46:42 +02:00			`# BMM Ops — Backup & Uptime`

			Scripts live in `/opt/bmm-ops/` on the Hetzner box. They are intentionally
			outside `/opt/buildmymcpserver/` so app deploys (tar overlay) don't
			`overwrite them.`

			`## Scripts`

			`\| Script \| Cron \| Purpose \|`
			`\|---\|---\|---\|`
			\| `backup.sh` \| `15 3 * * *` (03:15 UTC daily) \| pg_dump → gzip → integrity-check → offsite push → 14d retention \|
			\| `restore-test.sh` \| `30 4 * * 0` (Sun 04:30 UTC weekly) \| Restores latest backup into temp DB, verifies core schema, drops temp DB \|
			\| `uptime-check.sh` \| `/5 * * *` (every 5 min) \| Probes homepage + API health + robots.txt; alerts on edge transitions; heartbeats external watchdog \|
			\| `notify.sh` \| (helper) \| Sends Twilio SMS + optional webhook on alert; always syslogs \|

			`## Backup Strategy`

			`3-2-1 rule applied:`
			- 3 copies — Postgres volume (live) + local gzip (`/var/backups/bmm/`) + offsite (rclone target)
			`- 2 different media — Hetzner box SSD + external object storage (R2/B2/Hetzner Storage Box)`
			`- 1 offsite — not on the same machine`

			`Retention:`
			`- Local: 14 days rolling`
			`- Offsite: configure via rclone lifecycle on the bucket (recommend: 30 daily, 12 monthly)`

			`Integrity guarantees (every run):`
			1. `pg_dump` pipefail-safe — partial dump never overwrites previous good backup
			2. `gunzip -t` validates the compressed stream
			3. Header sanity check — decompressed first line must start with `-- PostgreSQL database dump` (catches the case where pg_dump emitted an error message that still compressed cleanly)
			`4. Size warning if backup drops below 1KB`
			`5. Atomic mv-into-place — only swap the dated filename in once all checks pass`

			`Restore drill (proven, not assumed):`
			- `restore-test.sh` runs weekly. It creates a throwaway DB inside the same Postgres container, restores the newest backup, verifies the core tables (`users`, `sessions`, `oauth_tokens`, `mcp_servers`) exist, then drops the temp DB.
			`- Failure here sends an SMS — silent-corruption detection.`

			`Manual restore (full recovery procedure):`

			```bash
			`# 1. Stop dependent services`
			`docker compose --env-file .env.production -f docker-compose.prod.yml stop api web generator`

			`# 2. Drop + recreate target DB (DANGER — destroys current data)`
			`docker exec bmm-postgres psql -U bmm -d postgres -c "DROP DATABASE IF EXISTS bmm"`
			`docker exec bmm-postgres psql -U bmm -d postgres -c "CREATE DATABASE bmm OWNER bmm"`

			`# 3. Restore`
			`gunzip -c /var/backups/bmm/bmm-YYYYMMDD.sql.gz \| docker exec -i bmm-postgres psql -U bmm -d bmm`

			`# 4. Restart services`
			`docker compose --env-file .env.production -f docker-compose.prod.yml up -d`
			```

			If the box itself is gone: spin up a fresh Postgres on a new box, `scp` the
			`latest offsite backup from R2/B2, then run step 3 above against the new`
			`container.`

			`## Offsite — to enable`

			`Pick one of three:`

			`### Option A — Cloudflare R2 (recommended, you're already on CF)`
			```bash
			`apt-get install -y rclone`
			`rclone config # → New remote → s3 → Cloudflare → fill access_key/secret/endpoint`
			`# Then in /opt/buildmymcpserver/.env.production:`
			`BMM_BACKUP_REMOTE=r2:bmm-backups/postgres`
			```
			`10 GB free, no egress fees.`

			`### Option B — Backblaze B2`
			Same `rclone config` but choose Backblaze B2. $6/TB/mo, 10 GB free.

			`### Option C — Hetzner Storage Box`
			Order one in Hetzner Robot. Uses SFTP via rclone (`sftp` remote type).
			`Cheapest for keeping data inside the same provider's network.`

			The `backup.sh` script picks up `BMM_BACKUP_REMOTE` automatically and runs
			`rclone copy` after every successful local backup. No code changes needed.

			`## Uptime Monitoring`

			`Two-layer strategy — covers both app failure and box failure:`

			1. Self-hosted probe (`uptime-check.sh` every 5 min from this box):
			`Detects app-layer outages — Postgres down, API returning 500, web container`
			`crashed. Sends SMS via Twilio on the first failing tick (edge-triggered to`
			`avoid SMS-storm on sustained outages); sends an "uptime-recovered" SMS`
			`when the next tick succeeds.`

			`2. External watchdog (healthchecks.io heartbeat):`
			`uptime-check.sh` pings `HEALTHCHECKS_HEARTBEAT_URL` on every successful
			`probe. If the box itself dies (network loss, hardware fail, kernel panic),`
			`no heartbeat arrives → healthchecks.io alerts via its own channel.`
			`**Without this layer the self-hosted monitor cannot detect box-level`
			`failures** — they kill the monitor itself.`

			`To enable external watchdog:`
			`1. Sign up at https://healthchecks.io (free, no credit card)`
			`2. Create a new check with 5-minute period, 1-minute grace`
			3. Copy the ping URL (`https://hc-ping.com/<uuid>`)
			4. In `/opt/buildmymcpserver/.env.production`:
			```
			`HEALTHCHECKS_HEARTBEAT_URL=https://hc-ping.com/<your-uuid>`
			```
			`5. Configure healthchecks.io to send email / SMS / Slack on failure`

			`Alert target (for both layers):`
			`ADMIN_PHONE=+41XXXXXXXXX` must be set in `.env.production` for Twilio SMS.
			Without it, alerts still land in syslog (`journalctl -t bmm-ops`) but no SMS.

			`## Logs`

			`\| File \| Content \|`
			`\|---\|---\|`
			\| `/var/log/bmm-backup.log` \| Backup + restore-test history \|
			\| `/var/log/bmm-uptime.log` \| One line per 5-min check \|
			\| `journalctl -t bmm-ops` \| All notify.sh events (syslog) \|

			`## Cron files`

			\| `/etc/cron.d/bmm-postgres-backup` \| runs backup.sh \|
			\| `/etc/cron.d/bmm-restore-test` \| runs restore-test.sh \|
			\| `/etc/cron.d/bmm-uptime` \| runs uptime-check.sh \|