Two overlapping bugs were killing OAuth discovery for every external
MCP client (Claude Desktop, Cursor, etc.):
1. worker.ts injected PUBLIC_URL=http://<RUNNER_HOST>:<port> into the
runner container even when MCP_DOMAIN was set. Result: the runner's
/.well-known/oauth-protected-resource advertised an unreachable URL
and the WWW-Authenticate header pointed at a non-HTTPS loopback
address. Claude Desktop refused to follow the discovery chain.
Now derives PUBLIC_URL from the same computePublicUrl() helper that
builds the user-visible URL stored in mcp_servers.public_url, so the
container's self-reported resource matches its actual route.
2. docker-compose.prod.yml never mounted /opt/buildmymcpserver/runner-map
into the api / generator containers. The .conf snippet written by
the generator landed in an ephemeral container path; the host
inotify watcher saw an empty directory and produced an empty
runner-map.combined. Result: nginx 404'd every /<slug>/* request,
the runner was unreachable from the public domain, and OAuth
discovery couldn't even begin. Mount added to both services.
Existing weather server has the wrong PUBLIC_URL baked in and must be
recreated after deploy. No customers yet.
export computePublicUrl from deploy.ts so worker.ts can call it.
The per-subdomain approach (*.mcp.buildmymcpserver.com) failed at the
Cloudflare edge — Universal SSL only covers ONE-level wildcards, so the
TLS handshake on slug.mcp.buildmymcpserver.com hits SSL alert 40
handshake_failure. The two paths to fix that (CF Advanced Cert Manager
at $10/mo, or a Let's-Encrypt wildcard via DNS-01 with certbot) both
trade either money or ops for the URL aesthetic.
Pivot to path-routing on the single subdomain mcp.buildmymcpserver.com,
which IS covered by free Universal SSL. publicUrl format changes from
https://<slug>.mcp.buildmymcpserver.com → https://mcp.buildmymcpserver.com/<slug>
No recurring cost, works with the existing CF setup, MCP clients don't
care about the URL shape (it comes from the wizard's install snippet).
Code changes:
- generator/lib/deploy.ts:
* publicUrl computed as `${MCP_DOMAIN}/${slug}` instead of `${slug}.${MCP_DOMAIN}`
* writeRunnerMapEntry writes one-line nginx snippet:
if ($bmm_slug = "<slug>") { set $bmm_port <port>; }
(was: a map-entry pair "<slug>.<MCP_DOMAIN> <port>;")
- setup-runner-tls.sh:
* nginx vhost is now single server_name mcp.buildmymcpserver.com
* regex location captures (?<bmm_slug>...)(?<bmm_path>/.*)?
* includes runner-map.combined inside the location block so the
generated if-snippets set $bmm_port; unknown slug → 404
* proxy_pass strips the slug prefix: /<slug>/foo → 127.0.0.1:port/foo
* Prereq docs updated: just A-record for mcp (no wildcard needed),
same Origin CA cert reused
* Added /health endpoint at vhost root for monitoring
Systemd watcher + map dir + volume mounts unchanged — same file paths,
just different snippet content. Re-running setup-runner-tls.sh on the
host overwrites the wildcard vhost with the new path-based one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OAUTH REFRESH-TOKEN
- oauth_tokens.subject column added (migration applied to prod DB): stores
the JWT sub claim from the original authorization so refreshes can
re-mint with the same identity without re-walking the (consumed) code.
- Authorization-code branch now writes subject AND uses a 30-day
expires_at for the row (was 1h — same as access token, which killed
refresh after 1h).
- New refresh_token grant branch:
* looks up token by refresh-hash + expiry
* client_id must match, client_secret verified if confidential
* RFC 8707: requested resource must equal stored resource
* OAuth 2.1 rotation: atomic UPDATE WHERE old_hash → new access JWT,
new refresh token, extended expiry; loser of a race sees invalid_grant
- Access TTL (1h) and refresh TTL (30d) extracted as constants.
Clients no longer have to re-authorize hourly. Closes Zb-001.
PER-RUNNER SUBDOMAIN TLS (Z1-002)
Code path:
- New MCP_DOMAIN env (e.g. "mcp.buildmymcpserver.com") + RUNNER_MAP_DIR
(default /var/runner-map) in generator config.
- deployContainer: writes /var/runner-map/<slug>.conf with content
"slug.MCP_DOMAIN port;" and computes publicUrl as
https://<slug>.<MCP_DOMAIN>. Falls back to http://host:port when
MCP_DOMAIN is unset (zero behaviour change until host is configured).
- stopContainer (both api/lib/docker.ts and generator/lib/deploy.ts) now
accepts an optional slug arg and removes the map fragment. Callers
(DELETE /v1/servers/:id, admin template takedown) updated.
Infra path (one-time host setup — Marco runs as root):
- scripts/setup-runner-tls.sh:
1. nginx vhost matching *.mcp.buildmymcpserver.com via regex →
reads slug→port from /opt/buildmymcpserver/runner-map.combined
2. systemd inotify service watches the map dir, combines fragments
on any change, reloads nginx
3. installs inotify-tools if missing, idempotent
- Prereqs documented at top: Cloudflare wildcard DNS proxied, Origin CA
cert for *.mcp.buildmymcpserver.com, SSL mode Full (strict).
- After running: edit docker-compose.prod.yml to mount the map dir into
api + generator, set MCP_DOMAIN in env, recreate containers.
Closes Zb-001 fully. Closes Z1-002 on the code side; one Marco-on-host
action away from closing it on the infra side.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five confirmed findings from the sovereign-audit pass, ordered by severity:
Z3-001 CRITICAL — Fastify now trustProxy:true so req.ip resolves to the
real visitor IP via X-Forwarded-For instead of always being the nginx /
docker-bridge peer. Every per-IP rate-limit in the codebase was silently
collapsed into one global counter; this restores them.
Z1-001 CRITICAL — runner container hardening flags (--read-only,
--cap-drop=ALL, --security-opt=no-new-privileges:true, --pids-limit=100,
--memory=512m, --cpus=0.5, tmpfs /tmp) were sitting commented-out as a
TODO despite /security promising them. Now applied unconditionally on
production/staging; opt-out flag RUNNER_DISABLE_HARDENING=1 for Win-dev.
Z2-001 + Z2-002 CRITICAL / MEDIUM — banned-pattern blacklist tightened
(Function(...) without `new`, process.binding, process.dlopen,
.constructor.constructor, _load, vm.runIn*Context, globalThis['..'],
"system prompt override"). scanForInjection now also walks tool.name and
every inputSchema property description, not only implementation +
description — closes the prompt-injection-into-AI-client surface that
downstream clients (Claude Desktop, Cursor) read verbatim. The duplicate
BANNED_PATTERNS in apps/api/src/routes/servers.ts deleted in favour of
the single shared scanForInjection export from @bmm/llm.
Z4-001 HIGH — /v1/auth/magic-link gained the two-axis daily rate-limit
the SMS endpoint already had: 10/IP/day + 5/email/day. Combined with the
trustProxy fix above these are now real per-visitor limits.
Z4-002 MEDIUM — magic-link callback URL no longer printed to stdout in
production. In dev it still prints (so devs can click the link); in
production we log only "issued, URL withheld" and a loud error if no
email sender is wired (Resend integration is the actual launch
blocker — left as a TODO).
Z6-001 MEDIUM — /v1/builds/:id/stream WebSocket now refuses cross-origin
upgrades. SameSite=Lax already mitigates in modern browsers; this is the
defense-in-depth against browser bugs and non-browser clients.
FALSE POSITIVES dismissed: slug path-traversal (schema regex
^[a-z][a-z0-9-]*$ in @bmm/types catches it); session-after-promote
(getSession re-fetches isAdmin from DB on every request).
DEFERRED (not blockers, tracked):
- Z1-002 generated-server HTTPS — needs nginx wildcard subdomain TLS
- Z1-003 docker image cleanup cron
- Z2-001 v2 — real sandbox runtime (multi-week refactor)
- Z3-002 rawBody-per-request memory — branch on webhook path only
- Z5-001 multi-user org RBAC for billing — gated on Team feature
- Email sender integration (Resend) — launch blocker
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sovereign-audit follow-up. The audit's finding pass missed this: every
Iterate (version > 1) ran allocatePort -> a NEW port and deployContainer -> a
NEW container, then pointed the DB row at it — and never stopped the old
container. The previous version kept running forever, holding a host port,
with the old secrets baked into its env, untracked (its containerId was
overwritten in the DB by deployContainer). Same bug class as API-SERVERS-001
but on the iterate path.
Fix: the worker captures the server's current containerId before the build
mutates the row, and after the new container is confirmed live + the DB
updated, it stops the old one. This also makes the 'rolling deploy' the UI
promises actually true — the old version stays up until the new one is live,
then is retired.
deploy.ts stopContainer now returns { ok, detail } (was void) so the worker
can log the outcome.
Verified: generator typecheck clean.