Auth chain finally landed but tool calls crashed in the wetter server
with "Error: params is not defined". The MCP SDK passes the validated
tool args as a single parameter; our template names that parameter
`args` but the model frequently writes `params.location` / `input.x`
because that's how OpenAPI and JSON-RPC reference docs read.
Two-sided fix:
- render.ts wraps every implementation with `const params = args; const
input = args;` inside the try block. Whichever alias the model
picked, the variable resolves to the same validated object.
- SYSTEM_PROMPT now states the variable name EXPLICITLY ("variable
named EXACTLY `args`, e.g. args.location") so new generations stop
drifting on that detail.
Existing wetter runner needs a rebuild to pick up the alias shim.
Architectural fix for "spec_too_large" / preview_timeout — the sync
endpoint had to fit the whole model run into Cloudflare's ~100s edge
window, which made the system fragile against any prompt that produced
a verbose spec. The new streaming path pipes Anthropic's token deltas
as Server-Sent Events; every chunk resets CF's idle timer and a 15s
keepalive comment guarantees activity even during slow first-token
windows.
@bmm/llm: new streamSpecFromAnthropic() exposes the SDK's .stream()
flow with the same typed-error contract as generateSpec — same
SpecTruncatedError / SpecValidationError / SpecTimeoutError raised from
the relevant moment.
API: POST /v1/servers/preview/stream returns text/event-stream with
events 'text' (deltas), 'spec' (final success payload, same shape as
the sync endpoint), 'error' (typed). Anthropic-only — GLM/hobby falls
back to the sync route via 409 streaming_unavailable.
Frontend: apiSseStream() handles the POST + ReadableStream + SSE
parser. The wizard's analyze() prefers the stream and only uses the
sync endpoint on the explicit 409 fallback.
nginx (api.buildmymcpserver.com): the /v1/builds/ location block (which
already had proxy_buffering off + 600s read timeout for the WS build
stream) now also matches /v1/servers/preview/stream so the SSE
response isn't buffered.
Sonnet 4.6 was still hitting max_tokens on ambitious prompts like
"WorldWeather MCP for any location" because the implementation bodies
ballooned with defensive scaffolding. Two changes:
1. SYSTEM_PROMPT now imposes hard limits the model can self-enforce:
- at most 6 tools (combine related capabilities with a mode param)
- implementation body <= 40 lines, no comments, no overengineering
- descriptions <= 100 chars
These keep a typical preview under ~7k output tokens.
2. team/enterprise maxTokens 8192 -> 12288. At ~130 tok/s that fits in
~94s, still under Cloudflare's 100s edge cap. Hobby (GLM) and pro
(Haiku) keep their existing limits — they were not hitting the
ceiling.
SpecTruncatedError still fires + surfaces 422 spec_too_large when even
12288 isn't enough, so the user gets actionable feedback instead of an
opaque zod error.
Root cause of repeat 422s: 4096 was too tight for ambitious prompts
(Marco's research-assistant prompt produces ~12kB of JSON before the
model gets cut off mid-string). The error then surfaced as an opaque
"Unterminated string in JSON" zod failure instead of pointing the user
at the real problem.
Two fixes:
- maxTokens back to 8192 (the original) for all Claude tiers, 4096 for
GLM. Timeouts bumped to 95s — Sonnet 4.6 at ~130 tok/s does 8192 in
~63s, ~30s headroom for cold starts, still under Cloudflare's 100s
edge cap.
- Detect stop_reason === 'max_tokens' on the Anthropic response BEFORE
parsing and throw the new SpecTruncatedError. /preview catches it
and returns 422 spec_too_large with a clear "split the prompt"
message instead of leaking the zod parse failure.
422s from /preview hid the actual reason: zod_message tells which field
was wrong and a 400-char preview of the model output reveals refusals
or non-JSON returns. Both stay in the api log only — never surfaced
to the client unchanged.
Enterprise plan was hitting SpecTimeoutError exactly at 60s because the
Sonnet 4.6 preview was budgeted for 8192 tokens at ~80 tok/s (≈102s
worst case) inside a 60s window. The frontend then rolled back to step
1 with no spec.
A real spec is small (<= ~10 tools, ~1.5–2.5k output tokens in practice)
so 4096 is plenty and lets even Sonnet finish in ~51s worst case. The
90s timeout buys headroom for cold starts while staying under
Cloudflare's 100s edge cap. Hobby/GLM bumped to 90s too — same
headroom argument.
Six confirmed findings closed (3 MEDIUM, 3 LOW). Tier-1 surfaces from
Pass-1 re-verified non-regressed; this pass deepened the audit on the
auth library, OAuth issuer, and template marketplace.
Za-002 MEDIUM (scrypt cost) — bump SCRYPT_N from 2^14 → 2^17 (131072)
matching current OWASP guidance for password hashing in 2026. Hash
format embeds N (`scrypt$N$salt$hash`), so the existing admin
password at the old cost still verifies — backward-compatible. Also
added explicit maxmem ceilings since Node's default (~32MiB) is
insufficient for the new N.
Za-003 MEDIUM (single-use race) — consumeMagicLink was SELECT-then-
UPDATE; two parallel redemptions could both win and mint two
sessions from the same token. Now uses the same atomic
`UPDATE … WHERE id = ? AND consumedAt IS NULL RETURNING id` pattern
/oauth/token already had — loser of the race gets
invalid_or_expired_token.
Za-004 LOW (membership ordering) — `.orderBy(memberships.createdAt)`
added so when org-invites eventually let a user belong to multiple
orgs, the same one wins every login instead of insertion-order
roulette. Latent-bug pre-empt.
Zb-002 LOW (OAuth register spam) — /oauth/register now per-IP daily
rate-limited at 20/day (well above any legitimate MCP-client
bootstrap pattern). Prevents DB-row spam.
Zc-001 MEDIUM (banned-pattern drift) — three separate copies of
BANNED_PATTERNS had drifted apart. The publish-time scanner in
templates.ts was MISSING the 7 new patterns added in Pass-1
(process.binding, dlopen, .constructor.constructor, vm.runIn*,
globalThis['..']). Single source of truth in @bmm/llm now exports
SHARED_BANNED_PATTERNS; templates.ts composes PUBLISH_BANNED_PATTERNS
= SHARED ∪ code-only-extras (dynamic import, fs.rm, setTimeout-with-
string, process.kill, jailbreak markers).
Zc-002 LOW (N+1) — /v1/templates list was issuing one COUNT(*) per
template (101 queries for a 100-row page). Now one grouped query
with templateId GROUP BY, merged in JS. p95 doesn't degrade with
marketplace growth.
DEFERRED (documented, scoped for next sprint):
Za-001 HIGH — Account takeover via cross-provider email lookup.
Requires schema change (users.primaryProvider). Mitigation in
/settings/account banner planned.
Zb-001 MEDIUM — /oauth/token refresh_token grant: advertised in
AS metadata but unsupported_grant_type. Either implement (~40
LOC) or strip from metadata.
Zc-003 LOW — Admin takedown partial-failure consistency.
Zd-001 IMPROVE — DEK cache invalidation across replicas (single-
instance today).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five confirmed findings from the sovereign-audit pass, ordered by severity:
Z3-001 CRITICAL — Fastify now trustProxy:true so req.ip resolves to the
real visitor IP via X-Forwarded-For instead of always being the nginx /
docker-bridge peer. Every per-IP rate-limit in the codebase was silently
collapsed into one global counter; this restores them.
Z1-001 CRITICAL — runner container hardening flags (--read-only,
--cap-drop=ALL, --security-opt=no-new-privileges:true, --pids-limit=100,
--memory=512m, --cpus=0.5, tmpfs /tmp) were sitting commented-out as a
TODO despite /security promising them. Now applied unconditionally on
production/staging; opt-out flag RUNNER_DISABLE_HARDENING=1 for Win-dev.
Z2-001 + Z2-002 CRITICAL / MEDIUM — banned-pattern blacklist tightened
(Function(...) without `new`, process.binding, process.dlopen,
.constructor.constructor, _load, vm.runIn*Context, globalThis['..'],
"system prompt override"). scanForInjection now also walks tool.name and
every inputSchema property description, not only implementation +
description — closes the prompt-injection-into-AI-client surface that
downstream clients (Claude Desktop, Cursor) read verbatim. The duplicate
BANNED_PATTERNS in apps/api/src/routes/servers.ts deleted in favour of
the single shared scanForInjection export from @bmm/llm.
Z4-001 HIGH — /v1/auth/magic-link gained the two-axis daily rate-limit
the SMS endpoint already had: 10/IP/day + 5/email/day. Combined with the
trustProxy fix above these are now real per-visitor limits.
Z4-002 MEDIUM — magic-link callback URL no longer printed to stdout in
production. In dev it still prints (so devs can click the link); in
production we log only "issued, URL withheld" and a loud error if no
email sender is wired (Resend integration is the actual launch
blocker — left as a TODO).
Z6-001 MEDIUM — /v1/builds/:id/stream WebSocket now refuses cross-origin
upgrades. SameSite=Lax already mitigates in modern browsers; this is the
defense-in-depth against browser bugs and non-browser clients.
FALSE POSITIVES dismissed: slug path-traversal (schema regex
^[a-z][a-z0-9-]*$ in @bmm/types catches it); session-after-promote
(getSession re-fetches isAdmin from DB on every request).
DEFERRED (not blockers, tracked):
- Z1-002 generated-server HTTPS — needs nginx wildcard subdomain TLS
- Z1-003 docker image cleanup cron
- Z2-001 v2 — real sandbox runtime (multi-week refactor)
- Z3-002 rawBody-per-request memory — branch on webhook path only
- Z5-001 multi-user org RBAC for billing — gated on Team feature
- Email sender integration (Resend) — launch blocker
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The free tier was hemorrhaging Anthropic cost with no abuse cap (no rate
limit on /preview, Opus default in the build worker, 5-min cache TTL that
made cache-miss the common case). This switches free users to GLM, paid
users to Claude tiers, and tightens every leak found in the audit.
Backend:
- @bmm/llm: GLM provider via Zhipu's OpenAI-compatible endpoint, pickPreviewModel
+ pickBuildModel helpers, plan-aware ModelChoice
- preview-cache TTL 5min -> 24h (kills the cache-miss path)
- /v1/servers/preview: picks model from caller's plan, returns model name to UI
- /v1/servers POST: enforces SERVER_LIMITS per plan (402), rate-limits builds
- daily rate-limit on preview (5/40/150/1000) and build (3/20/100/500)
- /v1/auth/me returns plan so the wizard can show the right model name
- generator worker: GLM default, Anthropic Sonnet fallback if GLM errors
Frontend:
- Wizard fetches plan, shows "<model> is drafting the tool spec" pre-emptively,
upgrade hint for hobby users, friendly errors for 402 / 429
- Pricing page: AI-model line per tier (Open-tier / Haiku / Sonnet / Opus),
Team €149 -> €199, Enterprise €499 -> €999, daily-preview limit per tier
- Privacy + Security: explicit subprocessor disclosure for Anthropic (US) /
Zhipu (CN) and which tier uses which
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /v1/servers/preview route ran claude-opus-4-7 synchronously; full spec
generation routinely exceeded Cloudflare's ~100s proxy cap, so the browser
received a headerless 524 and reported it as a CORS failure.
- preview now uses claude-sonnet-4-6 with a 45s per-attempt timeout and one
retry — comfortably inside the proxy budget
- generateSpec maps an exhausted timeout to SpecTimeoutError; the route
returns a clean 504 (with CORS headers) instead of a stalled connection
- analyze step: live elapsed-seconds counter as freeze-proof, plus a
reduced-motion exception so the loading spinner keeps spinning (a status
indicator, which WCAG exempts from reduced-motion)
- textarea resize grip restyled to dark theme (light hatch on dark square)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>