The llm package called the user-supplied onSpec/onError handlers
without awaiting them. In the /preview/stream route onSpec is async
(it does `await cacheSpec(...)` then writes the SSE `spec` event), so
the api handler's `await streamSpecFromAnthropic(...)` returned BEFORE
the terminal event had been written. The route's finally block then
ran `reply.raw.end()`, the queued `send('spec', ...)` hit a closed
stream and silently no-op'd, and the browser saw zero terminal
events — frontend ran into the "Spec generation failed." fallback
even though Anthropic had delivered a perfectly valid spec.
Verified against prod log: req-8 ran 66s with 200 and produced no
preview_spec_* log line, which is exactly the success-but-event-lost
signature.
Fix:
- StreamHandlers.onSpec / onError typed as Promise<void> | void
- Both call sites in streamSpecFromAnthropic now `await` them
- /preview/stream sets `resolved = true` at the END of each handler
(after the SSE write completes) so the post-stream "unresolved"
fallback only fires on a genuine programming bug
- Added preview_spec_ready info log on the happy path so future
diagnosis doesn't have to infer success from the absence of error
logs
Auth chain finally landed but tool calls crashed in the wetter server
with "Error: params is not defined". The MCP SDK passes the validated
tool args as a single parameter; our template names that parameter
`args` but the model frequently writes `params.location` / `input.x`
because that's how OpenAPI and JSON-RPC reference docs read.
Two-sided fix:
- render.ts wraps every implementation with `const params = args; const
input = args;` inside the try block. Whichever alias the model
picked, the variable resolves to the same validated object.
- SYSTEM_PROMPT now states the variable name EXPLICITLY ("variable
named EXACTLY `args`, e.g. args.location") so new generations stop
drifting on that detail.
Existing wetter runner needs a rebuild to pick up the alias shim.
Architectural fix for "spec_too_large" / preview_timeout — the sync
endpoint had to fit the whole model run into Cloudflare's ~100s edge
window, which made the system fragile against any prompt that produced
a verbose spec. The new streaming path pipes Anthropic's token deltas
as Server-Sent Events; every chunk resets CF's idle timer and a 15s
keepalive comment guarantees activity even during slow first-token
windows.
@bmm/llm: new streamSpecFromAnthropic() exposes the SDK's .stream()
flow with the same typed-error contract as generateSpec — same
SpecTruncatedError / SpecValidationError / SpecTimeoutError raised from
the relevant moment.
API: POST /v1/servers/preview/stream returns text/event-stream with
events 'text' (deltas), 'spec' (final success payload, same shape as
the sync endpoint), 'error' (typed). Anthropic-only — GLM/hobby falls
back to the sync route via 409 streaming_unavailable.
Frontend: apiSseStream() handles the POST + ReadableStream + SSE
parser. The wizard's analyze() prefers the stream and only uses the
sync endpoint on the explicit 409 fallback.
nginx (api.buildmymcpserver.com): the /v1/builds/ location block (which
already had proxy_buffering off + 600s read timeout for the WS build
stream) now also matches /v1/servers/preview/stream so the SSE
response isn't buffered.
Sonnet 4.6 was still hitting max_tokens on ambitious prompts like
"WorldWeather MCP for any location" because the implementation bodies
ballooned with defensive scaffolding. Two changes:
1. SYSTEM_PROMPT now imposes hard limits the model can self-enforce:
- at most 6 tools (combine related capabilities with a mode param)
- implementation body <= 40 lines, no comments, no overengineering
- descriptions <= 100 chars
These keep a typical preview under ~7k output tokens.
2. team/enterprise maxTokens 8192 -> 12288. At ~130 tok/s that fits in
~94s, still under Cloudflare's 100s edge cap. Hobby (GLM) and pro
(Haiku) keep their existing limits — they were not hitting the
ceiling.
SpecTruncatedError still fires + surfaces 422 spec_too_large when even
12288 isn't enough, so the user gets actionable feedback instead of an
opaque zod error.
Root cause of repeat 422s: 4096 was too tight for ambitious prompts
(Marco's research-assistant prompt produces ~12kB of JSON before the
model gets cut off mid-string). The error then surfaced as an opaque
"Unterminated string in JSON" zod failure instead of pointing the user
at the real problem.
Two fixes:
- maxTokens back to 8192 (the original) for all Claude tiers, 4096 for
GLM. Timeouts bumped to 95s — Sonnet 4.6 at ~130 tok/s does 8192 in
~63s, ~30s headroom for cold starts, still under Cloudflare's 100s
edge cap.
- Detect stop_reason === 'max_tokens' on the Anthropic response BEFORE
parsing and throw the new SpecTruncatedError. /preview catches it
and returns 422 spec_too_large with a clear "split the prompt"
message instead of leaking the zod parse failure.
422s from /preview hid the actual reason: zod_message tells which field
was wrong and a 400-char preview of the model output reveals refusals
or non-JSON returns. Both stay in the api log only — never surfaced
to the client unchanged.
Enterprise plan was hitting SpecTimeoutError exactly at 60s because the
Sonnet 4.6 preview was budgeted for 8192 tokens at ~80 tok/s (≈102s
worst case) inside a 60s window. The frontend then rolled back to step
1 with no spec.
A real spec is small (<= ~10 tools, ~1.5–2.5k output tokens in practice)
so 4096 is plenty and lets even Sonnet finish in ~51s worst case. The
90s timeout buys headroom for cold starts while staying under
Cloudflare's 100s edge cap. Hobby/GLM bumped to 90s too — same
headroom argument.
Six confirmed findings closed (3 MEDIUM, 3 LOW). Tier-1 surfaces from
Pass-1 re-verified non-regressed; this pass deepened the audit on the
auth library, OAuth issuer, and template marketplace.
Za-002 MEDIUM (scrypt cost) — bump SCRYPT_N from 2^14 → 2^17 (131072)
matching current OWASP guidance for password hashing in 2026. Hash
format embeds N (`scrypt$N$salt$hash`), so the existing admin
password at the old cost still verifies — backward-compatible. Also
added explicit maxmem ceilings since Node's default (~32MiB) is
insufficient for the new N.
Za-003 MEDIUM (single-use race) — consumeMagicLink was SELECT-then-
UPDATE; two parallel redemptions could both win and mint two
sessions from the same token. Now uses the same atomic
`UPDATE … WHERE id = ? AND consumedAt IS NULL RETURNING id` pattern
/oauth/token already had — loser of the race gets
invalid_or_expired_token.
Za-004 LOW (membership ordering) — `.orderBy(memberships.createdAt)`
added so when org-invites eventually let a user belong to multiple
orgs, the same one wins every login instead of insertion-order
roulette. Latent-bug pre-empt.
Zb-002 LOW (OAuth register spam) — /oauth/register now per-IP daily
rate-limited at 20/day (well above any legitimate MCP-client
bootstrap pattern). Prevents DB-row spam.
Zc-001 MEDIUM (banned-pattern drift) — three separate copies of
BANNED_PATTERNS had drifted apart. The publish-time scanner in
templates.ts was MISSING the 7 new patterns added in Pass-1
(process.binding, dlopen, .constructor.constructor, vm.runIn*,
globalThis['..']). Single source of truth in @bmm/llm now exports
SHARED_BANNED_PATTERNS; templates.ts composes PUBLISH_BANNED_PATTERNS
= SHARED ∪ code-only-extras (dynamic import, fs.rm, setTimeout-with-
string, process.kill, jailbreak markers).
Zc-002 LOW (N+1) — /v1/templates list was issuing one COUNT(*) per
template (101 queries for a 100-row page). Now one grouped query
with templateId GROUP BY, merged in JS. p95 doesn't degrade with
marketplace growth.
DEFERRED (documented, scoped for next sprint):
Za-001 HIGH — Account takeover via cross-provider email lookup.
Requires schema change (users.primaryProvider). Mitigation in
/settings/account banner planned.
Zb-001 MEDIUM — /oauth/token refresh_token grant: advertised in
AS metadata but unsupported_grant_type. Either implement (~40
LOC) or strip from metadata.
Zc-003 LOW — Admin takedown partial-failure consistency.
Zd-001 IMPROVE — DEK cache invalidation across replicas (single-
instance today).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five confirmed findings from the sovereign-audit pass, ordered by severity:
Z3-001 CRITICAL — Fastify now trustProxy:true so req.ip resolves to the
real visitor IP via X-Forwarded-For instead of always being the nginx /
docker-bridge peer. Every per-IP rate-limit in the codebase was silently
collapsed into one global counter; this restores them.
Z1-001 CRITICAL — runner container hardening flags (--read-only,
--cap-drop=ALL, --security-opt=no-new-privileges:true, --pids-limit=100,
--memory=512m, --cpus=0.5, tmpfs /tmp) were sitting commented-out as a
TODO despite /security promising them. Now applied unconditionally on
production/staging; opt-out flag RUNNER_DISABLE_HARDENING=1 for Win-dev.
Z2-001 + Z2-002 CRITICAL / MEDIUM — banned-pattern blacklist tightened
(Function(...) without `new`, process.binding, process.dlopen,
.constructor.constructor, _load, vm.runIn*Context, globalThis['..'],
"system prompt override"). scanForInjection now also walks tool.name and
every inputSchema property description, not only implementation +
description — closes the prompt-injection-into-AI-client surface that
downstream clients (Claude Desktop, Cursor) read verbatim. The duplicate
BANNED_PATTERNS in apps/api/src/routes/servers.ts deleted in favour of
the single shared scanForInjection export from @bmm/llm.
Z4-001 HIGH — /v1/auth/magic-link gained the two-axis daily rate-limit
the SMS endpoint already had: 10/IP/day + 5/email/day. Combined with the
trustProxy fix above these are now real per-visitor limits.
Z4-002 MEDIUM — magic-link callback URL no longer printed to stdout in
production. In dev it still prints (so devs can click the link); in
production we log only "issued, URL withheld" and a loud error if no
email sender is wired (Resend integration is the actual launch
blocker — left as a TODO).
Z6-001 MEDIUM — /v1/builds/:id/stream WebSocket now refuses cross-origin
upgrades. SameSite=Lax already mitigates in modern browsers; this is the
defense-in-depth against browser bugs and non-browser clients.
FALSE POSITIVES dismissed: slug path-traversal (schema regex
^[a-z][a-z0-9-]*$ in @bmm/types catches it); session-after-promote
(getSession re-fetches isAdmin from DB on every request).
DEFERRED (not blockers, tracked):
- Z1-002 generated-server HTTPS — needs nginx wildcard subdomain TLS
- Z1-003 docker image cleanup cron
- Z2-001 v2 — real sandbox runtime (multi-week refactor)
- Z3-002 rawBody-per-request memory — branch on webhook path only
- Z5-001 multi-user org RBAC for billing — gated on Team feature
- Email sender integration (Resend) — launch blocker
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The free tier was hemorrhaging Anthropic cost with no abuse cap (no rate
limit on /preview, Opus default in the build worker, 5-min cache TTL that
made cache-miss the common case). This switches free users to GLM, paid
users to Claude tiers, and tightens every leak found in the audit.
Backend:
- @bmm/llm: GLM provider via Zhipu's OpenAI-compatible endpoint, pickPreviewModel
+ pickBuildModel helpers, plan-aware ModelChoice
- preview-cache TTL 5min -> 24h (kills the cache-miss path)
- /v1/servers/preview: picks model from caller's plan, returns model name to UI
- /v1/servers POST: enforces SERVER_LIMITS per plan (402), rate-limits builds
- daily rate-limit on preview (5/40/150/1000) and build (3/20/100/500)
- /v1/auth/me returns plan so the wizard can show the right model name
- generator worker: GLM default, Anthropic Sonnet fallback if GLM errors
Frontend:
- Wizard fetches plan, shows "<model> is drafting the tool spec" pre-emptively,
upgrade hint for hobby users, friendly errors for 402 / 429
- Pricing page: AI-model line per tier (Open-tier / Haiku / Sonnet / Opus),
Team €149 -> €199, Enterprise €499 -> €999, daily-preview limit per tier
- Privacy + Security: explicit subprocessor disclosure for Anthropic (US) /
Zhipu (CN) and which tier uses which
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /v1/servers/preview route ran claude-opus-4-7 synchronously; full spec
generation routinely exceeded Cloudflare's ~100s proxy cap, so the browser
received a headerless 524 and reported it as a CORS failure.
- preview now uses claude-sonnet-4-6 with a 45s per-attempt timeout and one
retry — comfortably inside the proxy budget
- generateSpec maps an exhausted timeout to SpecTimeoutError; the route
returns a clean 504 (with CORS headers) instead of a stalled connection
- analyze step: live elapsed-seconds counter as freeze-proof, plus a
reduced-motion exception so the loading spinner keeps spinning (a status
indicator, which WCAG exempts from reduced-motion)
- textarea resize grip restyled to dark theme (light hatch on dark square)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>