Closes structural weakness #4 from the audit (single global key, no rotation,
no KMS path). Customer secrets now use envelope encryption with a real
rotation story.
Model:
KEK — Key Encryption Key, 32 bytes from env (SECRETS_ENCRYPTION_KEY). Never
stored in the DB. Root of trust.
DEK — Data Encryption Key, 32 random bytes we generate, stored in the new
encryption_keys table *wrapped* (AES-256-GCM encrypted) with the KEK.
Secrets are encrypted with the DEK.
Schema:
- encryption_keys (version, wrappedDek, active, rotatedBy, createdAt, retiredAt)
- secrets.keyId — which DEK encrypted this row. NULL = legacy (KEK-direct,
pre-envelope); decryptSecret handles both and the first rotation migrates
legacy rows onto a DEK.
crypto.ts (full rewrite):
- ensureActiveKey() — boot-time, loads keys + creates v1 if none. Fail-closed:
index.ts process.exit(1) if it throws — the API will not serve if encryption
can't initialize.
- encryptSecret() — encrypts with the active DEK, returns { value, keyId }.
- decryptSecret(value, keyId) — DEK path or legacy KEK-direct path.
- rotateKeys() — mints a fresh DEK, re-encrypts EVERY secret under it inside a
single transaction (decrypt-old / encrypt-new per row), retires the old key,
activates the new one. A partial failure is recoverable because every row
carries its own keyId.
- encryptionStatus() — active version, key history, secret + legacy counts.
Admin:
- GET /v1/admin/encryption — status
- POST /v1/admin/encryption/rotate — triggers rotateKeys, audit-logged as
admin.encryption.rotate with { newVersion, reEncrypted }.
- /admin/encryption page — active-key/secret/legacy cards, Rotate button with
confirm, key-history table, plain-English how-it-works. Added to admin nav.
Verified end-to-end:
- boot → encryption_keys v1 active, '[crypto] envelope encryption ready'
- created a server with secret MY_API_KEY → stored ciphertext, keyId = v1
- POST rotate → { newVersion: 2, reEncrypted: 1 }; ciphertext changed, keyId
now v2, v1 retired, v2 active. The decrypt-then-reencrypt round-trip
succeeded (rotation throws otherwise) — the secret is provably recoverable.
- admin UI renders the status + history correctly.
Deferred, named honestly (not built this iteration):
- worker reads secrets from the DB instead of the BullMQ job-data plaintext
copy — would also remove plaintext secrets from Redis. Separate change with
its own risk surface on the iterate/fork flows.
- per-server secret-value rotation UI
- audit_log hash-chaining (tamper-evidence)
- rate limiting on auth endpoints