WP-04 The Compliant AI Gateway

The bill is real, the governance is fictional

Enterprise spend on commercial LLM APIs crossed $8.4 billion in 2025 and is tracking toward $15 billion by the end of 2026. Most of that spend is going through one of three patterns:

Direct API calls from app code, with the API key in environment variables on whatever runs the request. No cost attribution, no quota enforcement, no audit log.
A single shared key behind a thin proxy. Better hygiene than (1), but no per-tenant boundary, no per-feature budgeting, and a noisy neighbor will exhaust the rate limit at exactly the wrong moment.
A commercial gateway like Cloudflare AI Gateway, Portkey, or Kong AI. Better than (2). Designed for direct enterprise use. Almost none of them are designed for an MSP that needs to resell governed AI to multiple end-clients with chargeback.

Multi-tenant MSPs need (4): a gateway that treats each client tenant as a first-class boundary — with its own virtual keys, budgets, models, region, retention policy, audit log, and PII redaction policy — and that the MSP itself owns and governs.

$8.4B

2025 enterprise LLM spend

$15B

2026 projection

~73%

Cost reduction from semantic caching (Redis)

22%

Of orgs with written GenAI policies

What a real gateway has to do

Concretely, before any production LLM workload should run through it:

Virtual keys per tenant, with a hard ceiling on monthly spend, model whitelist, region whitelist, and rate limit. The tenant cannot exceed their cap. The MSP cannot accidentally bill one client’s spend to another.
Prompt caching that respects tenant boundaries. Anthropic, OpenAI, and Bedrock all expose prompt caching primitives that yield meaningful cost wins (~73% reduction on repetitive workloads per Redis benchmarks). They also have well-documented cache-key collision modes that — if you ignore them — can leak one tenant’s response into another tenant’s session.
Audit log per request. Tenant ID, agent ID, model, input token count, output token count, completion-status, cost, latency, and a redacted request body. Streamed to an immutable store with retention policies that match the tenant’s industry.
PII redaction at the edge. Before any prompt leaves the gateway, configurable detectors for PII, PHI, payment data, and trade secrets run. Redacted text is what leaves the network. The original is logged in the tenant’s store with the same retention policy as their other regulated data.
OWASP LLM-class boundary checks. Prompt injection signatures, jailbreak fingerprints, output-class enforcement (no SSRF URLs, no executable code in chat responses by default).
Failover and provider routing. If Anthropic returns 529 overloaded, the gateway routes to a configured fallback — OpenAI, Bedrock, Vertex — with the tenant’s explicit consent for cross-provider routing recorded. Many tenants opt out for compliance reasons. The gateway respects that.

Why we built rather than bought

We evaluated the major commercial gateways in late 2025 and again in early 2026:

Cloudflare AI Gateway. Fast, well-priced, deeply integrated with Cloudflare’s edge. No RBAC. No tenant-scoped workspaces. Audit log is event-stream level, not request level. Built for the product owner, not the MSP reselling it.
Portkey. Strong governance primitives. Sold direct enterprise. The tenant model is "your team", not "your reseller’s clients". The pricing structure does not contemplate channel resale.
Kong AI Gateway. Excellent for direct enterprise governance. Operationally heavy. The MSP packaging story is a roadmap item, not a product.
LiteLLM (open source). The control-plane primitives are right. The edge layer (caching, WAF, rate-limiting closer to the user) is missing. The audit log is good but tenant-scoping requires careful configuration.

The decision: combine LiteLLM as the control plane (per-tenant virtual keys, budgets, spend logs) with Cloudflare AI Gateway as the edge layer (prompt caching, WAF, regional egress, rate limit). One control surface, layered correctly. We named the product AiT AI Gateway.

The architecture

Tenant call → Cloudflare AI Gateway (cache, WAF, rate limit) → LiteLLM control plane (virtual key, budget, model whitelist, audit log) → provider (Anthropic / OpenAI / Bedrock / Vertex). PII redaction sits in front of the control plane. Spend logs land in a tenant-scoped Postgres in Supabase. Anthropic prompt caching cuts repeated-system-prompt cost ~70% in our own internal usage.

The cache-poisoning incident that proves the case

Six months before we built our own, we ran a thin custom proxy in front of Anthropic for one internal product. Two tenants. Same system prompt prefix. Cache key derived from a hash that did not include tenant ID. A coincidental request shape produced a cache hit across tenants.

The leak was discovered within minutes by the system that already audited every request. No PII left the boundary; the leaked content was an internal prompt template. We rotated keys, fixed the cache key derivation, and added a tenant-id salt that is now part of every cache derivation. The point is not the bug. The point is that the bug class is real, the audit log is what catches it, and you cannot rely on the upstream vendor to scope caching correctly for your tenants.

Compliance from the start

The gateway sits at exactly the layer where the new regulatory wave applies most cleanly. EU AI Act Article 50 transparency obligations — fully applicable 2 August 2026 — require that AI-generated content be machine-readably marked. The gateway is the right place to insert that marker. NIST AI RMF substantial-compliance posture — the affirmative defense for TRAIGA fines — can be evidenced in large part from gateway audit logs (input scoping, output filtering, retention, incident response). Treasury’s February 2026 framework that maps NIST onto SOC 2 controls is implemented at the gateway level.

HIPAA-tier tenants additionally get optional Private AI containers in front of the gateway: PII/PHI redaction running on a dedicated container, no logs leaving the tenant’s region. We do not enable this by default because most tenants do not need it. Those that do, get it.

What this looks like in practice

A 250-person professional services firm signs the MSP agreement. Their requirements: every employee can use Claude and ChatGPT, but no client data may leave the United States, and the legal department needs an audit trail of every prompt that mentions any of forty-three specific client names. Monthly cap: $1,200.

We provision a tenant in the gateway. Virtual keys are issued for Claude Sonnet 4.6, Claude Haiku 4.5, GPT-4o, and GPT-4o-mini. Models that don’t support US-only egress are excluded. PII redaction is set to "match-then-log-then-route". A custom rule fires on the forty-three client names: every match is logged with a tenant-readable hash and surfaced in the legal department’s daily digest. Spend hits $980 in week three; the gateway emits a 75% threshold alert; legal reviews two unusual spikes (one was a developer load-testing a prompt; the other was an actual client query). Spend hits $1,200 on the 27th; calls return a structured 429 with a clear "monthly cap reached, contact MSP to raise" message.

The auditor asks for evidence in February. The Trust Portal exports the SOC 2-aligned report in two clicks.

Where this fits

AiT AI Gateway is the platform layer beneath everything else. AiTAgent, AiT SOC Sentinel, AiT CRM, AiTBMS — every product in our portfolio uses the gateway for its LLM calls. The gateway is what makes the rest of the stack governable rather than just functional. Read about the related audit posture in WP-06.

The Compliant AI Gateway