Quota-Aware LLM Gateway

Overview

The most practical Console use case for product teams is not just calling a model. It is putting every AI request behind a controlled gateway so teams can separate traffic, enforce quotas, and see where cost goes.

This is useful when you have multiple internal apps or customer-facing modules sharing the same AI platform but operating under different budgets and reliability rules.

When to reach for this recipe

If your team needs the capabilities described above and you'd rather build on proven primitives than wire one from scratch — this is the shape to start from.

Architecture

Console acts as the LLM gateway and control plane. Each project can have its own API key, model catalog, quota boundaries, and tracing visibility.

Console SDK keeps the app-side integration simple while still exposing request IDs, model access, and typed responses.

1. Separate Traffic By Project And Quota

Quota and rate policies are configured in Console, while each app only receives the project-scoped key it should use.

// Example project setup in Console:
//
// Project: support-assistant
// - apiKey: cp_support_xxx
// - monthly quota: 2M input tokens
// - allowed models: gpt-4o-mini, claude-3-7-sonnet
//
// Project: sales-enablement
// - apiKey: cp_sales_xxx
// - monthly quota: 500K input tokens
// - allowed models: gpt-4o-mini
 
import { ConsoleClient } from '@cognipeer/console-sdk';
 
const client = new ConsoleClient({
  apiKey: process.env.SUPPORT_PROJECT_API_KEY!,
  baseURL: 'https://console.example.com',
});

2. Route Requests Through A Stable Model Key

Your application uses one model key while Console decides routing, fallback, and resiliency behind the scenes.

const response = await client.chat.completions.create({
  model: 'support-primary',
  messages: [
    { role: 'system', content: 'You are the support assistant for premium customers.' },
    { role: 'user', content: 'Summarize the open tickets for account A-104.' },
  ],
  temperature: 0.2,
});
 
console.log(response.choices[0].message.content);
console.log('request_id:', response.request_id);

3. Use Request IDs For Operational Follow-Up

When a team hits a quota wall or sees cost spikes, request-level correlation becomes operationally important.

async function askSupportAssistant(prompt: string) {
  const response = await client.chat.completions.create({
    model: 'support-primary',
    messages: [{ role: 'user', content: prompt }],
  });
 
  auditLog.info({
    requestId: response.request_id,
    prompt,
    model: 'support-primary',
  });
 
  return response.choices[0].message.content;
}

4. Add Semantic Cache Through Console

Console can also serve repeated or semantically similar prompts from cache. That is especially useful for high-volume support, catalog, and policy lookup scenarios where the same intent appears with slightly different wording.

// Example Console model configuration:
//
// Model key: support-primary
// semanticCache:
//   enabled: true
//   vectorProviderKey: qdrant-main
//   vectorIndexKey: support-semantic-cache
//   embeddingModelKey: text-embedding-3-small
//   similarityThreshold: 0.93
//   ttlSeconds: 86400
 
const response = await client.chat.completions.create({
  model: 'support-primary',
  messages: [{ role: 'user', content: 'How do I reset my workspace password?' }],
});
 
console.log(response.request_id);

Result

You get a gateway pattern that:

- Separates AI traffic by project and budget - Controls which models each team can use - Applies routing, fallback, and semantic caching without app-side complexity - Improves cost and quota investigations through request-level visibility

All recipes Suggest a change

Overview

When to reach for this recipe

Architecture

1. Separate Traffic By Project And Quota

2. Route Requests Through A Stable Model Key

3. Use Request IDs For Operational Follow-Up

4. Add Semantic Cache Through Console

Result

Related recipes

Vector RAG Operations Control Plane

PromptOps And MCP Tool Gateway

AI-Powered Applications

Enterprise AI Governance