Token Saving

Token Saving is a named plan that bundles two independent optimizations: prompt compression (reduce input tokens before the LLM call) and response caching (eliminate the LLM call entirely on repeated or near-duplicate prompts).

Both optimizations are managed centrally by admins and activated per-request by a single ID. No application logic changes.

Token Saving is about reducing compute — not buying cheaper tokens. The goal is eliminating redundant LLM calls and shrinking prompts, with the savings attributable as platform cost reduction in your spend reports.


Activation

response = client.chat.completions.create(
    model="smart/balanced",
    messages=[...],
    extra_body={"token_saving_plan_id": "my-plan"},
)

The plan is resolved from your workspace, applied as a pre-call hook, and stripped before the request reaches the upstream provider.

To opt out on a specific request: pass cache: {"no-store": true}.


Prompt compression

Compression runs before the cache-key calculation, so compressed prompts can share cache hits across callers with different history lengths.

Engine Method Use when
trim Deterministic truncation (removes oldest messages to fit max_input_tokens) Fast, zero dependencies, predictable
text_rank TextRank extractive summarisation Medium context, semantic fidelity matters
lex_rank LexRank extractive summarisation Similar to TextRank, often better on structured text
lsa LSA (Latent Semantic Analysis) summarisation Longer documents, topic-based extraction

Summarisation engines require sumy and nltk Python packages. Set max_input_tokens on the compression plan.


Response caching

A two-tier waterfall:

Tier 1 — Exact cache Checks Redis for an identical cache key (model + compressed messages + parameters). Cache namespace is always the plan ID — each workspace’s cache is private. Default TTL: 3600 s.

Tier 2 — Semantic cache (on exact-cache miss) Generates an embedding of the query and performs vector similarity search (default threshold: 0.85) against previously cached queries. If a semantically equivalent prior response is found, it is returned without calling the LLM.

Semantic cache backends: Redis-Stack (RediSearch vector similarity) or Qdrant. Embedding calls route back through the proxy via an internal service-account key — their cost is tracked as platform spend and never double-charged to the calling key.


Creating a plan

curl -X POST https://api.routero.ai/token-saving/plans \
  -H "Authorization: Bearer $ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "plan_name": "my-plan",
    "cache": {
      "backend": "redis_semantic",
      "similarity_threshold": 0.85,
      "ttl": 3600
    },
    "compression": {
      "engine": "text_rank",
      "max_input_tokens": 4096
    }
  }'

You can also configure plans in the Routero dashboard under Token Saving → Plans.


Management API

Endpoint Description
POST /token-saving/plans Create a plan
GET /token-saving/plans List all plans in workspace
GET /token-saving/plans/{id} Get plan details
PATCH /token-saving/plans/{id} Update a plan
DELETE /token-saving/plans/{id} Delete a plan
GET /token-saving/cache-engines List available cache backends
GET /token-saving/compression-engines List available compression engines

Dependencies

Feature Required packages Required infrastructure
Exact cache only Redis
Semantic cache (Redis-Stack) redis-stack client Redis-Stack (RediSearch + vector module)
Semantic cache (Qdrant) qdrant-client Qdrant instance
Summarisation compression sumy, nltk
Trim compression