Token Saving
Token Saving is a named plan that bundles two independent optimizations: prompt compression (reduce input tokens before the LLM call) and response caching (eliminate the LLM call entirely on repeated or near-duplicate prompts).
Both optimizations are managed centrally by admins and activated per-request by a single ID. No application logic changes.
Token Saving is about reducing compute — not buying cheaper tokens. The goal is eliminating redundant LLM calls and shrinking prompts, with the savings attributable as platform cost reduction in your spend reports.
Activation
response = client.chat.completions.create(
model="smart/balanced",
messages=[...],
extra_body={"token_saving_plan_id": "my-plan"},
)
The plan is resolved from your workspace, applied as a pre-call hook, and stripped before the request reaches the upstream provider.
To opt out on a specific request: pass cache: {"no-store": true}.
Prompt compression
Compression runs before the cache-key calculation, so compressed prompts can share cache hits across callers with different history lengths.
| Engine | Method | Use when |
|---|---|---|
trim |
Deterministic truncation (removes oldest messages to fit max_input_tokens) |
Fast, zero dependencies, predictable |
text_rank |
TextRank extractive summarisation | Medium context, semantic fidelity matters |
lex_rank |
LexRank extractive summarisation | Similar to TextRank, often better on structured text |
lsa |
LSA (Latent Semantic Analysis) summarisation | Longer documents, topic-based extraction |
Summarisation engines require sumy and nltk Python packages. Set max_input_tokens on the compression plan.
Response caching
A two-tier waterfall:
Tier 1 — Exact cache Checks Redis for an identical cache key (model + compressed messages + parameters). Cache namespace is always the plan ID — each workspace’s cache is private. Default TTL: 3600 s.
Tier 2 — Semantic cache (on exact-cache miss) Generates an embedding of the query and performs vector similarity search (default threshold: 0.85) against previously cached queries. If a semantically equivalent prior response is found, it is returned without calling the LLM.
Semantic cache backends: Redis-Stack (RediSearch vector similarity) or Qdrant. Embedding calls route back through the proxy via an internal service-account key — their cost is tracked as platform spend and never double-charged to the calling key.
Creating a plan
curl -X POST https://api.routero.ai/token-saving/plans \
-H "Authorization: Bearer $ADMIN_KEY" \
-H "Content-Type: application/json" \
-d '{
"plan_name": "my-plan",
"cache": {
"backend": "redis_semantic",
"similarity_threshold": 0.85,
"ttl": 3600
},
"compression": {
"engine": "text_rank",
"max_input_tokens": 4096
}
}'
You can also configure plans in the Routero dashboard under Token Saving → Plans.
Management API
| Endpoint | Description |
|---|---|
POST /token-saving/plans |
Create a plan |
GET /token-saving/plans |
List all plans in workspace |
GET /token-saving/plans/{id} |
Get plan details |
PATCH /token-saving/plans/{id} |
Update a plan |
DELETE /token-saving/plans/{id} |
Delete a plan |
GET /token-saving/cache-engines |
List available cache backends |
GET /token-saving/compression-engines |
List available compression engines |
Dependencies
| Feature | Required packages | Required infrastructure |
|---|---|---|
| Exact cache only | — | Redis |
| Semantic cache (Redis-Stack) | redis-stack client |
Redis-Stack (RediSearch + vector module) |
| Semantic cache (Qdrant) | qdrant-client |
Qdrant instance |
| Summarisation compression | sumy, nltk |
— |
| Trim compression | — | — |