Routing & Load Balancing

The Routero Router distributes each request across one or more configured provider deployments using a pluggable strategy. You configure model groups (a named alias → list of provider deployments); the Router picks one based on real-time health and your chosen strategy.


Routing strategies

Strategy How it picks Best for
simple_shuffle (default) Random weighted selection Even distribution, simple setups
least_busy Deployment with lowest in-flight request count Throughput-limited providers
lowest_latency Deployment with lowest recent p50 latency Latency-sensitive applications
lowest_cost Deployment with lowest per-token cost Cost optimisation
lowest_tpm_rpm Deployment furthest from its TPM/RPM limit Rate-limit avoidance
usage_based_routing_v2 Tracks real-time usage against provider limits High-volume, mixed rate limits
tag_based_routing Matches request tags to deployment tags Residency, capability routing
auto_router Routero-managed, adapts over time Hands-off optimisation

Set the strategy per model group in your router configuration or policy YAML.


Model groups

A model group maps a named alias to an ordered list of provider deployments. Example:

model_list:
  - model_name: smart/balanced
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
    model_info:
      mode: chat

  - model_name: smart/balanced
    litellm_params:
      model: anthropic/claude-sonnet-4-6-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: smart/balanced
    litellm_params:
      model: bedrock/anthropic.claude-sonnet-4-6-20250514-v1:0

router_settings:
  routing_strategy: least_busy
  num_retries: 3
  timeout: 30

Requests to model: "smart/balanced" are distributed across all three deployments. On failure, the Router automatically retries on the next available deployment.


Health monitoring

The Router tracks each deployment’s health in Redis:

  • Error rate — tracks 5xx, 429, and content-filter trip rates.
  • Cooldown — a deployment that crosses an error threshold is cooled down (removed from rotation) for a configurable period.
  • Latency percentiles — p50/p95 rolling window, used by lowest_latency strategy.
  • TPM/RPM proximity — tracks usage against declared provider limits for usage_based_routing_v2.

Routing state

All routing state (cooldowns, usage counters, latency windows) is stored in Redis. In a multi-instance deployment, all proxy replicas share the same routing state — no cross-instance coordination required beyond the shared Redis.

Failover & Fallbacks for how the Router handles provider errors mid-request. → Policy Routing for declarative YAML rules that influence routing decisions.