For PMs approving routing architecture. How to route across models for cost, quality, and latency; when routing earns its complexity; and three worked routing patterns.
Routing matters when one model size doesn't fit all queries. Use heuristic routers for shape-based decisions (fast and predictable), classifier routers when heuristics miss subtle cases, cascade routing for confidence-based escalation. Always include a fallback chain for provider outages and instrument routing decisions for cost attribution. Most teams over-engineer routing early; start with heuristics and let observed traffic drive the decision.
Multi-model routing makes sense when one model size doesn't fit all queries. Easy queries route to a small fast model; hard queries route to a frontier model. The hybrid captures most of the cost savings of small models with most of the quality of frontier models.
For most teams, heuristic routing (rules based on query length, keywords, or source) is the right starting point. Add a classifier router only if heuristics underperform on observed traffic. Reach for cascade routing (run the cheap model first, escalate on low confidence) when quality matters but the cheap model handles most cases well.
Always include a fallback chain. Multi-provider availability beats single-provider availability. The provider you depend on will have an outage; plan for it.
Most teams over-engineer routing early. Start with heuristics, instrument heavily, evolve based on observed traffic.
| Your situation | Routing approach | Why |
|---|---|---|
| Single model handles workload well | No routing layer | Routing complexity isn't free; only add when needed |
| Easy queries dominate, hard queries occasional | Cascade routing (cheap first, escalate) | Captures most savings with quality safety net |
| Distinct query categories with different optimal models | Classifier router | Categorization-driven routing is the right shape |
| Budget pressure at sustained volume | Heuristic routing to smaller fine-tuned model | Capture cost savings on consistent queries |
| Latency-critical with mixed workloads | Heuristic routing by latency budget | Hard time budgets need explicit routing |
| Multi-tenant SaaS with per-tenant cost models | Cost-aware routing | Bound costs per tenant; protect overall margins |
| Provider-outage tolerance critical | Fallback chain across providers | Multi-provider beats single-provider availability |
| Routing logic changes frequently | Heuristic with feature flags | Easier to update than retraining a classifier |
| Routing logic is multi-dimensional and stable | Classifier router | Better signal-to-noise on complex decisions |
| Below 10M tokens per day | Probably no routing needed | Routing engineering overhead exceeds savings |
| Above 100M tokens per day | Routing is required | Cost compounds; can't ignore |
| New product, traffic unknown | Single model + observability | Add routing only when patterns emerge |
A SaaS company processes 50M tokens per day of customer support queries. Most queries (80%) are simple FAQ-style; the remaining 20% need deeper reasoning.
The right approach: heuristic router based on query length and keyword detection. Short queries with FAQ keywords route to fine-tuned Llama 3.1 8B (about $0.20 per million tokens). Longer queries or queries flagged with complex keywords route to Claude 3.5 Haiku (about $1 per million tokens). Combined cost: about $4,500 per month vs $25,000 per month for pure-Claude.
What worked: simple rules. Three keyword lists (FAQ patterns, complex patterns, escalation patterns) plus a length threshold. Took one engineer two days to implement.
What they nearly got wrong: starting with a learned classifier. The team initially proposed training a routing model. The heuristic version worked well enough that the classifier wasn't needed; it would have been weeks of work for marginal gains.
What to remember: start with heuristics. They're often sufficient and dramatically simpler than learned routers. Move to classifiers only if observed traffic shows heuristics missing.
A legal-tech product analyzes contracts. Quality matters; even occasional bad answers damage trust.
The right approach: cascade router. Fine-tuned Llama 3.1 70B handles all queries first. The model returns a confidence score; queries below 0.85 confidence escalate to GPT-4. About 12% of queries escalate. Combined cost: $8,000 per month vs $35,000 per month for pure-GPT-4.
What worked: cascade preserves quality on hard cases (where confidence is low) while capturing cost savings on confident cases. The escalation rate (12%) was within the budget for 2x cost on those queries.
What they nearly got wrong: pure heuristic routing. Without the confidence signal, the team would have either over-escalated (defeating cost savings) or under-escalated (degrading quality on edge cases). The model's own confidence is the better signal.
What to remember: cascade routing is the right pattern when the small model handles most cases well but you need a quality safety net for the rest.
A high-traffic platform depends on AI for core user-facing features. A multi-hour outage at the AI provider would damage the business.
The right approach: routing layer with fallback chain. Primary: Claude 3.5 Sonnet via Anthropic API. Secondary: GPT-4o via OpenAI API. Tertiary: self-hosted Llama 3.1 70B. The routing layer attempts primary, falls back to secondary on errors or timeouts (over 5 seconds), tertiary if both fail.
What worked: when Anthropic had a 4-hour outage in Q2, the platform stayed up. Users saw slightly different response styles (the secondary model has different conventions) but the product remained functional.
What they nearly got wrong: single-provider deployment. The original architecture had no fallback; an outage would have meant 4 hours of broken core product features.
What to remember: provider availability is real risk at scale. Multi-provider with fallback is the production-grade approach. Plan for the outage that will happen, not the outage that "shouldn't."
What it looks like: routing architecture proposals that lead with ML models for routing decisions.
Why it's wrong: most routing problems have heuristic solutions. Learned routers have higher complexity, training data needs, and operational surface. They're also slower at inference time.
How to redirect: prototype with heuristic routing. Measure. Move to classifier only if observed traffic shows heuristics missing meaningful cases.
What it looks like: avoiding routing complexity by sticking with one model and one provider.
Why it's wrong: at scale, this is fragile. Provider outages happen. One model rarely fits all queries optimally. The simplicity has hidden costs.
How to redirect: add a fallback chain even if you don't add routing logic. The two-line change (try provider A, fall back to provider B) buys substantial reliability.
What it looks like: routing decisions scattered across application services.
Why it's wrong: when routing logic changes, you have to update every service. Cost attribution becomes hard. Observability is fragmented.
How to redirect: centralize routing in a dedicated layer (a thin service or a shared library). All AI calls go through it. Routing logic, fallback chains, and cost telemetry all live in one place.
What it looks like: deferring routing as an "optimization."
Why it's wrong: routing isn't optimization, it's architecture. Retrofitting it onto a production system is harder than designing for it. Caching, monitoring, and the fallback chain all assume specific routing behavior.
How to redirect: build a thin routing layer from day one even if it just routes everything to one model. The abstraction means you can add real routing later without re-architecting.
What it looks like: routing layers without observability.
Why it's wrong: you'll need to debug routing decisions. Without observability, you're flying blind. Cost surprises will be unexplained.
How to redirect: log every routing decision with: user, query (or hash), chosen model, fallback chain attempted, latency, cost. Build dashboards on top. Routing observability is non-negotiable.
Sometimes the answer is don't route:
In these cases, single-model with a fallback chain (in case of provider outage) is sufficient.
Realistic ranges for routing layer projects:
| Routing approach | Engineering setup | Ongoing maintenance | Best for |
|---|---|---|---|
| Heuristic with fallback chain | 1 to 2 weeks | Low; rule updates as needed | Most teams; clear-cut routing rules |
| Cost-aware heuristic with tenant rules | 2 to 4 weeks | Low to medium | Multi-tenant SaaS |
| Cascade (cheap first, escalate) | 2 to 3 weeks | Medium; threshold tuning | Quality-sensitive workloads |
| Classifier router | 4 to 8 weeks (incl. training) | Medium; retrain quarterly | Multi-category workloads with subtle distinctions |
| Multi-provider with full fallback | 1 to 2 weeks (added to any) | Low; mostly observability | Production-grade availability |
| Cost / latency / quality multi-dimensional | 6 to 10 weeks | Medium-high | Sophisticated workloads only |
Routing infrastructure pays back via cost savings; the engineering investment amortizes within 3 to 6 months at meaningful volume.
Routing is the right call when one model size doesn't fit all queries. Heuristic routing is the right starting point; learned routers only when observed traffic justifies the complexity.
Always include a fallback chain. Multi-provider availability beats single-provider. The provider you depend on will have an outage; plan for it.
Don't over-engineer routing early. Start simple, instrument heavily, evolve based on real traffic. The teams that route well aren't the ones with the most sophisticated routers; they're the ones with the right router for their actual workload.
Boolean & Beyond
AI Model Fine-Tuning, Deployment & Evaluation Systems · Updated 8 May 2026
From guide to production
Our team has hands-on experience implementing these systems. Book a free architecture call to discuss your specific requirements and get a clear delivery plan.
御社の課題をお聞かせください。24時間以内に、AI活用の可能性と具体的な進め方について無料でご提案いたします。
Boolean and Beyond
825/90, 13th Cross, 3rd Main
Mahalaxmi Layout, Bengaluru - 560086
590, Diwan Bahadur Rd
Near Savitha Hall, R.S. Puram
Coimbatore, Tamil Nadu 641002