For PMs deciding inference architecture for AI features. When real-time wins, when streaming makes sense, when batch saves 5x. The cost and UX implications of each.
Real-time when latency is part of the product (sub-1-second budget). Streaming for long generations where time-to-first-token matters more than total time. Batch for analytics, content generation, and high-volume processing where latency is irrelevant. Most production systems use all three. Common mistake: defaulting to real-time everywhere when 30% of workload could be batch at 5x lower cost.
Inference falls into three latency regimes, each with different cost and UX implications:
Most production systems use all three for different parts of the workflow. The single most common mistake we see: defaulting to real-time everywhere when 30% of workload could be batch at much lower cost. Don't fight latency requirements you don't actually have.
| Your situation | Inference mode | Why |
|---|---|---|
| Interactive chat, search, voice | Streaming with real-time fallback | Time-to-first-token is the user experience |
| Live recommendations, real-time decisions | Real-time | Latency is part of the product |
| Sub-200ms latency requirement | Real-time on dedicated hardware | Network and queueing become the constraint |
| Long-form generation (analyses, reports) | Streaming | Total time is long; perceived progress matters |
| Analytics or business intelligence | Batch | Latency is irrelevant; cost matters |
| Bulk content generation (catalogs, descriptions) | Batch | High volume, low urgency |
| Embeddings generation for indexing | Batch | Cost dominates; real-time wastes money |
| Content moderation at scale | Hybrid: real-time for surface, batch for backfill | Latency for live, cost for reprocessing |
| Email or notification generation | Near-real-time queue (seconds OK) | Some latency tolerable; reduces cost |
| Periodic data enrichment | Batch (nightly or hourly) | Predictable; cheapest option |
| User-triggered exports or reports | Async with notification | Don't block UI; async UX is fine |
| Below 10M tokens per day | Real-time often acceptable | Cost difference negligible at this scale |
| Above 100M tokens per day | Batch where possible | Cost compounds; can't afford real-time everywhere |
A chat product uses Claude 3.5 Sonnet for customer-facing conversations. Average response: about 200 tokens. Total response time would be 4 to 8 seconds.
The right approach: streaming. First token arrives in 300ms; full response completes in 4 to 8 seconds, but the user sees progress immediately.
What worked: streaming UX. Users perceive the system as fast even when total generation takes seconds. The same generation as a non-streaming response would feel broken.
What they nearly got wrong: implementing real-time without streaming. The product would have felt unresponsive despite being technically "real-time."
What to remember: for any user-facing generation longer than 1 second, streaming beats real-time on UX even when latency is identical.
An e-commerce platform needs AI-generated product descriptions for 200,000 catalog items. Quality matters; speed doesn't (this is a one-time generation effort).
The right approach: batch processing. Anthropic's Batch API at 50% off real-time pricing. Generation took 8 hours overnight. Total cost: about $4,500. Real-time generation would have cost about $9,000.
What worked: matching the inference mode to the actual requirement. There was no user waiting; saving 50% by tolerating overnight latency was an obvious win.
What they nearly got wrong: pushing the generation through the same real-time API used for live features. The team was about to spend $9K because "we already have the integration." The Batch API integration took 2 days; saved $4.5K immediately and continues to save on every backfill.
What to remember: when no user is waiting, batch is the right answer. The cost difference is too large to ignore at scale.
A social platform moderates posts at 1B tokens per day. Speed matters (post should be hidden quickly if violating); cost also matters at this scale.
The right approach: hybrid. Live posts go through real-time fine-tuned 7B classifier (90% of posts decided immediately). Borderline cases (10%) queue for near-real-time (within 5 seconds) frontier API check. Periodic re-evaluation of older posts runs as nightly batch job.
Combined cost: about $35,000 per month. Pure real-time frontier API would have been $2.5M per month.
What worked: three modes for three different needs. Real-time for the speed-critical path, near-real-time for the borderline cases where small latency is acceptable, batch for re-evaluation where speed is irrelevant.
What they nearly got wrong: pure real-time everywhere. The cost would have been impossible at platform scale. The hybrid wasn't optional; it was the only architecture that scaled.
What to remember: at very high volume, hybrid is the only viable architecture. Pure real-time doesn't scale to billions of tokens per day.
What it looks like: defaulting to real-time inference for all AI features.
Why it's wrong: simpler architecturally, but more expensive operationally. At scale, the cost difference between real-time and batch is dramatic (5 to 10x).
How to redirect: audit each feature for actual latency requirement. Where there's no user waiting, batch wins. Where there is, real-time or streaming.
What it looks like: avoiding batch where it would actually fit.
Why it's wrong: many "real-time" workloads aren't actually user-blocking. Analytics, content generation, periodic enrichment all have hours-tolerance.
How to redirect: distinguish between user-perceived latency and operational latency. Async-with-notification UX is often acceptable and dramatically cheaper.
What it looks like: avoiding streaming because of implementation complexity.
Why it's wrong: streaming is now standard. Server-Sent Events (SSE) or WebSocket implementations are well-understood. The UX win is substantial for any generation longer than 1 second.
How to redirect: stream all user-facing generations. The implementation cost is a few days; the UX value is permanent.
What it looks like: deferring batch architecture as an optimization.
Why it's wrong: batch is architecture, not optimization. Retrofitting batch UX (async notifications, polling, queue management) is harder than designing for it.
How to redirect: identify batch-eligible workloads early. Build the async UX patterns from day one. The cost savings start immediately.
What it looks like: separate inference layers for different modes, with no shared infrastructure.
Why it's wrong: most production systems benefit from a unified inference abstraction. Models can serve all three modes; the difference is the calling pattern.
How to redirect: build a single inference layer that supports all three modes (real-time, streaming, batch) via the same API. Mode selection becomes a parameter, not a separate system.
Specific cases where the answer is real-time:
In all these cases, real-time is non-negotiable. Don't try to fight it; pay for the latency.
Realistic relative cost differences:
| Mode | Relative cost (per 1M tokens) | Best for |
|---|---|---|
| Real-time, dedicated hardware | 1.0x (baseline) | Sub-1-second user-facing latency |
| Streaming | 1.0x (same as real-time) | Long generations, interactive UX |
| Async queue (seconds tolerance) | 0.7 to 0.9x | Email, notifications, near-real-time |
| Batch (nightly or hourly) | 0.3 to 0.5x | Analytics, backfill, content generation |
| Anthropic Batch API | 0.5x of API real-time | High-volume Anthropic workloads |
| Spot batch (interruptible) | 0.2 to 0.4x | Truly interruptible batch workloads |
Multiply by your token volume to get monthly cost differences. At 100M tokens per month, the difference between pure real-time and pure batch is typically $30K vs $10K. Mix matters.
Real-time when latency is part of the product. Streaming for long generations. Batch for everything else. Most production systems use all three, mixed based on actual user-facing requirements.
Don't default to real-time everywhere. Audit each AI feature for actual latency requirement; many "real-time" workloads are batch-eligible at 5x lower cost. The savings from getting this right at design time compound for the life of the product.
Boolean & Beyond
AI Model Fine-Tuning, Deployment & Evaluation Systems · Updated 8 May 2026
From guide to production
Our team has hands-on experience implementing these systems. Book a free architecture call to discuss your specific requirements and get a clear delivery plan.
御社の課題をお聞かせください。24時間以内に、AI活用の可能性と具体的な進め方について無料でご提案いたします。
Boolean and Beyond
825/90, 13th Cross, 3rd Main
Mahalaxmi Layout, Bengaluru - 560086
590, Diwan Bahadur Rd
Near Savitha Hall, R.S. Puram
Coimbatore, Tamil Nadu 641002