When to pre-compute recommendations offline vs. generate them in real-time, and how to build hybrid systems.
Batch recommendations pre-compute suggestions periodically, offering simplicity and cost-efficiency for stable preferences. Real-time systems update instantly based on session behavior, essential for short sessions and changing contexts. Most production systems combine both: batch-computed candidates filtered and re-ranked in real-time.
Recommendation freshness exists on a spectrum, not a binary. Three points on that spectrum drive most architectural decisions:
The right answer is rarely all-or-nothing. Most production systems combine modes — batch for stable signals, near-real-time for emerging signals, real-time for context-dependent ranking.
Batch recommendations are dramatically cheaper than real-time. A daily batch job computes recommendations once for every active user; the marginal cost of serving them is a cache lookup. Real-time inference, by contrast, costs CPU/GPU per request.
Batch is the right default when:
Common failure mode: defaulting to real-time because "users expect fresh recommendations." Most users don't notice batch staleness if the recommendations were good when computed. Measure before optimizing for freshness.
Real-time inference is necessary when the right recommendation depends on signals that arrive after the last batch run:
The threshold for "real-time required" is empirical. Run an A/B test: serve recommendations from a one-day-stale cache vs. real-time inference. If business metrics are the same, batch is sufficient. If real-time wins meaningfully, it justifies the cost.
The dominant production pattern is hybrid: batch-precomputed candidate sets, real-time reranking on top.
The architecture:
This separates expensive personalization (the candidate generation, often expensive matrix factorization or two-tower retrieval) from cheap context adjustment (the rerank, often a gradient-boosted tree on a few hundred candidates).
Variants:
Near-real-time recommendations require streaming infrastructure. The standard stack:
The feature store is the linchpin. Without it, training-serving skew (the difference between features used at training time and at inference time) degrades production model quality silently. With it, the same feature definitions feed both training and inference, eliminating a class of bugs.
Realistic latency targets for production recommendation serving:
When budgets are exceeded, common interventions: cache user embeddings at session boundaries, reduce reranker candidate set size, downsize embedding dimensions, replace neural rerankers with gradient-boosted trees, parallelize independent fetches.
Rough cost intuition for a 10M-user platform with 1M items:
These numbers are illustrative; the precise economics depend on traffic shape, model size, and infrastructure choices. The point is that the three modes have different cost shapes (capex-like for batch, opex-like for real-time) and the right architecture often combines them to minimize cost per useful recommendation.
For most engagements, the freshness decision is typically made in two stages.
First, we identify which signals genuinely change fast enough to require real-time handling. For most clients, this is a subset — session intent, recent activity, location — not the whole stack.
Second, we design the architecture to make those signals real-time while keeping the rest batch. The result is usually a hybrid: batch candidate generation refreshed daily, real-time reranking on the candidates with session and context features. This is dramatically cheaper than full real-time at competitive quality.
Where we see teams over-engineer is in defaulting to real-time everywhere. The cost compounds and the quality lift over hybrid is usually small. Where we see teams under-engineer is in pure batch on session-sensitive surfaces; users notice when search results don't reflect their search.
Freshness is a tradeoff with cost, complexity, and reliability. The right architecture is the one that delivers the freshness business outcomes require, not the one with the freshest architecture diagram.
From guide to production
Our team has hands-on experience implementing these systems. Book a free architecture call to discuss your specific requirements and get a clear delivery plan.
Share your project details and we'll get back to you within 24 hours with a free consultation—no commitment required.
Boolean and Beyond
825/90, 13th Cross, 3rd Main
Mahalaxmi Layout, Bengaluru - 560086
590, Diwan Bahadur Rd
Near Savitha Hall, R.S. Puram
Coimbatore, Tamil Nadu 641002