Solutions/AI Recommendation Engine Development

Production SystemsUpdated 8 May 2026

Real-Time vs Batch Recommendations

When to pre-compute recommendations offline vs. generate them in real-time, and how to build hybrid systems.

When should you use real-time vs batch recommendation systems?

Batch recommendations pre-compute suggestions periodically, offering simplicity and cost-efficiency for stable preferences. Real-time systems update instantly based on session behavior, essential for short sessions and changing contexts. Most production systems combine both: batch-computed candidates filtered and re-ranked in real-time.

Three Modes: Batch, Near-Real-Time, Real-Time

Recommendation freshness exists on a spectrum, not a binary. Three points on that spectrum drive most architectural decisions:

Batch: recommendations are precomputed on a schedule (daily, hourly) and served from cache. Latency to serve is sub-10ms; staleness is hours.
Near-real-time: recommendations refresh in response to signals within seconds to minutes (often via streaming pipelines and feature stores). Latency is tens of ms; staleness is seconds.
Real-time: recommendations are computed at request time, conditioned on the current session, query, or context. Latency is hundreds of ms; staleness is zero.

The right answer is rarely all-or-nothing. Most production systems combine modes — batch for stable signals, near-real-time for emerging signals, real-time for context-dependent ranking.

When Batch Wins: Cost and Predictability

Batch recommendations are dramatically cheaper than real-time. A daily batch job computes recommendations once for every active user; the marginal cost of serving them is a cache lookup. Real-time inference, by contrast, costs CPU/GPU per request.

Batch is the right default when:

The recommendation context doesn't change rapidly. Daily email recommendations, homepage feeds for casual users, "for you" lists checked once or twice a day.
The catalog is stable. Nothing about the right recommendation changes within hours.
Cost per recommendation matters. High-volume, low-revenue surfaces (free-tier users, ads serving) often only justify batch economics.
The personalization signal is dominated by long-term taste, not session behavior.

Common failure mode: defaulting to real-time because "users expect fresh recommendations." Most users don't notice batch staleness if the recommendations were good when computed. Measure before optimizing for freshness.

When Real-Time Is Required: Context That Matters

Real-time inference is necessary when the right recommendation depends on signals that arrive after the last batch run:

Session intent. A user who just searched "running shoes" should see running shoe recommendations on the next page, not the precomputed homepage feed.
Recent activity. A user who just rated an item should immediately see that signal reflected in subsequent recommendations.
Contextual signals that shift fast. Location, time of day, current weather, current event (e.g., breaking news, live event).
High-stakes engagement. Active user on a paying tier where each session matters more than infrastructure cost.

The threshold for "real-time required" is empirical. Run an A/B test: serve recommendations from a one-day-stale cache vs. real-time inference. If business metrics are the same, batch is sufficient. If real-time wins meaningfully, it justifies the cost.

Hybrid Architectures and Lambda Patterns

The dominant production pattern is hybrid: batch-precomputed candidate sets, real-time reranking on top.

The architecture:

Batch layer: daily or hourly job computes per-user candidate set (top-1000 items) and writes to a fast key-value store.
Real-time layer: at request time, fetch the candidate set, apply session features and context, rerank with a fast model, return top-N.

This separates expensive personalization (the candidate generation, often expensive matrix factorization or two-tower retrieval) from cheap context adjustment (the rerank, often a gradient-boosted tree on a few hundred candidates).

Variants:

Lambda architecture: parallel batch and streaming pipelines compute candidates; serving layer merges. Robust but operationally complex.
Kappa architecture: all data flows through a streaming pipeline; "batch" becomes "long-window streaming." Conceptually cleaner but harder to bootstrap.
Two-stage retrieval: batch ANN index of item embeddings, real-time user embedding computation and retrieval. Clean architecture for embedding-based recommenders.

Streaming Pipelines and Feature Stores

Near-real-time recommendations require streaming infrastructure. The standard stack:

Event ingestion: Kafka, Kinesis, Pub/Sub. User actions arrive as events within seconds.
Stream processing: Flink, Kafka Streams, Spark Structured Streaming. Aggregate events into features (e.g., "items the user clicked in the last 30 minutes").
Feature store: Feast, Tecton, Vertex AI Feature Store. Stores features with offline-online parity guarantees, served at sub-10ms latency.
Online inference: model server reads features from the feature store, produces recommendations.

The feature store is the linchpin. Without it, training-serving skew (the difference between features used at training time and at inference time) degrades production model quality silently. With it, the same feature definitions feed both training and inference, eliminating a class of bugs.

Latency Budgets and Performance Targets

Realistic latency targets for production recommendation serving:

p50 latency: 50–100ms end-to-end.
p99 latency: 200–500ms — the meaningful target for user experience.
Component budgets within p99:
Feature fetch: 5–20ms
User embedding compute (if not cached): 10–50ms
ANN retrieval: 10–30ms
Reranker: 20–100ms
Filtering, diversity, business rules: 5–20ms
Network and serving overhead: 20–50ms

When budgets are exceeded, common interventions: cache user embeddings at session boundaries, reduce reranker candidate set size, downsize embedding dimensions, replace neural rerankers with gradient-boosted trees, parallelize independent fetches.

Cost Modeling Across the Three Modes

Rough cost intuition for a 10M-user platform with 1M items:

Batch (daily refresh): ~10M user-item scoring operations per day. Modest GPU/CPU cluster for 4–6 hours. Marginal cost per served recommendation: near zero.
Near-real-time (feature store + streaming): infrastructure baseline (~$10K/month for Kafka cluster + Flink + feature store at this scale), plus inference cost. Marginal cost per recommendation: low, but baseline is non-trivial.
Real-time (per-request inference): GPU inference cluster sized for peak QPS. At 1000 QPS with GPU inference, ~$5K–$30K/month depending on model size.

These numbers are illustrative; the precise economics depend on traffic shape, model size, and infrastructure choices. The point is that the three modes have different cost shapes (capex-like for batch, opex-like for real-time) and the right architecture often combines them to minimize cost per useful recommendation.

How Boolean & Beyond Approaches Freshness Architecture

For most engagements, the freshness decision is typically made in two stages.

First, we identify which signals genuinely change fast enough to require real-time handling. For most clients, this is a subset — session intent, recent activity, location — not the whole stack.

Second, we design the architecture to make those signals real-time while keeping the rest batch. The result is usually a hybrid: batch candidate generation refreshed daily, real-time reranking on the candidates with session and context features. This is dramatically cheaper than full real-time at competitive quality.

Where we see teams over-engineer is in defaulting to real-time everywhere. The cost compounds and the quality lift over hybrid is usually small. Where we see teams under-engineer is in pure batch on session-sensitive surfaces; users notice when search results don't reflect their search.

Summary: Choosing Your Freshness Architecture

Default to hybrid. Batch candidate generation, real-time reranking. Right tradeoff for most production systems.
Identify the signals that genuinely require freshness. Most teams overestimate; A/B test cache staleness against real-time before committing to real-time everywhere.
Build a feature store. Eliminates training-serving skew, enables consistent feature definitions across batch and streaming.
Set explicit latency budgets per component. Without them, latency creeps until users notice.
Cache aggressively at session boundaries. User embeddings are the highest-value cache; they save 30–80ms per request.
Right-size the candidate set. A reranker on 200 candidates is dramatically faster than on 2000 with marginal quality difference.
Cost-model before choosing architecture. Real-time-everywhere is rarely the right answer at scale.
Plan for graceful degradation. Real-time inference fails sometimes; serve a cached or batch fallback rather than empty recommendations.

Freshness is a tradeoff with cost, complexity, and reliability. The right architecture is the one that delivers the freshness business outcomes require, not the one with the freshest architecture diagram.

Boolean & Beyond

AI Recommendation Engine Development · Updated 8 May 2026

Talk to our team

From guide to production

Need help building this?

Our team has hands-on experience implementing these systems. Book a free architecture call to discuss your specific requirements and get a clear delivery plan.

Book a free consultation Estimate cost

All AI Recommendation Engine Development guides

Real-Time vs Batch Recommendations

When should you use real-time vs batch recommendation systems?

Three Modes: Batch, Near-Real-Time, Real-Time

When Batch Wins: Cost and Predictability

When Real-Time Is Required: Context That Matters

Hybrid Architectures and Lambda Patterns

Streaming Pipelines and Feature Stores

Latency Budgets and Performance Targets

Cost Modeling Across the Three Modes

How Boolean & Beyond Approaches Freshness Architecture

Summary: Choosing Your Freshness Architecture

Need help building this?

Related Guides

Scaling Recommendation Systems

A/B Testing Recommendation Systems

Embeddings and Vector Search for Recommendations

Ready to start building?

Registered Office

Operational Office