Solutions/AI Recommendation Engine Development

Production SystemsUpdated 8 May 2026

Scaling Recommendation Systems

Architecture patterns for recommendation systems serving millions of users: candidate generation, ranking, and infrastructure.

How do you scale recommendation systems to millions of users and items?

Scaling requires approximate nearest neighbor search instead of brute-force, two-stage retrieval (candidate generation + ranking), embedding pre-computation, feature stores with millisecond latency, and infrastructure separating training from serving.

The Multi-Stage Recommendation Funnel

Production recommendation systems at scale almost always converge on a multi-stage funnel:

Retrieval (millions to ~1000): generate candidates from the full catalog efficiently. Optimizes for recall.
Ranking (~1000 to ~50): apply a heavier model with rich features. Optimizes for precision.
Reranking (~50 to ~10): final ordering with diversity, freshness, business rules. Optimizes for the actual user-facing experience.

Each stage uses progressively heavier compute on progressively smaller candidate sets. This is the only architecture that hits sub-200ms latency on catalogs of millions of items. Skipping a stage either blows the latency budget (heavy model on full catalog) or degrades quality (light model on small candidate set).

Retrieval: Candidate Generation at Scale

The retrieval stage must reduce a catalog of millions to a candidate set of hundreds in tens of milliseconds. The dominant approaches:

Embedding-based retrieval: train user and item embeddings, index items in an ANN store, query at request time. Sub-linear scaling (HNSW, IVF), well-understood operationally. Works for both collaborative and content-based signals.

Heuristic retrieval: business rules, popularity, recency, category-based filters. Fast, interpretable, often used as an additional retrieval source alongside embeddings.

Multi-source retrieval: combine multiple retrieval strategies. Examples: 100 from collaborative embeddings + 100 from content-based embeddings + 50 from popularity + 50 from "recently viewed by similar users" + 50 from explicit preferences. Each source contributes a different recommendation flavor; the ranker weighs them.

Retrieval optimizes for recall: did we surface the items the user would actually engage with? Precision is the ranker's job. A retrieval stage that aggressively prunes for precision starves the ranker of good candidates.

Ranking: Heavier Models on Smaller Candidate Sets

The ranker's job is to take a few hundred to a few thousand candidates and order them by predicted engagement. With a smaller candidate set, you can afford a much heavier model.

Common ranker architectures:

Gradient-boosted trees (LightGBM, XGBoost, CatBoost): strong baseline, handles heterogeneous features, fast inference, easy to debug. Often beats neural nets in production for ranking with structured features.
Wide & Deep: linear model over hand-crafted feature crosses + deep neural net over dense features. Google's classic, still a strong choice.
DCN, DCN-v2: explicit feature crosses via cross networks. Better than Wide & Deep on many benchmarks.
Two-tower with cross-features: late-fusion architecture; allows fast retrieval but adds cross-features in ranking.

Features that matter:

User-item cross features: "did this user previously engage with this item's brand?" Generally the highest-impact features.
Recency features: time since last interaction, time of day, day of week.
Popularity features: item engagement rate, trending velocity. Essential for cold-start.
Contextual features: device, location, query (if applicable), referrer.

Latency budget for ranking: 30–100ms. Modern rankers on 200–500 candidates fit comfortably; rankers on 5000+ candidates push p99 over budget.

Reranking: Diversity, Freshness, Business Rules

The final stage applies hard constraints and softer post-processing:

Diversity: prevent collapsing the result list to near-duplicates. MMR (Maximal Marginal Relevance) is the standard; DPP (Determinantal Point Processes) is more principled but heavier.

Freshness: boost recently added or recently updated items; decay items that have been shown too often.

Business rules: out-of-stock, region restrictions, age gates, content moderation, caps on items per category.

Personal constraints: items the user has explicitly hidden, items the user already owns, items recently shown that didn't engage.

Slot-aware optimization: the recommendation surface often has multiple slots with different constraints (top slot must be high-confidence, lower slots can be more exploratory). Optimize the slate, not individual items.

This stage rarely affects offline metrics meaningfully but materially affects the user-facing experience. Production systems that skip it produce technically-strong rankings that nonetheless feel repetitive or off.

Feature Stores and Online/Offline Parity

A feature store is non-optional at scale. Without it, training-serving skew silently degrades production model quality. The skew comes from features being computed differently at training time (in a batch job over historical data) and at serving time (in real-time over current data).

Feature stores enforce parity:

Feature definitions are written once, used by both training pipelines and serving.
Features are versioned. Training data joins features as-of the prediction time, not as-of now.
Online and offline storage are kept in sync; the feature store guarantees the values match.

Tools: Feast (open source), Tecton, Hopsworks, Vertex AI Feature Store, SageMaker Feature Store. Build vs. buy depends on team scale. For most teams, buy. The engineering cost of a custom feature store is significant and the failure modes are subtle.

Model Serving and Inference Optimization

For production-scale recommenders, model serving is its own discipline.

Standard patterns:

TorchServe, Triton, TensorFlow Serving for neural network inference.
ONNX or TensorRT for cross-framework optimization, often 2–5x latency reduction with minimal quality loss.
Quantization (INT8, FP16) for further latency and memory reduction. Validate quality on each model.
Batched inference at serving time when QPS is high enough to amortize batch fill latency.
Multi-GPU / multi-node serving for high-traffic platforms.

For ANN retrieval:

Self-hosted FAISS, hnswlib, Milvus for cost optimization at scale.
Managed Pinecone, Qdrant Cloud, Weaviate Cloud for operational simplicity.
Hybrid: managed for primary serving, self-hosted for batch-only workloads.

Latency optimization is rarely about the model itself — it's about caching, parallelism, network, and serialization overhead. Profile end-to-end before assuming the model is the bottleneck.

Observability and A/B Test Infrastructure

A scaled recommender needs deep observability:

Per-stage latency metrics: retrieval, ranking, reranking, total. Trace which stage spikes when latency degrades.
Per-stage quality metrics: retrieval recall, ranking NDCG, end-to-end engagement. A regression at one stage may not show up downstream until users notice.
Per-cohort metrics: new users vs. warm users, mobile vs. desktop, region by region. Average metrics hide cohort-specific failure.
Slice-level dashboards: by content category, by user segment, by geography. Drill down when something drifts.

A/B test infrastructure:

Random assignment service with sticky randomization.
Sample ratio mismatch monitoring (catch assignment bugs before reading metrics).
Pre-registered metric pipelines.
Holdout groups for long-term lift measurement.

Without this infrastructure, every experiment becomes a custom build, the team makes decisions on noisy data, and improvements drift backward over time.

How Boolean & Beyond Builds Production Recommenders

For most engagements, we typically design the recommender system around the multi-stage funnel from week one. The first deliverable is a working pipeline at modest scale: retrieval (typically embedding-based), a simple ranker (often LightGBM as a baseline), and minimal reranking.

From there, the system grows: better embeddings, heavier rankers, diversity post-processing, full feature store, multi-source retrieval. The architecture stays the same; the components within each stage get more sophisticated.

This staged approach matters because building the "ideal" architecture in one shot is high-risk and slow. Shipping the pipeline early gives the team a working system to iterate on, real production telemetry to debug, and a baseline against which improvements can be measured.

Summary: Production-Scale Recommender Priority Stack

Build the multi-stage funnel from week one. Even a simple retrieval + ranker beats a complicated single-stage system.
Use embedding-based retrieval as the default. Sub-linear scaling, unifies signals, well-understood ops.
Start with a gradient-boosted tree ranker. Strong baseline, fast inference, easy to debug. Move to neural rankers when feature quality demands it.
Add reranking before launching to users. Diversity, freshness, business rules. Skip this and the system feels off.
Adopt a feature store before any second model. The training-serving skew problem only gets worse with each model.
Set per-stage latency budgets and instrument trace them at production traffic.
Build A/B test infrastructure as a first-class deliverable, not an afterthought.
Plan for graceful degradation. Each stage should fail to a sensible default, not break the user experience.

The architecture is mature; the engineering is significant but well-scoped. Teams that follow the multi-stage pattern ship production-scale recommenders. Teams that try to skip stages or build "novel" architectures tend to ship slower and operate at lower quality.

Boolean & Beyond

AI Recommendation Engine Development · Updated 8 May 2026

Talk to our team

From guide to production

Need help building this?

Our team has hands-on experience implementing these systems. Book a free architecture call to discuss your specific requirements and get a clear delivery plan.

Book a free consultation Estimate cost

All AI Recommendation Engine Development guides

Scaling Recommendation Systems

How do you scale recommendation systems to millions of users and items?

The Multi-Stage Recommendation Funnel

Retrieval: Candidate Generation at Scale

Ranking: Heavier Models on Smaller Candidate Sets

Reranking: Diversity, Freshness, Business Rules

Feature Stores and Online/Offline Parity

Model Serving and Inference Optimization

Observability and A/B Test Infrastructure

How Boolean & Beyond Builds Production Recommenders

Summary: Production-Scale Recommender Priority Stack

Need help building this?

Related Guides

Embeddings and Vector Search for Recommendations

Real-Time vs Batch Recommendations

A/B Testing Recommendation Systems

Klaar om te beginnen met bouwen?

Registered Office

Operational Office