Architecture patterns for recommendation systems serving millions of users: candidate generation, ranking, and infrastructure.
Scaling requires approximate nearest neighbor search instead of brute-force, two-stage retrieval (candidate generation + ranking), embedding pre-computation, feature stores with millisecond latency, and infrastructure separating training from serving.
Production recommendation systems at scale almost always converge on a multi-stage funnel:
Each stage uses progressively heavier compute on progressively smaller candidate sets. This is the only architecture that hits sub-200ms latency on catalogs of millions of items. Skipping a stage either blows the latency budget (heavy model on full catalog) or degrades quality (light model on small candidate set).
The retrieval stage must reduce a catalog of millions to a candidate set of hundreds in tens of milliseconds. The dominant approaches:
Embedding-based retrieval: train user and item embeddings, index items in an ANN store, query at request time. Sub-linear scaling (HNSW, IVF), well-understood operationally. Works for both collaborative and content-based signals.
Heuristic retrieval: business rules, popularity, recency, category-based filters. Fast, interpretable, often used as an additional retrieval source alongside embeddings.
Multi-source retrieval: combine multiple retrieval strategies. Examples: 100 from collaborative embeddings + 100 from content-based embeddings + 50 from popularity + 50 from "recently viewed by similar users" + 50 from explicit preferences. Each source contributes a different recommendation flavor; the ranker weighs them.
Retrieval optimizes for recall: did we surface the items the user would actually engage with? Precision is the ranker's job. A retrieval stage that aggressively prunes for precision starves the ranker of good candidates.
The ranker's job is to take a few hundred to a few thousand candidates and order them by predicted engagement. With a smaller candidate set, you can afford a much heavier model.
Common ranker architectures:
Features that matter:
Latency budget for ranking: 30–100ms. Modern rankers on 200–500 candidates fit comfortably; rankers on 5000+ candidates push p99 over budget.
The final stage applies hard constraints and softer post-processing:
Diversity: prevent collapsing the result list to near-duplicates. MMR (Maximal Marginal Relevance) is the standard; DPP (Determinantal Point Processes) is more principled but heavier.
Freshness: boost recently added or recently updated items; decay items that have been shown too often.
Business rules: out-of-stock, region restrictions, age gates, content moderation, caps on items per category.
Personal constraints: items the user has explicitly hidden, items the user already owns, items recently shown that didn't engage.
Slot-aware optimization: the recommendation surface often has multiple slots with different constraints (top slot must be high-confidence, lower slots can be more exploratory). Optimize the slate, not individual items.
This stage rarely affects offline metrics meaningfully but materially affects the user-facing experience. Production systems that skip it produce technically-strong rankings that nonetheless feel repetitive or off.
A feature store is non-optional at scale. Without it, training-serving skew silently degrades production model quality. The skew comes from features being computed differently at training time (in a batch job over historical data) and at serving time (in real-time over current data).
Feature stores enforce parity:
Tools: Feast (open source), Tecton, Hopsworks, Vertex AI Feature Store, SageMaker Feature Store. Build vs. buy depends on team scale. For most teams, buy. The engineering cost of a custom feature store is significant and the failure modes are subtle.
For production-scale recommenders, model serving is its own discipline.
Standard patterns:
For ANN retrieval:
Latency optimization is rarely about the model itself — it's about caching, parallelism, network, and serialization overhead. Profile end-to-end before assuming the model is the bottleneck.
A scaled recommender needs deep observability:
A/B test infrastructure:
Without this infrastructure, every experiment becomes a custom build, the team makes decisions on noisy data, and improvements drift backward over time.
For most engagements, we typically design the recommender system around the multi-stage funnel from week one. The first deliverable is a working pipeline at modest scale: retrieval (typically embedding-based), a simple ranker (often LightGBM as a baseline), and minimal reranking.
From there, the system grows: better embeddings, heavier rankers, diversity post-processing, full feature store, multi-source retrieval. The architecture stays the same; the components within each stage get more sophisticated.
This staged approach matters because building the "ideal" architecture in one shot is high-risk and slow. Shipping the pipeline early gives the team a working system to iterate on, real production telemetry to debug, and a baseline against which improvements can be measured.
The architecture is mature; the engineering is significant but well-scoped. Teams that follow the multi-stage pattern ship production-scale recommenders. Teams that try to skip stages or build "novel" architectures tend to ship slower and operate at lower quality.
From guide to production
Our team has hands-on experience implementing these systems. Book a free architecture call to discuss your specific requirements and get a clear delivery plan.
Deel uw projectdetails en wij nemen binnen 24 uur contact met u op voor een gratis consultatie — zonder verplichtingen.
Boolean and Beyond
825/90, 13th Cross, 3rd Main
Mahalaxmi Layout, Bengaluru - 560086
590, Diwan Bahadur Rd
Near Savitha Hall, R.S. Puram
Coimbatore, Tamil Nadu 641002