A step-by-step guide to building production-grade RAG systems, covering document ingestion, chunking strategies, embedding selection, vector DB choice, retrieval patterns, evaluation frameworks, and deployment architecture.
Building a RAG proof-of-concept takes a weekend. You chunk some documents, embed them into a vector database, wire up a retrieval chain with LangChain, and the demo works impressively. Stakeholders get excited. Then the project enters a swamp that swallows months of engineering effort. The same system that answered 8 out of 10 demo questions correctly starts hallucinating on real user queries. Documents with tables break the chunking pipeline. The system retrieves irrelevant passages when questions are phrased slightly differently from the training data. Response latency spikes to 8 seconds when the context window fills up.
This guide is the distillation of 14 production RAG deployments we have built across insurance, legal tech, enterprise knowledge management, and developer documentation. It maps a 90-day path from validated POC to production system, covering every architectural decision point with specific recommendations and the reasoning behind them.
Document parsing is where most RAG projects silently fail. The quality of your retrieval system is bounded by the quality of the parsed text. If your parser drops table data, misorders columns, or loses section hierarchy, no amount of embedding optimization downstream can fix it. For PDFs, the landscape in 2026 offers three tiers. Basic extraction with PyPDF2 or pdfplumber works for text-heavy, single-column documents and costs nothing. It fails on multi-column layouts, embedded images, and complex tables. Mid-tier solutions like Unstructured.io or LlamaParse handle most document layouts correctly, including tables, headers, and multi-column text, at $0.01-0.03 per page. Enterprise solutions like Azure Document Intelligence or Google Document AI provide the highest accuracy on complex layouts at $0.01-0.05 per page with OCR included.
Our recommendation for most projects: start with Unstructured.io for the first version. It handles 80-85% of document types correctly out of the box. Build a fallback pipeline that routes parsing failures to Azure Document Intelligence for the remaining edge cases. This hybrid approach costs 60% less than using the enterprise parser for everything while covering 95%+ of documents correctly.
Tables are the single most common failure mode in RAG document processing. A financial report where the answer lives in a table cell will return garbage if the table is flattened into a text string. The production solution is to detect tables during parsing, preserve them as structured markdown or HTML, and embed them as complete units rather than splitting across chunks. For tables wider than the chunk size limit, create a table summary alongside the raw table data and embed both. The summary captures the semantic meaning for retrieval while the raw data provides precise values for the LLM to reference in its answer.
Images with text content, such as diagrams, flowcharts, or screenshots, require OCR or multimodal processing. In 2026, the most effective approach is using a vision-language model like Claude or GPT-4o to generate text descriptions of images, then embedding those descriptions alongside the surrounding text context. This costs $0.002-0.005 per image but dramatically improves retrieval for queries about visual content.
Every document should carry metadata that enables filtering before vector search. At minimum, extract: document title, source file name, creation or publication date, document type or category, section headers as a hierarchical path, and page numbers. For enterprise deployments, add: access control tags mapping to your organization's permission model, department or team ownership, document version and whether it supersedes earlier versions, and confidence score from the parser indicating extraction quality. This metadata serves two critical purposes: pre-filtering the search space to improve precision, and providing the LLM with source attribution information so users can verify answers.
Fixed-size chunking splits text into segments of N tokens with an overlap of M tokens. The standard starting point is 512 tokens with 50-token overlap. This approach is deterministic, easy to debug, and works surprisingly well for homogeneous document collections where content density is relatively uniform, like technical documentation or FAQ databases. The limitation appears with documents that have varying content density: a legal contract where one clause spans 2,000 tokens gets split across 4 chunks, destroying the semantic coherence that makes retrieval effective. For documents with consistent structure, fixed-size chunking at 400-600 tokens achieves retrieval precision within 3-5% of more sophisticated methods at a fraction of the implementation complexity.
Semantic chunking uses embedding similarity between consecutive sentences to detect topic boundaries. When the cosine similarity between adjacent sentence embeddings drops below a threshold (typically 0.7-0.8), the chunker creates a split. This produces chunks that align with natural topic boundaries, improving retrieval relevance for queries that target specific concepts within a document. The implementation uses a sliding window of sentence embeddings and detects similarity valleys. Libraries like LlamaIndex provide semantic chunking out of the box, but the threshold parameter needs tuning per document type. Technical documentation tends to need a lower threshold (0.65-0.72) because adjacent sections often share terminology, while narrative content works well at 0.78-0.85.
Document-aware chunking uses the document's own structure, headings, sections, paragraphs, and list items, as chunking boundaries. A section under an H2 heading becomes one chunk if it fits within the token limit, or gets split at paragraph boundaries if it exceeds it. Each chunk inherits the heading hierarchy as context: a paragraph under "Financial Results > Q3 2025 > Revenue Breakdown" carries that full path as a prefix. This approach produces the highest retrieval quality for well-structured documents because queries often reference structural elements: users ask about a specific section, clause, or topic that maps directly to the document hierarchy. The implementation cost is higher because you need the parser to preserve structural information, but the retrieval quality improvement of 8-15% over fixed-size chunking justifies it for most production systems.
One of the most effective production patterns is parent-child chunking. Create small chunks of 200-300 tokens for embedding and retrieval, but link each small chunk to a larger parent chunk of 1,000-1,500 tokens. When retrieval finds a matching small chunk, the system fetches the parent chunk to include in the LLM context. This gives you the precision of small chunks for matching and the context richness of large chunks for answer generation. The implementation stores two layers in your vector database: the small chunks with embeddings for search, and the parent chunks as stored documents referenced by ID. The retrieval pipeline searches small chunks, deduplicates by parent ID, and returns the parent chunks. This pattern consistently outperforms single-layer chunking by 10-20% on answer quality metrics in our benchmarks.
Embedding model choice directly impacts retrieval quality, latency, and cost. The leading options in 2026 are: OpenAI text-embedding-3-large (3072 dimensions, $0.13 per million tokens, strong general-purpose performance), OpenAI text-embedding-3-small (1536 dimensions, $0.02 per million tokens, 90% of large model quality at 85% lower cost), Cohere embed-v3 (1024 dimensions, $0.10 per million tokens, best-in-class for multilingual retrieval), Voyage AI voyage-3 (1024 dimensions, $0.06 per million tokens, optimized for code and technical content), and open-source options like BGE-large-en-v1.5 or Nomic Embed (768 dimensions, free to self-host, strong performance on MTEB benchmarks). For most production RAG systems in India, OpenAI text-embedding-3-small offers the best balance of quality, cost, and ease of integration. It ranks within 2-4% of the large model on MTEB retrieval benchmarks while costing 85% less.
Self-hosting makes sense in three scenarios: when data cannot leave your infrastructure due to regulatory requirements, when embedding volume exceeds 50 million tokens per day making API costs prohibitive, or when you need to fine-tune the embedding model on domain-specific data. Running BGE-large on a single T4 GPU instance costs approximately $150/month on AWS and handles about 500 embedding requests per second. At that throughput, the break-even point versus OpenAI API pricing is roughly 15 million tokens per day. Below that volume, the operational complexity of maintaining a self-hosted model, GPU driver updates, model serving infrastructure, monitoring, and scaling, outweighs the cost savings.
If your application already uses PostgreSQL, pgvector is the correct starting point. It adds vector similarity search to your existing database with no additional infrastructure. At up to 1 million vectors with 1536 dimensions, pgvector on a db.r6g.xlarge RDS instance ($280/month) delivers sub-50ms query latency with HNSW indexing. The operational advantage is significant: your existing database backup, monitoring, and failover infrastructure covers the vector data automatically. The limitation is scale. Beyond 5 million vectors, pgvector query latency degrades unless you invest in index tuning, partitioning, and potentially dedicated hardware. At 10 million vectors, you should plan a migration to a purpose-built vector database or commit to PostgreSQL table partitioning and careful HNSW parameter tuning.
Pinecone offers a fully managed vector database with a serverless pricing model starting at $0.08 per million reads and $2 per million writes. For a RAG system processing 10,000 queries per day with 1 million vectors, monthly costs run approximately $70-120. Pinecone excels at zero-ops: no index tuning, no infrastructure management, no capacity planning. Query latency is consistently under 50ms at the 1-10 million vector scale. The trade-off is cost at higher volumes and limited configurability. You cannot tune HNSW parameters, choose index types, or optimize for specific query patterns. At 50 million vectors with high query volume, Pinecone costs can reach $800-1,500/month, significantly more than self-managed alternatives.
Weaviate provides hybrid search combining vector similarity and BM25 keyword search natively, which is critical for production RAG because pure vector search misses exact-match queries like product codes, policy numbers, and technical terms. Weaviate Cloud starts at $25/month for development and scales to $200-500/month for production workloads at the 1-10 million vector range. Self-hosted Weaviate on a 4-core, 16GB instance handles 1-5 million vectors at $120-200/month in compute costs. The built-in hybrid search saves 2-3 weeks of engineering compared to implementing it yourself with pgvector and Elasticsearch, which is a meaningful consideration for teams on tight timelines.
Pure vector search fails on exact-match queries. When a user asks about policy number POL-2024-38291, vector similarity may retrieve documents mentioning other policy numbers because the embedding captures the concept of policy numbers rather than the specific identifier. Hybrid search solves this by combining vector similarity scores with BM25 keyword scores using reciprocal rank fusion (RRF). The implementation retrieves the top-K results from both vector and keyword search independently, then merges using RRF: for each document d, score(d) = sum(1 / (k + rank_vector(d)) + 1 / (k + rank_keyword(d))) where k is typically 60. Hybrid search improves retrieval precision by 12-18% over pure vector search in our production benchmarks, with the largest gains on queries containing specific identifiers, dates, and domain terminology.
Re-ranking is the highest-leverage improvement you can add to a RAG pipeline after hybrid search. The pattern: retrieve a broad set of 20-50 candidates from hybrid search, then use a cross-encoder model to re-score each candidate against the original query. Cross-encoders jointly encode the query and document, producing far more accurate relevance scores than the bi-encoder embeddings used for initial retrieval. Cohere Rerank costs $1 per 1,000 search queries and consistently improves top-5 precision by 15-25% in our evaluations. For self-hosted re-ranking, the cross-encoder/ms-marco-MiniLM-L-12-v2 model runs on CPU at 50-100 candidates per second, fast enough for real-time re-ranking of 50 candidates in under a second. The latency cost of re-ranking, typically 100-300ms, is almost always worth the precision improvement.
Raw user queries often make poor search queries. A question like "what happened with the server issue last Tuesday" contains temporal references and vague language that vector search handles poorly. Query transformation uses an LLM to rewrite the user query into one or more optimized search queries before retrieval. The three most effective transformations are: query decomposition (splitting a complex question into 2-3 sub-queries that each target specific information), hypothetical document embedding or HyDE (asking the LLM to generate a hypothetical answer, then using that answer as the search query since it will be closer in embedding space to real answers), and query expansion (adding relevant terms and synonyms to broaden the search). Implementing all three and running retrieval for each transformed query, then merging results, improves recall by 20-30% at the cost of 3-5x more retrieval operations. In practice, HyDE alone provides 60-70% of the total improvement and adds only one additional retrieval pass.
Retrieved chunks often contain 70-80% irrelevant text surrounding the actual answer. Passing all of this to the LLM wastes context window tokens and can confuse the generation. Contextual compression uses a lightweight LLM call to extract only the relevant sentences from each retrieved chunk before passing them to the final generation step. This reduces the effective context by 50-70%, allowing you to include more retrieved chunks within the same context budget. The cost is an additional LLM call using a small model like Claude Haiku, adding $0.0001-0.0003 per query and 200-400ms of latency. The trade-off is worth it when your chunks are large (800+ tokens) or when you need to include 8+ chunks in context.
Retrieval evaluation requires a test set of query-document pairs where you know which documents should be retrieved for each query. Create this by having domain experts write 200-500 questions and annotate the source documents. Then measure: precision at K (what fraction of the top-K retrieved documents are relevant), recall at K (what fraction of all relevant documents appear in the top-K), and mean reciprocal rank (how high does the first relevant document rank on average). Target benchmarks for production RAG: precision@5 above 0.70, recall@10 above 0.85, MRR above 0.75. If your system hits these numbers on a representative test set, retrieval quality is unlikely to be the bottleneck in answer quality.
Answer quality evaluation uses LLM-as-judge patterns where a separate LLM scores the generated answer on multiple dimensions. The RAGAS framework provides four core metrics: faithfulness (is the answer supported by the retrieved context, not hallucinated), answer relevancy (does the answer address the user's question), context precision (are the retrieved chunks relevant to the question), and context recall (do the retrieved chunks contain the information needed to answer). Run these evaluations on every test set query and track the metrics over time. A faithfulness score below 0.85 indicates a hallucination problem that needs prompt engineering or context management fixes. An answer relevancy score below 0.80 suggests the generation step is not focusing on the right information.
Production evaluation extends beyond test sets. Implement a feedback loop that captures: user thumbs up or down ratings on answers, query logs with retrieval results for offline analysis, LLM-as-judge scoring on a random sample of 5-10% of production queries, and drift detection that alerts when answer quality metrics drop below thresholds. The production evaluation pipeline should run as a background job processing sampled queries, not inline with user requests. Store all evaluation results in a dashboard that the team reviews weekly to identify patterns: specific document types with low retrieval quality, query categories with high hallucination rates, or user segments with lower satisfaction scores.
A production RAG system has five layers. The ingestion layer processes new documents through parsing, chunking, embedding, and indexing on an event-driven or scheduled basis. The retrieval layer handles query transformation, hybrid search, and re-ranking with sub-second latency requirements. The generation layer manages prompt construction, LLM calls, and response streaming. The evaluation layer runs continuous quality monitoring in the background. The serving layer handles API endpoints, authentication, rate limiting, and caching. Each layer should be independently deployable and scalable. The ingestion pipeline runs as async workers processing a document queue. The retrieval and generation layers run as synchronous API services with horizontal scaling. The evaluation layer runs as periodic batch jobs.
Caching in RAG is more nuanced than traditional API caching because exact query matches are rare. Implement two caching layers. First, semantic caching: embed the incoming query and check if any cached query has cosine similarity above 0.95. If so, return the cached response. This catches paraphrased versions of the same question and typically achieves a 15-25% cache hit rate, reducing LLM costs proportionally. Second, retrieval caching: cache the retrieval results for each query embedding for 15-60 minutes. If the same or very similar query arrives within the cache window, skip the vector search and re-ranking steps. This reduces retrieval latency from 200-500ms to under 5ms for repeated queries.
Total end-to-end latency for a RAG query, from user input to complete response, typically ranges from 3-8 seconds: 200-500ms for retrieval and re-ranking, 200-400ms for prompt construction and context assembly, and 2-6 seconds for LLM generation. Without streaming, users stare at a loading spinner for this entire duration. With streaming, the first token appears within 500-1,000ms of the query, and users begin reading while generation continues. Implementing streaming requires server-sent events or WebSocket connections, chunked response handling in the frontend, and careful error handling for mid-stream failures. The perceived latency improvement from streaming is dramatic: user satisfaction scores typically improve 30-40% compared to batch response delivery, even though the total completion time is identical.
Log every component of the RAG pipeline for every query: the raw user query, any query transformations applied, the retrieval results with scores, the re-ranked results with scores, the full prompt sent to the LLM (excluding sensitive user data), the LLM response, token counts for input and output, latency breakdown by pipeline stage, and any user feedback. Use structured logging with trace IDs that link all pipeline stages for a single query. This enables debugging specific failures: when a user reports a bad answer, you can trace from their query through retrieval results to the LLM prompt and identify exactly where the pipeline failed, whether retrieval returned irrelevant chunks, re-ranking failed to prioritize the right document, or the LLM hallucinated despite having correct context.
Set up alerts for: average retrieval score dropping below baseline by more than 10% (indicates index corruption or embedding model issues), LLM response latency exceeding 10 seconds at p95 (indicates model provider issues or context window bloat), user negative feedback rate exceeding 15% over a 24-hour window (indicates systemic quality degradation), and ingestion pipeline failures or backlogs exceeding 1 hour (indicates document processing issues that will lead to stale information). Weekly review of these metrics catches gradual degradation that daily monitoring misses: slow increases in latency as the vector index grows, gradual drops in retrieval relevance as the document corpus evolves, and seasonal patterns in user query types that reveal gaps in the knowledge base.
The phases overlap significantly in practice. While the ingestion pipeline is being built in weeks 1-2, the team should simultaneously be setting up the evaluation framework with initial test cases. Chunking and embedding experiments run in parallel with vector database setup. Re-ranking and query transformation are added iteratively during weeks 4-7 as baseline metrics reveal where the pipeline underperforms. Production hardening, caching, streaming, and monitoring occupy weeks 8-12, but the system should be serving internal beta users from week 6 onward to generate real usage data for optimization. The 90-day timeline assumes a team of 2-3 engineers working full-time. A single engineer can deliver the same system in 4-5 months. A team of 4-5 can compress it to 6-8 weeks by parallelizing more aggressively.
The best chunking strategy depends on your document types. For well-structured documents with clear headings, document-aware chunking that splits on section boundaries produces the highest retrieval quality. For homogeneous text content, fixed-size chunking at 400-600 tokens works well. The parent-child pattern, using small 200-300 token chunks for retrieval and larger 1,000-1,500 token parent chunks for context, consistently outperforms single-layer approaches by 10-20% on answer quality metrics.
Use pgvector if you already have PostgreSQL and your vector count will stay under 5 million. It adds zero infrastructure complexity and costs nothing extra. Use Pinecone if you want zero-ops management and your scale is 1-10 million vectors, at $70-120 per month. For hybrid search requirements, consider Weaviate which provides native BM25 plus vector search. Beyond 50 million vectors, self-hosted solutions become significantly more cost-effective.
Production RAG evaluation combines automated metrics and user feedback. Use the RAGAS framework to measure faithfulness, answer relevancy, context precision, and context recall on a sample of 5-10% of production queries. Track user thumbs-up/down ratings and negative feedback rates. Set alerts for retrieval score drops exceeding 10% and negative feedback rates above 15%. Review metrics weekly to catch gradual degradation.
OpenAI text-embedding-3-small offers the best balance of quality, cost, and ease of integration for most production RAG systems. It ranks within 2-4% of larger models on retrieval benchmarks while costing 85% less at $0.02 per million tokens. For multilingual content, Cohere embed-v3 is the strongest option. Self-host BGE-large or Nomic Embed when data cannot leave your infrastructure or embedding volume exceeds 50 million tokens per day.
A production RAG system takes approximately 90 days with a team of 2-3 engineers working full-time. This covers document ingestion, chunking, embedding, vector database setup, hybrid search, re-ranking, evaluation framework, production deployment with caching and streaming, and observability. A single engineer can deliver the same system in 4-5 months. A team of 4-5 can compress it to 6-8 weeks.
Hybrid search combines vector similarity search with BM25 keyword search using reciprocal rank fusion to merge results. It matters because pure vector search fails on exact-match queries like policy numbers, product codes, and specific technical terms. Hybrid search improves retrieval precision by 12-18% over pure vector search, with the largest gains on queries containing specific identifiers and domain terminology.
Explore our solutions that can help you implement these insights in Bengaluru & Coimbatore.
RAG-Based AI & Knowledge Systems
Build enterprise RAG systems with vector databases, intelligent chunking, and secure deployment. Production-ready retrieval-augmented generation for knowledge bases, customer support, and document processing.
Learn moreRAG Implementation Services
Expert RAG implementation services. Build enterprise-grade Retrieval-Augmented Generation systems with vector databases, semantic search, and LLM integration. Production-ready RAG solutions for accurate, contextual AI responses.
Learn moreLLM Integration Services
Expert LLM integration services. Integrate ChatGPT, Claude, GPT-4 into your applications. Production-ready API integration, prompt engineering, and cost optimization for enterprise AI deployment.
Learn moreExplore related services, insights, case studies, and planning tools for your next implementation step.
Delivery available from Bengaluru and Coimbatore teams, with remote implementation across India.
Insight to Execution
Book an architecture call, validate cost assumptions, and move from strategy to production execution with measurable milestones.
4-8 weeks
pilot to production timeline
95%+
delivery milestone adherence
99.3%
observed SLA stability in ops programs