Taking RAG Systems to Production

Introduction

RAG (Retrieval-Augmented Generation) systems have become essential for building knowledge-powered AI applications. Moving from prototype to production requires addressing challenges that don't appear in demos.

The Production RAG Stack

A production RAG system typically includes:

**Document Processor** - Handles ingestion and chunking

**Vector Store** - Stores and retrieves embeddings

**Retriever** - Finds relevant context

**Generator** - Produces final responses

**Evaluator** - Measures quality

Chunking Strategies

Chunking significantly impacts retrieval quality.

Fixed-Size Chunking

Simple but often breaks semantic boundaries:

def fixed_chunk(text: str, size: int = 1000, overlap: int = 200):

chunks = []

for i in range(0, len(text), size - overlap):

chunks.append(text[i:i + size])

return chunks

Semantic Chunking

Respects document structure:

def semantic_chunk(text: str):

# Split on natural boundaries

sections = re.split(r'

#{1,3}\s', text)

# Further split large sections

chunks = []

for section in sections:

if len(section) > MAX_CHUNK:

chunks.extend(split_on_sentences(section))

else:

chunks.append(section)

return chunks

Hierarchical Chunking

Store both summaries and details:

**Level 1**: Document summary

**Level 2**: Section summaries

**Level 3**: Detailed paragraphs

Vector Store Optimization

Index Selection

**HNSW** - Best for most use cases (fast, accurate)

**IVF** - Good for very large datasets (>10M vectors)

**Flat** - Perfect accuracy, slow for large datasets

Hybrid Search

Combine vector similarity with keyword matching:

def hybrid_search(query: str, alpha: float = 0.7):

vector_results = vector_store.similarity_search(query, k=20)

keyword_results = keyword_index.search(query, k=20)

# Reciprocal Rank Fusion

combined = reciprocal_rank_fusion(

vector_results,

keyword_results,

alpha=alpha

)

return combined[:10]

Retrieval Optimization

Query Expansion

Improve recall by expanding queries:

def expand_query(query: str) -> list[str]:

# Generate variations

expansions = llm.generate(f"Generate 3 alternative phrasings for: {query}")

return [query] + expansions

Re-ranking

Use a cross-encoder for better precision:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query: str, documents: list[str]) -> list[str]:

pairs = [[query, doc] for doc in documents]

scores = reranker.predict(pairs)

ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)

return [doc for doc, _ in ranked]

Evaluation Metrics

Retrieval Metrics

**Recall@K** - Are relevant documents in top K?

**MRR** - How high is the first relevant result?

**NDCG** - Quality of ranking

Generation Metrics

**Faithfulness** - Does the answer match the context?

**Relevance** - Does it answer the question?

**Groundedness** - Is it supported by evidence?

Production Considerations

**Caching** - Cache embeddings and common queries

**Monitoring** - Track latency, quality, and costs

**Fallbacks** - Handle retrieval failures gracefully

**Versioning** - Version your embeddings and chunks

**Updates** - Plan for document updates and deletions

Conclusion

Production RAG requires attention to chunking, retrieval optimization, and continuous evaluation. Start with simple approaches and iterate based on real user queries and feedback.