Taking RAG Systems to Production
Best practices for deploying RAG systems in production, including chunking strategies, vector stores, and retrieval optimization.
Introduction
RAG (Retrieval-Augmented Generation) systems have become essential for building knowledge-powered AI applications. Moving from prototype to production requires addressing challenges that don't appear in demos.
The Production RAG Stack
A production RAG system typically includes:
Chunking Strategies
Chunking significantly impacts retrieval quality.
Fixed-Size Chunking
Simple but often breaks semantic boundaries:
def fixed_chunk(text: str, size: int = 1000, overlap: int = 200):
chunks = []
for i in range(0, len(text), size - overlap):
chunks.append(text[i:i + size])
return chunks
Semantic Chunking
Respects document structure:
def semantic_chunk(text: str):
# Split on natural boundaries
sections = re.split(r'
#{1,3}\s', text)
# Further split large sections
chunks = []
for section in sections:
if len(section) > MAX_CHUNK:
chunks.extend(split_on_sentences(section))
else:
chunks.append(section)
return chunks
Hierarchical Chunking
Store both summaries and details:
Vector Store Optimization
Index Selection
Hybrid Search
Combine vector similarity with keyword matching:
def hybrid_search(query: str, alpha: float = 0.7):
vector_results = vector_store.similarity_search(query, k=20)
keyword_results = keyword_index.search(query, k=20)
# Reciprocal Rank Fusion
combined = reciprocal_rank_fusion(
vector_results,
keyword_results,
alpha=alpha
)
return combined[:10]
Retrieval Optimization
Query Expansion
Improve recall by expanding queries:
def expand_query(query: str) -> list[str]:
# Generate variations
expansions = llm.generate(f"Generate 3 alternative phrasings for: {query}")
return [query] + expansions
Re-ranking
Use a cross-encoder for better precision:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query: str, documents: list[str]) -> list[str]:
pairs = [[query, doc] for doc in documents]
scores = reranker.predict(pairs)
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, _ in ranked]
Evaluation Metrics
Retrieval Metrics
Generation Metrics
Production Considerations
Conclusion
Production RAG requires attention to chunking, retrieval optimization, and continuous evaluation. Start with simple approaches and iterate based on real user queries and feedback.