Jorge's Portfolio logoJorge's Portfolio

Connect with me

Back to Chronicles
RAG
LangChain
Production

Taking RAG Systems to Production

Best practices for deploying RAG systems in production, including chunking strategies, vector stores, and retrieval optimization.

Jorge Luiz Gomes
January 18, 2024
15 min read

Introduction


RAG (Retrieval-Augmented Generation) systems have become essential for building knowledge-powered AI applications. Moving from prototype to production requires addressing challenges that don't appear in demos.


The Production RAG Stack


A production RAG system typically includes:


  • **Document Processor** - Handles ingestion and chunking
  • **Vector Store** - Stores and retrieves embeddings
  • **Retriever** - Finds relevant context
  • **Generator** - Produces final responses
  • **Evaluator** - Measures quality

  • Chunking Strategies


    Chunking significantly impacts retrieval quality.


    Fixed-Size Chunking


    Simple but often breaks semantic boundaries:


    def fixed_chunk(text: str, size: int = 1000, overlap: int = 200):

    chunks = []

    for i in range(0, len(text), size - overlap):

    chunks.append(text[i:i + size])

    return chunks


    Semantic Chunking


    Respects document structure:


    def semantic_chunk(text: str):

    # Split on natural boundaries

    sections = re.split(r'

    #{1,3}\s', text)

    # Further split large sections

    chunks = []

    for section in sections:

    if len(section) > MAX_CHUNK:

    chunks.extend(split_on_sentences(section))

    else:

    chunks.append(section)

    return chunks


    Hierarchical Chunking


    Store both summaries and details:


  • **Level 1**: Document summary
  • **Level 2**: Section summaries
  • **Level 3**: Detailed paragraphs

  • Vector Store Optimization


    Index Selection


  • **HNSW** - Best for most use cases (fast, accurate)
  • **IVF** - Good for very large datasets (>10M vectors)
  • **Flat** - Perfect accuracy, slow for large datasets

  • Hybrid Search


    Combine vector similarity with keyword matching:


    def hybrid_search(query: str, alpha: float = 0.7):

    vector_results = vector_store.similarity_search(query, k=20)

    keyword_results = keyword_index.search(query, k=20)


    # Reciprocal Rank Fusion

    combined = reciprocal_rank_fusion(

    vector_results,

    keyword_results,

    alpha=alpha

    )

    return combined[:10]


    Retrieval Optimization


    Query Expansion


    Improve recall by expanding queries:


    def expand_query(query: str) -> list[str]:

    # Generate variations

    expansions = llm.generate(f"Generate 3 alternative phrasings for: {query}")

    return [query] + expansions


    Re-ranking


    Use a cross-encoder for better precision:


    from sentence_transformers import CrossEncoder

    reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')


    def rerank(query: str, documents: list[str]) -> list[str]:

    pairs = [[query, doc] for doc in documents]

    scores = reranker.predict(pairs)

    ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)

    return [doc for doc, _ in ranked]


    Evaluation Metrics


    Retrieval Metrics


  • **Recall@K** - Are relevant documents in top K?
  • **MRR** - How high is the first relevant result?
  • **NDCG** - Quality of ranking

  • Generation Metrics


  • **Faithfulness** - Does the answer match the context?
  • **Relevance** - Does it answer the question?
  • **Groundedness** - Is it supported by evidence?

  • Production Considerations


  • **Caching** - Cache embeddings and common queries
  • **Monitoring** - Track latency, quality, and costs
  • **Fallbacks** - Handle retrieval failures gracefully
  • **Versioning** - Version your embeddings and chunks
  • **Updates** - Plan for document updates and deletions

  • Conclusion


    Production RAG requires attention to chunking, retrieval optimization, and continuous evaluation. Start with simple approaches and iterate based on real user queries and feedback.


    Discuss

    Related Posts

    Enjoyed this post?

    Subscribe to get notified when I publish new articles about AI engineering.

    The Apprentice

    AI Assistant

    The Apprentice

    Hello! I'm The Apprentice, Jorge's AI assistant. I can help you explore his portfolio, learn about his projects, or answer questions about his work in AI engineering. What would you like to know?