RAG Systems in Production: Chunking, Retrieval, and Reranking (2025)

Oct 26, 2025•

ragretrievalhybrid-searchreranking

•

Retrieval-Augmented Generation (RAG) is the backbone of most practical LLM systems. This guide is a deep, practitioner-focused walkthrough for building production-grade RAG: from chunking and metadata strategies to hybrid retrieval, reranking, evaluation, observability, and cost control.

Executive Summary

Chunking is a product decision, not just an indexer parameter; optimize for question types and grounding.
Use hybrid retrieval (BM25 + dense) with domain-aware query rewriting and filters; rerank aggressively for top-K.
Evaluate continuously with golden sets and real traffic; gate deployments by win-rate and hallucination scores.
Log every hop (rewrite → retrieve → rerank → assemble → generate) with cost/latency attribution and cache hits.
Secure your pipeline: sanitize inputs, guard against prompt injection via vector stores, and sign your content.

Architecture Overview

graph LR
  A[User Query] --> B[Query Rewrite/Expand]
  B --> C[Hybrid Retrieval]
  C --> D[Reranker]
  D --> E[Context Assembler]
  E --> F[LLM Generator]
  F --> G[Response + Citations]
  G --> H[Feedback + Telemetry]

Query Rewrite: spelling fixes, acronym expansion, synonyms, intent routing.
Hybrid Retrieval: BM25 for lexical match + vector for semantic similarity with field boosts and filters.
Reranker: cross-encoder that scores candidate passages, often improves groundedness significantly.
Context Assembler: dedupe, enforce diversity, compress and structure into cards with citations.

Chunking Strategies

Principles

Optimize chunks for answerability: include titles, headings, and stable anchors.
Prefer 300–800 token windows with overlaps 10–15% for long prose; smaller for FAQs/code.
Tag chunks with hierarchical metadata: doc_id, section, headings, author, version, published_at.

class Chunker:
    def chunk(self, html: str) -> list[dict]:
        blocks = self.split_by_headings(html)
        return [self.enrich(b) for b in self.window(blocks, max_tokens=600, overlap=80)]

Specialized Chunking

Code: function-level with import graph context, keep signatures and docstrings.
Tables: extract as key-value and normalized JSON for structured lookup.
PDFs: detect columns, figures, captions; attach OCR confidence.

Hybrid Retrieval

class HybridRetriever:
    def retrieve(self, query: str, k: int = 50) -> list[Candidate]:
        rewritten = self.rewrite(query)
        lexical = self.bm25.search(rewritten, k=200)
        dense = self.vector.search(self.embed(rewritten), k=200)
        fused = self.reciprocal_rank_fusion(lexical, dense, top=k)
        return fused

Query rewrite: spellcheck, acronym expansion, synonyms, noun-phrase extraction.
Filters: product, version, language, recency windows; exact match boosts for IDs.
Fusion: Reciprocal Rank Fusion (RRF) or weighted linear fusion; learn weights from feedback.

Reranking

from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

scores = reranker.predict([(query, doc.text) for doc in candidates])
ranked = [doc for _, doc in sorted(zip(scores, candidates), reverse=True)]

Cross-encoders are compute-heavy; rerank top-50 to top-5/10.
Calibrate thresholds to drop low-confidence passages.
Cache reranker results by (query_hash, doc_id) for common queries.

Context Assembly

type Card = { title: string; snippet: string; url: string; citationId: string; tokens: number };

function assemble(cards: Card[], maxTokens: number): Card[] {
  const seen = new Set<string>();
  const out: Card[] = [];
  let budget = maxTokens;
  for (const c of cards) {
    const key = c.citationId;
    if (seen.has(key)) continue;
    if (c.tokens > budget) continue;
    out.push(c); seen.add(key); budget -= c.tokens;
  }
  return out;
}

Diversity: prefer 1 card per source initially; allow follow-ups to drill deeper.
Compression: sentence-level extract with query-aware summarization; preserve citations.

Evaluation Framework

metrics:
  - groundedness_hallucination_rate
  - exact_match / F1 (for QA)
  - coverage@k (did we retrieve the gold chunk?)
  - click-through on citations
  - response_time_ms and cost_usd

Golden sets: hand-curated questions with authoritative answers and gold chunks.
Shadow deploy: compare old vs new pipeline A/B, require win-rate > X% to promote.
Red-team: jailbreak attempts, prompt-injection canaries in content.

Observability

interface TraceSpan {
  name: string; start: number; end: number; attrs: Record<string, any>;
}

Trace spans per stage: rewrite, retrieve, rerank, assemble, generate.
Attach tokens, costs, cache hits, and candidate IDs to each span.
Store minimal snippets; avoid PII; sample generously for error cases.

Security

Sanitize inputs; strip HTML/JS; block suspicious patterns.
Validate outbound tool calls; whitelist hosts; set timeouts.
Sign indexed content; store hash; verify at retrieval to prevent content poisoning.

Cost Controls

Cache embeddings and reranks; batch embeds; dedupe near-duplicates.
Use smaller models for rewrite and rerank; reserve large models for generation only when needed.
Token budgeting in assembler; favor citations over verbose prose.

Troubleshooting

Low relevance: improve rewrite, adjust field boosts, add synonyms.
Hallucinations: tighten thresholds, increase citations, refuse on low confidence.
High latency: lower k, cache more, parallelize, precompute heavy steps.

FAQ

Q: How big should chunks be?
A: 300–800 tokens for prose; smaller for code/FAQs. Optimize on evaluation results.

Q: Should I dedupe similar chunks?
A: Yes—during indexing and retrieval. Use MinHash or cosine thresholding.

AI Agents Architecture: /blog/ai-agents-architecture-autonomous-systems-2025
LLM Fine-Tuning (LoRA/QLoRA): /blog/llm-fine-tuning-complete-guide-lora-qlora-2025
Vector Databases Comparison: /blog/vector-databases-comparison-pinecone-weaviate-qdrant
LLM Security: /blog/llm-security-prompt-injection-jailbreaking-prevention
LLM Observability: /blog/llm-observability-monitoring-langsmith-helicone-2025

Call to action

Need help productionizing RAG at scale? Get a free architecture review.
Contact: /contact • Newsletter: /newsletter

Production Cookbook (End-to-End Recipes)

Recipe 1 — SaaS Knowledge Base RAG (Multi-tenant, EU/US Residency)

Requirements: tenant isolation, EU/US residency, low latency, cost caps
Stack: Next.js App Router (server actions), LangChain, Qdrant (EU/US clusters), Redis cache, Helicone proxy, LangSmith evals

graph TB
  subgraph EU
    A[Next.js EU] --> B[Redis EU]
    A --> C[Qdrant EU]
  end
  subgraph US
    D[Next.js US] --> E[Redis US]
    D --> F[Qdrant US]
  end
  A & D --> G[Helicone]
  G --> H[LLM Provider]
  A & D --> I[LangSmith]

// app/api/rag/route.ts (server-only)
import { kv } from "@vercel/kv"; // or ioredis
import { qdrantEU, qdrantUS } from "@/lib/qdrant";
import { rewriteQuery, retrieve, rerank, assemble } from "@/lib/rag";
import { withTenant } from "@/lib/tenant";
import { withBudgetGuard } from "@/lib/cost";
import { trace } from "@/lib/otel";

export const POST = withTenant(withBudgetGuard(async (req) => {
  return trace("rag.pipeline", async (span) => {
    const { query, tenantId, region } = await req.json();
    const cacheKey = `rag:${tenantId}:${region}:${hash(query)}`;

    const cached = await kv.get(cacheKey);
    if (cached) return Response.json(cached);

    const qdrant = region === "eu" ? qdrantEU : qdrantUS;

    const q1 = await rewriteQuery(query, { tenantId });
    const candidates = await retrieve(qdrant, q1, { tenantId, topK: 128 });
    const ranked = await rerank(q1, candidates, { topK: 8 });
    const context = await assemble(q1, ranked, { tokenBudget: 2000 });
    const answer = await generate({ query: q1, context, tenantId });

    const result = { answer, context, citations: context.map(c => c.url) };
    await kv.set(cacheKey, result, { ex: 300 });
    return Response.json(result);
  });
}));

Recipe 2 — Developer Docs RAG with OpenAPI-calling Tool

Expose a typed tool that calls internal OpenAPI endpoints when context indicates "how-to" queries
Guard with allowlist, signature verification, and rate limits

type Tool = { name: string; params: any; run: (p: any) => Promise<any> };
export const getUserTool: Tool = {
  name: "get_user",
  params: { type: "object", properties: { id: { type: "string" } }, required: ["id"] },
  run: async ({ id }) => fetch(`/internal/api/users/${id}`, { headers: signedHeaders() }).then(r => r.json())
};

export async function agent(query: string, ctx: any) {
  const plan = await llm.plan(query, { tools: [getUserTool] });
  for await (const step of plan) {
    if (step.type === "tool" && step.name === "get_user") {
      const resp = await getUserTool.run(step.params);
      plan.observe(resp);
    }
  }
  return plan.final();
}

Recipe 3 — Domain Router (Finance vs Support vs Legal)

const routes = [
  { name: "finance", match: [/invoice|receipt|tax|vat/i], kb: "kb_fin" },
  { name: "support", match: [/error|bug|troubleshoot|reset/i], kb: "kb_supp" },
  { name: "legal", match: [/terms|privacy|dpa|dpo/i], kb: "kb_leg" },
];
export function route(query: string) {
  const r = routes.find(r => r.match.some(rx => rx.test(query)))?.kb ?? "kb_general";
  return r;
}

Language Implementations (Python, Node, Go)

Python (FastAPI + Qdrant + Redis + Tenancy)

from fastapi import FastAPI, Depends
from redis.asyncio import Redis
from qdrant_client import QdrantClient
from pydantic import BaseModel

app = FastAPI()
redis = Redis.from_url("redis://...")
qdrant = QdrantClient(url="http://qdrant:6333")

class RAGRequest(BaseModel):
    query: str
    tenant_id: str

@app.post("/rag")
async def rag(req: RAGRequest):
    cache_key = f"rag:{req.tenant_id}:{hash(req.query)}"
    cached = await redis.get(cache_key)
    if cached:
        return json.loads(cached)

    q1 = rewrite(req.query)
    vec = await embed(q1)
    res = qdrant.search(collection_name=f"kb_{req.tenant_id}", query_vector=vec, limit=128)
    ranked = rerank(q1, res)
    context = assemble(q1, ranked)
    answer = await generate(q1, context)
    result = {"answer": answer, "context": context}
    await redis.set(cache_key, json.dumps(result), ex=300)
    return result

Node (Express + Pinecone + Helicone)

import express from "express";
import { Pinecone } from "@pinecone-database/pinecone";
import fetch from "node-fetch";

const app = express();
app.use(express.json());

const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const index = pc.Index("kb");

app.post("/rag", async (req, res) => {
  const { query, tenantId } = req.body;
  const vec = await embed(query);
  const result = await index.query({ topK: 100, vector: vec, filter: { tenantId } });
  const ranked = await rerank(query, result.matches);
  const context = assemble(query, ranked);

  const answer = await fetch(process.env.HELICONE_PROXY!, {
    method: "POST",
    headers: { "Content-Type": "application/json", "Helicone-Auth": process.env.HELICONE_KEY! },
    body: JSON.stringify({ messages: makeMessages(query, context) })
  }).then(r => r.json());

  res.json({ answer: answer.choices?.[0]?.message?.content, context });
});

Go (Fiber + Weaviate)

package main
import (
  "github.com/gofiber/fiber/v2"
  wv "github.com/weaviate/weaviate-go-client/v4/weaviate"
)
func main() {
  app := fiber.New()
  client := wv.NewClient(wv.Config{Scheme: "http", Host: "weaviate:8080"})
  app.Post("/rag", func(c *fiber.Ctx) error {
    var req struct{ Query string; Tenant string }
    c.BodyParser(&req)
    // search and compose (pseudo)
    return c.JSON(fiber.Map{"answer": "...", "tenant": req.Tenant})
  })
  app.Listen(":8080")
}

Retrieval Indexing and Ingestion

Ingestion Pipeline (Docs, PDFs, HTML, Structured)

steps:
  - fetch: { kind: http, urls: ["https://docs.example.com/"] }
  - extract: { kind: readability }
  - chunk: { kind: headings, max_tokens: 600, overlap: 80 }
  - enrich:
      - title
      - anchor
      - section
      - author
      - published_at
  - embed: { model: text-embedding-3-large, batch: 64 }
  - upsert: { store: qdrant, collection: kb_tenant }

Content Hashing & Signing (Poisoning Defense)

import crypto from "crypto";
export function contentHash(s: string) {
  return crypto.createHash("sha256").update(s).digest("hex");
}
export function sign(hash: string) {
  return crypto.createHmac("sha256", process.env.SIGNING_KEY!).update(hash).digest("hex");
}

Reranking Strategies (Trade-offs and Models)

MiniLM cross-encoders for speed; bge-reranker-large for quality
Pairwise rerank vs pointwise scoring; calibration thresholds
Cache on (queryHash, passageId, model) with TTL and LRU

export async function crossEncode(q: string, docs: string[]) {
  // call reranker service or HF Inference endpoints
  return docs.map((_, i) => 1 - i / docs.length); // placeholder
}

Context Assembly Strategies (Cards, Tables, and Code)

Structured cards: title, snippet, URL, important fields
Code-aware assembly: preserve code blocks; limit formatting churn
Table-aware assembly: render as CSV/Markdown for clarity

export function compressToTokens(text: string, budget: number) {
  // heuristic trimming by sentences, keep citations
  const sents = text.split(/([.!?])\s+/);
  const out: string[] = [];
  let tokens = 0;
  for (const s of sents) {
    const t = approxTokens(s);
    if (tokens + t > budget) break;
    out.push(s);
    tokens += t;
  }
  return out.join(" ");
}

Evaluation at Scale (Offline + Online)

Golden Sets (Construction and Maintenance)

suites:
  - name: faq_critical
    items:
      - id: faq-001
        query: "How do I reset my SSO password?"
        expected:
          contains: ["Click 'Forgot password'", "SSO provider", "email"]
          citations_required: true
      - id: faq-002
        query: "What is our DPA address for EU tenants?"
        expected:
          contains: ["Data Processing Addendum", "EU"],
          citations_required: true

Online Evals (Shadow, A/B, Bandit)

type Arm = "baseline" | "candidate";
export function assignArm(userId: string): Arm {
  return hash(userId) % 100 < 10 ? "candidate" : "baseline"; // 10% canary
}

Observability (Trace Spec)

{
  "name": "rag.pipeline",
  "attributes": {
    "tenant.id": "abc",
    "region": "eu",
    "rewrite.ms": 12,
    "retrieve.ms": 43,
    "rerank.ms": 80,
    "assemble.ms": 14,
    "generate.ms": 900,
    "cost.usd": 0.0123,
    "tokens.in": 1234,
    "tokens.out": 456
  }
}

Security Policies (Guardrails)

policies:
  prompt_injection:
    block_patterns:
      - "ignore previous instructions"
      - "you are now"
      - "system:"
  tools:
    http_request:
      allow_hosts: ["api.internal", "docs.example.com"]
      deny_ips: ["169.254.169.254"]
      timeout_ms: 8000
      max_body_kb: 256
  content:
    outbound_links:
      allow_domains: ["example.com", "docs.example.com"]
      require_citations: true

Playbooks (Ops & SRE)

Playbook — Latency Spike

Symptoms: P95 > 3s in generate or rerank spans
Actions: check cache hit %, reranker queue depth, model route changes; reduce topK; enable short context mode
Rollback: switch to smaller model route; disable reranker temporarily; raise refusal threshold

Playbook — Cost Spike

Symptoms: cost.usd per request > budget
Actions: enforce token budget, enable prompt compression, raise cache TTL, downshift model tier

Playbook — Quality Regression

Symptoms: win-rate drop > 5% vs baseline
Actions: freeze deploys, run backfill evals, analyze failures by category; revert last change

Benchmarks (Latency/Cost Profiles)

route,model,input_tokens,output_tokens,latency_ms,cost_usd
small,gpt-4o-mini,900,200,700,0.0041
medium,gpt-4o,1200,350,1200,0.0180
large,claude-3-opus,1400,500,1800,0.0315

Extended FAQ (Advanced)

Q: How do we prevent duplicate or near-duplicate chunks?
Use locality-sensitive hashing (MinHash/SimHash) at ingest; drop within-threshold items or downweight at retrieval.

Q: What if retrieval returns correct but low-quality sources?
Boost authoritative sources via per-source weights; penalize low-quality domains; add quality signals to ranking.

Q: How do we keep costs bounded under heavy load?
Hierarchical caches, token budgets, small-model rerankers, dynamic topK, surge control, and circuit breakers on LLM calls.

Q: How to localize RAG?
Segment indices by locale; prefer locale match in filters; translate queries before/after; re-embed localized corpora.

Q: How to prevent secret leakage via retrieval?
Pre-index DLP scans; exclude matches; at generation, scan outputs for secret regexes; redact and log events.

Q: How do we decide between Pinecone/Qdrant/Weaviate/pgvector?
Use managed (Pinecone) for turnkey and SLAs; Qdrant for cost/control; Weaviate for graph-like schemas; pgvector for SQL integration.

Q: Should we add graph edges to chunks?
Yes when relations help navigation (parent/child/see-also); improves diversity and follow-up retrieval.

Q: How big should the rerank set be?
Commonly 50–200 candidates; tune by latency/cost goals and reranker throughput.

Q: How to monitor hallucination rate?
Use rubric scoring with required citation coverage; sample answers and auto-check citation presence/consistency.

Q: Can we do RAG without embeddings?
Yes, lexical-only can work for structured FAQs; hybrid usually wins for broader corpora.

Glossary

RRF: Reciprocal Rank Fusion — method to combine ranked lists
Cross-encoder: model scoring (query, passage) pairs jointly
Context card: structured snippet with source/citation ready for LLM
TopK: number of items to keep at a stage (retrieve/rerank)

References and Further Reading

OpenAI Evals and eval theory
MS MARCO / BEIR benchmarks
OTEL Semantic Conventions for AI
Vector DB docs: Qdrant, Pinecone, Weaviate, pgvector

Integration Blueprints (Vendors and Stacks)

Blueprint — Pinecone + LangGraph + Next.js

// langgraph.ts (pseudo)
import { StateGraph } from "langgraph";
const g = new StateGraph()
  .addNode("rewrite", rewriteNode)
  .addNode("retrieve", retrieveNode)
  .addNode("rerank", rerankNode)
  .addNode("assemble", assembleNode)
  .addNode("generate", generateNode)
  .addEdge("rewrite","retrieve")
  .addEdge("retrieve","rerank")
  .addEdge("rerank","assemble")
  .addEdge("assemble","generate");
export default g;

// pinecone.ts
import { Pinecone } from "@pinecone-database/pinecone";
export const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
export const index = pc.Index("kb");

Blueprint — Weaviate (Hybrid) + Cloudflare Workers

// worker.ts
export default {
  async fetch(req: Request, env: any) {
    const url = new URL(req.url);
    if (url.pathname === "/rag") return handleRAG(req, env);
    return new Response("Not found", { status: 404 });
  }
}

Full Config Samples

Qdrant Collections and Payload Indexes

{
  "collection_name": "kb_tenant",
  "vectors": { "size": 1536, "distance": "Cosine" },
  "optimizers_config": { "default_segment_number": 6 },
  "hnsw_config": { "ef_construct": 128, "m": 32 },
  "quantization_config": { "product": { "compression": 8 } },
  "on_disk_payload": true,
  "shard_number": 2,
  "replication_factor": 2
}

Weaviate Schema (Graph-Like)

{
  "class": "Document",
  "description": "Knowledge base entries",
  "vectorizer": "none",
  "properties": [
    { "name": "title", "dataType": ["text"] },
    { "name": "text", "dataType": ["text"] },
    { "name": "url", "dataType": ["text"] },
    { "name": "tenantId", "dataType": ["text"] },
    { "name": "locale", "dataType": ["text"] }
  ]
}

Security Matrices

Layer	Risk	Control	Evidence
Input	Injection	Sanitizer + WAF	Regex hits, blocked count
Retrieve	Poisoning	Signed content	Hash/sign logs
Rerank	Model abuse	Rate limits	Span metrics
Assemble	PII leak	Redaction	Redaction logs
Generate	Hallucination	Citations required	Eval scores

Governance SOPs

Change management: proposal → review → shadow deploy → promote
Dataset updates: lineage captured; consent; PII handling; audits
Model changes: model card, eval diff ≥ +X% win‑rate, rollback plan

Localization and Accessibility

Locale routing; language tags in payload; localized stopwords
Accessibility: readable citations, keyboard focus for UI, high contrast highlights

Dataset Curation Playbook

Source allowlist; crawler etiquette; license tracking
Deduplication strategies (MinHash thresholds)
Quality labels and reviewer guidelines

labels:
  grounded: yes/no
  authoritative: yes/no
  stale: yes/no
  sensitive: pii/secret/none

Comprehensive Testing Suites

Unit Tests (Assembler)

import { assemble } from "@/lib/rag";

test("dedupes by docId", () => {
  const ranked = [
    { payload: { docId: "1", title: "A", text: "...", url: "u1" } },
    { payload: { docId: "1", title: "A", text: "...", url: "u1" } },
    { payload: { docId: "2", title: "B", text: "...", url: "u2" } }
  ];
  const cards = assemble("q", ranked, { tokenBudget: 100 });
  expect(cards.length).toBe(2);
});

Contract Tests (API)

- name: GET /api/rag returns citations
  request: { method: POST, path: /api/rag, body: { query: "how to reset" } }
  expect:
    status: 200
    json: { $.citations: present, $.answer: present }

SLOs and SLIs

SLO: P95 latency ≤ 1.5s; Error rate ≤ 1%; Win‑rate ≥ baseline + 5%
SLIs: trace spans per stage; cache hit rate; cost per request; citation coverage

Disaster Recovery

Multi‑region replicas; snapshot embeddings/payloads; tested restore runbooks
DNS or edge routing failover; low TTLs; warm caches on recovery

Capacity Planning

Queries/minute projections; vector insert rates; storage growth; efSearch scaling
Back‑of‑envelope: memory per vector with metadata; CPU per RPS for reranker

Cost Calculators (Detailed)

export function costPerRequest({tokensIn, tokensOut, model}:{tokensIn:number;tokensOut:number;model:"small"|"medium"|"large"}){
  const price = { small:{in:1e-6,out:3e-6}, medium:{in:6e-6,out:12e-6}, large:{in:12e-6,out:24e-6} };
  return tokensIn*price[model].in + tokensOut*price[model].out;
}

Advanced FAQ (Additional)

Q: What’s an effective cache key?
Hash of normalized query + tenant + locale + version + route.

Q: Should embeddings be encrypted at rest?
Yes; treat as sensitive if they may encode proprietary content.

Q: How to validate citations?
Automated link checkers + content hash verification against stored hash.

Q: How to schedule re‑embedding?
When model upgrades, content updates, or evaluation finds drift; incremental jobs.

Q: Do structured sources need chunking?
Often record‑level works; attach field semantics and consider entity linking.

Q: How to throttle expensive rerankers?
Queue with concurrency limits; fall back to faster reranker when under load.

Q: What if hybrid search returns conflicting results?
Prefer diversity; present options; let user disambiguate; improve rewrite.

Q: How to handle private vs public corpora?
Separate indices; strict auth at retrieval; do not mix payloads.

Q: What metrics detect poisoning?
Sudden topic drift, low quality flags, mismatch between link text and target, signature failures.

Q: How to keep token counts predictable?
Aggressive trimming by sentences; structured cards; strict token budgets per stage.

Vendor Playbooks (Operational)

Pinecone Playbook

Index sizing: start small, scale replicas on P95 > target
Regions: minimize egress; colocate with app
Filters: use metadata for tenant/locale, payload-only filtering for speed

runbooks:
  scale:
    trigger: p95_ms > 30 for 15m
    steps:
      - pinecone scale replicas +1
      - verify health
      - run smoke queries
  incident-latency:
    trigger: p99_ms > 60
    steps:
      - check routing errors
      - reduce topK from 200->120
      - enable response cache 5m
      - notify oncall

Qdrant Playbook

HNSW tuning: start M=32, ef=128, raise ef for recall; monitor CPU
Segmenting: default_segment_number tuned per dataset; compact when fragments grow

# Compact collection maintenance window
curl -X POST http://qdrant:6333/collections/kb/optimizers/recommend

Weaviate Playbook

Modules: disable unused vectorizers; set replication; autoschema off for control
Graph queries: keep shallow; precompute relationships for frequent paths

Infra-as-Code (IaC) Samples

Terraform (Qdrant + App)

resource "aws_instance" "qdrant" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.large"
  user_data     = file("cloud-init/qdrant.yaml")
  tags = { Name = "qdrant" }
}

resource "aws_lb" "app" { # ... }
resource "aws_lb_target_group" "app" { # ... }
resource "aws_lb_listener" "app" { # ... }

Kubernetes (RAG API + Reranker)

apiVersion: apps/v1
kind: Deployment
metadata: { name: rag-api }
spec:
  replicas: 3
  selector: { matchLabels: { app: rag-api } }
  template:
    metadata: { labels: { app: rag-api } }
    spec:
      containers:
        - name: api
          image: registry/rag-api:latest
          resources:
            requests: { cpu: "250m", memory: "256Mi" }
            limits: { cpu: "500m", memory: "512Mi" }
          env:
            - name: VECTOR_URL
              valueFrom: { secretKeyRef: { name: rag-secrets, key: vector_url } }
---
apiVersion: apps/v1
kind: Deployment
metadata: { name: reranker }
spec:
  replicas: 2
  selector: { matchLabels: { app: reranker } }
  template:
    metadata: { labels: { app: reranker } }
    spec:
      containers:
        - name: reranker
          image: registry/reranker:latest
          resources:
            requests: { cpu: "1", memory: "2Gi" }
            limits: { cpu: "2", memory: "4Gi" }

Monitoring Dashboards (JSON)

{
  "title": "RAG Pipeline",
  "panels": [
    { "type": "graph", "title": "P95 Latency", "targets": [{ "expr": "histogram_quantile(0.95, sum(rate(rag_stage_latency_bucket[5m])) by (le))" }] },
    { "type": "stat", "title": "Cache Hit %", "targets": [{ "expr": "sum(rate(rag_cache_hit_total[5m])) / sum(rate(rag_cache_total[5m])) * 100" }] },
    { "type": "graph", "title": "Cost per Request", "targets": [{ "expr": "sum(rate(rag_cost_usd[5m])) / sum(rate(rag_requests_total[5m]))" }] },
    { "type": "table", "title": "Top Errors", "targets": [{ "expr": "topk(10, increase(rag_errors_total[24h]))" }] }
  ]
}

End-to-End Test Suites

Smoke Tests

it("answers FAQ with citation", async () => {
  const res = await fetch("/api/rag", { method: "POST", body: JSON.stringify({ query: "reset password" }) });
  const json = await res.json();
  expect(json.citations?.length).toBeGreaterThan(0);
  expect(json.answer).toMatch(/reset/);
});

Load Tests (k6)

import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = { vus: 50, duration: '10m' };
export default function() {
  const res = http.post(__ENV.BASE_URL+"/api/rag", JSON.stringify({ query: "billing" }), { headers: { 'Content-Type': 'application/json' } });
  check(res, { 'status 200': r => r.status === 200 });
  sleep(0.2);
}

Data Governance SOPs

Data intake: license check, source allowlist, robots.txt compliance
Retention: default 12 months; purge requests via DSR workflow
Access: least privilege; audit every read/write; quarterly reviews

Localization Strategies

Per-locale indices; locale-aware rewrite (stemming, synonyms)
Fallback chains (fr-CA → fr → en); mark citation locales
UI: render citations with language tags and accessible labels

40 Advanced FAQs

How do we handle multi‑KB contexts without cost explosions?
Use structured cards + smart trimming, token budgets per stage, and small-model first routing.
How to ensure determinism for compliance answers?
Fixed prompts, strict citation rules, frozen indices for compliance content, and versioned prompts.
What if the reranker model is down?
Fall back to lexical boosts + diverse sampling; lower topK; note degraded mode in responses.
How to solve "parroting" duplicate content?
Downweight duplicates at retrieval; dedupe in assembly; diversify by source.
How to prevent long-tail latency spikes?
Cap tool calls time; limit max candidates; circuit-break on generation.
Best way to log without leaking PII?
Hash + redact; store diffs; keep raw prompts in a quarantined lake with access approvals.
How to enforce "citation required"?
Verifier function scans answer for citation markers and matching context IDs; reject otherwise.
Can we stream partial answers with citations?
Yes—stream answer and append citations at end; or stream footnote numbers and resolve later.
Do we need query understanding models?
Rewrite often suffices; for complex domains add intent classifiers trained on traffic.
How to measure real ROI?
Deflection, time-to-resolution, cost/request, win-rate deltas vs baseline, and satisfaction scores.
Handle regulatory deletes (Right to be Forgotten)?
Track provenance; purge chunks by IDs; reindex; invalidate caches; verify deletion reports.
Control stale content?
Add published_at; decay scores; exclude beyond TTL unless explicitly requested.
Balance precision vs recall?
Tune ef/topK and reranker thresholds; use per‑query type configurations.
Is BM25 still necessary?
Yes; hybrid consistently wins for precise ID queries and acronyms.
Reduce cold-start latency?
Warm caches, preload embeddings for frequent queries, keep small-model route hot.
Secure tenant isolation?
Separate collections or strict payload filters plus auth checks; sign tokens with tenant claims.
Detect bad citations?
Parser validates URLs/anchors; content hash mismatch triggers exclusion; add QA tasks.
How to do branch previews safely?
Use ephemeral indices; name by branch; restrict to reviewers; auto-delete on merge.
Recommended batch sizes?
Embeddings 32–128; reranker depends on GPU; monitor throughput/latency.
Logging schema?
Trace ID, tenant, region, route, tokens, cost, spans, selected candidates, citations; privacy flags.
Prompt versioning?
Store hash + metadata; tie to model route; roll back with config.
Which reranker model?
Start with MiniLM cross‑encoder; upgrade to bge rerank when quality requires; test.
Chunk overlap 10% or 20%?
Start 10–15% for prose; less for FAQs; validate on evals.
Tables vs text?
Extract structured tables and store as JSON for targeted lookup; include both views.
What about code docs?
Chunk by function/module; include signatures and imports; preserve formatting.
Personalization?
Use tenant/user tags to bias retrieval; keep privacy; avoid overfitting responses.
How to compose evidence packs?
Bundle citations with hashes and timestamps; export as PDF/ZIP for audits.
Abuse/attack telemetry?
Track blocked patterns, tool denials, WAF hits; alert on spikes.
Model drift indicators?
Win‑rate drop, refusal misfires, tone/style changes; run periodic evals.
Can we precompute answers?
Yes for FAQs; invalidate on content updates; store in fast KV.
A/B test pitfalls?
Beware novelty effects; run long enough; segment by tenant and query type.
Guardrail failure handling?
Refuse with explanation; prompt user for safer query; log event for tuning.
How to attribute cost across teams?
Tag traces with team/tenant; chargeback reports; budgets.
Which embedding model?
Use state-of-art with stable latency; ensure licensing allows domain use.
How to prevent API metadata leakage?
Mask secrets in payloads; keep minimal metadata; audit fields.
Disaster game days?
Simulate index outage and reranker failures; document recovery times.
Pagination of results?
Provide top answer and "More results" with ranked cards; allow user feedback.
Keep long-running conversations grounded?
Regularly refresh context from retrieval; trim thread memory; re-ask clarifying questions.
Legal disclaimers in regulated domains?
Add domain-specific disclaimers; route high-risk queries to human escalation.
What about multimodal (images/tables)?
Use multimodal embeddings; extract image captions; include alt text and OCR, link back to source.

Dataset Catalogs (Templates and Examples)

catalog:
  title: "Company Knowledge Base"
  owners: ["docs@company.com", "platform@company.com"]
  sources:
    - id: kb-product
      type: website
      url: https://docs.company.com/
      license: proprietary
      crawl:
        depth: 3
        include: ["/guides/", "/faq/"]
        exclude: ["/admin/"]
      preprocess:
        readability: true
        code_blocks: preserve
        tables: extract
      chunk:
        strategy: headings
        max_tokens: 600
        overlap: 80
      metadata:
        product: core
        locale: en
    - id: kb-legal
      type: pdf
      path: s3://kb/legal/*.pdf
      license: proprietary
      ocr: true
      preprocess:
        detect_columns: true
        captions: attach
      chunk:
        strategy: paragraphs
        max_tokens: 450
        overlap: 60
      metadata:
        category: legal
        locale: en
    - id: kb-support-fr
      type: website
      url: https://support.company.fr/
      locale: fr
      chunk:
        strategy: headings
        max_tokens: 500
        overlap: 70

OpenAPI Tool Library (Function Calling Specs)

{
  "tools": [
    {
      "name": "get_user",
      "description": "Fetch user profile by ID",
      "parameters": {
        "type": "object",
        "properties": { "id": { "type": "string" } },
        "required": ["id"]
      }
    },
    {
      "name": "create_ticket",
      "description": "Create support ticket",
      "parameters": {
        "type": "object",
        "properties": {
          "title": { "type": "string" },
          "body": { "type": "string" },
          "severity": { "type": "string", "enum": ["low","medium","high"] }
        },
        "required": ["title","body"]
      }
    },
    {
      "name": "list_invoices",
      "description": "List invoices for account",
      "parameters": {
        "type": "object",
        "properties": { "accountId": { "type": "string" }, "limit": { "type": "number" } },
        "required": ["accountId"]
      }
    }
  ]
}

// validators.ts
export function enforceToolPolicy(name: string, params: Record<string,unknown>) {
  if (name === "get_user") {
    if (!/^usr_[a-z0-9]{8}$/.test(String((params as any).id))) throw new Error("invalid id");
  }
  if (name === "create_ticket") {
    if (String((params as any).title).length < 5) throw new Error("title too short");
  }
}

Prompts Library (Operations-Ready)

System (RAG):
You are a retrieval-augmented assistant. Always cite sources as [^n] with links.
If context is insufficient, say "I don’t have enough information" and propose next steps.

System (Rewrite):
You normalize queries: fix spelling, expand acronyms, add synonyms; keep meaning.
Do not fabricate facts.

System (Safety):
Refuse unsafe requests (self-harm, illegal, privacy violations). Be brief and offer safe alternatives.

System (Assembler):
Summarize cards into concise, non-redundant bullets with citations [^n]. Keep technical terms.

100-Case Evaluation Suite (YAML)

suite: kb_eval_v1
items:
  - id: q001
    query: "Reset MFA device"
    expected:
      contains: ["MFA", "reset", "admin"]
      citations: 1
  - id: q002
    query: "Pricing for enterprise tier"
    expected:
      contains: ["enterprise", "pricing"]
      citations: 1
  - id: q003
    query: "GDPR DPA address"
    expected:
      contains: ["DPA", "address", "EU"]
      citations: 1
  - id: q004
    query: "Error code XY-1234 troubleshooting"
    expected:
      contains: ["XY-1234", "steps", "logs"]
      citations: 1
  # ... add q005–q100 with detailed expectations

// eval-runner.ts
import { runEvalItem } from "./eval-lib";
import items from "./kb_eval_v1.yaml";

(async function main(){
  let pass=0; let total=0;
  for (const it of items.items){
    total++;
    const r = await runEvalItem(it);
    if (r.pass) pass++;
    console.log(it.id, r.pass?"PASS":"FAIL", r.metrics);
  }
  console.log("win-rate:", pass/total);
})();

Language Samples (Java)

@RestController
public class RagController {
  @PostMapping("/rag")
  public Map<String,Object> rag(@RequestBody Map<String,Object> body) {
    String query = (String) body.get("query");
    String q1 = Rewrite.normalize(query);
    float[] vec = Embeddings.embed(q1);
    List<Candidate> cands = VectorStore.search("kb", vec, 128);
    List<Candidate> ranked = Reranker.rank(q1, cands, 10);
    List<Card> context = Assembler.assemble(q1, ranked, 2000);
    String answer = LLM.generate(q1, context);
    return Map.of("answer", answer, "context", context);
  }
}

CI Pipelines (GitHub Actions)

name: rag-ci
on:
  push:
    branches: [ main ]
jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci && npm run lint
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci && npm test -- --ci
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: node eval-runner.js

Governance Docs (Model Cards, Changes)

# Model Card — RAG Generator v0.7
- Provider: gpt-4o-mini
- Safety: refusal rate 4.1% (target 3–6%)
- Win‑rate vs baseline: +7.8%
- Known limitations: cites only included context; may refuse borderline requests
- Change log:
  - 0.7: raised refusal threshold; tightened citation regex
  - 0.6: improved assembly compression

Extra Advanced FAQs (Selection)

How do we A/B test prompts safely?
Use a prompt registry with version IDs; assign arms per user; cap traffic; auto‑rollback on metric regression.
What about PII in vector embeddings?
Redact before embedding; hash replacement; keep mapping in secure KMS; add PII detector at ingest.
Can we throttle per‑tenant cost?
Yes—track cost at trace level; enforce budgets; return graceful fallback on overage.
How to trace across microservices?
Propagate W3C traceparent; attach tenant/route; emit spans at each stage.
Is JSON-LD needed in responses?
Not necessary for RAG, but useful for SEO when rendering knowledge base answers on web.
How to combine code and prose context?
Keep code blocks intact; present as separate cards; instruct model to quote code carefully.
Should we store raw HTML?
Store both HTML and extracted text; use HTML for anchors and accurate citations.
How to guard against cache poisoning?
Key by normalized query and tenant; sign values; short TTL; validate on read.
What if reranker disagrees with business priority?
Add source boosts/penalties; rerank score + source weight; document policy.
How to rapidly iterate?
Shadow deploy pipeline changes; collect metrics; promote with change freezes for high‑risk periods.
Multi‑cloud indices?
Prefer single‑cloud per region for simplicity; replicate across clouds only for critical DR.
GPU or CPU for reranker?
GPU for heavy cross‑encoders; CPU can suffice for MiniLM at modest QPS.
Batch vs streaming ingest?
Both—CDC for change data, batch for rebuilds; ensure idempotency and backpressure handling.
How to present uncertainty?
Include confidence score; allow user to view sources; provide "Was this helpful?" feedback.
Are denial lists effective?
They help but are insufficient alone; combine detectors, allowlists, and policies.
Should we relieve re‑embedding costs by quantization?
Yes—store compressed vectors when recall remains acceptable.
Time‑boxed generate?
Set max_tokens and timeouts; degrade to summary when time exhausted.
Proxy vendors?
Helicone/Langfuse can help with analytics and cost tracking; validate privacy and SLAs.
Can we localize reranker?
Use multilingual rerankers; language detection in rewrite; locale‑aware thresholds.
What about images in RAG?
Use multimodal retrieval; generate alt text for accessibility; cite image sources clearly.

OpenAPI Specification (RAG API)

openapi: 3.0.3
info:
  title: RAG API
  version: 1.0.0
paths:
  /rag:
    post:
      summary: Answer query using retrieval-augmented generation
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              properties:
                query: { type: string }
                tenantId: { type: string }
                locale: { type: string }
                topK: { type: integer, default: 10 }
              required: [query, tenantId]
      responses:
        '200':
          description: Success
          content:
            application/json:
              schema:
                type: object
                properties:
                  answer: { type: string }
                  citations:
                    type: array
                    items: { type: string }
                  context:
                    type: array
                    items:
                      type: object
                      properties:
                        title: { type: string }
                        snippet: { type: string }
                        url: { type: string }
        '400': { description: Invalid input }
        '429': { description: Budget exceeded }
        '500': { description: Server error }

Full Next.js API Implementation (Server Actions + Tracing)

// app/api/rag/route.ts
import { NextRequest } from "next/server";
import { kv } from "@vercel/kv";
import { index } from "@/lib/pinecone";
import { crossEncode } from "@/lib/rerank";
import { trace, span } from "@/lib/otel";
import { withBudget } from "@/lib/budget";

export const runtime = "nodejs";

function normalize(q: string){ return q.normalize("NFKC").trim(); }

async function embed(q: string){ /* call embedding */ return new Array(1536).fill(0); }

export async function POST(req: NextRequest){
  return withBudget(await trace("rag", async () => {
    const body = await req.json();
    const query = normalize(String(body.query || ""));
    const tenantId = String(body.tenantId || "");
    const locale = String(body.locale || "en");
    const topK = Number(body.topK || 10);
    if (!query || !tenantId) return Response.json({ error: "invalid" }, { status: 400 });

    const cacheKey = `rag:${tenantId}:${locale}:${topK}:${hash(query)}`;
    const cached = await kv.get(cacheKey);
    if (cached) return Response.json(cached);

    const vec = await span("embed", () => embed(query));
    const candidates = await span("retrieve", async () => {
      const res = await index.query({ topK: 200, vector: vec, filter: { tenantId, locale } });
      return res.matches?.map((m:any) => ({ id: m.id, score: m.score, payload: m.metadata })) || [];
    });

    const ranked = await span("rerank", async () => {
      const texts = candidates.map((c:any) => c.payload.text);
      const scores = await crossEncode(query, texts);
      return candidates
        .map((c:any, i:number) => ({ ...c, score: scores[i] }))
        .sort((a:any,b:any)=>b.score-a.score)
        .slice(0, topK);
    });

    const context = await span("assemble", async () => {
      const seen = new Set<string>();
      const out:any[] = [];
      for (const r of ranked){
        if (seen.has(r.payload.docId)) continue;
        out.push({ title: r.payload.title, snippet: r.payload.text.slice(0, 800), url: r.payload.url });
        seen.add(r.payload.docId);
      }
      return out;
    });

    const answer = await span("generate", async () => {
      return `Answer (locale=${locale}):\n` + context.map((c,i)=>`[^${i+1}] ${c.title}`).join("\n");
    });

    const result = { answer, citations: context.map((c)=>c.url), context };
    await kv.set(cacheKey, result, { ex: 300 });
    return Response.json(result);
  }));
}

Python Client (Typed)

from typing import List, TypedDict
import httpx

class Card(TypedDict):
    title: str
    snippet: str
    url: str

class RagResponse(TypedDict):
    answer: str
    citations: List[str]
    context: List[Card]

class RagClient:
    def __init__(self, base_url: str, tenant_id: str, timeout: float = 8.0):
        self.base_url = base_url.rstrip('/')
        self.tenant_id = tenant_id
        self.timeout = timeout

    async def ask(self, query: str, locale: str = "en", top_k: int = 10) -> RagResponse:
        async with httpx.AsyncClient(timeout=self.timeout) as client:
            r = await client.post(f"{self.base_url}/rag", json={
                "query": query,
                "tenantId": self.tenant_id,
                "locale": locale,
                "topK": top_k
            })
            r.raise_for_status()
            return r.json()  # type: ignore

Helm Chart Values (Excerpt)

api:
  image: registry/rag-api:1.2.3
  replicaCount: 3
  resources:
    requests: { cpu: 250m, memory: 256Mi }
    limits: { cpu: 500m, memory: 512Mi }
  env:
    VECTOR_URL: https://pinecone.io/...
    HELICONE_PROXY: https://proxy.helicone.ai/
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 10
    targetCPUUtilizationPercentage: 60
reranker:
  image: registry/reranker:0.9.0
  resources:
    requests: { cpu: 1, memory: 2Gi }
    limits: { cpu: 2, memory: 4Gi }

k6 Load Testing Variations

import http from 'k6/http';
import exec from 'k6/execution';
import { check, sleep } from 'k6';

export const options = {
  scenarios: {
    baseline: { executor: 'ramping-vus', startVUs: 1, stages: [ { duration: '2m', target: 50 } ] },
    spike: { executor: 'constant-arrival-rate', rate: 100, timeUnit: '1s', duration: '5m', preAllocatedVUs: 200 },
  }
};

export default function () {
  const q = __ENV.QUERY || 'reset password';
  const res = http.post(`${__ENV.BASE}/api/rag`, JSON.stringify({ query: q, tenantId: 't_demo' }), { headers: { 'Content-Type': 'application/json' } });
  check(res, { '200': r => r.status === 200, 'has answer': r => !!r.json('answer') });
  if (res.status !== 200) exec.test.abort('non-200');
  sleep(0.2);
}

30 More Advanced FAQs

Can we run reranker and retrieval in parallel?
Usually retrieve first; but you can prefetch alternate retrieval strategies in parallel to reduce time-to-first-token.
How to maintain consistent tone across answers?
Use tone guidelines in system prompt; small post-process normalizer; sample answers in QA.
Detect and remove dead links?
Nightly link check jobs; drop or replace citations; add cached copies.
How to minimize cross-region latency?
Anycast edge routing; per-region indices; geo-aware caches.
How to record precise token accounting?
Add model response usage; track tokens.in/out in spans; budget enforcement middleware.
Multi-tenant traffic isolation?
Per-tenant queues and budgets; fair scheduling; tenant-level throttles.
Can we stream context along with the answer?
Yes—progressively reveal which cards are currently influencing the answer.
How to limit hallucinations on numeric answers?
Prefer retrieval of structured data; add validators for numeric ranges; refuse when confidence low.
Is llm caching safe?
Cache only for non-personalized, safe answers; include prompt hash + route + locale in key.
Should we attach JSON metadata to answers?
Helpful for clients—return citations, confidence, and route info.
Versioning indices?
Tag every chunk with index_version; use during assembly for stable citations.
How to audit who saw which content?
Add source IDs in logs; legal/privacy review for access tracking compliance.
How to autoscale reranker?
HPA on queue depth and latency; GPU node pools with PDBs; pre-warm pods.
Prevent path traversal in file tool?
Normalize paths; enforce allowed prefixes; always read-only.
Is batching dangerous?
Ensure per-query isolation; never leak context across batch items.
Runtime feature flags?
Flags for topK, model route, cache TTL; safe to toggle; trace flags.
Quiet hours / surge control?
Throttle LLM calls; fallback to lexical-only flow under extreme load.
Secret management?
KMS + dynamic creds; never hardcode; rotate on incident or quarterly.
Evaluate against user feedback?
Incorporate helpful/unhelpful votes; correlate with offline metrics; adjust thresholds.
Legal holds?
Freeze indices and logs for the hold scope; track chain-of-custody.
Can we share chunks across tenants?
Only public content; otherwise maintain strict isolation.
Disable reranker per tenant?
Expose per-tenant config; verify SLO impact.
Sanitizing markdown/script blocks?
Strip script tags; sanitize HTML; escape where rendering.
Is model-distillation applicable?
Yes—distill to smaller generator with RAG prompts for cost savings.
How to mitigate vendor lock-in?
Abstract embedding/reranker/generator behind adapters; keep data in open formats.
Avoid explosion of indices?
Group small tenants; shard per large tenant; automate lifecycle.
Prompt rotation risks?
Treat prompts as code; code review; eval before rollout; canary.
Negative queries?
Return safe refusal with suggestions; log patterns to improve content.
Use knowledge graphs?
Augment retrieval for relational queries; costly to build, but helpful for certain domains.
Accessibility in UI?
Citations as accessible footnotes; keyboard navigation; screen reader-friendly labels.