RAG Systems in Production: Chunking, Retrieval, and Reranking (2025)
Retrieval-Augmented Generation (RAG) is the backbone of most practical LLM systems. This guide is a deep, practitioner-focused walkthrough for building production-grade RAG: from chunking and metadata strategies to hybrid retrieval, reranking, evaluation, observability, and cost control.
Executive Summary
- Chunking is a product decision, not just an indexer parameter; optimize for question types and grounding.
- Use hybrid retrieval (BM25 + dense) with domain-aware query rewriting and filters; rerank aggressively for top-K.
- Evaluate continuously with golden sets and real traffic; gate deployments by win-rate and hallucination scores.
- Log every hop (rewrite → retrieve → rerank → assemble → generate) with cost/latency attribution and cache hits.
- Secure your pipeline: sanitize inputs, guard against prompt injection via vector stores, and sign your content.
Architecture Overview
graph LR
A[User Query] --> B[Query Rewrite/Expand]
B --> C[Hybrid Retrieval]
C --> D[Reranker]
D --> E[Context Assembler]
E --> F[LLM Generator]
F --> G[Response + Citations]
G --> H[Feedback + Telemetry]
- Query Rewrite: spelling fixes, acronym expansion, synonyms, intent routing.
- Hybrid Retrieval: BM25 for lexical match + vector for semantic similarity with field boosts and filters.
- Reranker: cross-encoder that scores candidate passages, often improves groundedness significantly.
- Context Assembler: dedupe, enforce diversity, compress and structure into cards with citations.
Chunking Strategies
Principles
- Optimize chunks for answerability: include titles, headings, and stable anchors.
- Prefer 300–800 token windows with overlaps 10–15% for long prose; smaller for FAQs/code.
- Tag chunks with hierarchical metadata: doc_id, section, headings, author, version, published_at.
class Chunker:
def chunk(self, html: str) -> list[dict]:
blocks = self.split_by_headings(html)
return [self.enrich(b) for b in self.window(blocks, max_tokens=600, overlap=80)]
Specialized Chunking
- Code: function-level with import graph context, keep signatures and docstrings.
- Tables: extract as key-value and normalized JSON for structured lookup.
- PDFs: detect columns, figures, captions; attach OCR confidence.
Hybrid Retrieval
class HybridRetriever:
def retrieve(self, query: str, k: int = 50) -> list[Candidate]:
rewritten = self.rewrite(query)
lexical = self.bm25.search(rewritten, k=200)
dense = self.vector.search(self.embed(rewritten), k=200)
fused = self.reciprocal_rank_fusion(lexical, dense, top=k)
return fused
- Query rewrite: spellcheck, acronym expansion, synonyms, noun-phrase extraction.
- Filters: product, version, language, recency windows; exact match boosts for IDs.
- Fusion: Reciprocal Rank Fusion (RRF) or weighted linear fusion; learn weights from feedback.
Reranking
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, doc.text) for doc in candidates])
ranked = [doc for _, doc in sorted(zip(scores, candidates), reverse=True)]
- Cross-encoders are compute-heavy; rerank top-50 to top-5/10.
- Calibrate thresholds to drop low-confidence passages.
- Cache reranker results by (query_hash, doc_id) for common queries.
Context Assembly
type Card = { title: string; snippet: string; url: string; citationId: string; tokens: number };
function assemble(cards: Card[], maxTokens: number): Card[] {
const seen = new Set<string>();
const out: Card[] = [];
let budget = maxTokens;
for (const c of cards) {
const key = c.citationId;
if (seen.has(key)) continue;
if (c.tokens > budget) continue;
out.push(c); seen.add(key); budget -= c.tokens;
}
return out;
}
- Diversity: prefer 1 card per source initially; allow follow-ups to drill deeper.
- Compression: sentence-level extract with query-aware summarization; preserve citations.
Evaluation Framework
metrics:
- groundedness_hallucination_rate
- exact_match / F1 (for QA)
- coverage@k (did we retrieve the gold chunk?)
- click-through on citations
- response_time_ms and cost_usd
- Golden sets: hand-curated questions with authoritative answers and gold chunks.
- Shadow deploy: compare old vs new pipeline A/B, require win-rate > X% to promote.
- Red-team: jailbreak attempts, prompt-injection canaries in content.
Observability
interface TraceSpan {
name: string; start: number; end: number; attrs: Record<string, any>;
}
- Trace spans per stage: rewrite, retrieve, rerank, assemble, generate.
- Attach tokens, costs, cache hits, and candidate IDs to each span.
- Store minimal snippets; avoid PII; sample generously for error cases.
Security
- Sanitize inputs; strip HTML/JS; block suspicious patterns.
- Validate outbound tool calls; whitelist hosts; set timeouts.
- Sign indexed content; store hash; verify at retrieval to prevent content poisoning.
Cost Controls
- Cache embeddings and reranks; batch embeds; dedupe near-duplicates.
- Use smaller models for rewrite and rerank; reserve large models for generation only when needed.
- Token budgeting in assembler; favor citations over verbose prose.
Troubleshooting
- Low relevance: improve rewrite, adjust field boosts, add synonyms.
- Hallucinations: tighten thresholds, increase citations, refuse on low confidence.
- High latency: lower k, cache more, parallelize, precompute heavy steps.
FAQ
Q: How big should chunks be?
A: 300–800 tokens for prose; smaller for code/FAQs. Optimize on evaluation results.
Q: Should I dedupe similar chunks?
A: Yes—during indexing and retrieval. Use MinHash or cosine thresholding.
Related posts
- AI Agents Architecture: /blog/ai-agents-architecture-autonomous-systems-2025
- LLM Fine-Tuning (LoRA/QLoRA): /blog/llm-fine-tuning-complete-guide-lora-qlora-2025
- Vector Databases Comparison: /blog/vector-databases-comparison-pinecone-weaviate-qdrant
- LLM Security: /blog/llm-security-prompt-injection-jailbreaking-prevention
- LLM Observability: /blog/llm-observability-monitoring-langsmith-helicone-2025
Call to action
Need help productionizing RAG at scale? Get a free architecture review.
Contact: /contact • Newsletter: /newsletter
Production Cookbook (End-to-End Recipes)
Recipe 1 — SaaS Knowledge Base RAG (Multi-tenant, EU/US Residency)
- Requirements: tenant isolation, EU/US residency, low latency, cost caps
- Stack: Next.js App Router (server actions), LangChain, Qdrant (EU/US clusters), Redis cache, Helicone proxy, LangSmith evals
graph TB
subgraph EU
A[Next.js EU] --> B[Redis EU]
A --> C[Qdrant EU]
end
subgraph US
D[Next.js US] --> E[Redis US]
D --> F[Qdrant US]
end
A & D --> G[Helicone]
G --> H[LLM Provider]
A & D --> I[LangSmith]
// app/api/rag/route.ts (server-only)
import { kv } from "@vercel/kv"; // or ioredis
import { qdrantEU, qdrantUS } from "@/lib/qdrant";
import { rewriteQuery, retrieve, rerank, assemble } from "@/lib/rag";
import { withTenant } from "@/lib/tenant";
import { withBudgetGuard } from "@/lib/cost";
import { trace } from "@/lib/otel";
export const POST = withTenant(withBudgetGuard(async (req) => {
return trace("rag.pipeline", async (span) => {
const { query, tenantId, region } = await req.json();
const cacheKey = `rag:${tenantId}:${region}:${hash(query)}`;
const cached = await kv.get(cacheKey);
if (cached) return Response.json(cached);
const qdrant = region === "eu" ? qdrantEU : qdrantUS;
const q1 = await rewriteQuery(query, { tenantId });
const candidates = await retrieve(qdrant, q1, { tenantId, topK: 128 });
const ranked = await rerank(q1, candidates, { topK: 8 });
const context = await assemble(q1, ranked, { tokenBudget: 2000 });
const answer = await generate({ query: q1, context, tenantId });
const result = { answer, context, citations: context.map(c => c.url) };
await kv.set(cacheKey, result, { ex: 300 });
return Response.json(result);
});
}));
Recipe 2 — Developer Docs RAG with OpenAPI-calling Tool
- Expose a typed tool that calls internal OpenAPI endpoints when context indicates "how-to" queries
- Guard with allowlist, signature verification, and rate limits
type Tool = { name: string; params: any; run: (p: any) => Promise<any> };
export const getUserTool: Tool = {
name: "get_user",
params: { type: "object", properties: { id: { type: "string" } }, required: ["id"] },
run: async ({ id }) => fetch(`/internal/api/users/${id}`, { headers: signedHeaders() }).then(r => r.json())
};
export async function agent(query: string, ctx: any) {
const plan = await llm.plan(query, { tools: [getUserTool] });
for await (const step of plan) {
if (step.type === "tool" && step.name === "get_user") {
const resp = await getUserTool.run(step.params);
plan.observe(resp);
}
}
return plan.final();
}
Recipe 3 — Domain Router (Finance vs Support vs Legal)
const routes = [
{ name: "finance", match: [/invoice|receipt|tax|vat/i], kb: "kb_fin" },
{ name: "support", match: [/error|bug|troubleshoot|reset/i], kb: "kb_supp" },
{ name: "legal", match: [/terms|privacy|dpa|dpo/i], kb: "kb_leg" },
];
export function route(query: string) {
const r = routes.find(r => r.match.some(rx => rx.test(query)))?.kb ?? "kb_general";
return r;
}
Language Implementations (Python, Node, Go)
Python (FastAPI + Qdrant + Redis + Tenancy)
from fastapi import FastAPI, Depends
from redis.asyncio import Redis
from qdrant_client import QdrantClient
from pydantic import BaseModel
app = FastAPI()
redis = Redis.from_url("redis://...")
qdrant = QdrantClient(url="http://qdrant:6333")
class RAGRequest(BaseModel):
query: str
tenant_id: str
@app.post("/rag")
async def rag(req: RAGRequest):
cache_key = f"rag:{req.tenant_id}:{hash(req.query)}"
cached = await redis.get(cache_key)
if cached:
return json.loads(cached)
q1 = rewrite(req.query)
vec = await embed(q1)
res = qdrant.search(collection_name=f"kb_{req.tenant_id}", query_vector=vec, limit=128)
ranked = rerank(q1, res)
context = assemble(q1, ranked)
answer = await generate(q1, context)
result = {"answer": answer, "context": context}
await redis.set(cache_key, json.dumps(result), ex=300)
return result
Node (Express + Pinecone + Helicone)
import express from "express";
import { Pinecone } from "@pinecone-database/pinecone";
import fetch from "node-fetch";
const app = express();
app.use(express.json());
const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const index = pc.Index("kb");
app.post("/rag", async (req, res) => {
const { query, tenantId } = req.body;
const vec = await embed(query);
const result = await index.query({ topK: 100, vector: vec, filter: { tenantId } });
const ranked = await rerank(query, result.matches);
const context = assemble(query, ranked);
const answer = await fetch(process.env.HELICONE_PROXY!, {
method: "POST",
headers: { "Content-Type": "application/json", "Helicone-Auth": process.env.HELICONE_KEY! },
body: JSON.stringify({ messages: makeMessages(query, context) })
}).then(r => r.json());
res.json({ answer: answer.choices?.[0]?.message?.content, context });
});
Go (Fiber + Weaviate)
package main
import (
"github.com/gofiber/fiber/v2"
wv "github.com/weaviate/weaviate-go-client/v4/weaviate"
)
func main() {
app := fiber.New()
client := wv.NewClient(wv.Config{Scheme: "http", Host: "weaviate:8080"})
app.Post("/rag", func(c *fiber.Ctx) error {
var req struct{ Query string; Tenant string }
c.BodyParser(&req)
// search and compose (pseudo)
return c.JSON(fiber.Map{"answer": "...", "tenant": req.Tenant})
})
app.Listen(":8080")
}
Retrieval Indexing and Ingestion
Ingestion Pipeline (Docs, PDFs, HTML, Structured)
steps:
- fetch: { kind: http, urls: ["https://docs.example.com/"] }
- extract: { kind: readability }
- chunk: { kind: headings, max_tokens: 600, overlap: 80 }
- enrich:
- title
- anchor
- section
- author
- published_at
- embed: { model: text-embedding-3-large, batch: 64 }
- upsert: { store: qdrant, collection: kb_tenant }
Content Hashing & Signing (Poisoning Defense)
import crypto from "crypto";
export function contentHash(s: string) {
return crypto.createHash("sha256").update(s).digest("hex");
}
export function sign(hash: string) {
return crypto.createHmac("sha256", process.env.SIGNING_KEY!).update(hash).digest("hex");
}
Reranking Strategies (Trade-offs and Models)
- MiniLM cross-encoders for speed; bge-reranker-large for quality
- Pairwise rerank vs pointwise scoring; calibration thresholds
- Cache on (queryHash, passageId, model) with TTL and LRU
export async function crossEncode(q: string, docs: string[]) {
// call reranker service or HF Inference endpoints
return docs.map((_, i) => 1 - i / docs.length); // placeholder
}
Context Assembly Strategies (Cards, Tables, and Code)
- Structured cards: title, snippet, URL, important fields
- Code-aware assembly: preserve code blocks; limit formatting churn
- Table-aware assembly: render as CSV/Markdown for clarity
export function compressToTokens(text: string, budget: number) {
// heuristic trimming by sentences, keep citations
const sents = text.split(/([.!?])\s+/);
const out: string[] = [];
let tokens = 0;
for (const s of sents) {
const t = approxTokens(s);
if (tokens + t > budget) break;
out.push(s);
tokens += t;
}
return out.join(" ");
}
Evaluation at Scale (Offline + Online)
Golden Sets (Construction and Maintenance)
suites:
- name: faq_critical
items:
- id: faq-001
query: "How do I reset my SSO password?"
expected:
contains: ["Click 'Forgot password'", "SSO provider", "email"]
citations_required: true
- id: faq-002
query: "What is our DPA address for EU tenants?"
expected:
contains: ["Data Processing Addendum", "EU"],
citations_required: true
Online Evals (Shadow, A/B, Bandit)
type Arm = "baseline" | "candidate";
export function assignArm(userId: string): Arm {
return hash(userId) % 100 < 10 ? "candidate" : "baseline"; // 10% canary
}
Observability (Trace Spec)
{
"name": "rag.pipeline",
"attributes": {
"tenant.id": "abc",
"region": "eu",
"rewrite.ms": 12,
"retrieve.ms": 43,
"rerank.ms": 80,
"assemble.ms": 14,
"generate.ms": 900,
"cost.usd": 0.0123,
"tokens.in": 1234,
"tokens.out": 456
}
}
Security Policies (Guardrails)
policies:
prompt_injection:
block_patterns:
- "ignore previous instructions"
- "you are now"
- "system:"
tools:
http_request:
allow_hosts: ["api.internal", "docs.example.com"]
deny_ips: ["169.254.169.254"]
timeout_ms: 8000
max_body_kb: 256
content:
outbound_links:
allow_domains: ["example.com", "docs.example.com"]
require_citations: true
Playbooks (Ops & SRE)
Playbook — Latency Spike
- Symptoms: P95 > 3s in generate or rerank spans
- Actions: check cache hit %, reranker queue depth, model route changes; reduce topK; enable short context mode
- Rollback: switch to smaller model route; disable reranker temporarily; raise refusal threshold
Playbook — Cost Spike
- Symptoms: cost.usd per request > budget
- Actions: enforce token budget, enable prompt compression, raise cache TTL, downshift model tier
Playbook — Quality Regression
- Symptoms: win-rate drop > 5% vs baseline
- Actions: freeze deploys, run backfill evals, analyze failures by category; revert last change
Benchmarks (Latency/Cost Profiles)
route,model,input_tokens,output_tokens,latency_ms,cost_usd
small,gpt-4o-mini,900,200,700,0.0041
medium,gpt-4o,1200,350,1200,0.0180
large,claude-3-opus,1400,500,1800,0.0315
Extended FAQ (Advanced)
Q: How do we prevent duplicate or near-duplicate chunks?
Use locality-sensitive hashing (MinHash/SimHash) at ingest; drop within-threshold items or downweight at retrieval.
Q: What if retrieval returns correct but low-quality sources?
Boost authoritative sources via per-source weights; penalize low-quality domains; add quality signals to ranking.
Q: How do we keep costs bounded under heavy load?
Hierarchical caches, token budgets, small-model rerankers, dynamic topK, surge control, and circuit breakers on LLM calls.
Q: How to localize RAG?
Segment indices by locale; prefer locale match in filters; translate queries before/after; re-embed localized corpora.
Q: How to prevent secret leakage via retrieval?
Pre-index DLP scans; exclude matches; at generation, scan outputs for secret regexes; redact and log events.
Q: How do we decide between Pinecone/Qdrant/Weaviate/pgvector?
Use managed (Pinecone) for turnkey and SLAs; Qdrant for cost/control; Weaviate for graph-like schemas; pgvector for SQL integration.
Q: Should we add graph edges to chunks?
Yes when relations help navigation (parent/child/see-also); improves diversity and follow-up retrieval.
Q: How big should the rerank set be?
Commonly 50–200 candidates; tune by latency/cost goals and reranker throughput.
Q: How to monitor hallucination rate?
Use rubric scoring with required citation coverage; sample answers and auto-check citation presence/consistency.
Q: Can we do RAG without embeddings?
Yes, lexical-only can work for structured FAQs; hybrid usually wins for broader corpora.
Glossary
- RRF: Reciprocal Rank Fusion — method to combine ranked lists
- Cross-encoder: model scoring (query, passage) pairs jointly
- Context card: structured snippet with source/citation ready for LLM
- TopK: number of items to keep at a stage (retrieve/rerank)
References and Further Reading
- OpenAI Evals and eval theory
- MS MARCO / BEIR benchmarks
- OTEL Semantic Conventions for AI
- Vector DB docs: Qdrant, Pinecone, Weaviate, pgvector
Integration Blueprints (Vendors and Stacks)
Blueprint — Pinecone + LangGraph + Next.js
// langgraph.ts (pseudo)
import { StateGraph } from "langgraph";
const g = new StateGraph()
.addNode("rewrite", rewriteNode)
.addNode("retrieve", retrieveNode)
.addNode("rerank", rerankNode)
.addNode("assemble", assembleNode)
.addNode("generate", generateNode)
.addEdge("rewrite","retrieve")
.addEdge("retrieve","rerank")
.addEdge("rerank","assemble")
.addEdge("assemble","generate");
export default g;
// pinecone.ts
import { Pinecone } from "@pinecone-database/pinecone";
export const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
export const index = pc.Index("kb");
Blueprint — Weaviate (Hybrid) + Cloudflare Workers
// worker.ts
export default {
async fetch(req: Request, env: any) {
const url = new URL(req.url);
if (url.pathname === "/rag") return handleRAG(req, env);
return new Response("Not found", { status: 404 });
}
}
Full Config Samples
Qdrant Collections and Payload Indexes
{
"collection_name": "kb_tenant",
"vectors": { "size": 1536, "distance": "Cosine" },
"optimizers_config": { "default_segment_number": 6 },
"hnsw_config": { "ef_construct": 128, "m": 32 },
"quantization_config": { "product": { "compression": 8 } },
"on_disk_payload": true,
"shard_number": 2,
"replication_factor": 2
}
Weaviate Schema (Graph-Like)
{
"class": "Document",
"description": "Knowledge base entries",
"vectorizer": "none",
"properties": [
{ "name": "title", "dataType": ["text"] },
{ "name": "text", "dataType": ["text"] },
{ "name": "url", "dataType": ["text"] },
{ "name": "tenantId", "dataType": ["text"] },
{ "name": "locale", "dataType": ["text"] }
]
}
Security Matrices
| Layer | Risk | Control | Evidence |
|---|---|---|---|
| Input | Injection | Sanitizer + WAF | Regex hits, blocked count |
| Retrieve | Poisoning | Signed content | Hash/sign logs |
| Rerank | Model abuse | Rate limits | Span metrics |
| Assemble | PII leak | Redaction | Redaction logs |
| Generate | Hallucination | Citations required | Eval scores |
Governance SOPs
- Change management: proposal → review → shadow deploy → promote
- Dataset updates: lineage captured; consent; PII handling; audits
- Model changes: model card, eval diff ≥ +X% win‑rate, rollback plan
Localization and Accessibility
- Locale routing; language tags in payload; localized stopwords
- Accessibility: readable citations, keyboard focus for UI, high contrast highlights
Dataset Curation Playbook
- Source allowlist; crawler etiquette; license tracking
- Deduplication strategies (MinHash thresholds)
- Quality labels and reviewer guidelines
labels:
grounded: yes/no
authoritative: yes/no
stale: yes/no
sensitive: pii/secret/none
Comprehensive Testing Suites
Unit Tests (Assembler)
import { assemble } from "@/lib/rag";
test("dedupes by docId", () => {
const ranked = [
{ payload: { docId: "1", title: "A", text: "...", url: "u1" } },
{ payload: { docId: "1", title: "A", text: "...", url: "u1" } },
{ payload: { docId: "2", title: "B", text: "...", url: "u2" } }
];
const cards = assemble("q", ranked, { tokenBudget: 100 });
expect(cards.length).toBe(2);
});
Contract Tests (API)
- name: GET /api/rag returns citations
request: { method: POST, path: /api/rag, body: { query: "how to reset" } }
expect:
status: 200
json: { $.citations: present, $.answer: present }
SLOs and SLIs
- SLO: P95 latency ≤ 1.5s; Error rate ≤ 1%; Win‑rate ≥ baseline + 5%
- SLIs: trace spans per stage; cache hit rate; cost per request; citation coverage
Disaster Recovery
- Multi‑region replicas; snapshot embeddings/payloads; tested restore runbooks
- DNS or edge routing failover; low TTLs; warm caches on recovery
Capacity Planning
- Queries/minute projections; vector insert rates; storage growth; efSearch scaling
- Back‑of‑envelope: memory per vector with metadata; CPU per RPS for reranker
Cost Calculators (Detailed)
export function costPerRequest({tokensIn, tokensOut, model}:{tokensIn:number;tokensOut:number;model:"small"|"medium"|"large"}){
const price = { small:{in:1e-6,out:3e-6}, medium:{in:6e-6,out:12e-6}, large:{in:12e-6,out:24e-6} };
return tokensIn*price[model].in + tokensOut*price[model].out;
}
Advanced FAQ (Additional)
Q: What’s an effective cache key?
Hash of normalized query + tenant + locale + version + route.
Q: Should embeddings be encrypted at rest?
Yes; treat as sensitive if they may encode proprietary content.
Q: How to validate citations?
Automated link checkers + content hash verification against stored hash.
Q: How to schedule re‑embedding?
When model upgrades, content updates, or evaluation finds drift; incremental jobs.
Q: Do structured sources need chunking?
Often record‑level works; attach field semantics and consider entity linking.
Q: How to throttle expensive rerankers?
Queue with concurrency limits; fall back to faster reranker when under load.
Q: What if hybrid search returns conflicting results?
Prefer diversity; present options; let user disambiguate; improve rewrite.
Q: How to handle private vs public corpora?
Separate indices; strict auth at retrieval; do not mix payloads.
Q: What metrics detect poisoning?
Sudden topic drift, low quality flags, mismatch between link text and target, signature failures.
Q: How to keep token counts predictable?
Aggressive trimming by sentences; structured cards; strict token budgets per stage.
Vendor Playbooks (Operational)
Pinecone Playbook
- Index sizing: start small, scale replicas on P95 > target
- Regions: minimize egress; colocate with app
- Filters: use metadata for tenant/locale, payload-only filtering for speed
runbooks:
scale:
trigger: p95_ms > 30 for 15m
steps:
- pinecone scale replicas +1
- verify health
- run smoke queries
incident-latency:
trigger: p99_ms > 60
steps:
- check routing errors
- reduce topK from 200->120
- enable response cache 5m
- notify oncall
Qdrant Playbook
- HNSW tuning: start M=32, ef=128, raise ef for recall; monitor CPU
- Segmenting: default_segment_number tuned per dataset; compact when fragments grow
# Compact collection maintenance window
curl -X POST http://qdrant:6333/collections/kb/optimizers/recommend
Weaviate Playbook
- Modules: disable unused vectorizers; set replication; autoschema off for control
- Graph queries: keep shallow; precompute relationships for frequent paths
Infra-as-Code (IaC) Samples
Terraform (Qdrant + App)
resource "aws_instance" "qdrant" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.large"
user_data = file("cloud-init/qdrant.yaml")
tags = { Name = "qdrant" }
}
resource "aws_lb" "app" { # ... }
resource "aws_lb_target_group" "app" { # ... }
resource "aws_lb_listener" "app" { # ... }
Kubernetes (RAG API + Reranker)
apiVersion: apps/v1
kind: Deployment
metadata: { name: rag-api }
spec:
replicas: 3
selector: { matchLabels: { app: rag-api } }
template:
metadata: { labels: { app: rag-api } }
spec:
containers:
- name: api
image: registry/rag-api:latest
resources:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "500m", memory: "512Mi" }
env:
- name: VECTOR_URL
valueFrom: { secretKeyRef: { name: rag-secrets, key: vector_url } }
---
apiVersion: apps/v1
kind: Deployment
metadata: { name: reranker }
spec:
replicas: 2
selector: { matchLabels: { app: reranker } }
template:
metadata: { labels: { app: reranker } }
spec:
containers:
- name: reranker
image: registry/reranker:latest
resources:
requests: { cpu: "1", memory: "2Gi" }
limits: { cpu: "2", memory: "4Gi" }
Monitoring Dashboards (JSON)
{
"title": "RAG Pipeline",
"panels": [
{ "type": "graph", "title": "P95 Latency", "targets": [{ "expr": "histogram_quantile(0.95, sum(rate(rag_stage_latency_bucket[5m])) by (le))" }] },
{ "type": "stat", "title": "Cache Hit %", "targets": [{ "expr": "sum(rate(rag_cache_hit_total[5m])) / sum(rate(rag_cache_total[5m])) * 100" }] },
{ "type": "graph", "title": "Cost per Request", "targets": [{ "expr": "sum(rate(rag_cost_usd[5m])) / sum(rate(rag_requests_total[5m]))" }] },
{ "type": "table", "title": "Top Errors", "targets": [{ "expr": "topk(10, increase(rag_errors_total[24h]))" }] }
]
}
End-to-End Test Suites
Smoke Tests
it("answers FAQ with citation", async () => {
const res = await fetch("/api/rag", { method: "POST", body: JSON.stringify({ query: "reset password" }) });
const json = await res.json();
expect(json.citations?.length).toBeGreaterThan(0);
expect(json.answer).toMatch(/reset/);
});
Load Tests (k6)
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = { vus: 50, duration: '10m' };
export default function() {
const res = http.post(__ENV.BASE_URL+"/api/rag", JSON.stringify({ query: "billing" }), { headers: { 'Content-Type': 'application/json' } });
check(res, { 'status 200': r => r.status === 200 });
sleep(0.2);
}
Data Governance SOPs
- Data intake: license check, source allowlist, robots.txt compliance
- Retention: default 12 months; purge requests via DSR workflow
- Access: least privilege; audit every read/write; quarterly reviews
Localization Strategies
- Per-locale indices; locale-aware rewrite (stemming, synonyms)
- Fallback chains (fr-CA → fr → en); mark citation locales
- UI: render citations with language tags and accessible labels
40 Advanced FAQs
-
How do we handle multi‑KB contexts without cost explosions?
Use structured cards + smart trimming, token budgets per stage, and small-model first routing. -
How to ensure determinism for compliance answers?
Fixed prompts, strict citation rules, frozen indices for compliance content, and versioned prompts. -
What if the reranker model is down?
Fall back to lexical boosts + diverse sampling; lower topK; note degraded mode in responses. -
How to solve "parroting" duplicate content?
Downweight duplicates at retrieval; dedupe in assembly; diversify by source. -
How to prevent long-tail latency spikes?
Cap tool calls time; limit max candidates; circuit-break on generation. -
Best way to log without leaking PII?
Hash + redact; store diffs; keep raw prompts in a quarantined lake with access approvals. -
How to enforce "citation required"?
Verifier function scans answer for citation markers and matching context IDs; reject otherwise. -
Can we stream partial answers with citations?
Yes—stream answer and append citations at end; or stream footnote numbers and resolve later. -
Do we need query understanding models?
Rewrite often suffices; for complex domains add intent classifiers trained on traffic. -
How to measure real ROI?
Deflection, time-to-resolution, cost/request, win-rate deltas vs baseline, and satisfaction scores. -
Handle regulatory deletes (Right to be Forgotten)?
Track provenance; purge chunks by IDs; reindex; invalidate caches; verify deletion reports. -
Control stale content?
Add published_at; decay scores; exclude beyond TTL unless explicitly requested. -
Balance precision vs recall?
Tune ef/topK and reranker thresholds; use per‑query type configurations. -
Is BM25 still necessary?
Yes; hybrid consistently wins for precise ID queries and acronyms. -
Reduce cold-start latency?
Warm caches, preload embeddings for frequent queries, keep small-model route hot. -
Secure tenant isolation?
Separate collections or strict payload filters plus auth checks; sign tokens with tenant claims. -
Detect bad citations?
Parser validates URLs/anchors; content hash mismatch triggers exclusion; add QA tasks. -
How to do branch previews safely?
Use ephemeral indices; name by branch; restrict to reviewers; auto-delete on merge. -
Recommended batch sizes?
Embeddings 32–128; reranker depends on GPU; monitor throughput/latency. -
Logging schema?
Trace ID, tenant, region, route, tokens, cost, spans, selected candidates, citations; privacy flags. -
Prompt versioning?
Store hash + metadata; tie to model route; roll back with config. -
Which reranker model?
Start with MiniLM cross‑encoder; upgrade to bge rerank when quality requires; test. -
Chunk overlap 10% or 20%?
Start 10–15% for prose; less for FAQs; validate on evals. -
Tables vs text?
Extract structured tables and store as JSON for targeted lookup; include both views. -
What about code docs?
Chunk by function/module; include signatures and imports; preserve formatting. -
Personalization?
Use tenant/user tags to bias retrieval; keep privacy; avoid overfitting responses. -
How to compose evidence packs?
Bundle citations with hashes and timestamps; export as PDF/ZIP for audits. -
Abuse/attack telemetry?
Track blocked patterns, tool denials, WAF hits; alert on spikes. -
Model drift indicators?
Win‑rate drop, refusal misfires, tone/style changes; run periodic evals. -
Can we precompute answers?
Yes for FAQs; invalidate on content updates; store in fast KV. -
A/B test pitfalls?
Beware novelty effects; run long enough; segment by tenant and query type. -
Guardrail failure handling?
Refuse with explanation; prompt user for safer query; log event for tuning. -
How to attribute cost across teams?
Tag traces with team/tenant; chargeback reports; budgets. -
Which embedding model?
Use state-of-art with stable latency; ensure licensing allows domain use. -
How to prevent API metadata leakage?
Mask secrets in payloads; keep minimal metadata; audit fields. -
Disaster game days?
Simulate index outage and reranker failures; document recovery times. -
Pagination of results?
Provide top answer and "More results" with ranked cards; allow user feedback. -
Keep long-running conversations grounded?
Regularly refresh context from retrieval; trim thread memory; re-ask clarifying questions. -
Legal disclaimers in regulated domains?
Add domain-specific disclaimers; route high-risk queries to human escalation. -
What about multimodal (images/tables)?
Use multimodal embeddings; extract image captions; include alt text and OCR, link back to source.
Dataset Catalogs (Templates and Examples)
catalog:
title: "Company Knowledge Base"
owners: ["docs@company.com", "platform@company.com"]
sources:
- id: kb-product
type: website
url: https://docs.company.com/
license: proprietary
crawl:
depth: 3
include: ["/guides/", "/faq/"]
exclude: ["/admin/"]
preprocess:
readability: true
code_blocks: preserve
tables: extract
chunk:
strategy: headings
max_tokens: 600
overlap: 80
metadata:
product: core
locale: en
- id: kb-legal
type: pdf
path: s3://kb/legal/*.pdf
license: proprietary
ocr: true
preprocess:
detect_columns: true
captions: attach
chunk:
strategy: paragraphs
max_tokens: 450
overlap: 60
metadata:
category: legal
locale: en
- id: kb-support-fr
type: website
url: https://support.company.fr/
locale: fr
chunk:
strategy: headings
max_tokens: 500
overlap: 70
OpenAPI Tool Library (Function Calling Specs)
{
"tools": [
{
"name": "get_user",
"description": "Fetch user profile by ID",
"parameters": {
"type": "object",
"properties": { "id": { "type": "string" } },
"required": ["id"]
}
},
{
"name": "create_ticket",
"description": "Create support ticket",
"parameters": {
"type": "object",
"properties": {
"title": { "type": "string" },
"body": { "type": "string" },
"severity": { "type": "string", "enum": ["low","medium","high"] }
},
"required": ["title","body"]
}
},
{
"name": "list_invoices",
"description": "List invoices for account",
"parameters": {
"type": "object",
"properties": { "accountId": { "type": "string" }, "limit": { "type": "number" } },
"required": ["accountId"]
}
}
]
}
// validators.ts
export function enforceToolPolicy(name: string, params: Record<string,unknown>) {
if (name === "get_user") {
if (!/^usr_[a-z0-9]{8}$/.test(String((params as any).id))) throw new Error("invalid id");
}
if (name === "create_ticket") {
if (String((params as any).title).length < 5) throw new Error("title too short");
}
}
Prompts Library (Operations-Ready)
System (RAG):
You are a retrieval-augmented assistant. Always cite sources as [^n] with links.
If context is insufficient, say "I don’t have enough information" and propose next steps.
System (Rewrite):
You normalize queries: fix spelling, expand acronyms, add synonyms; keep meaning.
Do not fabricate facts.
System (Safety):
Refuse unsafe requests (self-harm, illegal, privacy violations). Be brief and offer safe alternatives.
System (Assembler):
Summarize cards into concise, non-redundant bullets with citations [^n]. Keep technical terms.
100-Case Evaluation Suite (YAML)
suite: kb_eval_v1
items:
- id: q001
query: "Reset MFA device"
expected:
contains: ["MFA", "reset", "admin"]
citations: 1
- id: q002
query: "Pricing for enterprise tier"
expected:
contains: ["enterprise", "pricing"]
citations: 1
- id: q003
query: "GDPR DPA address"
expected:
contains: ["DPA", "address", "EU"]
citations: 1
- id: q004
query: "Error code XY-1234 troubleshooting"
expected:
contains: ["XY-1234", "steps", "logs"]
citations: 1
# ... add q005–q100 with detailed expectations
// eval-runner.ts
import { runEvalItem } from "./eval-lib";
import items from "./kb_eval_v1.yaml";
(async function main(){
let pass=0; let total=0;
for (const it of items.items){
total++;
const r = await runEvalItem(it);
if (r.pass) pass++;
console.log(it.id, r.pass?"PASS":"FAIL", r.metrics);
}
console.log("win-rate:", pass/total);
})();
Language Samples (Java)
@RestController
public class RagController {
@PostMapping("/rag")
public Map<String,Object> rag(@RequestBody Map<String,Object> body) {
String query = (String) body.get("query");
String q1 = Rewrite.normalize(query);
float[] vec = Embeddings.embed(q1);
List<Candidate> cands = VectorStore.search("kb", vec, 128);
List<Candidate> ranked = Reranker.rank(q1, cands, 10);
List<Card> context = Assembler.assemble(q1, ranked, 2000);
String answer = LLM.generate(q1, context);
return Map.of("answer", answer, "context", context);
}
}
CI Pipelines (GitHub Actions)
name: rag-ci
on:
push:
branches: [ main ]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci && npm run lint
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci && npm test -- --ci
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: node eval-runner.js
Governance Docs (Model Cards, Changes)
# Model Card — RAG Generator v0.7
- Provider: gpt-4o-mini
- Safety: refusal rate 4.1% (target 3–6%)
- Win‑rate vs baseline: +7.8%
- Known limitations: cites only included context; may refuse borderline requests
- Change log:
- 0.7: raised refusal threshold; tightened citation regex
- 0.6: improved assembly compression
Extra Advanced FAQs (Selection)
-
How do we A/B test prompts safely?
Use a prompt registry with version IDs; assign arms per user; cap traffic; auto‑rollback on metric regression. -
What about PII in vector embeddings?
Redact before embedding; hash replacement; keep mapping in secure KMS; add PII detector at ingest. -
Can we throttle per‑tenant cost?
Yes—track cost at trace level; enforce budgets; return graceful fallback on overage. -
How to trace across microservices?
Propagate W3C traceparent; attach tenant/route; emit spans at each stage. -
Is JSON-LD needed in responses?
Not necessary for RAG, but useful for SEO when rendering knowledge base answers on web. -
How to combine code and prose context?
Keep code blocks intact; present as separate cards; instruct model to quote code carefully. -
Should we store raw HTML?
Store both HTML and extracted text; use HTML for anchors and accurate citations. -
How to guard against cache poisoning?
Key by normalized query and tenant; sign values; short TTL; validate on read. -
What if reranker disagrees with business priority?
Add source boosts/penalties; rerank score + source weight; document policy. -
How to rapidly iterate?
Shadow deploy pipeline changes; collect metrics; promote with change freezes for high‑risk periods. -
Multi‑cloud indices?
Prefer single‑cloud per region for simplicity; replicate across clouds only for critical DR. -
GPU or CPU for reranker?
GPU for heavy cross‑encoders; CPU can suffice for MiniLM at modest QPS. -
Batch vs streaming ingest?
Both—CDC for change data, batch for rebuilds; ensure idempotency and backpressure handling. -
How to present uncertainty?
Include confidence score; allow user to view sources; provide "Was this helpful?" feedback. -
Are denial lists effective?
They help but are insufficient alone; combine detectors, allowlists, and policies. -
Should we relieve re‑embedding costs by quantization?
Yes—store compressed vectors when recall remains acceptable. -
Time‑boxed generate?
Set max_tokens and timeouts; degrade to summary when time exhausted. -
Proxy vendors?
Helicone/Langfuse can help with analytics and cost tracking; validate privacy and SLAs. -
Can we localize reranker?
Use multilingual rerankers; language detection in rewrite; locale‑aware thresholds. -
What about images in RAG?
Use multimodal retrieval; generate alt text for accessibility; cite image sources clearly.
OpenAPI Specification (RAG API)
openapi: 3.0.3
info:
title: RAG API
version: 1.0.0
paths:
/rag:
post:
summary: Answer query using retrieval-augmented generation
requestBody:
required: true
content:
application/json:
schema:
type: object
properties:
query: { type: string }
tenantId: { type: string }
locale: { type: string }
topK: { type: integer, default: 10 }
required: [query, tenantId]
responses:
'200':
description: Success
content:
application/json:
schema:
type: object
properties:
answer: { type: string }
citations:
type: array
items: { type: string }
context:
type: array
items:
type: object
properties:
title: { type: string }
snippet: { type: string }
url: { type: string }
'400': { description: Invalid input }
'429': { description: Budget exceeded }
'500': { description: Server error }
Full Next.js API Implementation (Server Actions + Tracing)
// app/api/rag/route.ts
import { NextRequest } from "next/server";
import { kv } from "@vercel/kv";
import { index } from "@/lib/pinecone";
import { crossEncode } from "@/lib/rerank";
import { trace, span } from "@/lib/otel";
import { withBudget } from "@/lib/budget";
export const runtime = "nodejs";
function normalize(q: string){ return q.normalize("NFKC").trim(); }
async function embed(q: string){ /* call embedding */ return new Array(1536).fill(0); }
export async function POST(req: NextRequest){
return withBudget(await trace("rag", async () => {
const body = await req.json();
const query = normalize(String(body.query || ""));
const tenantId = String(body.tenantId || "");
const locale = String(body.locale || "en");
const topK = Number(body.topK || 10);
if (!query || !tenantId) return Response.json({ error: "invalid" }, { status: 400 });
const cacheKey = `rag:${tenantId}:${locale}:${topK}:${hash(query)}`;
const cached = await kv.get(cacheKey);
if (cached) return Response.json(cached);
const vec = await span("embed", () => embed(query));
const candidates = await span("retrieve", async () => {
const res = await index.query({ topK: 200, vector: vec, filter: { tenantId, locale } });
return res.matches?.map((m:any) => ({ id: m.id, score: m.score, payload: m.metadata })) || [];
});
const ranked = await span("rerank", async () => {
const texts = candidates.map((c:any) => c.payload.text);
const scores = await crossEncode(query, texts);
return candidates
.map((c:any, i:number) => ({ ...c, score: scores[i] }))
.sort((a:any,b:any)=>b.score-a.score)
.slice(0, topK);
});
const context = await span("assemble", async () => {
const seen = new Set<string>();
const out:any[] = [];
for (const r of ranked){
if (seen.has(r.payload.docId)) continue;
out.push({ title: r.payload.title, snippet: r.payload.text.slice(0, 800), url: r.payload.url });
seen.add(r.payload.docId);
}
return out;
});
const answer = await span("generate", async () => {
return `Answer (locale=${locale}):\n` + context.map((c,i)=>`[^${i+1}] ${c.title}`).join("\n");
});
const result = { answer, citations: context.map((c)=>c.url), context };
await kv.set(cacheKey, result, { ex: 300 });
return Response.json(result);
}));
}
Python Client (Typed)
from typing import List, TypedDict
import httpx
class Card(TypedDict):
title: str
snippet: str
url: str
class RagResponse(TypedDict):
answer: str
citations: List[str]
context: List[Card]
class RagClient:
def __init__(self, base_url: str, tenant_id: str, timeout: float = 8.0):
self.base_url = base_url.rstrip('/')
self.tenant_id = tenant_id
self.timeout = timeout
async def ask(self, query: str, locale: str = "en", top_k: int = 10) -> RagResponse:
async with httpx.AsyncClient(timeout=self.timeout) as client:
r = await client.post(f"{self.base_url}/rag", json={
"query": query,
"tenantId": self.tenant_id,
"locale": locale,
"topK": top_k
})
r.raise_for_status()
return r.json() # type: ignore
Helm Chart Values (Excerpt)
api:
image: registry/rag-api:1.2.3
replicaCount: 3
resources:
requests: { cpu: 250m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
env:
VECTOR_URL: https://pinecone.io/...
HELICONE_PROXY: https://proxy.helicone.ai/
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 60
reranker:
image: registry/reranker:0.9.0
resources:
requests: { cpu: 1, memory: 2Gi }
limits: { cpu: 2, memory: 4Gi }
k6 Load Testing Variations
import http from 'k6/http';
import exec from 'k6/execution';
import { check, sleep } from 'k6';
export const options = {
scenarios: {
baseline: { executor: 'ramping-vus', startVUs: 1, stages: [ { duration: '2m', target: 50 } ] },
spike: { executor: 'constant-arrival-rate', rate: 100, timeUnit: '1s', duration: '5m', preAllocatedVUs: 200 },
}
};
export default function () {
const q = __ENV.QUERY || 'reset password';
const res = http.post(`${__ENV.BASE}/api/rag`, JSON.stringify({ query: q, tenantId: 't_demo' }), { headers: { 'Content-Type': 'application/json' } });
check(res, { '200': r => r.status === 200, 'has answer': r => !!r.json('answer') });
if (res.status !== 200) exec.test.abort('non-200');
sleep(0.2);
}
30 More Advanced FAQs
-
Can we run reranker and retrieval in parallel?
Usually retrieve first; but you can prefetch alternate retrieval strategies in parallel to reduce time-to-first-token. -
How to maintain consistent tone across answers?
Use tone guidelines in system prompt; small post-process normalizer; sample answers in QA. -
Detect and remove dead links?
Nightly link check jobs; drop or replace citations; add cached copies. -
How to minimize cross-region latency?
Anycast edge routing; per-region indices; geo-aware caches. -
How to record precise token accounting?
Add model response usage; track tokens.in/out in spans; budget enforcement middleware. -
Multi-tenant traffic isolation?
Per-tenant queues and budgets; fair scheduling; tenant-level throttles. -
Can we stream context along with the answer?
Yes—progressively reveal which cards are currently influencing the answer. -
How to limit hallucinations on numeric answers?
Prefer retrieval of structured data; add validators for numeric ranges; refuse when confidence low. -
Is llm caching safe?
Cache only for non-personalized, safe answers; include prompt hash + route + locale in key. -
Should we attach JSON metadata to answers?
Helpful for clients—return citations, confidence, and route info. -
Versioning indices?
Tag every chunk with index_version; use during assembly for stable citations. -
How to audit who saw which content?
Add source IDs in logs; legal/privacy review for access tracking compliance. -
How to autoscale reranker?
HPA on queue depth and latency; GPU node pools with PDBs; pre-warm pods. -
Prevent path traversal in file tool?
Normalize paths; enforce allowed prefixes; always read-only. -
Is batching dangerous?
Ensure per-query isolation; never leak context across batch items. -
Runtime feature flags?
Flags for topK, model route, cache TTL; safe to toggle; trace flags. -
Quiet hours / surge control?
Throttle LLM calls; fallback to lexical-only flow under extreme load. -
Secret management?
KMS + dynamic creds; never hardcode; rotate on incident or quarterly. -
Evaluate against user feedback?
Incorporate helpful/unhelpful votes; correlate with offline metrics; adjust thresholds. -
Legal holds?
Freeze indices and logs for the hold scope; track chain-of-custody. -
Can we share chunks across tenants?
Only public content; otherwise maintain strict isolation. -
Disable reranker per tenant?
Expose per-tenant config; verify SLO impact. -
Sanitizing markdown/script blocks?
Strip script tags; sanitize HTML; escape where rendering. -
Is model-distillation applicable?
Yes—distill to smaller generator with RAG prompts for cost savings. -
How to mitigate vendor lock-in?
Abstract embedding/reranker/generator behind adapters; keep data in open formats. -
Avoid explosion of indices?
Group small tenants; shard per large tenant; automate lifecycle. -
Prompt rotation risks?
Treat prompts as code; code review; eval before rollout; canary. -
Negative queries?
Return safe refusal with suggestions; log patterns to improve content. -
Use knowledge graphs?
Augment retrieval for relational queries; costly to build, but helpful for certain domains. -
Accessibility in UI?
Citations as accessible footnotes; keyboard navigation; screen reader-friendly labels.