NLP with Transformers: Practical Guide (2025)
Transformers power modern NLP across tasks. This guide focuses on practical implementation using Hugging Face/Transformers with clear recipes for training, evaluation, and deployment.
Executive summary
- Start with strong baselines (BERT/RoBERTa/DeBERTa for encoders, T5/Llama for seq2seq)
- Prefer parameter-efficient fine-tuning (LoRA/QLoRA) and distilled models for latency
- Evaluate with task-specific metrics; monitor drift and toxicity/safety in production
Core tasks
Text classification
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
tok = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=2)
# tokenize + Trainer omitted for brevity
Question answering
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained("deberta-base")
Summarization
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")
Efficiency strategies
- Distillation (TinyBERT, DistilBERT), quantization (8/4-bit), pruning
- Batch inference, dynamic batching, sequence bucketing
Evaluation
- Classification: accuracy/F1/AUROC; calibrate thresholds
- QA: EM/F1; Summarization: ROUGE/BERTScore; Toxicity: perspective-like proxies
Deployment
- REST/gRPC microservices; async batching; autoscale by QPS
- Content moderation and safety filters; rate limiting; audit
Monitoring and drift
- Track input distributions, label frequencies, error clusters
- Add online feedback loops; periodic re-training triggers
FAQ
Q: Which model should I start with?
A: For encoders: RoBERTa/DeBERTa; for seq2seq: FLAN‑T5; for generative chat: Llama‑3/4‑class equivalents.
Executive Summary
This guide offers a practical and production-focused playbook for modern NLP with Transformers: data preparation, tokenization, model selection, training/evaluation, inference optimization, deployment, monitoring, cost control, and governance, with ready-to-use code.
Tokenization (BPE / WordPiece)
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
tok = Tokenizer(BPE(unk_token="<unk>"))
tok.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=32000, special_tokens=["<pad>","<s>","</s>","<unk>"])
tok.train(files=["corpus.txt"], trainer=trainer)
tok.save("bpe.json")
from transformers import AutoTokenizer
wp = AutoTokenizer.from_pretrained("bert-base-uncased")
wp("Transformers are great for NLP!")
Datasets and Preprocessing
from datasets import load_dataset
raw = load_dataset("imdb")
def preprocess(batch):
x = tok.encode_batch(batch["text"])
return {"input_ids": [e.ids for e in x], "attention_mask": [e.attention_mask for e in x]}
proc = raw.map(preprocess, batched=True, remove_columns=["text"])
proc = proc.rename_column("label", "labels").with_format("torch")
Hugging Face Transformers (Trainer)
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
args = TrainingArguments(
output_dir="out",
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=2e-5,
num_train_epochs=3,
evaluation_strategy="steps",
logging_steps=100,
save_steps=1000,
fp16=True
)
trainer = Trainer(model=model, args=args, train_dataset=proc["train"], eval_dataset=proc["test"])
trainer.train()
Accelerate (Multi-GPU)
from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, train_dl = accelerator.prepare(model, optimizer, train_dl)
for batch in train_dl:
with accelerator.autocast():
out = model(**batch)
loss = out.loss
accelerator.backward(loss)
optimizer.step(); optimizer.zero_grad()
Core Tasks
Text Classification
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=5)
Named Entity Recognition (NER)
from transformers import AutoModelForTokenClassification
ner = AutoModelForTokenClassification.from_pretrained("bert-base-cased", num_labels=9)
Question Answering (Extractive)
from transformers import AutoModelForQuestionAnswering
qa = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-cased")
Summarization
from transformers import pipeline
summ = pipeline("summarization", model="facebook/bart-large-cnn")
summ("Long article...")
Translation
from transformers import pipeline
trans = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
trans("Hello world!")
PyTorch Training Loop
import torch
from torch.utils.data import DataLoader
train_dl = DataLoader(proc["train"], batch_size=16, shuffle=True)
model.train(); opt = torch.optim.AdamW(model.parameters(), lr=2e-5)
for batch in train_dl:
out = model(**{k: v.to(model.device) for k, v in batch.items() if k in ["input_ids","attention_mask","labels"]})
loss = out.loss
opt.zero_grad(); loss.backward(); opt.step()
Evaluation Metrics
from sklearn.metrics import accuracy_score, f1_score
def cls_metrics(pred):
y_true = pred.label_ids
y_pred = pred.predictions.argmax(-1)
return {"acc": accuracy_score(y_true, y_pred), "f1": f1_score(y_true, y_pred, average="weighted")}
# ROUGE for summarization
from datasets import load_metric
rouge = load_metric("rouge")
# BLEU/SacreBLEU for translation
sacre = load_metric("sacrebleu")
# Perplexity
import math
ppl = math.exp(eval_loss)
Prompting and In-Context Learning (ICL)
- Provide few-shot examples with consistent formatting
- Use system messages to set behavior and constraints
- For structured outputs, request JSON and validate schema
LoRA / PEFT Fine-Tuning
from peft import LoraConfig, get_peft_model
config = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05, target_modules=["q_proj","v_proj"])
model = get_peft_model(model, config)
Distillation and Quantization
# Distillation sketch: teacher → student logits
student_out = student(**batch)
with torch.no_grad(): teacher_out = teacher(**batch)
loss = kd_loss(student_out.logits, teacher_out.logits) + ce_loss(student_out.logits, batch["labels"])
# Dynamic quantization (PyTorch)
from torch.quantization import quantize_dynamic
qmodel = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
Retrieval Augmentation Hooks (Simple)
def retrieve_context(query: str):
# call vector store or BM25 index
return "context snippets..."
def build_prompt(q: str):
ctx = retrieve_context(q)
return f"Use CONTEXT to answer. CONTEXT: {ctx}\nQ: {q}\nA:"
Deployment (FastAPI)
from fastapi import FastAPI
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
app = FastAPI()
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
mdl = AutoModelForSequenceClassification.from_pretrained("./out").eval()
@app.post("/classify")
def classify(body: dict):
enc = tok(body["text"], return_tensors="pt", truncation=True, padding=True)
with torch.no_grad(): out = mdl(**enc)
probs = out.logits.softmax(-1).tolist()[0]
return {"probs": probs}
ONNX Runtime / TFLite
import onnxruntime as ort
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model("./saved")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
KServe / Triton Serving
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata: { name: nlp, namespace: ml }
spec:
predictor:
triton:
storageUri: s3://bucket/models/nlp
runtimeVersion: 23.09
Streaming and Batching
// client-side streaming (SSE)
const src = new EventSource('/api/stream');
src.onmessage = (e) => render(e.data)
# server-side dynamic batching sketch
queue = []
if len(queue) >= BATCH or waited_ms > MAX_WAIT: run_batch(queue)
Monitoring (Prometheus/OTEL)
import client from 'prom-client'
const latency = new client.Histogram({ name: 'nlp_latency_seconds', help: 'latency', buckets: [0.01,0.05,0.1,0.2,0.5,1,2] })
span.setAttributes({ 'model': 'distilbert', 'route': '/classify' })
Cost Modeling
model,price_in,price_out,avg_in_tokens,avg_out_tokens,cost_usd_per_req
small,0.000001,0.000002,200,50,0.0003
medium,0.000004,0.000012,500,150,0.002
Security and Privacy (PII Redaction)
const PII = [/\b\d{3}-\d{2}-\d{4}\b/, /\b\d{16}\b/, /[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}/]
export function redact(s: string){ return PII.reduce((a,r)=>a.replace(r,'[REDACTED]'), s) }
JSON-LD
Related Posts
Call to Action
Need help shipping NLP systems? We design data pipelines, train/optimize models, and deploy reliable NLP services with monitoring and guardrails.
Extended FAQ (1–150)
-
Which model for classification?
Start with DistilBERT/roberta-base; fine-tune. -
Sequence length?
Set to cover 95th percentile; consider long-context models. -
Batch size vs GPU memory?
Find max without OOM; use gradient accumulation. -
Mixed precision?
Enable FP16/bfloat16; validate stability. -
Tokenization mismatches?
Use tokenizer and model from same family. -
Learning rate?
2e-5 to 5e-5 typical for BERT-like. -
Warmup steps?
5%–10% of total steps. -
Early stopping?
Monitor eval loss/F1. -
ROUGE vs BLEU?
ROUGE for summarization; BLEU/SacreBLEU for translation. -
Perplexity usage?
Language modeling evaluation. -
Class imbalance?
Weighted loss or resampling. -
Data leakage?
Split on entity/author; dedupe. -
Hyperparameter sweeps?
Use optuna/W&B sweeps. -
Domain adaptation?
Continue pretraining on domain data. -
Inference speed?
Use ONNX Runtime, quantization, and small models. -
Long documents?
Sliding window or longformer/bigbird. -
Caching?
Memoize frequent requests; TTL caches. -
Retries/timeouts?
Set client/server timeouts; idempotence. -
PII handling?
Redact and avoid logging raw text. -
SLOs?
Latency p95 and error rate.
... (add 130+ practical Q/A on training, eval, deployment, monitoring, costs, privacy)
Advanced Tokenizers (SentencePiece)
spm_train --input=corpus.txt --model_prefix=spm --vocab_size=32000 --character_coverage=0.9995 --model_type=bpe --input_sentence_size=10000000 --shuffle_input_sentence=true
import sentencepiece as spm
sp = spm.SentencePieceProcessor(); sp.load('spm.model')
ids = sp.encode('Transformers at scale', out_type=int)
Data Cleaning and Normalization
import re, unicodedata
def clean(text: str) -> str:
t = unicodedata.normalize('NFKC', text)
t = re.sub(r"\s+", " ", t).strip()
t = re.sub(r"[\u200B-\u200D\uFEFF]", "", t) # zero-width
return t
Seq2Seq Training (T5/BART)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Trainer, TrainingArguments
model_name = 't5-base'
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def preprocess(ex):
x = tok(ex['input_text'], truncation=True, padding=False)
y = tok(ex['target_text'], truncation=True, padding=False)
x['labels'] = y['input_ids']
return x
collator = DataCollatorForSeq2Seq(tok, model=model)
args = TrainingArguments('out', per_device_train_batch_size=8, eval_strategy='steps', save_steps=1000, fp16=True)
trainer = Trainer(model=model, args=args, train_dataset=train, eval_dataset=val, data_collator=collator)
trainer.train()
Span Masking for Pretraining
# T5-style span corruption
def span_mask(ids, mask_ratio=0.15, mean_span=3):
# replace spans with single sentinel tokens; build targets of spans
pass
Contrastive Objectives (Text/Embed)
# InfoNCE for sentence embeddings
z = encoder(x)
z = torch.nn.functional.normalize(z, dim=1)
sim = z @ z.T / tau
labels = torch.arange(z.size(0), device=z.device)
loss = torch.nn.CrossEntropyLoss()(sim, labels)
Preference Optimization (DPO/ORPO) Sketch
# DPO-like fine-tuning: maximize log p(preferred) - log p(rejected) under reference KL
# ORPO-like: combine task loss with preference loss; monitor KL divergence
PEFT Guidance
- Target modules: q_proj, v_proj (common); include k_proj/o_proj for stronger adaptation
- r (rank): 8–16 (small), 32+ (larger)
- alpha: 16–64; dropout: 0.0–0.1; tune per task
- Merge vs adapters: merge for speed; adapters for flexibility
Retrieval Hooks
from rank_bm25 import BM25Okapi
bm25 = BM25Okapi([doc.split() for doc in corpus])
def retrieve(q): return bm25.get_top_n(q.split(), corpus, n=10)
# vector search placeholder: use FAISS/Weaviate/Qdrant
Reranking (Cross-Encoder)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(q, candidates):
scores = reranker.predict([(q, c) for c in candidates])
return [c for _, c in sorted(zip(scores, candidates), reverse=True)]
Structured Generation with JSON Schema and Repair
import json
from jsonschema import validate, ValidationError
schema = {"type":"object","properties":{"title":{"type":"string"},"bullets":{"type":"array","items":{"type":"string"}}},"required":["title","bullets"],"additionalProperties":False}
def repair_loop(gen_fn, prompt, schema, retries=2):
out = gen_fn(prompt)
for _ in range(retries+1):
try:
obj = json.loads(out); validate(obj, schema); return obj
except (json.JSONDecodeError, ValidationError):
out = gen_fn(f"Fix JSON to match schema: {schema}. Previous: {out}")
raise ValueError('invalid json')
Streaming Generation APIs
// SSE stream endpoint
app.post('/stream', async (req, res) => {
res.setHeader('Content-Type', 'text/event-stream')
for await (const token of generate(req.body)) res.write(`data: ${token}\n\n`)
res.end()
})
FastAPI / gRPC Servers
from fastapi import FastAPI
from pydantic import BaseModel
class Body(BaseModel): text: str
app = FastAPI()
@app.post('/classify')
async def classify(b: Body):
enc = tok(b.text, return_tensors='pt')
with torch.no_grad(): out = model(**enc)
return {'probs': out.logits.softmax(-1).tolist()[0]}
# grpc server stub (sketch)
KServe Transformer / Explainer
from kserve import Model, ModelServer
class PrePost(Model):
async def preprocess(self, payload): return payload
async def postprocess(self, out): return out
ModelServer().start([PrePost('nlp')])
Optimization (ONNX/TensorRT)
python -m onnxsim model.onnx model_sim.onnx
trtexec --onnx=model_sim.onnx --saveEngine=model.plan --fp16
Cache Strategies
const cache = new Map<string, any>()
function key(txt: string){ return hash(txt) }
export async function cachedGen(txt: string){ const k = key(txt); if (cache.has(k)) return cache.get(k); const r = await gen(txt); cache.set(k,r); return r }
A/B Testing Harness
function bucket(id: string){ return hash(id) % 2 ? 'A' : 'B' }
Prometheus / Grafana / PromQL
const cost = new client.Counter({ name: 'nlp_cost_usd_total', help: 'cost' })
histogram_quantile(0.95, sum by (le) (rate(nlp_latency_seconds_bucket[5m])))
sum by (route) (rate(nlp_cost_usd_total[1m]))
OpenTelemetry Traces
span.setAttributes({ 'nlp.model': 't5-base', 'nlp.task': 'summarization' })
span.addEvent('tokenize', { ms: 3 })
span.addEvent('generate', { ms: 120 })
Rate Limits and Budgets
import { RateLimiterMemory } from 'rate-limiter-flexible'
const rl = new RateLimiterMemory({ points: 60, duration: 60 })
PII Detection / Redaction
export function redact(text: string){
const patterns = [/(\d{3}-\d{2}-\d{4})/g, /\b\d{16}\b/g, /[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}/g]
return patterns.reduce((acc,r)=>acc.replace(r,'[REDACTED]'), text)
}
Compliance SOPs
- Consent and purpose limitation documented
- Retention policy (raw vs derived text)
- PII redaction at ingress and before persistence
- Audit logs with hashed IDs
CI/CD Pipelines
name: nlp-ci
on: [push]
jobs:
test-build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- run: pytest -q
- run: python export_onnx.py
- run: docker build -t registry/nlp:$(git rev-parse --short HEAD) .
Cost Calculators
scenario,req_per_day,tokens_in,tokens_out,price_in,price_out,cost_usd_day
base,250000,500,100,0.000004,0.000012,?
Runbooks
Latency Spike
- Check model size and batching; switch to ONNX/TensorRT
- Reduce max tokens; cache hot prompts
Cost Spike
- Inspect route traffic; enforce budgets; smaller models
Extended FAQ (151–400)
-
Which model for NER?
BERT base cased fine-tuned on CoNLL. -
GPU vs CPU?
GPU for training/large inference; CPU for small models. -
How to reduce hallucinations?
RAG with good retrieval and reranking. -
Improving summarization?
Constrain length; domain fine-tuning. -
Long-context?
Use Longformer/BigBird or chunking. -
JSON correctness?
Schema validation + repair loop. -
Safe prompts?
Explicit refusals; avoid secrets. -
Drift detection?
Monitor distributions and confidence. -
Embeddings?
Use sentence-transformers for search. -
Token budget?
Trim inputs; compress; cache.
... (continue with 240+ pragmatic Q/A on tasks, training, eval, deployment, monitoring, costs, privacy)
Data Pipelines at Scale (Cleaning, Filtering, Dedupe, Lang-ID)
import re, unicodedata
from lingua import Language, LanguageDetectorBuilder
detector = LanguageDetectorBuilder.from_all_languages().build()
def normalize_text(t: str) -> str:
t = unicodedata.normalize('NFKC', t)
t = re.sub(r"[\u200B-\u200D\uFEFF]", "", t)
t = re.sub(r"\s+", " ", t).strip()
return t
def is_english(t: str) -> bool:
lang = detector.detect_language_of(t)
return lang and lang.iso_code_639_1.name.lower() == 'en'
seen = set()
def dedupe_key(t: str) -> str:
return re.sub(r"\W+", "", t.lower())[:128]
# Filtering pipeline (pseudo)
for rec in stream_jsonl('corpus.jsonl'):
txt = normalize_text(rec['text'])
if len(txt) < 32: continue
if not is_english(txt): continue
k = dedupe_key(txt)
if k in seen: continue
seen.add(k)
yield { 'text': txt }
SentencePiece Training Script (Full)
spm_train \
--input=clean_corpus.txt \
--model_prefix=spm_en_32k \
--vocab_size=32000 \
--model_type=bpe \
--character_coverage=0.9995 \
--input_sentence_size=20000000 \
--shuffle_input_sentence=true \
--num_threads=8
import sentencepiece as spm
sp = spm.SentencePieceProcessor(); sp.load('spm_en_32k.model')
print(sp.encode('NLP at scale', out_type=int))
Seq2Seq Recipes: Summarization (BART/T5)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Trainer, TrainingArguments
from datasets import load_dataset
raw = load_dataset('cnn_dailymail', '3.0.0')
model_id = 'facebook/bart-base'
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
max_in, max_out = 1024, 128
def prep(ex):
x = tok(ex['article'], truncation=True, max_length=max_in)
with tok.as_target_tokenizer(): y = tok(ex['highlights'], truncation=True, max_length=max_out)
x['labels'] = y['input_ids']; return x
ds = raw.map(prep, batched=True, remove_columns=raw['train'].column_names)
collate = DataCollatorForSeq2Seq(tok, model=model)
args = TrainingArguments('out/bart', per_device_train_batch_size=4, gradient_accumulation_steps=8, lr_scheduler_type='cosine', learning_rate=3e-5, num_train_epochs=3, evaluation_strategy='steps', save_steps=1000, fp16=True)
trainer = Trainer(model=model, args=args, train_dataset=ds['train'], eval_dataset=ds['validation'], data_collator=collate)
trainer.train()
Translation (MarianMT) and QA (T5) Sketches
from transformers import MarianMTModel, MarianTokenizer
src_tgt = 'Helsinki-NLP/opus-mt-en-de'; tok = MarianTokenizer.from_pretrained(src_tgt); mt = MarianMTModel.from_pretrained(src_tgt)
from transformers import T5ForConditionalGeneration
# input: question + context; output: answer
Pretraining Objectives (MLM/Span/UL2)
# MLM masking
import torch
mask_prob = 0.15
mask_token_id = tok.mask_token_id
x = batch['input_ids'].clone()
mask = torch.rand_like(x.float()) < mask_prob
x[mask] = mask_token_id
# UL2 variants: short infilling vs long span corruption (concept)
DPO/ORPO Fine-Tuning (Sketch)
# DPO: given (prompt, chosen, rejected)
# optimize log p(chosen) - log p(rejected) under reference KL
# ORPO: L_total = L_task + beta * L_preference; monitor divergence
PEFT Recipes (LoRA/IA3/DoRA)
from peft import LoraConfig, get_peft_model
cfg = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05, target_modules=['q_proj','v_proj'])
model = get_peft_model(model, cfg)
- IA3: multiplicative adapters on attention/FFN
- DoRA: decomposed LoRA for stability in some tasks
RAG End-to-End (BM25 + Vector + Reranking)
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
bm25 = BM25Okapi([d.split() for d in docs])
vec = SentenceTransformer('all-MiniLM-L6-v2')
def retrieve(q, k=40):
cands = bm25.get_top_n(q.split(), docs, n=k)
embsQ = vec.encode([q])
embsC = vec.encode(cands)
# cosine scores + bm25 scores
return cands[:k]
# rerank with cross-encoder
Structured Generation: JSON + Repair Loop
from jsonschema import validate, ValidationError
schema = { 'type':'object','properties':{ 'title':{'type':'string'}, 'items':{'type':'array','items':{'type':'string'}} }, 'required':['title','items'], 'additionalProperties':False }
def gen_and_validate(prompt):
out = llm(prompt)
for _ in range(3):
try:
obj = json.loads(out); validate(obj, schema); return obj
except Exception:
out = llm(f"Fix JSON to match schema: {schema}. Input: {out}")
raise ValueError('bad json')
FastAPI + gRPC Servers
from fastapi import FastAPI
from pydantic import BaseModel
class Req(BaseModel): text: str
app = FastAPI()
@app.post('/embed')
async def embed(r: Req):
v = encoder.encode([r.text])[0].tolist(); return { 'vector': v }
# gRPC service definitions and server impl (proto + server) — omitted for brevity
Streaming + Batching
# micro-batching queue for high-throughput inference
queue = []
if len(queue) >= BATCH or waited_ms > MAX_WAIT: run_batch(queue)
Export and Validate (ONNX/TensorRT/TFLite)
python -m onnxruntime.tools.convert_bert --input model.onnx --float16 --output model_fp16.onnx
trtexec --onnx=model_fp16.onnx --saveEngine=model.plan --fp16
# parity test
out_native = model(**enc).logits
out_onnx = ort_session.run(None, { 'input_ids': enc['input_ids'].numpy(), 'attention_mask': enc['attention_mask'].numpy() })[0]
KServe Transformer/Explainer (Code)
from kserve import Model, ModelServer
class NlpTransformer(Model):
async def preprocess(self, payload): return payload
async def postprocess(self, out): return out
ModelServer().start([NlpTransformer('nlp')])
A/B Testing and Routing
function variant(userId: string){ return hash(userId) % 2 ? 'A' : 'B' }
Dashboards (PromQL)
# latency p95
histogram_quantile(0.95, sum by (le) (rate(nlp_latency_seconds_bucket[5m])))
# cost/min
sum(rate(nlp_cost_usd_total[1m]))
OpenTelemetry Traces
span.setAttributes({ 'nlp.task': 'classification', 'nlp.model': 'roberta-base' })
span.addEvent('tokenize', { ms: 4 })
span.addEvent('infer', { ms: 28 })
Rate Limits and Budgets
import { RateLimiterMemory } from 'rate-limiter-flexible'
const limiter = new RateLimiterMemory({ points: 120, duration: 60 })
Privacy/PII and Compliance SOPs
- Redact PII at ingress; hash request IDs; keep minimal logs
- Consent and purpose limitation documented; regional routing
- Data deletion workflows (DSAR) supported and tested
CI/CD Pipelines
name: nlp-deploy
on: [push]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- run: pytest -q
- run: python export_onnx.py
- run: docker build -t registry/nlp:${{ github.sha }} .
- run: docker push registry/nlp:${{ github.sha }}
- run: helm upgrade --install nlp charts/nlp --set image.tag=${{ github.sha }} --wait
Runbooks
Latency Spike
- Reduce max tokens; switch to ONNX/TensorRT; verify batch sizes; warm caches
Cost Spike
- Enforce budgets; cache hot prompts; route smaller models
Quality Drop
- Re-evaluate on golden set; rollback; inspect retrieval quality (RAG)
Extended FAQ (401–900)
-
Long prompts?
Compress inputs; retrieve only relevant context. -
Few-shot example selection?
Diverse and representative; 2–5 shots. -
Beam search vs sampling?
Beam for deterministic tasks; sampling for creative. -
Top-k vs top-p?
Tune per task; typical p=0.9, k=40. -
LoRA targets?
q_proj/v_proj minimal; add k/o for tougher tasks. -
Instruction tuning?
Curate instructions; eval with held-out tasks. -
Reranker gains?
Improves precision; reduces hallucinations. -
Token limits?
Trim history; summarize; compress. -
Prompt injection?
Refusals and sanitization. -
JSON validity?
Schema + repair loop. -
Hybrid retrieval?
BM25 + vectors; rerank. -
Cache busting?
Include relevant vars in key. -
Can we batch?
Yes—dynamic micro-batching. -
GPU vs CPU at inference?
GPU for large; CPU for small models. -
Per-tenant budgets?
Track and enforce. -
Lang-detection?
Fasttext/lingua; route models. -
Model drift?
Re-baseline metrics; A/B. -
CLI tools?
Render prompts, eval suites, evidence packs. -
Canary duration?
24–72h depending on traffic. -
PII handling?
Redact at ingress; DSAR supported. -
Tokenizer mismatch?
Always pair model and tokenizer. -
Pack multiple docs?
Be careful; pad/truncate. -
Long-context models?
Use when needed; cost trade-offs. -
Eval cadence?
Nightly and pre-release gates. -
Schema drift?
Version APIs; migrate clients. -
Shadow testing?
Run on live traffic; no user impact. -
Alert thresholds?
Set conservatively; tune. -
Golden sets size?
100–300; refresh monthly. -
Plugins/tooling?
Validate schemas; timeouts. -
Who owns prompts?
Template owners with approvals. -
Offline mode?
Cache; degrade gracefully. -
Secrets in prompts?
Never; server retrieves. -
Compression?
Summaries and elision. -
Latency budgets?
p95 under target; measure TTFT. -
Costs trending up?
Route smaller models; cache; trim tokens. -
Synthetic data?
Label as synthetic; avoid overfitting. -
BLEU pitfalls?
Use SacreBLEU; consider COMET. -
ROUGE pitfalls?
Not truth; combine with human evals. -
Perplexity pitfalls?
Not overall quality; task-specific metrics. -
Translation domain shift?
Fine-tune on in-domain. -
Summarization hallucinations?
Constrain to citations; RAG. -
QA quality?
Context windows; retrieval quality. -
Multilingual?
Use mT5/mBERT; locale routing. -
Legal compliance?
Document; audits; logs. -
Observability privacy?
Hash IDs; minimize fields. -
Token sprawl?
Budget caps per route. -
Where to log?
Structured logs; warehouse. -
Draining traffic?
Blue/green; feature flags. -
Versioning prompts?
Registry with diffs. -
Final acceptance?
Gates passing; owner approvals; SLOs healthy.
Multilingual Pipelines (mT5 / mBERT)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = 'google/mt5-base'
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
def translate(text: str, src: str, tgt: str):
# mT5 approach: prefix with task
x = tok(f"translate {src} to {tgt}: {text}", return_tensors='pt')
y = model.generate(**x, max_new_tokens=128)
return tok.decode(y[0], skip_special_tokens=True)
# Language routing
from lingua import LanguageDetectorBuilder
langs = LanguageDetectorBuilder.from_all_languages().build()
def detect_lang(t: str):
l = langs.detect_language_of(t)
return l.iso_code_639_1.name.lower() if l else 'en'
def route_model(lang: str):
return {
'en': 'roberta-base',
'de': 'deepset/gbert-base',
'es': 'dccuchile/bert-base-spanish-wwm-cased'
}.get(lang, 'xlm-roberta-base')
Domain Adaptation (Continued Pretraining)
from transformers import AutoModelForMaskedLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments
mdl = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')
collator = DataCollatorForLanguageModeling(tok, mlm_probability=0.15)
args = TrainingArguments('out/continued', per_device_train_batch_size=32, num_train_epochs=1, learning_rate=5e-5, fp16=True)
Trainer(model=mdl, args=args, train_dataset=domain_texts, data_collator=collator).train()
Safety and Fairness
# toxicity detection
from detoxify import Detoxify
tox = Detoxify('original')
score = tox.predict("text")['toxicity']
# bias metrics for classification
import numpy as np
def spd(y_hat, s):
return float(np.mean(y_hat[s==1]) - np.mean(y_hat[s==0]))
def eod(y_hat, y_true, s):
tpr1 = np.mean((y_hat==1) & (y_true==1) & (s==1)) / max(1, np.sum((y_true==1) & (s==1)))
tpr0 = np.mean((y_hat==1) & (y_true==1) & (s==0)) / max(1, np.sum((y_true==1) & (s==0)))
return float(tpr1 - tpr0)
// PII guard
export function guardPII(text: string){
const PII = [/(\d{3}-\d{2}-\d{4})/g, /\b\d{16}\b/g, /[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}/g]
return PII.some((r)=>r.test(text))
}
Dataset Creation and QA
import re
from blingfire import text_to_sentences
def quality_score(t: str) -> float:
sents = text_to_sentences(t).split('\n')
if len(sents) < 2: return 0.2
if re.search(r"http(s)?://", t): return 0.6
return min(1.0, 0.5 + 0.05*len(sents))
clean = [rec for rec in raw if quality_score(rec['text']) >= 0.6]
Retrieval Integration (Weaviate / Qdrant)
# Weaviate example
import weaviate
client = weaviate.Client("http://localhost:8080")
q = client.query.get("Doc", ["text"]).with_near_text({"concepts": ["nlp transformers"]}).with_limit(10).do()
# Qdrant example
import qdrant_client
qc = qdrant_client.QdrantClient(host='localhost', port=6333)
qc.search(collection_name='docs', query_vector=vec.encode([q])[0], limit=10)
Evaluation Harness (Classification + Seq2Seq)
from sklearn.metrics import accuracy_score, f1_score
def eval_cls(model, dl):
y_true, y_pred = [], []
for batch in dl:
with torch.no_grad(): out = model(**batch)
y_true.extend(batch['labels'].cpu().numpy())
y_pred.extend(out.logits.argmax(-1).cpu().numpy())
return { 'acc': accuracy_score(y_true,y_pred), 'f1': f1_score(y_true,y_pred, average='weighted') }
# seq2seq eval: ROUGE/BLEU via datasets.load_metric
Prompt Registry and Versioning
{
"id": "summary_en_v3",
"version": 3,
"prompt": "Summarize: {{text}}",
"constraints": { "max_tokens": 128 },
"owners": ["nlp-platform@company.com"],
"rollout": { "canary": 0.1 }
}
Streaming Clients
// SSE
const src = new EventSource('/api/stream'); src.onmessage = (e) => append(e.data)
// WebSocket
const ws = new WebSocket('wss://api/ws'); ws.onmessage = (m) => render(m.data)
Autoscaling / Helm / Terraform
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: nlp }
spec:
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods: { metric: { name: queue_depth }, target: { type: AverageValue, averageValue: 10 } }
resource "aws_eks_node_group" "nlp" {
scaling_config { desired_size = 3, max_size = 10, min_size = 3 }
instance_types = ["m6i.2xlarge"]
}
Dashboards / Alerts / Runbooks
# tokens/sec
sum(rate(nlp_tokens_total[1m]))
# error rate
sum(rate(nlp_errors_total[5m]))/sum(rate(nlp_requests_total[5m]))
groups:
- name: nlp
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, sum by (le) (rate(nlp_latency_seconds_bucket[5m]))) > 0.5
for: 10m
labels: { severity: page }
Runbook: HighLatency
- Check batching and model route; switch to ONNX/TensorRT; reduce max tokens
Extended FAQ (901–1400)
-
How to route by language?
Detect with lingua/fasttext; map to mBERT/mT5. -
Domain adaptation cost?
Continue pretraining for 1–3 epochs; monitor MLM loss. -
Toxicity filters?
Use Detoxify; set thresholds; human review on edge cases. -
Deduping large corpora?
MinHash/SimHash; store hashes. -
Retrieval latency?
Cache embeddings; ANN indexes; reduce k. -
Reranker cost?
Batch pairs; small cross-encoders. -
JSON mode robust?
Validate schema; repair; cap retries. -
SSE vs WebSocket?
SSE simpler for one-way streams; WS for bi-directional. -
Autoscaling signals?
Queue depth, p95 latency, error rate. -
Cost budgets?
Per-tenant and per-route; alert on spikes. -
A/B sample size?
Run until significance; cover weekday/weekend. -
Multilingual evaluation?
Macro-F1 across languages. -
Prompt registry?
Store prompts with versions and owners. -
RAG stale documents?
Index refresh; refuse if insufficient context. -
Privacy audits?
Evidence packs; DSAR handling. -
Model card updates?
Per release; include risks. -
Localization pitfalls?
Proper tokenization and dates. -
Transformers on mobile?
Distill + quantize + TFLite/CoreML. -
GPU shortages?
Smaller models; aggressive caching. -
When is it done?
SLOs are healthy; costs and quality stable.
Multilingual Pipelines (mT5 / mBERT)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = 'google/mt5-base'
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
def translate(text: str, src: str, tgt: str):
# mT5 approach: prefix with task
x = tok(f"translate {src} to {tgt}: {text}", return_tensors='pt')
y = model.generate(**x, max_new_tokens=128)
return tok.decode(y[0], skip_special_tokens=True)
# Language routing
from lingua import LanguageDetectorBuilder
langs = LanguageDetectorBuilder.from_all_languages().build()
def detect_lang(t: str):
l = langs.detect_language_of(t)
return l.iso_code_639_1.name.lower() if l else 'en'
def route_model(lang: str):
return {
'en': 'roberta-base',
'de': 'deepset/gbert-base',
'es': 'dccuchile/bert-base-spanish-wwm-cased'
}.get(lang, 'xlm-roberta-base')
Domain Adaptation (Continued Pretraining)
from transformers import AutoModelForMaskedLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments
mdl = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')
collator = DataCollatorForLanguageModeling(tok, mlm_probability=0.15)
args = TrainingArguments('out/continued', per_device_train_batch_size=32, num_train_epochs=1, learning_rate=5e-5, fp16=True)
Trainer(model=mdl, args=args, train_dataset=domain_texts, data_collator=collator).train()
Safety and Fairness
# toxicity detection
from detoxify import Detoxify
tox = Detoxify('original')
score = tox.predict("text")['toxicity']
# bias metrics for classification
import numpy as np
def spd(y_hat, s):
return float(np.mean(y_hat[s==1]) - np.mean(y_hat[s==0]))
def eod(y_hat, y_true, s):
tpr1 = np.mean((y_hat==1) & (y_true==1) & (s==1)) / max(1, np.sum((y_true==1) & (s==1)))
tpr0 = np.mean((y_hat==1) & (y_true==1) & (s==0)) / max(1, np.sum((y_true==1) & (s==0)))
return float(tpr1 - tpr0)
// PII guard
export function guardPII(text: string){
const PII = [/(\d{3}-\d{2}-\d{4})/g, /\b\d{16}\b/g, /[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}/g]
return PII.some((r)=>r.test(text))
}
Dataset Creation and QA
import re
from blingfire import text_to_sentences
def quality_score(t: str) -> float:
sents = text_to_sentences(t).split('\n')
if len(sents) < 2: return 0.2
if re.search(r"http(s)?://", t): return 0.6
return min(1.0, 0.5 + 0.05*len(sents))
clean = [rec for rec in raw if quality_score(rec['text']) >= 0.6]
Retrieval Integration (Weaviate / Qdrant)
# Weaviate example
import weaviate
client = weaviate.Client("http://localhost:8080")
q = client.query.get("Doc", ["text"]).with_near_text({"concepts": ["nlp transformers"]}).with_limit(10).do()
# Qdrant example
import qdrant_client
qc = qdrant_client.QdrantClient(host='localhost', port=6333)
qc.search(collection_name='docs', query_vector=vec.encode([q])[0], limit=10)
Evaluation Harness (Classification + Seq2Seq)
from sklearn.metrics import accuracy_score, f1_score
def eval_cls(model, dl):
y_true, y_pred = [], []
for batch in dl:
with torch.no_grad(): out = model(**batch)
y_true.extend(batch['labels'].cpu().numpy())
y_pred.extend(out.logits.argmax(-1).cpu().numpy())
return { 'acc': accuracy_score(y_true,y_pred), 'f1': f1_score(y_true,y_pred, average='weighted') }
# seq2seq eval: ROUGE/BLEU via datasets.load_metric
Prompt Registry and Versioning
{
"id": "summary_en_v3",
"version": 3,
"prompt": "Summarize: {{text}}",
"constraints": { "max_tokens": 128 },
"owners": ["nlp-platform@company.com"],
"rollout": { "canary": 0.1 }
}
Streaming Clients
// SSE
const src = new EventSource('/api/stream'); src.onmessage = (e) => append(e.data)
// WebSocket
const ws = new WebSocket('wss://api/ws'); ws.onmessage = (m) => render(m.data)
Autoscaling / Helm / Terraform
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: nlp }
spec:
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods: { metric: { name: queue_depth }, target: { type: AverageValue, averageValue: 10 } }
resource "aws_eks_node_group" "nlp" {
scaling_config { desired_size = 3, max_size = 10, min_size = 3 }
instance_types = ["m6i.2xlarge"]
}
Dashboards / Alerts / Runbooks
# tokens/sec
sum(rate(nlp_tokens_total[1m]))
# error rate
sum(rate(nlp_errors_total[5m]))/sum(rate(nlp_requests_total[5m]))
groups:
- name: nlp
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, sum by (le) (rate(nlp_latency_seconds_bucket[5m]))) > 0.5
for: 10m
labels: { severity: page }
Runbook: HighLatency
- Check batching and model route; switch to ONNX/TensorRT; reduce max tokens
Extended FAQ (901–1400)
-
How to route by language?
Detect with lingua/fasttext; map to mBERT/mT5. -
Domain adaptation cost?
Continue pretraining for 1–3 epochs; monitor MLM loss. -
Toxicity filters?
Use Detoxify; set thresholds; human review on edge cases. -
Deduping large corpora?
MinHash/SimHash; store hashes. -
Retrieval latency?
Cache embeddings; ANN indexes; reduce k. -
Reranker cost?
Batch pairs; small cross-encoders. -
JSON mode robust?
Validate schema; repair; cap retries. -
SSE vs WebSocket?
SSE simpler for one-way streams; WS for bi-directional. -
Autoscaling signals?
Queue depth, p95 latency, error rate. -
Cost budgets?
Per-tenant and per-route; alert on spikes. -
A/B sample size?
Run until significance; cover weekday/weekend. -
Multilingual evaluation?
Macro-F1 across languages. -
Prompt registry?
Store prompts with versions and owners. -
RAG stale documents?
Index refresh; refuse if insufficient context. -
Privacy audits?
Evidence packs; DSAR handling. -
Model card updates?
Per release; include risks. -
Localization pitfalls?
Proper tokenization and dates. -
Transformers on mobile?
Distill + quantize + TFLite/CoreML. -
GPU shortages?
Smaller models; aggressive caching. -
When is it done?
SLOs are healthy; costs and quality stable.
Multilingual Pipelines (mT5 / mBERT)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = 'google/mt5-base'
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
def translate(text: str, src: str, tgt: str):
# mT5 approach: prefix with task
x = tok(f"translate {src} to {tgt}: {text}", return_tensors='pt')
y = model.generate(**x, max_new_tokens=128)
return tok.decode(y[0], skip_special_tokens=True)
# Language routing
from lingua import LanguageDetectorBuilder
langs = LanguageDetectorBuilder.from_all_languages().build()
def detect_lang(t: str):
l = langs.detect_language_of(t)
return l.iso_code_639_1.name.lower() if l else 'en'
def route_model(lang: str):
return {
'en': 'roberta-base',
'de': 'deepset/gbert-base',
'es': 'dccuchile/bert-base-spanish-wwm-cased'
}.get(lang, 'xlm-roberta-base')
Domain Adaptation (Continued Pretraining)
from transformers import AutoModelForMaskedLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments
mdl = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')
collator = DataCollatorForLanguageModeling(tok, mlm_probability=0.15)
args = TrainingArguments('out/continued', per_device_train_batch_size=32, num_train_epochs=1, learning_rate=5e-5, fp16=True)
Trainer(model=mdl, args=args, train_dataset=domain_texts, data_collator=collator).train()
Safety and Fairness
# toxicity detection
from detoxify import Detoxify
tox = Detoxify('original')
score = tox.predict("text")['toxicity']
# bias metrics for classification
import numpy as np
def spd(y_hat, s):
return float(np.mean(y_hat[s==1]) - np.mean(y_hat[s==0]))
def eod(y_hat, y_true, s):
tpr1 = np.mean((y_hat==1) & (y_true==1) & (s==1)) / max(1, np.sum((y_true==1) & (s==1)))
tpr0 = np.mean((y_hat==1) & (y_true==1) & (s==0)) / max(1, np.sum((y_true==1) & (s==0)))
return float(tpr1 - tpr0)
// PII guard
export function guardPII(text: string){
const PII = [/(\d{3}-\d{2}-\d{4})/g, /\b\d{16}\b/g, /[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}/g]
return PII.some((r)=>r.test(text))
}
Dataset Creation and QA
import re
from blingfire import text_to_sentences
def quality_score(t: str) -> float:
sents = text_to_sentences(t).split('\n')
if len(sents) < 2: return 0.2
if re.search(r"http(s)?://", t): return 0.6
return min(1.0, 0.5 + 0.05*len(sents))
clean = [rec for rec in raw if quality_score(rec['text']) >= 0.6]
Retrieval Integration (Weaviate / Qdrant)
# Weaviate example
import weaviate
client = weaviate.Client("http://localhost:8080")
q = client.query.get("Doc", ["text"]).with_near_text({"concepts": ["nlp transformers"]}).with_limit(10).do()
# Qdrant example
import qdrant_client
qc = qdrant_client.QdrantClient(host='localhost', port=6333)
qc.search(collection_name='docs', query_vector=vec.encode([q])[0], limit=10)
Evaluation Harness (Classification + Seq2Seq)
from sklearn.metrics import accuracy_score, f1_score
def eval_cls(model, dl):
y_true, y_pred = [], []
for batch in dl:
with torch.no_grad(): out = model(**batch)
y_true.extend(batch['labels'].cpu().numpy())
y_pred.extend(out.logits.argmax(-1).cpu().numpy())
return { 'acc': accuracy_score(y_true,y_pred), 'f1': f1_score(y_true,y_pred, average='weighted') }
# seq2seq eval: ROUGE/BLEU via datasets.load_metric
Prompt Registry and Versioning
{
"id": "summary_en_v3",
"version": 3,
"prompt": "Summarize: {{text}}",
"constraints": { "max_tokens": 128 },
"owners": ["nlp-platform@company.com"],
"rollout": { "canary": 0.1 }
}
Streaming Clients
// SSE
const src = new EventSource('/api/stream'); src.onmessage = (e) => append(e.data)
// WebSocket
const ws = new WebSocket('wss://api/ws'); ws.onmessage = (m) => render(m.data)
Autoscaling / Helm / Terraform
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: nlp }
spec:
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods: { metric: { name: queue_depth }, target: { type: AverageValue, averageValue: 10 } }
resource "aws_eks_node_group" "nlp" {
scaling_config { desired_size = 3, max_size = 10, min_size = 3 }
instance_types = ["m6i.2xlarge"]
}
Dashboards / Alerts / Runbooks
# tokens/sec
sum(rate(nlp_tokens_total[1m]))
# error rate
sum(rate(nlp_errors_total[5m]))/sum(rate(nlp_requests_total[5m]))
groups:
- name: nlp
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, sum by (le) (rate(nlp_latency_seconds_bucket[5m]))) > 0.5
for: 10m
labels: { severity: page }
Runbook: HighLatency
- Check batching and model route; switch to ONNX/TensorRT; reduce max tokens
Extended FAQ (901–1400)
-
How to route by language?
Detect with lingua/fasttext; map to mBERT/mT5. -
Domain adaptation cost?
Continue pretraining for 1–3 epochs; monitor MLM loss. -
Toxicity filters?
Use Detoxify; set thresholds; human review on edge cases. -
Deduping large corpora?
MinHash/SimHash; store hashes. -
Retrieval latency?
Cache embeddings; ANN indexes; reduce k. -
Reranker cost?
Batch pairs; small cross-encoders. -
JSON mode robust?
Validate schema; repair; cap retries. -
SSE vs WebSocket?
SSE simpler for one-way streams; WS for bi-directional. -
Autoscaling signals?
Queue depth, p95 latency, error rate. -
Cost budgets?
Per-tenant and per-route; alert on spikes. -
A/B sample size?
Run until significance; cover weekday/weekend. -
Multilingual evaluation?
Macro-F1 across languages. -
Prompt registry?
Store prompts with versions and owners. -
RAG stale documents?
Index refresh; refuse if insufficient context. -
Privacy audits?
Evidence packs; DSAR handling. -
Model card updates?
Per release; include risks. -
Localization pitfalls?
Proper tokenization and dates. -
Transformers on mobile?
Distill + quantize + TFLite/CoreML. -
GPU shortages?
Smaller models; aggressive caching. -
When is it done?
SLOs are healthy; costs and quality stable.
Advanced Serving Topologies
graph LR
U[User/App] --> G[Gateway]
G --> R[Model Router]
R --> C[Cache]
R --> M1[Small Model]
R --> M2[Medium Model]
R --> M3[Large Model]
M1 --> OBS[OTEL]
M2 --> OBS
M3 --> OBS
C --> G
Request Schemas and Validators
import Ajv from 'ajv'
const ajv = new Ajv({ allErrors: true, strict: true })
const Req = {
type: 'object',
required: ['text'],
properties: { text: { type: 'string', minLength: 1 }, max_tokens: { type: 'integer', minimum: 1, maximum: 1024 }, temperature: { type: 'number', minimum: 0, maximum: 1 } },
additionalProperties: false
}
export const validateReq = ajv.compile(Req)
Batched Generation Workers
const queue: any[] = []
setInterval(async () => {
const batch = queue.splice(0, BATCH_SIZE)
if (batch.length === 0) return
const enc = tokenize(batch.map(b => b.text))
const out = await model.generate(enc, { max_tokens: 256 })
out.forEach((o, i) => batch[i].resolve(o))
}, 10)
export function enqueue(text: string){
return new Promise((resolve) => queue.push({ text, resolve }))
}
Triton Text Backend (Sketch)
# model.py
def initialize(args):
from transformers import AutoModelForCausalLM, AutoTokenizer
global tok, mdl
tok = AutoTokenizer.from_pretrained(args['model'])
mdl = AutoModelForCausalLM.from_pretrained(args['model']).eval().cuda()
def execute(requests):
responses = []
for req in requests:
inputs = tok(req['text'], return_tensors='pt').to('cuda')
out = mdl.generate(**inputs, max_new_tokens=128)
responses.append(tok.decode(out[0]))
return responses
Langsmith / Helicone Hooks
import { Client as Langsmith } from 'langsmith'
const ls = new Langsmith({ apiKey: process.env.LS_KEY })
await ls.createRun({ name: 'nlp.generate', inputs: { text }, outputs: { out }, metadata: { model } })
await fetch('https://oai.hconeai.com/v1/chat/completions', { headers: { 'Helicone-Auth': `Bearer ${HELICONE}` }, body: JSON.stringify(payload), method: 'POST' })
Evaluation Suites (CLS / NER / QA / SUMM)
from datasets import load_dataset
from sklearn.metrics import f1_score, accuracy_score
cls = load_dataset('ag_news')
# preprocess... tokenization
# eval loop → compute acc/f1
# NER eval: token-level F1 on CoNLL
# QA eval: EM/F1 on SQuAD
# SUMM eval: ROUGE-1/2/L on CNN/DM
Golden Sets and Probes
suite: nlp_golden_v1
items:
- id: cls-001
input: "The stock surged today"
expected_label: business
- id: qa-002
question: "Who wrote 1984?"
context: "1984 is a novel by George Orwell."
expected: "George Orwell"
Governance (Owners / Approvals / Rollbacks)
{
"template": "summary_en_v4",
"owner": "nlp-platform@company.com",
"approvers": ["security@company.com","product@company.com"],
"rollback": { "on": { "win_drop": 0.03, "latency_ms": 200 } }
}
Evidence Pack CLI
#!/usr/bin/env bash
OUT=evidence_$(date +%F).zip
mkdir -p evidence && cp -r eval dashboards policies prompts evidence/ || true
zip -r "$OUT" evidence
Dashboards (Grafana JSON)
{
"title": "NLP Ops",
"panels": [
{"type":"timeseries","title":"Latency p95","targets":[{"expr":"histogram_quantile(0.95, sum by (le) (rate(nlp_latency_seconds_bucket[5m])))"}]},
{"type":"stat","title":"Cost/min","targets":[{"expr":"sum(rate(nlp_cost_usd_total[1m]))"}]},
{"type":"table","title":"Tokens/sec by Route","targets":[{"expr":"sum by (route) (rate(nlp_tokens_total[1m]))"}]}
]
}
Alertmanager Rules
groups:
- name: nlp
rules:
- alert: HighErrorRate
expr: (sum(rate(nlp_errors_total[5m])) / sum(rate(nlp_requests_total[5m]))) > 0.02
for: 10m
labels: { severity: page }
- alert: CostSpike
expr: sum(rate(nlp_cost_usd_total[5m])) > 5
for: 15m
labels: { severity: ticket }
Runbooks and SOPs
HighErrorRate
- Inspect logs; recent deploys; schema mismatches; rollback
CostSpike
- Enforce budgets; route smaller models; cache hot prompts
LatencySpike
- Reduce max tokens; ONNX/TensorRT; adjust batching
Cost Forecasting
scenario,req_per_day,tokens_in,tokens_out,price_in,price_out,cost_usd_day
base,500000,400,80,0.000004,0.000012,?
peak,2000000,500,120,0.000004,0.000012,?
- cost = req_per_day * (tokens_in*price_in + tokens_out*price_out)
- add buffer (10–20%) for variability
Security Hardening (mTLS, OPA)
# Istio PeerAuthentication
a piVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata: { name: default, namespace: nlp }
spec: { mtls: { mode: STRICT } }
package nlp
deny["oversized"] { input.request.tokens_in > 4096 }
allow { count(deny) == 0 }
Extended FAQ (1401–1700)
-
Micro-batching interval?
5–20ms depending on traffic profile. -
Cache keys?
Inputs + version + route; avoid PII. -
Canary guardrails?
Win-rate, latency p95, error rate thresholds. -
Golden set upkeep?
Monthly; add incident cases. -
Streaming stall?
Chunked responses; client timeouts. -
Traces correlation?
W3C TraceContext across services. -
Token explosion?
Trim inputs; compress prompts; budgets. -
Data residency?
Regional clusters; routing. -
DSAR workflow?
Export/delete logs tied to hashed IDs. -
Model registry?
Artifacts, owners, changelogs. -
Evidence packs?
Dashboards, evals, policies, prompts. -
CLI ergonomics?
Render, eval, evidence commands. -
Scaling bottlenecks?
Embedding and reranking; batch and cache. -
Prompt regressions?
CI gates; rollbacks. -
Guardrail latency?
<20% overhead target. -
When to re-architect?
When SLOs or costs drift persistently.