MLOps Deployment: Serving, Versioning, and Monitoring (2025)
Deploying ML is a systems problem. This guide covers model servers, registries, canary strategies, drift detection, and practical CI/CD.
Architecture
graph TB
D[Data/Features] --> F[Feature Store]
R[Registry] --> S[Model Server]
S --> M[Metrics]
S --> L[Logs]
S --> T[Traces]
G[Gateway] --> S
Serving patterns
- Online (REST/gRPC): FastAPI, TF Serving, Triton, BentoML
- Batch: Spark jobs, Airflow DAGs
- Streaming: Kafka/Flink with stateful operators
Versioning and rollouts
- Semantic model versions; lineage (data/code)
- Blue/green and canary with automated rollback
Drift and performance monitoring
def population_stability_index(expected, actual) -> float:
# simplified PSI
...
Track: PSI/KS, latency P95, error rates, business KPIs.
CI/CD template (sketch)
stages:
- test
- build
- deploy
test:
script: ["pytest -q", "great_expectations checkpoint run"]
build:
script: ["bentoml build .", "bentoml containerize svc:latest"]
deploy:
script: ["kubectl apply -f k8s/", "kubectl rollout status deploy/ml"]
FAQ
Q: How to pick a model server?
A: Choose based on framework support, GPU needs, and ops maturity; BentoML is a great general default.
Related posts
- LLM Fine-Tuning: /blog/llm-fine-tuning-complete-guide-lora-qlora-2025
- LLM Observability: /blog/llm-observability-monitoring-langsmith-helicone-2025
- RAG Systems: /blog/rag-systems-production-guide-chunking-retrieval-2025
- ClickHouse Performance: /blog/clickhouse-analytics-database-performance-guide-2025
- Streaming (Kafka/Flink): /blog/real-time-data-streaming-kafka-flink-architecture-2025
Call to action
Need a deployment and monitoring blueprint? Request an MLOps review.
Contact: /contact • Newsletter: /newsletter
Executive Summary
This guide provides a production-grade blueprint for deploying, operating, and iterating ML models at scale: serving options, CI/CD, observability, governance, and cost optimization. It includes copy-paste code, Helm/Terraform, pipelines, and runbooks.
Deployment Patterns
- Batch scoring: scheduled jobs writing predictions to data stores
- Online inference: low-latency HTTP/gRPC services
- Streaming: event-driven with Kafka/Kinesis and stateful processors
graph LR
A[Data Sources] --> B[Feature Store]
B --> C[Batch Scoring]
B --> D[Online Serving]
B --> E[Streaming Inference]
C --> WH[Warehouse]
D --> APP[Apps]
E --> SINK[Topics/DB]
Serving Options
FastAPI (Python)
from fastapi import FastAPI
import joblib
app = FastAPI()
model = joblib.load("model.joblib")
@app.post("/predict")
def predict(x: list[float]):
return {"y": float(model.predict([x])[0])}
TensorFlow Serving
docker run -p 8501:8501 -v $PWD/model:/models/m tfserving --model_base_path=/models/m --rest_api_port=8501
TorchServe
torch-model-archiver --model-name clf --version 1.0 --serialized-file model.pt --handler handler.py --export-path model_store
torchserve --start --ncs --model-store model_store --models clf.mar
NVIDIA Triton
docker run --gpus all -p8000:8000 -p8001:8001 -p8002:8002 nvcr.io/nvidia/tritonserver:23.09-py3 tritonserver --model-repository=/models
KServe on Kubernetes
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata: { name: clf, namespace: ml }
spec:
predictor:
sklearn:
storageUri: s3://bucket/models/clf
Infrastructure (K8s/Helm/Terraform)
# Helm values for online service
image: { repository: registry/ml-clf, tag: 1.9.2 }
resources:
requests: { cpu: 500m, memory: 1Gi }
limits: { cpu: 2, memory: 4Gi }
autoscaling:
minReplicas: 2
maxReplicas: 20
targetCPUUtilizationPercentage: 60
# Terraform for EKS node group
resource "aws_eks_node_group" "ml" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "ml-ng"
scaling_config = { desired_size = 3, max_size = 10, min_size = 2 }
instance_types = ["m6i.2xlarge"]
}
Model Registry
MLflow
import mlflow
with mlflow.start_run():
mlflow.log_metric("auc", 0.91)
mlflow.sklearn.log_model(model, "model")
mlflow.register_model("runs:/.../model", "credit-default")
SageMaker Model Registry
from sagemaker.model import Model
m = Model(image_uri=img, model_data=s3_uri, role=role)
Feature Store
from feast import FeatureStore
store = FeatureStore(repo_path=".")
features = store.get_online_features([
"user:age","user:plan_tier","user:avg_txn_30d"
], entity_rows=[{"user_id": 123}]).to_dict()
Data and Versioning (DVC/LakeFS)
dvc init
dvc add data/train.parquet data/val.parquet
dvc remote add -d s3 s3://ml-data
# lakeFS policy snippet
actions:
- name: require-code-review
on: pre-merge
conditions: { changed_paths: ["models/**"] }
steps: [ require_approval ]
CI/CD Pipelines
name: ml-ci
on: [push]
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.10' }
- run: pip install -r requirements.txt
- run: python train.py --config configs/exp.yaml
- run: python eval.py --suite eval/suite.yaml --out eval/report.json
- run: python tools/gate.py --auc-min 0.9 --drift-max 0.02
- uses: actions/upload-artifact@v4
with: { name: model, path: model/ }
deploy:
needs: train
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/download-artifact@v4
with: { name: model, path: model }
- run: helm upgrade --install clf charts/clf -f values/prod.yaml --wait
Rollouts: Canary, Blue/Green, Shadow
canary:
traffic: 10%
metrics:
- llm_latency_p95 < 300ms
- error_rate < 1%
rollback_on:
- auc_drop > 0.02
- latency_increase_ms > 150
shadow:
enabled: true
mirror_percent: 20
compare_metrics: [latency, error, drift]
Monitoring and Drift Detection
import client from 'prom-client'
const predLatency = new client.Histogram({ name: 'pred_latency_seconds', help: 'latency', buckets: [0.01,0.05,0.1,0.2,0.5,1] })
const err = new client.Counter({ name: 'pred_errors_total', help: 'errors' })
# population stability index (PSI)
import numpy as np
def psi(expected, actual, bins=10):
e, _ = np.histogram(expected, bins)
a, _ = np.histogram(actual, bins)
e = e / max(1, e.sum()); a = a / max(1, a.sum())
return np.sum((a - e) * np.log((a + 1e-9) / (e + 1e-9)))
span.setAttributes({ 'model.version': 'v19', 'data.drift.psi': psiValue })
Online and Offline Evaluations
python eval/offline.py --suite eval/suite.yaml --out eval/offline.json
python eval/online_probe.py --queries probes.json --out eval/online.json
A/B Testing
export function route(userId: string){ return (hash(userId) % 2) ? 'A' : 'B' }
Retraining Pipelines (Airflow)
from airflow import DAG
from airflow.operators.bash import BashOperator
with DAG('retrain', schedule='0 2 * * *', start_date=...):
prep = BashOperator(task_id='prep', bash_command='python prep.py')
train = BashOperator(task_id='train', bash_command='python train.py')
eval = BashOperator(task_id='eval', bash_command='python eval.py')
deploy = BashOperator(task_id='deploy', bash_command='python deploy.py')
prep >> train >> eval >> deploy
Governance and Security
rbac:
roles:
- name: viewer
permissions: [metrics.read]
- name: deployer
permissions: [deploy, rollback]
syft dir:. -o cyclonedx-json > sbom.json
cosign attest --predicate sbom.json --type cyclonedx registry/model:1.9.2
Cost Optimization
- Right-size instances; autoscaling based on queue depth
- Batch large payloads; cache frequent predictions
- Quantize models; distill where possible
export function tps(tokens: number, secs: number){ return tokens / secs }
Runbooks
Latency Spike
- Check provider/infra status, autoscaling, hot shards
- Reduce max batch size; warm cache; scale horizontally
Drift Alert
- Inspect feature distributions; roll back model; increase data freshness
Error Spike
- Check schema mismatches; validate payloads; circuit break
JSON-LD
Related Posts
Call to Action
Need help deploying ML models reliably? We design and operate MLOps stacks with robust serving, observability, and governance. Contact us for a free assessment.
Extended FAQ (1–120)
-
Batch vs online?
Batch for non-real-time; online for interactive use. -
KServe vs TF Serving?
KServe for K8s-native, TF Serving for TensorFlow-specific. -
Triton benefits?
Multi-framework, dynamic batching, GPU acceleration. -
Model registry choice?
MLflow for OSS; SageMaker/Vertex if cloud-managed. -
Feature store worth it?
Yes if multiple models share features; offline/online parity. -
Canary duration?
At least one business cycle (24–72h). -
Drift metric?
PSI/JS divergence; alert when threshold exceeded. -
Shadow mode purpose?
Compare behavior with zero user impact. -
Autoscaling signal?
Queue depth + latency p95. -
Quantization risks?
Small accuracy loss; monitor A/B.
... (continue with 110+ practical Q/A on serving, scaling, storage, evals, governance, security, cost)
Advanced Serving Patterns
Ensemble Serving (Stacking/Blending)
def predict(x):
p1 = model_a.predict_proba(x)[:,1]
p2 = model_b.predict_proba(x)[:,1]
return 0.6*p1 + 0.4*p2
Multi-Model Endpoints
# Triton model ensemble
model_repository:
- name: router
platform: ensemble
ensemble_scheduling:
step: [{ model_name: xgb }, { model_name: nn }]
Request Routing (By Segment)
export function route(req: { plan: string, country: string }){
if (req.plan === 'enterprise') return 'model_large'
if (req.country === 'BR') return 'model_br'
return 'model_default'
}
Dynamic Batching and Autoscaling
# Triton config
max_batch_size: 32
instance_group:
- kind: KIND_GPU
count: 2
dynamic_batching:
preferred_batch_size: [8, 16, 32]
max_queue_delay_microseconds: 5000
# K8s HPA with custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: clf }
spec:
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric: { name: queue_depth }
target: { type: AverageValue, averageValue: 10 }
KServe/Triton Configurations
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata: { name: triton-ens, namespace: ml }
spec:
predictor:
triton:
storageUri: s3://bucket/models
runtimeVersion: 23.09
resources:
limits: { nvidia.com/gpu: 1, cpu: 2, memory: 8Gi }
Feature Store Parity (Online/Offline)
# feast apply
from feast import FeatureStore
store = FeatureStore(repo_path=".")
# training features
df = store.get_historical_features(entity_df=entities, features=[...]).to_df()
# online features
onl = store.get_online_features(features=[...], entity_rows=[{"user_id": 1}])
Data Contracts and Schema Validation
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "PredictRequest",
"type": "object",
"required": ["features"],
"properties": {
"features": {
"type": "array",
"items": { "type": "number" },
"minItems": 10, "maxItems": 10
}
},
"additionalProperties": false
}
import Ajv from 'ajv'
const ajv = new Ajv(); const validate = ajv.compile(schema)
app.post('/predict', (req,res)=>{
if (!validate(req.body)) return res.status(400).json({ error: 'bad schema' })
res.json(predict(req.body.features))
})
OpenTelemetry Traces
span.setAttributes({ 'model.name': 'clf', 'model.version': 'v20', 'batch.size': 16 })
span.addEvent('features.fetched', { ms: 12 })
span.addEvent('inference.done', { ms: 18 })
Prometheus Dashboards and Alerts
# p95 latency
ihist = sum by (le) (rate(pred_latency_seconds_bucket[5m]))
histogram_quantile(0.95, ihist)
groups:
- name: ml-ops
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, sum(rate(pred_latency_seconds_bucket[5m])) by (le)) > 0.3
for: 10m
labels: { severity: page }
- alert: DriftHigh
expr: avg_over_time(feature_psi[30m]) > 0.2
for: 30m
labels: { severity: ticket }
Canary Metrics and Rollback
canary:
traffic: 20%
success_criteria:
win_rate_delta: "> -0.01"
latency_p95_ms: "< 350"
rollback_on:
- error_rate: "> 1%"
helm rollback clf 42 --namespace ml
Blue/Green Scripts
kubectl -n ml set image deploy/clf clf=registry/ml-clf:blue
sleep 60
kubectl -n ml set image deploy/clf clf=registry/ml-clf:green
Airflow/Dagster Retraining with Drift Triggers
# Airflow sensor
from airflow.sensors.base import BaseSensorOperator
class DriftSensor(BaseSensorOperator):
def poke(self, context):
return get_psi() > 0.2
# Dagster schedule
from dagster import schedule
@schedule(cron_schedule="0 3 * * *", job=retrain_job)
def retrain_on_schedule(_):
return {"ops": {"train": {"config": {"dataset": "fresh"}}}}
Governance Policies
policies:
approvals_required: 2
owners:
- platform-ml@company.com
- security@company.com
pii_logging: false
PII Handling
const PII = [/\b\d{3}-\d{2}-\d{4}\b/, /\b\d{16}\b/]
export function redact(s: string){ return PII.reduce((a,r)=>a.replace(r,'[REDACTED]'), s) }
SBOM and SLSA
syft dir:. -o cyclonedx-json > sbom.json
cosign attest --predicate sbom.json --type cyclonedx registry/clf:1.0.0
Helm Values and Terraform Modules
# values/prod.yaml
nodeSelector: { nodepool: compute }
tolerations: [{ key: gpu, operator: Exists, effect: NoSchedule }]
podDisruptionBudget: { minAvailable: 1 }
module "prom_stack" {
source = "git::ssh://git@github.com/company/infra//modules/prometheus"
namespace = "observability"
}
Cost Calculators
scenario,qps,latency_p95_ms,replicas,instance,cost_usd_month
base,200,180,4,m6i.2xlarge,XXXX
peak,800,240,12,m6i.2xlarge,YYYY
Runbooks (Expanded)
Feature Store Outage
- Switch to cached features with TTL
- Reduce traffic by rate limiting non-critical routes
- Page data platform
GPU Saturation
- Increase batch size cautiously; scale replicas; tune streams
Storage Latency
- Verify PVC performance; move hot models to NVMe
Extended FAQ (121–200)
-
How to choose batch size?
Measure p95 vs throughput; avoid queue overflows. -
Online vs offline features parity?
Use same definitions; unit tests for transformations. -
Canary KPI selection?
Latency p95, error rate, win-rate vs baseline. -
PSI thresholds?
0.1–0.2 typical; domain-specific. -
GPU vs CPU for inference?
GPU for deep models; CPU for tree/linear. -
A/B duration?
At least one traffic cycle; segment by user. -
Feature importance drift?
Track SHAP/permute importance; alert on changes. -
Shadow traffic storage?
Limit; hash PII; expire. -
Model card?
Document metrics, data, risks, owners. -
Rollback speed?
<2 minutes via Helm rollback. -
Blue/green pitfalls?
Stale connections; ensure drain. -
Terraform state safety?
Backups and locks; CI plan display. -
PII fields in logs?
Redact or hash; keep minimal. -
Secrets?
Vault/SM; rotate; never in env dumps. -
Reproducibility?
Seed, versions, and artifacts recorded. -
Cold start mitigation?
Warmers; keep hot path resident. -
Multi-tenancy?
Quotas and isolation; per-tenant SLOs. -
Observability SLOs?
Latency and error p95 with burn alerts. -
Cost runaway?
Budget guards and autoscaling. -
Vendor lock-in?
Abstract serving; portable formats. -
Canary scope?
Per-route or tenant; gradual ramp. -
Data skew?
Stratified sampling; segment dashboards. -
Ground truth lag?
Backfill offline evals later. -
Traffic shaping?
Rate limit low-priority. -
CI flakes?
Retries and timeouts; stabilize tests. -
Schema evolution?
Versioned contracts; migration path. -
Backward compatibility?
Adapters or transform layers. -
SLA design?
Realistic targets; publish. -
Disk I/O bottlenecks?
Profile; NVMe; preloading. -
Cache eviction policy?
LRU + TTL; monitor hit rate. -
Model explainability?
Log features and SHAP summaries. -
Security contexts?
Run as non-root; seccomp. -
Egress policies?
NetworkPolicy allowlists. -
Data residency?
Region-locked deployments. -
Queue backpressure?
429 + jitter; drain protection. -
Timeout policy?
Tailored per route; circuit breaker. -
Blue/green secrets?
Scoped per color; rotate. -
Snapshot restores?
Drills quarterly. -
Notebook leakage?
Strip outputs; commit clean. -
Autoscaling metric choice?
Queue depth + CPU better than one. -
Payload size limits?
Enforce; compress; chunk. -
Canary noise?
Use sufficient volume; filter outliers. -
Vendor outage?
Failover or degrade gracefully. -
Canary rollbacks?
Automate and log reasons. -
Retraining cadence?
Data drift and business cycles. -
Human-in-the-loop?
For critical decisions and appeals. -
Data quality checks?
Pre-ingest validation; contracts. -
Cache poisoning?
Validate inputs; hash keys. -
Data skew alarms?
Segmented alerts; thresholds. -
Edge serving?
Small models and caches. -
Observability noise?
Reduce labels; sample. -
Golden datasets?
Maintain and expand. -
Cost trade-offs?
Cheaper models vs quality. -
Backfills?
Idempotent; low priority windows. -
Model archive?
Artifact registry with hashes. -
A/B contamination?
Sticky bucketing; clear cookies. -
Real user monitoring?
TTFT and conversion. -
Retry storms?
Circuit breakers; budgets. -
GPU memory?
Profile and right-size. -
Batch windows?
Off-peak; monitor impact. -
PCI/PHI?
Compliance gates; encryption. -
Leak detection?
Regex/classifier scans. -
Pipeline ownership?
Docs and on-call. -
Model zoo?
Catalog and lifecycle. -
SLA credits?
Define in contracts. -
Rollout calendars?
Avoid peak traffic windows. -
Shadow pitfalls?
Storage costs; sample. -
Canary SLOs?
Stricter than prod SLOs. -
Rerun evals?
Nightly; before deploys. -
Naming versions?
Semver or date-based. -
Infra drift?
IaC and scans; conftest. -
Post-incident?
Blameless RCA; actions. -
Multi-cloud?
Latency and egress costs. -
Limits enforcement?
Max tokens/size; budgets. -
GPU scheduling?
Bin pack; priorities. -
Warm restarts?
Graceful; low churn. -
Log retention?
30 days typical; privacy. -
Data deletion?
Process and prove. -
Observability costs?
Sample; retention tiers. -
When to re-architect?
When scale/complexity exceeds current design.
Model Explainability (SHAP/LIME)
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_val[:100])
shap.summary_plot(shap_values, X_val[:100])
# LIME example
from lime.lime_tabular import LimeTabularExplainer
lime = LimeTabularExplainer(X_train.values, feature_names=X_train.columns)
exp = lime.explain_instance(X_val.iloc[0].values, model.predict_proba, num_features=8)
Data Validation with Great Expectations
import great_expectations as ge
from great_expectations.core.batch import BatchRequest
context = ge.get_context()
validator = context.sources.pandas_default.read_csv("data/val.csv")
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_table_row_count_to_be_between(1000, 1000000)
results = validator.validate()
assert results.success
Live Model Switching with Feature Flags
import { getFlag } from '@openfeature/server-sdk'
export function pickModel(user: { tier: string }){
const flag = getFlag<boolean>('ml.use_large_model', false)
if (flag || user.tier === 'enterprise') return 'clf-large-v20'
return 'clf-base-v12'
}
Experiment Tracking (Weights & Biases)
import wandb
wandb.init(project="mlops-clf", config={"model": "xgb", "version": 20})
wandb.log({"auc": 0.912, "loss": 0.41})
Inference OpenAPI
openapi: 3.0.3
info: { title: Inference API, version: 1.0.0 }
paths:
/predict:
post:
requestBody:
required: true
content: { application/json: { schema: { $ref: '#/components/schemas/PredictRequest' } } }
responses:
'200': { description: ok, content: { application/json: { schema: { $ref: '#/components/schemas/PredictResponse' } } } }
components:
schemas:
PredictRequest:
type: object
required: [features]
properties: { features: { type: array, items: { type: number }, minItems: 10, maxItems: 10 } }
PredictResponse:
type: object
properties: { y: { type: number }, latency_ms: { type: integer } }
Client SDKs
TypeScript
export async function predict(features: number[]){
const r = await fetch('/predict', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ features }) })
if (!r.ok) throw new Error('predict failed')
return r.json() as Promise<{ y: number, latency_ms: number }>
}
Python
import requests
def predict(features):
r = requests.post('https://api.company.com/predict', json={'features': features}, timeout=3)
r.raise_for_status()
return r.json()
Go
func Predict(features []float64) (float64, int, error) {
b, _ := json.Marshal(map[string]interface{}{"features": features})
req, _ := http.NewRequest("POST", "https://api.company.com/predict", bytes.NewReader(b))
req.Header.Set("Content-Type", "application/json")
resp, err := http.DefaultClient.Do(req)
if err != nil { return 0,0, err }
defer resp.Body.Close()
var out struct{ Y float64 `json:"y"`; LatencyMs int `json:"latency_ms"` }
json.NewDecoder(resp.Body).Decode(&out)
return out.Y, out.LatencyMs, nil
}
Service Mesh (Istio) and mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata: { name: default, namespace: ml }
spec: { mtls: { mode: STRICT } }
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata: { name: allow-gateway, namespace: ml }
spec:
selector: { matchLabels: { app: clf } }
rules: [{ from: [{ source: { principals: ["cluster.local/ns/istio-system/sa/istio-ingressgateway-service-account"] }]}] ]
Ingress and TLS
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata: { name: clf, namespace: ml, annotations: { kubernetes.io/ingress.class: nginx, cert-manager.io/cluster-issuer: letsencrypt } }
spec:
tls: [{ hosts: ["clf.example.com"], secretName: clf-tls }]
rules:
- host: clf.example.com
http: { paths: [{ path: /, pathType: Prefix, backend: { service: { name: clf, port: { number: 80 } } } }] }
Feature Store Examples
# write features
store.write_to_online_store(entity_df=entities, feature_view_name="user_metrics", df=features_df)
# fetch
out = store.get_online_features(["user_metrics:avg_txn_30d"], entity_rows=[{"user_id": 42}]).to_dict()
Drift Detection
Kolmogorov–Smirnov Test
from scipy.stats import ks_2samp
ks, p = ks_2samp(train_feature, live_feature)
if p < 0.01: alert("feature drift")
Concept Drift (Performance Decay)
# compare AUC on recent labeled data
a = auc(y_recent, yhat_recent)
if a < (baseline_auc - 0.02): alert("concept drift")
Chaos Experiments
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata: { name: kill-one, namespace: ml }
spec:
action: pod-kill
mode: one
selector: { namespaces: ["ml"], labelSelectors: { app: "clf" } }
duration: "1m"
AIOps Automation
// auto scale-up on sustained queue depth
if (queueDepth.p95(5*60) > 15) scaleReplicas('clf', +2)
Disaster Recovery
- Cross-region backups of models and features
- Periodic restore drills; RTO/RPO documented
- Runbooks for DNS/LB failover
Multi-Cloud Deployment
module "gke" { source = ".../gke" }
module "aks" { source = ".../aks" }
# deploy identical KServe InferenceService in each cluster
Edge and Offline Inference
# k3s on edge with smaller model
resources: { limits: { cpu: 1, memory: 1Gi } }
# offline batch predictions
for batch in batched(read_parquet('in.parquet'), 1000):
preds = model.predict(batch)
write_parquet('out.parquet', preds)
Batch Scoring Pipelines
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet('s3://bucket/features')
preds = predict_udf(df)
preds.write.mode('overwrite').parquet('s3://bucket/preds')
Feature Flags and Budget Enforcement
if (getFlag('ml.disable_expensive_route', false)) return fallback()
if (spent(tenant) > budget(tenant)) throw new Error('budget exceeded')
Cost Calculators
provider,instance,price_hour_usd
aws,g5.2xlarge,1.22
aws,m6i.2xlarge,0.384
Extended Runbooks
Throughput Drop
- Increase batch size and replicas
- Preload models; check GPU utilization
AUC Regression
- Roll back; inspect training data freshness; reweight samples
Feature Delay
- Backfill pipeline; increase SLAs with data platform
Extended FAQ (201–300)
-
Are SHAP plots required?
Helpful for audits; store artifacts per model version. -
Great Expectations in prod?
Run inline for small checks; nightly deep checks. -
Experiment tracking vendor?
W&B, MLflow, or custom; pick by org fit. -
Schema evolution without downtime?
Versioned contracts; dual-read during rollout. -
Istio overhead?
Small; gains in mTLS and policy control. -
Triton batch tuning?
Sweep preferred sizes; watch p95. -
KServe vs custom FastAPI?
KServe for platform standardization. -
Drift alert thresholds?
Start conservative; iterate. -
Chaos frequency?
Quarterly; postmortem learnings. -
Multi-cloud costs?
Higher—offset with resilience value. -
Edge model updates?
OTA updates; signed artifacts. -
Offline scoring SLAs?
Windowed; communicate delays. -
Budget guardrails?
Hard stop + alerts; owner approvals. -
Blue/green caveats?
Connection draining; session pinning. -
OpenAPI breaking changes?
New path or version; support overlap. -
Retrain triggers?
Drift + business cadence. -
GDPR/PII?
Minimize logs; data deletion workflows. -
Artifacts retention?
Policy-based (e.g., 90 days). -
GPU preemption?
Spot with buffer; failover. -
Model registry governance?
Owners and approvals; audit logs.
Packaging and Containerization
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
docker build -t registry/ml/clf:1.20.0 .
docker push registry/ml/clf:1.20.0
Model Artifact Formats
- SavedModel (TF), TorchScript (PyTorch), ONNX for portability, XGBoost JSON
# torchscript
scripted = torch.jit.script(model)
torch.jit.save(scripted, 'model.ts')
# onnx
torch.onnx.export(model, example, 'model.onnx', input_names=['x'], output_names=['y'])
Online/Offline Feature Parity Code
# feature_defs.py
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
def fit_features(df):
df['amt_log'] = np.log1p(df['amount'])
scaler.fit(df[['amt_log']])
return df
def transform_online(payload):
amt_log = math.log1p(payload['amount'])
amt_scaled = (amt_log - float(scaler.mean_[0])) / float(scaler.scale_[0])
return [amt_scaled, payload['age'], payload['tenure']]
Data Lineage with OpenLineage
facets:
dataQualityMetrics:
rowCount: 124502
nullCount: { user_id: 0 }
from openlineage.client.run import RunEvent, RunState
client.emit(RunEvent(eventType=RunState.START, job=job, run=run))
OTEL Detailed Spans
span.setAttributes({
'svc': 'clf',
'version': '1.20.0',
'route': '/predict',
'features.cache.hit': cacheHit,
'batch.size': batchSize
})
span.addEvent('fetch.features.start')
span.addEvent('fetch.features.end', { ms: featMs })
span.addEvent('inference', { provider: 'triton', ms: infMs })
Grafana Dashboards (JSON)
{
"title": "ML Inference",
"panels": [
{"type":"timeseries","title":"p95 Latency","targets":[{"expr":"histogram_quantile(0.95, sum by (le) (rate(pred_latency_seconds_bucket[5m])))"}]},
{"type":"stat","title":"Error Rate","targets":[{"expr":"sum(rate(pred_errors_total[5m])) / sum(rate(pred_requests_total[5m]))"}]},
{"type":"table","title":"Top Tenants by QPS","targets":[{"expr":"topk(10, sum by (tenant) (rate(pred_requests_total[1m])))"}]}
]
}
Alertmanager Burn Alerts
groups:
- name: error-budget
rules:
- alert: FastBurn
expr: (1 - (sum(rate(pred_success_total[5m])) / sum(rate(pred_requests_total[5m])))) > (1 - 0.999) * 14.4
for: 5m
labels: { severity: page }
- alert: SlowBurn
expr: (1 - (sum(rate(pred_success_total[1h])) / sum(rate(pred_requests_total[1h])))) > (1 - 0.999) * 6
for: 2h
labels: { severity: page }
SLA SQL
select date_trunc('day', ts) as day,
1 - sum(case when status='error' then 1 else 0 end)::float / count(*) as availability,
percentile_cont(0.95) within group (order by latency_ms) as p95
from inference_logs
where ts >= now() - interval '30 days'
group by 1 order by 1;
Rate Limiting and Quotas
import { RateLimiterMemory } from 'rate-limiter-flexible'
const limiter = new RateLimiterMemory({ points: 100, duration: 60 })
app.post('/predict', async (req,res,next)=>{
try { await limiter.consume(req.ip); next() } catch { res.status(429).end() }
})
Multi-Tenant Budgeting
const budget = new Map<string, number>() // tenant -> monthly USD
const spent = new Map<string, number>()
export function checkBudget(tenant: string, cost: number){
const b = budget.get(tenant) || 0
const s = (spent.get(tenant) || 0) + cost
spent.set(tenant, s)
if (s > b) throw new Error('budget exceeded')
}
Policy-as-Code (OPA) for Inference
package inference
deny["no_tier"] {
not input.subject.tier
}
deny["oversized request"] {
input.request.payload_size > 1e6
}
allow {
count(deny) == 0
}
SRE Playbooks
- p95 > 500ms: scale replicas, reduce batch size, check feature store latency
- Error rate > 2%: validate schema, inspect recent deploys, rollback
- Drift alert: freeze deploys, analyze distributions, trigger retrain
Disaster Recovery Runbooks
- Promote standby region: update DNS/LB; verify health checks
- Restore models from object store snapshots
- Rebuild feature indexes; validate counts
Chaos Drills
- Kill one replica every 15 minutes; verify autoscaling stability
- Inject 200ms latency to feature store; confirm fallbacks
E2E Smoke Tests
test('predict returns numeric y and latency', async () => {
const r = await predict([0.1, 42, 5])
expect(typeof r.y).toBe('number')
expect(r.latency_ms).toBeLessThan(500)
})
Shadow Traffic Harness
export async function mirror(req){
fetch('https://canary/predict', { method: 'POST', body: JSON.stringify(req), keepalive: true }).catch(()=>{})
}
Cost Dashboards
sum by (tenant) (rate(inference_cost_usd_total[5m]))
Extended FAQ (301–420)
-
Cold starts on Triton?
Warm models and keep min instances. -
ONNX vs TorchScript?
ONNX for portability; TorchScript for PyTorch features. -
OpenLineage value?
Trace data and model lineage for audits. -
Great Expectations overhead?
Run small checks inline; heavy checks as batch. -
Head vs tail sampling for traces?
Tail for slow/error traces; head for volume. -
Per-tenant budgets?
Yes—prevent runaway costs. -
OPA performance?
Cache decisions; small policies. -
Multi-model endpoints trade-offs?
Simplicity vs isolation; consider errors blast radius. -
Feature TTLs?
Expire stale online features to avoid drift. -
Shadow storage costs?
Limit retention; sample. -
SLIs selection?
Latency p95, error rate, availability, cost per req. -
Budget alerts cadence?
Hourly for spikes; daily rollups for finance. -
GPU autoscaling?
Custom metrics (queue depth, batch wait). -
Canary routing stickiness?
Hash users to avoid crossover. -
DR RPO/RTO?
Define realistic targets; test quarterly. -
Chaos blast radius?
Start small; increase gradually. -
E2E smoke scope?
Minimal but representative; run on each deploy. -
Edge constraints?
Smaller models; reduced features. -
Multi-cloud metrics merge?
Normalize labels; central warehouse. -
Managed vs self-hosted?
Managed for speed; self-hosted for control. -
Data contracts enforcement?
Block requests failing schema; version migrations. -
Spike protection?
Rate limiters and circuit breakers. -
Cost per req target?
Depends on business; trend downward. -
Model drift vs data drift?
Model drift is performance decay; data drift is distribution shift. -
Alert fatigue?
Hysteresis; for; ticket vs page separation. -
Who owns drift triage?
Data science + platform jointly. -
Ingress TLS renewals?
Cert-manager automation. -
API versioning?
/v1, /v2; deprecate with overlap. -
PII encryption?
At rest and in transit; minimize use. -
Cache invalidation?
On deploy or feature store update. -
Pricing models for infra?
Rightsize; spot; autoscale. -
Retrain cadence aligning to seasonality?
Increase before peaks (e.g., holidays). -
Secret scanning?
CI and pre-commit. -
Model card updates?
On each release; store with artifact. -
Latex in dashboards?
Avoid; stick to clear metrics. -
SLO reviews?
Monthly with stakeholders. -
Data contracts drift?
Version bump and dual-read. -
Canary evaluation length?
At least 24h; longer for weekly cycles. -
OTEL exporter choice?
OTLP; unify across stacks. -
SPIKE of 500 errors?
Rollback; inspect recent changes. -
Canary safe rollback?
Feature flag switch and Helm rollback. -
Sandbox tests?
Isolated env; synthetic data. -
Repo for IaC?
Monorepo or infra repo with modules. -
Model ownership?
Define rotation; on-call arrangements. -
P95 vs P99?
Monitor both; page on P95 to reduce noise. -
Combining metrics and traces?
Link by request_id; sampling strategy. -
Slow feature reads?
Warm cache; local copies; optimize queries. -
Java vs Python serving?
Java for throughput; Python for ecosystem. -
Canary with low traffic?
Longer duration; synthetic probes. -
Pre-compute features?
Batch nightly; online deltas. -
Cost dashboards audience?
Finance and engineering leads. -
Model explainability retention?
Store per release; sample for privacy. -
Audit prep?
Evidence folders: logs, configs, approvals. -
Combining ensembles?
Simple weighted average first; then meta-model. -
Throughput vs latency trade-offs?
Tune batching; measure user impact. -
Docker image size?
Slim base, multi-stage builds. -
Rolling updates vs recreate?
Rolling preferred; ensure readiness. -
Canary per route?
Yes—latency varies by route. -
Resource requests/limits?
Set to avoid noisy neighbors. -
Dead letter queues?
For streaming pipelines; inspect failures. -
Edge auth?
mTLS + tokens; rotate. -
Backpressure to clients?
429 with Retry-After. -
JSON vs Protobuf?
JSON for simplicity; Protobuf for speed. -
GPU sharing?
MPS or MIG if available. -
Security scanners?
Trivy/Grype on images. -
Env parity?
Dev/staging/prod alignment. -
Release trains?
Predictable cadence. -
Canary false positives?
Segment and sample to reduce noise. -
Benchmark harness?
Synthetic requests; reproducible. -
CPU throttling?
Requests > limits; monitor. -
Memory leaks?
p99 and RSS monitoring; restart policies. -
Dynamic routing?
Route by segment, cost, and latency. -
Governance reviews?
Quarterly; update policies. -
Nightly retrains?
Monitor stability; avoid drift. -
Timeouts tuning?
Keep short with retries. -
Post-deploy checks?
Smoke tests, probes, dashboards. -
Alert on what?
User impact metrics first. -
Rollback testing?
Perform drills; document. -
Backup validation?
Restore tests; checksum. -
ML budget planning?
Forecast by QPS and tokens. -
RUM for inference UIs?
TTFT and completion metrics. -
Migrations windows?
Off-peak; announce. -
Secrets rotation cadence?
Monthly or on incident. -
Paging policy?
P1 page; P2 page; P3 ticket. -
SRE/DS collaboration?
Shared dashboards; on-call overlap. -
Data masking?
Hash or remove; minimize use. -
Ingestion spikes?
Buffer; autoscale consumers. -
GPU stocks?
Capacity planning; priority queues. -
Vendor APIs?
Retry with jitter; circuit breakers. -
Readiness/liveness?
Probes with dependency checks. -
Canary health gates?
Automated—no manual judgment required. -
Budget resets?
Monthly; notify owners. -
Model labels?
owner, version, tier. -
Team training?
Runbooks and drills. -
SLO breaches?
RCAs and improvement plans. -
KPIs for MLOps?
TTFT, p95, availability, cost, drift. -
Egress costs?
Cache local; compress; reduce. -
Tool sprawl?
Consolidate; integrate. -
Who approves deploys?
Owners and on-call. -
Final: Done is monitored, not merged.
Production Readiness Checklist (Copy/Paste)
- Versioned model artifacts with provenance (data hash, code commit)
- Reproducible training (seeds, env, deps)
- Inference API with OpenAPI and schema validation
- Canary + shadow strategies defined and scripted
- p95 latency and error SLOs with burn alerts
- Drift monitors (data + concept) with thresholds and runbooks
- Feature store parity tests (online/offline)
- Rollback procedures tested (Helm, flags)
- Budget guards (tenant, route) and alerts
- Security: mTLS, SBOM/SLSA, secrets rotated, least-privilege IAM
- Observability: traces, metrics, logs wired and dashboards published
- Cost dashboards with per-tenant cost/min tracking
- Disaster Recovery: backups, restores, DNS/LB failover drills
SOP: Deploy a New Model Version
1) Train and log run (W&B/MLflow); register artifact
2) Run offline evals (golden sets); gate on AUC/loss targets
3) Publish image and chart; update values with version
4) Shadow 20% traffic; compare latency/error/drift
5) Canary 10% → 30% → 100% with health gates
6) Monitor burn alerts; rollback on breach
7) Update model card; notify stakeholders
SOP: Drift Triage and Retrain Trigger
Inputs: PSI > 0.2 or AUC drop > 0.02 sustained 2h
Steps:
- Freeze deploys; snapshot live distributions
- Segment by tenant/route; identify hot features
- Validate data contracts; check upstream schema changes
- Create retrain ticket with scope and timelines
- Launch retrain DAG; canary new model; monitor KPIs
Dashboard Templates (Grafana JSON Snippets)
{
"title": "ML Cost per Tenant",
"panels": [
{"type":"timeseries","title":"USD/min by Tenant","targets":[{"expr":"sum by (tenant) (rate(inference_cost_usd_total[1m]))"}]},
{"type":"table","title":"Top Models by Cost","targets":[{"expr":"topk(10, sum by (model) (rate(inference_cost_usd_total[5m])))"}]}
]
}
{
"title": "Inference Reliability",
"panels": [
{"type":"timeseries","title":"Availability","targets":[{"expr":"1 - sum(rate(pred_errors_total[5m]))/sum(rate(pred_requests_total[5m]))"}]},
{"type":"stat","title":"TTFT (ms)","targets":[{"expr":"llm_ttft_ms"}]}
]
}
Budget Guard Calculator
export function monthlyCostUSD(qps: number, usdPerReq: number){
return qps * usdPerReq * 60 * 60 * 24 * 30
}
export function usdPerReq(tokens: number, inPrice: number, outPrice: number){
const inTok = tokens * 0.8, outTok = tokens * 0.2
return inTok*inPrice + outTok*outPrice
}
Canary Health Gate Script (Node)
import { execSync } from 'node:child_process'
function q(expr: string){ return parseFloat(execSync(`bash -lc "curl -s localhost:9090/api/v1/query --data-urlencode query='${expr}' | jq -r .data.result[0].value[1]"`).toString()) }
const p95 = q("histogram_quantile(0.95, sum(rate(pred_latency_seconds_bucket{pod=~'clf-canary-.*'}[5m])) by (le))")
const err = q("sum(rate(pred_errors_total{pod=~'clf-canary-.*'}[5m]))/sum(rate(pred_requests_total{pod=~'clf-canary-.*'}[5m]))")
if (p95 > 0.35 || err > 0.01) process.exit(1)
Shadow Comparison Harness
import requests, time, statistics as st
BASE = 'https://prod/predict'; CANARY = 'https://canary/predict'
lat_base=[]; lat_canary=[]; errs=0
for i in range(200):
payload = { 'features': [0.1, 42, i % 10] }
t0=time.time(); rb = requests.post(BASE, json=payload, timeout=1); lb = (time.time()-t0)
t0=time.time(); rc = requests.post(CANARY, json=payload, timeout=1); lc = (time.time()-t0)
if rb.status_code!=200 or rc.status_code!=200: errs+=1
lat_base.append(lb); lat_canary.append(lc)
print('p95 base', st.quantiles(lat_base, n=100)[94], 'p95 canary', st.quantiles(lat_canary, n=100)[94], 'errs', errs)
E2E Smoke Test Suite (k6)
import http from 'k6/http'; import { check, sleep } from 'k6'
export const options = { vus: 20, duration: '2m' }
export default function () {
const r = http.post('https://api/predict', JSON.stringify({ features: [0.1,42,5] }), { headers: { 'Content-Type': 'application/json' } })
check(r, { '200': (res) => res.status === 200, 'swift': (res) => res.timings.duration < 300 })
sleep(0.5)
}
Cost Dashboard Panels (PromQL)
# cost per model per minute
sum by (model) (rate(inference_cost_usd_total[1m]))
# projected monthly cost
sum by (model) (rate(inference_cost_usd_total[5m])) * 60 * 24 * 30
SRE Playbook: Hotfix Rollback
Trigger: p95 > 500ms or error rate > 2%
1) Helm rollback last stable: helm rollback clf 1 -n ml
2) Verify readiness and metrics within 10 minutes
3) Post incident note; begin RCA
DR Drill Script (Pseudo)
aws route53 change-resource-record-sets --hosted-zone-id Z --change-batch file://dr-failover.json
kubectl --context=dr apply -f inference.yaml
kubectl --context=dr rollout status deploy/clf -n ml --timeout=10m
Extended FAQ (401–420)
-
Rollback time target?
Under 2 minutes to initiate, <10 minutes to stabilize. -
Canary gates location?
In CI and in deploy controller. -
Burn alert thresholds?
Based on SLO and window; start conservative. -
Feature store race conditions?
Transactions or idempotent writes; TTLs. -
Schema rollback?
Dual-read until complete migration. -
Token budget enforcement?
Per request and per tenant; hard stop. -
A/B pollution?
Sticky bucketing; segment correctly. -
GPU credit exhaustion?
Autoscale down; route to CPU; reduce batch size. -
Data exfiltration risk?
NetworkPolicy egress allowlists; audit. -
How to tune HPA?
Queue depth + CPU; test step loads. -
Stream vs batch trade-off?
Latency vs cost; pick per route. -
Observability costs?
Sample logs; aggregate metrics. -
Canary kill-switch?
One flag to zero traffic to canary. -
Post-incident reviews cadence?
Within 48 hours; publish actions. -
SLIs for cost?
USD/min and $/request. -
Infra cost allocation?
Labels; per-tenant tags; dashboards. -
Model Zoo sprawl?
Archive unused; owners accountable. -
Human override?
Owners can pause deploys and gates. -
Compliance evidence?
Export logs, approvals, configs; store securely. -
When to declare stable?
After passing gates and one cycle with healthy SLOs.