MLOps Deployment: Serving, Versioning, and Monitoring (2025)

Oct 26, 2025•

mlopsdeploymentmonitoringmodel-serving

•

Deploying ML is a systems problem. This guide covers model servers, registries, canary strategies, drift detection, and practical CI/CD.

Architecture

graph TB
  D[Data/Features] --> F[Feature Store]
  R[Registry] --> S[Model Server]
  S --> M[Metrics]
  S --> L[Logs]
  S --> T[Traces]
  G[Gateway] --> S

Serving patterns

Online (REST/gRPC): FastAPI, TF Serving, Triton, BentoML
Batch: Spark jobs, Airflow DAGs
Streaming: Kafka/Flink with stateful operators

Versioning and rollouts

Semantic model versions; lineage (data/code)
Blue/green and canary with automated rollback

Drift and performance monitoring

def population_stability_index(expected, actual) -> float:
    # simplified PSI
    ...

Track: PSI/KS, latency P95, error rates, business KPIs.

CI/CD template (sketch)

stages:
  - test
  - build
  - deploy
test:
  script: ["pytest -q", "great_expectations checkpoint run"]
build:
  script: ["bentoml build .", "bentoml containerize svc:latest"]
deploy:
  script: ["kubectl apply -f k8s/", "kubectl rollout status deploy/ml"]

FAQ

Q: How to pick a model server?
A: Choose based on framework support, GPU needs, and ops maturity; BentoML is a great general default.

LLM Fine-Tuning: /blog/llm-fine-tuning-complete-guide-lora-qlora-2025
LLM Observability: /blog/llm-observability-monitoring-langsmith-helicone-2025
RAG Systems: /blog/rag-systems-production-guide-chunking-retrieval-2025
ClickHouse Performance: /blog/clickhouse-analytics-database-performance-guide-2025
Streaming (Kafka/Flink): /blog/real-time-data-streaming-kafka-flink-architecture-2025

Call to action

Need a deployment and monitoring blueprint? Request an MLOps review.
Contact: /contact • Newsletter: /newsletter

Executive Summary

This guide provides a production-grade blueprint for deploying, operating, and iterating ML models at scale: serving options, CI/CD, observability, governance, and cost optimization. It includes copy-paste code, Helm/Terraform, pipelines, and runbooks.

Deployment Patterns

Batch scoring: scheduled jobs writing predictions to data stores
Online inference: low-latency HTTP/gRPC services
Streaming: event-driven with Kafka/Kinesis and stateful processors

graph LR
A[Data Sources] --> B[Feature Store]
B --> C[Batch Scoring]
B --> D[Online Serving]
B --> E[Streaming Inference]
C --> WH[Warehouse]
D --> APP[Apps]
E --> SINK[Topics/DB]

Serving Options

FastAPI (Python)

from fastapi import FastAPI
import joblib
app = FastAPI()
model = joblib.load("model.joblib")
@app.post("/predict")
def predict(x: list[float]):
    return {"y": float(model.predict([x])[0])}

TensorFlow Serving

docker run -p 8501:8501 -v $PWD/model:/models/m tfserving --model_base_path=/models/m --rest_api_port=8501

TorchServe

torch-model-archiver --model-name clf --version 1.0 --serialized-file model.pt --handler handler.py --export-path model_store
torchserve --start --ncs --model-store model_store --models clf.mar

NVIDIA Triton

docker run --gpus all -p8000:8000 -p8001:8001 -p8002:8002 nvcr.io/nvidia/tritonserver:23.09-py3 tritonserver --model-repository=/models

KServe on Kubernetes

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata: { name: clf, namespace: ml }
spec:
  predictor:
    sklearn:
      storageUri: s3://bucket/models/clf

Infrastructure (K8s/Helm/Terraform)

# Helm values for online service
image: { repository: registry/ml-clf, tag: 1.9.2 }
resources:
  requests: { cpu: 500m, memory: 1Gi }
  limits: { cpu: 2, memory: 4Gi }
autoscaling:
  minReplicas: 2
  maxReplicas: 20
  targetCPUUtilizationPercentage: 60

# Terraform for EKS node group
resource "aws_eks_node_group" "ml" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "ml-ng"
  scaling_config  = { desired_size = 3, max_size = 10, min_size = 2 }
  instance_types  = ["m6i.2xlarge"]
}

Model Registry

MLflow

import mlflow
with mlflow.start_run():
    mlflow.log_metric("auc", 0.91)
    mlflow.sklearn.log_model(model, "model")
    mlflow.register_model("runs:/.../model", "credit-default")

SageMaker Model Registry

from sagemaker.model import Model
m = Model(image_uri=img, model_data=s3_uri, role=role)

Feature Store

from feast import FeatureStore
store = FeatureStore(repo_path=".")
features = store.get_online_features([
  "user:age","user:plan_tier","user:avg_txn_30d"
], entity_rows=[{"user_id": 123}]).to_dict()

Data and Versioning (DVC/LakeFS)

dvc init
dvc add data/train.parquet data/val.parquet
dvc remote add -d s3 s3://ml-data

# lakeFS policy snippet
actions:
  - name: require-code-review
    on: pre-merge
    conditions: { changed_paths: ["models/**"] }
    steps: [ require_approval ]

CI/CD Pipelines

name: ml-ci
on: [push]
jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.10' }
      - run: pip install -r requirements.txt
      - run: python train.py --config configs/exp.yaml
      - run: python eval.py --suite eval/suite.yaml --out eval/report.json
      - run: python tools/gate.py --auc-min 0.9 --drift-max 0.02
      - uses: actions/upload-artifact@v4
        with: { name: model, path: model/ }
  deploy:
    needs: train
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/download-artifact@v4
        with: { name: model, path: model }
      - run: helm upgrade --install clf charts/clf -f values/prod.yaml --wait

Rollouts: Canary, Blue/Green, Shadow

canary:
  traffic: 10%
  metrics:
    - llm_latency_p95 < 300ms
    - error_rate < 1%
rollback_on:
  - auc_drop > 0.02
  - latency_increase_ms > 150

shadow:
  enabled: true
  mirror_percent: 20
  compare_metrics: [latency, error, drift]

Monitoring and Drift Detection

import client from 'prom-client'
const predLatency = new client.Histogram({ name: 'pred_latency_seconds', help: 'latency', buckets: [0.01,0.05,0.1,0.2,0.5,1] })
const err = new client.Counter({ name: 'pred_errors_total', help: 'errors' })

# population stability index (PSI)
import numpy as np

def psi(expected, actual, bins=10):
    e, _ = np.histogram(expected, bins)
    a, _ = np.histogram(actual, bins)
    e = e / max(1, e.sum()); a = a / max(1, a.sum())
    return np.sum((a - e) * np.log((a + 1e-9) / (e + 1e-9)))

span.setAttributes({ 'model.version': 'v19', 'data.drift.psi': psiValue })

Online and Offline Evaluations

python eval/offline.py --suite eval/suite.yaml --out eval/offline.json
python eval/online_probe.py --queries probes.json --out eval/online.json

A/B Testing

export function route(userId: string){ return (hash(userId) % 2) ? 'A' : 'B' }

Retraining Pipelines (Airflow)

from airflow import DAG
from airflow.operators.bash import BashOperator
with DAG('retrain', schedule='0 2 * * *', start_date=...):
    prep = BashOperator(task_id='prep', bash_command='python prep.py')
    train = BashOperator(task_id='train', bash_command='python train.py')
    eval = BashOperator(task_id='eval', bash_command='python eval.py')
    deploy = BashOperator(task_id='deploy', bash_command='python deploy.py')
    prep >> train >> eval >> deploy

Governance and Security

rbac:
  roles:
    - name: viewer
      permissions: [metrics.read]
    - name: deployer
      permissions: [deploy, rollback]

syft dir:. -o cyclonedx-json > sbom.json
cosign attest --predicate sbom.json --type cyclonedx registry/model:1.9.2

Cost Optimization

Right-size instances; autoscaling based on queue depth
Batch large payloads; cache frequent predictions
Quantize models; distill where possible

export function tps(tokens: number, secs: number){ return tokens / secs }

Runbooks

Latency Spike

Check provider/infra status, autoscaling, hot shards
Reduce max batch size; warm cache; scale horizontally

Drift Alert

Inspect feature distributions; roll back model; increase data freshness

Error Spike

Check schema mismatches; validate payloads; circuit break

JSON-LD

Call to Action

Need help deploying ML models reliably? We design and operate MLOps stacks with robust serving, observability, and governance. Contact us for a free assessment.

Extended FAQ (1–120)

Batch vs online?
Batch for non-real-time; online for interactive use.
KServe vs TF Serving?
KServe for K8s-native, TF Serving for TensorFlow-specific.
Triton benefits?
Multi-framework, dynamic batching, GPU acceleration.
Model registry choice?
MLflow for OSS; SageMaker/Vertex if cloud-managed.
Feature store worth it?
Yes if multiple models share features; offline/online parity.
Canary duration?
At least one business cycle (24–72h).
Drift metric?
PSI/JS divergence; alert when threshold exceeded.
Shadow mode purpose?
Compare behavior with zero user impact.
Autoscaling signal?
Queue depth + latency p95.
Quantization risks?
Small accuracy loss; monitor A/B.

... (continue with 110+ practical Q/A on serving, scaling, storage, evals, governance, security, cost)

Advanced Serving Patterns

Ensemble Serving (Stacking/Blending)

def predict(x):
    p1 = model_a.predict_proba(x)[:,1]
    p2 = model_b.predict_proba(x)[:,1]
    return 0.6*p1 + 0.4*p2

Multi-Model Endpoints

# Triton model ensemble
model_repository:
  - name: router
    platform: ensemble
    ensemble_scheduling:
      step: [{ model_name: xgb }, { model_name: nn }]

Request Routing (By Segment)

export function route(req: { plan: string, country: string }){
  if (req.plan === 'enterprise') return 'model_large'
  if (req.country === 'BR') return 'model_br'
  return 'model_default'
}

Dynamic Batching and Autoscaling

# Triton config
max_batch_size: 32
instance_group:
- kind: KIND_GPU
  count: 2
dynamic_batching:
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 5000

# K8s HPA with custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: clf }
spec:
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric: { name: queue_depth }
      target: { type: AverageValue, averageValue: 10 }

KServe/Triton Configurations

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata: { name: triton-ens, namespace: ml }
spec:
  predictor:
    triton:
      storageUri: s3://bucket/models
      runtimeVersion: 23.09
      resources:
        limits: { nvidia.com/gpu: 1, cpu: 2, memory: 8Gi }

Feature Store Parity (Online/Offline)

# feast apply
from feast import FeatureStore
store = FeatureStore(repo_path=".")
# training features
df = store.get_historical_features(entity_df=entities, features=[...]).to_df()
# online features
onl = store.get_online_features(features=[...], entity_rows=[{"user_id": 1}])

Data Contracts and Schema Validation

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "PredictRequest",
  "type": "object",
  "required": ["features"],
  "properties": {
    "features": {
      "type": "array",
      "items": { "type": "number" },
      "minItems": 10, "maxItems": 10
    }
  },
  "additionalProperties": false
}

import Ajv from 'ajv'
const ajv = new Ajv(); const validate = ajv.compile(schema)
app.post('/predict', (req,res)=>{
  if (!validate(req.body)) return res.status(400).json({ error: 'bad schema' })
  res.json(predict(req.body.features))
})

OpenTelemetry Traces

span.setAttributes({ 'model.name': 'clf', 'model.version': 'v20', 'batch.size': 16 })
span.addEvent('features.fetched', { ms: 12 })
span.addEvent('inference.done', { ms: 18 })

Prometheus Dashboards and Alerts

# p95 latency
ihist = sum by (le) (rate(pred_latency_seconds_bucket[5m]))
histogram_quantile(0.95, ihist)

groups:
- name: ml-ops
  rules:
  - alert: HighLatency
    expr: histogram_quantile(0.95, sum(rate(pred_latency_seconds_bucket[5m])) by (le)) > 0.3
    for: 10m
    labels: { severity: page }
  - alert: DriftHigh
    expr: avg_over_time(feature_psi[30m]) > 0.2
    for: 30m
    labels: { severity: ticket }

Canary Metrics and Rollback

canary:
  traffic: 20%
  success_criteria:
    win_rate_delta: "> -0.01"
    latency_p95_ms: "< 350"
rollback_on:
  - error_rate: "> 1%"

helm rollback clf 42 --namespace ml

Blue/Green Scripts

kubectl -n ml set image deploy/clf clf=registry/ml-clf:blue
sleep 60
kubectl -n ml set image deploy/clf clf=registry/ml-clf:green

Airflow/Dagster Retraining with Drift Triggers

# Airflow sensor
from airflow.sensors.base import BaseSensorOperator
class DriftSensor(BaseSensorOperator):
    def poke(self, context):
        return get_psi() > 0.2

# Dagster schedule
from dagster import schedule
@schedule(cron_schedule="0 3 * * *", job=retrain_job)
def retrain_on_schedule(_):
    return {"ops": {"train": {"config": {"dataset": "fresh"}}}}

Governance Policies

policies:
  approvals_required: 2
  owners:
    - platform-ml@company.com
    - security@company.com
  pii_logging: false

PII Handling

const PII = [/\b\d{3}-\d{2}-\d{4}\b/, /\b\d{16}\b/]
export function redact(s: string){ return PII.reduce((a,r)=>a.replace(r,'[REDACTED]'), s) }

SBOM and SLSA

syft dir:. -o cyclonedx-json > sbom.json
cosign attest --predicate sbom.json --type cyclonedx registry/clf:1.0.0

Helm Values and Terraform Modules

# values/prod.yaml
nodeSelector: { nodepool: compute }
tolerations: [{ key: gpu, operator: Exists, effect: NoSchedule }]
podDisruptionBudget: { minAvailable: 1 }

module "prom_stack" {
  source = "git::ssh://git@github.com/company/infra//modules/prometheus"
  namespace = "observability"
}

Cost Calculators

scenario,qps,latency_p95_ms,replicas,instance,cost_usd_month
base,200,180,4,m6i.2xlarge,XXXX
peak,800,240,12,m6i.2xlarge,YYYY

Runbooks (Expanded)

Feature Store Outage

Switch to cached features with TTL
Reduce traffic by rate limiting non-critical routes
Page data platform

GPU Saturation

Increase batch size cautiously; scale replicas; tune streams

Storage Latency

Verify PVC performance; move hot models to NVMe

Extended FAQ (121–200)

How to choose batch size?
Measure p95 vs throughput; avoid queue overflows.
Online vs offline features parity?
Use same definitions; unit tests for transformations.
Canary KPI selection?
Latency p95, error rate, win-rate vs baseline.
PSI thresholds?
0.1–0.2 typical; domain-specific.
GPU vs CPU for inference?
GPU for deep models; CPU for tree/linear.
A/B duration?
At least one traffic cycle; segment by user.
Feature importance drift?
Track SHAP/permute importance; alert on changes.
Shadow traffic storage?
Limit; hash PII; expire.
Model card?
Document metrics, data, risks, owners.
Rollback speed?
<2 minutes via Helm rollback.
Blue/green pitfalls?
Stale connections; ensure drain.
Terraform state safety?
Backups and locks; CI plan display.
PII fields in logs?
Redact or hash; keep minimal.
Secrets?
Vault/SM; rotate; never in env dumps.
Reproducibility?
Seed, versions, and artifacts recorded.
Cold start mitigation?
Warmers; keep hot path resident.
Multi-tenancy?
Quotas and isolation; per-tenant SLOs.
Observability SLOs?
Latency and error p95 with burn alerts.
Cost runaway?
Budget guards and autoscaling.
Vendor lock-in?
Abstract serving; portable formats.
Canary scope?
Per-route or tenant; gradual ramp.
Data skew?
Stratified sampling; segment dashboards.
Ground truth lag?
Backfill offline evals later.
Traffic shaping?
Rate limit low-priority.
CI flakes?
Retries and timeouts; stabilize tests.
Schema evolution?
Versioned contracts; migration path.
Backward compatibility?
Adapters or transform layers.
SLA design?
Realistic targets; publish.
Disk I/O bottlenecks?
Profile; NVMe; preloading.
Cache eviction policy?
LRU + TTL; monitor hit rate.
Model explainability?
Log features and SHAP summaries.
Security contexts?
Run as non-root; seccomp.
Egress policies?
NetworkPolicy allowlists.
Data residency?
Region-locked deployments.
Queue backpressure?
429 + jitter; drain protection.
Timeout policy?
Tailored per route; circuit breaker.
Blue/green secrets?
Scoped per color; rotate.
Snapshot restores?
Drills quarterly.
Notebook leakage?
Strip outputs; commit clean.
Autoscaling metric choice?
Queue depth + CPU better than one.
Payload size limits?
Enforce; compress; chunk.
Canary noise?
Use sufficient volume; filter outliers.
Vendor outage?
Failover or degrade gracefully.
Canary rollbacks?
Automate and log reasons.
Retraining cadence?
Data drift and business cycles.
Human-in-the-loop?
For critical decisions and appeals.
Data quality checks?
Pre-ingest validation; contracts.
Cache poisoning?
Validate inputs; hash keys.
Data skew alarms?
Segmented alerts; thresholds.
Edge serving?
Small models and caches.
Observability noise?
Reduce labels; sample.
Golden datasets?
Maintain and expand.
Cost trade-offs?
Cheaper models vs quality.
Backfills?
Idempotent; low priority windows.
Model archive?
Artifact registry with hashes.
A/B contamination?
Sticky bucketing; clear cookies.
Real user monitoring?
TTFT and conversion.
Retry storms?
Circuit breakers; budgets.
GPU memory?
Profile and right-size.
Batch windows?
Off-peak; monitor impact.
PCI/PHI?
Compliance gates; encryption.
Leak detection?
Regex/classifier scans.
Pipeline ownership?
Docs and on-call.
Model zoo?
Catalog and lifecycle.
SLA credits?
Define in contracts.
Rollout calendars?
Avoid peak traffic windows.
Shadow pitfalls?
Storage costs; sample.
Canary SLOs?
Stricter than prod SLOs.
Rerun evals?
Nightly; before deploys.
Naming versions?
Semver or date-based.
Infra drift?
IaC and scans; conftest.
Post-incident?
Blameless RCA; actions.
Multi-cloud?
Latency and egress costs.
Limits enforcement?
Max tokens/size; budgets.
GPU scheduling?
Bin pack; priorities.
Warm restarts?
Graceful; low churn.
Log retention?
30 days typical; privacy.
Data deletion?
Process and prove.
Observability costs?
Sample; retention tiers.
When to re-architect?
When scale/complexity exceeds current design.

Model Explainability (SHAP/LIME)

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_val[:100])
shap.summary_plot(shap_values, X_val[:100])

# LIME example
from lime.lime_tabular import LimeTabularExplainer
lime = LimeTabularExplainer(X_train.values, feature_names=X_train.columns)
exp = lime.explain_instance(X_val.iloc[0].values, model.predict_proba, num_features=8)

Data Validation with Great Expectations

import great_expectations as ge
from great_expectations.core.batch import BatchRequest
context = ge.get_context()
validator = context.sources.pandas_default.read_csv("data/val.csv")
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_table_row_count_to_be_between(1000, 1000000)
results = validator.validate()
assert results.success

Live Model Switching with Feature Flags

import { getFlag } from '@openfeature/server-sdk'
export function pickModel(user: { tier: string }){
  const flag = getFlag<boolean>('ml.use_large_model', false)
  if (flag || user.tier === 'enterprise') return 'clf-large-v20'
  return 'clf-base-v12'
}

Experiment Tracking (Weights & Biases)

import wandb
wandb.init(project="mlops-clf", config={"model": "xgb", "version": 20})
wandb.log({"auc": 0.912, "loss": 0.41})

Inference OpenAPI

openapi: 3.0.3
info: { title: Inference API, version: 1.0.0 }
paths:
  /predict:
    post:
      requestBody:
        required: true
        content: { application/json: { schema: { $ref: '#/components/schemas/PredictRequest' } } }
      responses:
        '200': { description: ok, content: { application/json: { schema: { $ref: '#/components/schemas/PredictResponse' } } } }
components:
  schemas:
    PredictRequest:
      type: object
      required: [features]
      properties: { features: { type: array, items: { type: number }, minItems: 10, maxItems: 10 } }
    PredictResponse:
      type: object
      properties: { y: { type: number }, latency_ms: { type: integer } }

Client SDKs

TypeScript

export async function predict(features: number[]){
  const r = await fetch('/predict', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ features }) })
  if (!r.ok) throw new Error('predict failed')
  return r.json() as Promise<{ y: number, latency_ms: number }>
}

Python

import requests

def predict(features):
    r = requests.post('https://api.company.com/predict', json={'features': features}, timeout=3)
    r.raise_for_status()
    return r.json()

Go

func Predict(features []float64) (float64, int, error) {
  b, _ := json.Marshal(map[string]interface{}{"features": features})
  req, _ := http.NewRequest("POST", "https://api.company.com/predict", bytes.NewReader(b))
  req.Header.Set("Content-Type", "application/json")
  resp, err := http.DefaultClient.Do(req)
  if err != nil { return 0,0, err }
  defer resp.Body.Close()
  var out struct{ Y float64 `json:"y"`; LatencyMs int `json:"latency_ms"` }
  json.NewDecoder(resp.Body).Decode(&out)
  return out.Y, out.LatencyMs, nil
}

Service Mesh (Istio) and mTLS

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata: { name: default, namespace: ml }
spec: { mtls: { mode: STRICT } }
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata: { name: allow-gateway, namespace: ml }
spec:
  selector: { matchLabels: { app: clf } }
  rules: [{ from: [{ source: { principals: ["cluster.local/ns/istio-system/sa/istio-ingressgateway-service-account"] }]}] ]

Ingress and TLS

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata: { name: clf, namespace: ml, annotations: { kubernetes.io/ingress.class: nginx, cert-manager.io/cluster-issuer: letsencrypt } }
spec:
  tls: [{ hosts: ["clf.example.com"], secretName: clf-tls }]
  rules:
    - host: clf.example.com
      http: { paths: [{ path: /, pathType: Prefix, backend: { service: { name: clf, port: { number: 80 } } } }] }

Feature Store Examples

# write features
store.write_to_online_store(entity_df=entities, feature_view_name="user_metrics", df=features_df)
# fetch
out = store.get_online_features(["user_metrics:avg_txn_30d"], entity_rows=[{"user_id": 42}]).to_dict()

Drift Detection

Kolmogorov–Smirnov Test

from scipy.stats import ks_2samp
ks, p = ks_2samp(train_feature, live_feature)
if p < 0.01: alert("feature drift")

Concept Drift (Performance Decay)

# compare AUC on recent labeled data
a = auc(y_recent, yhat_recent)
if a < (baseline_auc - 0.02): alert("concept drift")

Chaos Experiments

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata: { name: kill-one, namespace: ml }
spec:
  action: pod-kill
  mode: one
  selector: { namespaces: ["ml"], labelSelectors: { app: "clf" } }
  duration: "1m"

AIOps Automation

// auto scale-up on sustained queue depth
if (queueDepth.p95(5*60) > 15) scaleReplicas('clf', +2)

Disaster Recovery

Cross-region backups of models and features
Periodic restore drills; RTO/RPO documented
Runbooks for DNS/LB failover

Multi-Cloud Deployment

module "gke" { source = ".../gke" }
module "aks" { source = ".../aks" }
# deploy identical KServe InferenceService in each cluster

Edge and Offline Inference

# k3s on edge with smaller model
resources: { limits: { cpu: 1, memory: 1Gi } }

# offline batch predictions
for batch in batched(read_parquet('in.parquet'), 1000):
    preds = model.predict(batch)
    write_parquet('out.parquet', preds)

Batch Scoring Pipelines

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet('s3://bucket/features')
preds = predict_udf(df)
preds.write.mode('overwrite').parquet('s3://bucket/preds')

Feature Flags and Budget Enforcement

if (getFlag('ml.disable_expensive_route', false)) return fallback()
if (spent(tenant) > budget(tenant)) throw new Error('budget exceeded')

Cost Calculators

provider,instance,price_hour_usd
aws,g5.2xlarge,1.22
aws,m6i.2xlarge,0.384

Extended Runbooks

Throughput Drop

Increase batch size and replicas
Preload models; check GPU utilization

AUC Regression

Roll back; inspect training data freshness; reweight samples

Feature Delay

Backfill pipeline; increase SLAs with data platform

Extended FAQ (201–300)

Are SHAP plots required?
Helpful for audits; store artifacts per model version.
Great Expectations in prod?
Run inline for small checks; nightly deep checks.
Experiment tracking vendor?
W&B, MLflow, or custom; pick by org fit.
Schema evolution without downtime?
Versioned contracts; dual-read during rollout.
Istio overhead?
Small; gains in mTLS and policy control.
Triton batch tuning?
Sweep preferred sizes; watch p95.
KServe vs custom FastAPI?
KServe for platform standardization.
Drift alert thresholds?
Start conservative; iterate.
Chaos frequency?
Quarterly; postmortem learnings.
Multi-cloud costs?
Higher—offset with resilience value.
Edge model updates?
OTA updates; signed artifacts.
Offline scoring SLAs?
Windowed; communicate delays.
Budget guardrails?
Hard stop + alerts; owner approvals.
Blue/green caveats?
Connection draining; session pinning.
OpenAPI breaking changes?
New path or version; support overlap.
Retrain triggers?
Drift + business cadence.
GDPR/PII?
Minimize logs; data deletion workflows.
Artifacts retention?
Policy-based (e.g., 90 days).
GPU preemption?
Spot with buffer; failover.
Model registry governance?
Owners and approvals; audit logs.

Packaging and Containerization

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

docker build -t registry/ml/clf:1.20.0 .
docker push registry/ml/clf:1.20.0

Model Artifact Formats

SavedModel (TF), TorchScript (PyTorch), ONNX for portability, XGBoost JSON

# torchscript
scripted = torch.jit.script(model)
torch.jit.save(scripted, 'model.ts')

# onnx
torch.onnx.export(model, example, 'model.onnx', input_names=['x'], output_names=['y'])

Online/Offline Feature Parity Code

# feature_defs.py
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

def fit_features(df):
    df['amt_log'] = np.log1p(df['amount'])
    scaler.fit(df[['amt_log']])
    return df

def transform_online(payload):
    amt_log = math.log1p(payload['amount'])
    amt_scaled = (amt_log - float(scaler.mean_[0])) / float(scaler.scale_[0])
    return [amt_scaled, payload['age'], payload['tenure']]

Data Lineage with OpenLineage

facets:
  dataQualityMetrics:
    rowCount: 124502
    nullCount: { user_id: 0 }

from openlineage.client.run import RunEvent, RunState
client.emit(RunEvent(eventType=RunState.START, job=job, run=run))

OTEL Detailed Spans

span.setAttributes({
  'svc': 'clf',
  'version': '1.20.0',
  'route': '/predict',
  'features.cache.hit': cacheHit,
  'batch.size': batchSize
})
span.addEvent('fetch.features.start')
span.addEvent('fetch.features.end', { ms: featMs })
span.addEvent('inference', { provider: 'triton', ms: infMs })

Grafana Dashboards (JSON)

{
  "title": "ML Inference",
  "panels": [
    {"type":"timeseries","title":"p95 Latency","targets":[{"expr":"histogram_quantile(0.95, sum by (le) (rate(pred_latency_seconds_bucket[5m])))"}]},
    {"type":"stat","title":"Error Rate","targets":[{"expr":"sum(rate(pred_errors_total[5m])) / sum(rate(pred_requests_total[5m]))"}]},
    {"type":"table","title":"Top Tenants by QPS","targets":[{"expr":"topk(10, sum by (tenant) (rate(pred_requests_total[1m])))"}]}
  ]
}

Alertmanager Burn Alerts

groups:
- name: error-budget
  rules:
  - alert: FastBurn
    expr: (1 - (sum(rate(pred_success_total[5m])) / sum(rate(pred_requests_total[5m])))) > (1 - 0.999) * 14.4
    for: 5m
    labels: { severity: page }
  - alert: SlowBurn
    expr: (1 - (sum(rate(pred_success_total[1h])) / sum(rate(pred_requests_total[1h])))) > (1 - 0.999) * 6
    for: 2h
    labels: { severity: page }

SLA SQL

select date_trunc('day', ts) as day,
  1 - sum(case when status='error' then 1 else 0 end)::float / count(*) as availability,
  percentile_cont(0.95) within group (order by latency_ms) as p95
from inference_logs
where ts >= now() - interval '30 days'
group by 1 order by 1;

Rate Limiting and Quotas

import { RateLimiterMemory } from 'rate-limiter-flexible'
const limiter = new RateLimiterMemory({ points: 100, duration: 60 })
app.post('/predict', async (req,res,next)=>{
  try { await limiter.consume(req.ip); next() } catch { res.status(429).end() }
})

Multi-Tenant Budgeting

const budget = new Map<string, number>() // tenant -> monthly USD
const spent = new Map<string, number>()
export function checkBudget(tenant: string, cost: number){
  const b = budget.get(tenant) || 0
  const s = (spent.get(tenant) || 0) + cost
  spent.set(tenant, s)
  if (s > b) throw new Error('budget exceeded')
}

Policy-as-Code (OPA) for Inference

package inference

deny["no_tier"] {
  not input.subject.tier
}

deny["oversized request"] {
  input.request.payload_size > 1e6
}

allow {
  count(deny) == 0
}

SRE Playbooks

p95 > 500ms: scale replicas, reduce batch size, check feature store latency
Error rate > 2%: validate schema, inspect recent deploys, rollback
Drift alert: freeze deploys, analyze distributions, trigger retrain

Disaster Recovery Runbooks

Promote standby region: update DNS/LB; verify health checks
Restore models from object store snapshots
Rebuild feature indexes; validate counts

Chaos Drills

Kill one replica every 15 minutes; verify autoscaling stability
Inject 200ms latency to feature store; confirm fallbacks

E2E Smoke Tests

test('predict returns numeric y and latency', async () => {
  const r = await predict([0.1, 42, 5])
  expect(typeof r.y).toBe('number')
  expect(r.latency_ms).toBeLessThan(500)
})

Shadow Traffic Harness

export async function mirror(req){
  fetch('https://canary/predict', { method: 'POST', body: JSON.stringify(req), keepalive: true }).catch(()=>{})
}

Cost Dashboards

sum by (tenant) (rate(inference_cost_usd_total[5m]))

Extended FAQ (301–420)

Cold starts on Triton?
Warm models and keep min instances.
ONNX vs TorchScript?
ONNX for portability; TorchScript for PyTorch features.
OpenLineage value?
Trace data and model lineage for audits.
Great Expectations overhead?
Run small checks inline; heavy checks as batch.
Head vs tail sampling for traces?
Tail for slow/error traces; head for volume.
Per-tenant budgets?
Yes—prevent runaway costs.
OPA performance?
Cache decisions; small policies.
Multi-model endpoints trade-offs?
Simplicity vs isolation; consider errors blast radius.
Feature TTLs?
Expire stale online features to avoid drift.
Shadow storage costs?
Limit retention; sample.
SLIs selection?
Latency p95, error rate, availability, cost per req.
Budget alerts cadence?
Hourly for spikes; daily rollups for finance.
GPU autoscaling?
Custom metrics (queue depth, batch wait).
Canary routing stickiness?
Hash users to avoid crossover.
DR RPO/RTO?
Define realistic targets; test quarterly.
Chaos blast radius?
Start small; increase gradually.
E2E smoke scope?
Minimal but representative; run on each deploy.
Edge constraints?
Smaller models; reduced features.
Multi-cloud metrics merge?
Normalize labels; central warehouse.
Managed vs self-hosted?
Managed for speed; self-hosted for control.
Data contracts enforcement?
Block requests failing schema; version migrations.
Spike protection?
Rate limiters and circuit breakers.
Cost per req target?
Depends on business; trend downward.
Model drift vs data drift?
Model drift is performance decay; data drift is distribution shift.
Alert fatigue?
Hysteresis; for; ticket vs page separation.
Who owns drift triage?
Data science + platform jointly.
Ingress TLS renewals?
Cert-manager automation.
API versioning?
/v1, /v2; deprecate with overlap.
PII encryption?
At rest and in transit; minimize use.
Cache invalidation?
On deploy or feature store update.
Pricing models for infra?
Rightsize; spot; autoscale.
Retrain cadence aligning to seasonality?
Increase before peaks (e.g., holidays).
Secret scanning?
CI and pre-commit.
Model card updates?
On each release; store with artifact.
Latex in dashboards?
Avoid; stick to clear metrics.
SLO reviews?
Monthly with stakeholders.
Data contracts drift?
Version bump and dual-read.
Canary evaluation length?
At least 24h; longer for weekly cycles.
OTEL exporter choice?
OTLP; unify across stacks.
SPIKE of 500 errors?
Rollback; inspect recent changes.
Canary safe rollback?
Feature flag switch and Helm rollback.
Sandbox tests?
Isolated env; synthetic data.
Repo for IaC?
Monorepo or infra repo with modules.
Model ownership?
Define rotation; on-call arrangements.
P95 vs P99?
Monitor both; page on P95 to reduce noise.
Combining metrics and traces?
Link by request_id; sampling strategy.
Slow feature reads?
Warm cache; local copies; optimize queries.
Java vs Python serving?
Java for throughput; Python for ecosystem.
Canary with low traffic?
Longer duration; synthetic probes.
Pre-compute features?
Batch nightly; online deltas.
Cost dashboards audience?
Finance and engineering leads.
Model explainability retention?
Store per release; sample for privacy.
Audit prep?
Evidence folders: logs, configs, approvals.
Combining ensembles?
Simple weighted average first; then meta-model.
Throughput vs latency trade-offs?
Tune batching; measure user impact.
Docker image size?
Slim base, multi-stage builds.
Rolling updates vs recreate?
Rolling preferred; ensure readiness.
Canary per route?
Yes—latency varies by route.
Resource requests/limits?
Set to avoid noisy neighbors.
Dead letter queues?
For streaming pipelines; inspect failures.
Edge auth?
mTLS + tokens; rotate.
Backpressure to clients?
429 with Retry-After.
JSON vs Protobuf?
JSON for simplicity; Protobuf for speed.
GPU sharing?
MPS or MIG if available.
Security scanners?
Trivy/Grype on images.
Env parity?
Dev/staging/prod alignment.
Release trains?
Predictable cadence.
Canary false positives?
Segment and sample to reduce noise.
Benchmark harness?
Synthetic requests; reproducible.
CPU throttling?
Requests > limits; monitor.
Memory leaks?
p99 and RSS monitoring; restart policies.
Dynamic routing?
Route by segment, cost, and latency.
Governance reviews?
Quarterly; update policies.
Nightly retrains?
Monitor stability; avoid drift.
Timeouts tuning?
Keep short with retries.
Post-deploy checks?
Smoke tests, probes, dashboards.
Alert on what?
User impact metrics first.
Rollback testing?
Perform drills; document.
Backup validation?
Restore tests; checksum.
ML budget planning?
Forecast by QPS and tokens.
RUM for inference UIs?
TTFT and completion metrics.
Migrations windows?
Off-peak; announce.
Secrets rotation cadence?
Monthly or on incident.
Paging policy?
P1 page; P2 page; P3 ticket.
SRE/DS collaboration?
Shared dashboards; on-call overlap.
Data masking?
Hash or remove; minimize use.
Ingestion spikes?
Buffer; autoscale consumers.
GPU stocks?
Capacity planning; priority queues.
Vendor APIs?
Retry with jitter; circuit breakers.
Readiness/liveness?
Probes with dependency checks.
Canary health gates?
Automated—no manual judgment required.
Budget resets?
Monthly; notify owners.
Model labels?
owner, version, tier.
Team training?
Runbooks and drills.
SLO breaches?
RCAs and improvement plans.
KPIs for MLOps?
TTFT, p95, availability, cost, drift.
Egress costs?
Cache local; compress; reduce.
Tool sprawl?
Consolidate; integrate.
Who approves deploys?
Owners and on-call.
Final: Done is monitored, not merged.

Production Readiness Checklist (Copy/Paste)

Versioned model artifacts with provenance (data hash, code commit)
Reproducible training (seeds, env, deps)
Inference API with OpenAPI and schema validation
Canary + shadow strategies defined and scripted
p95 latency and error SLOs with burn alerts
Drift monitors (data + concept) with thresholds and runbooks
Feature store parity tests (online/offline)
Rollback procedures tested (Helm, flags)
Budget guards (tenant, route) and alerts
Security: mTLS, SBOM/SLSA, secrets rotated, least-privilege IAM
Observability: traces, metrics, logs wired and dashboards published
Cost dashboards with per-tenant cost/min tracking
Disaster Recovery: backups, restores, DNS/LB failover drills

SOP: Deploy a New Model Version

1) Train and log run (W&B/MLflow); register artifact
2) Run offline evals (golden sets); gate on AUC/loss targets
3) Publish image and chart; update values with version
4) Shadow 20% traffic; compare latency/error/drift
5) Canary 10% → 30% → 100% with health gates
6) Monitor burn alerts; rollback on breach
7) Update model card; notify stakeholders

SOP: Drift Triage and Retrain Trigger

Inputs: PSI > 0.2 or AUC drop > 0.02 sustained 2h
Steps:
- Freeze deploys; snapshot live distributions
- Segment by tenant/route; identify hot features
- Validate data contracts; check upstream schema changes
- Create retrain ticket with scope and timelines
- Launch retrain DAG; canary new model; monitor KPIs

Dashboard Templates (Grafana JSON Snippets)

{
  "title": "ML Cost per Tenant",
  "panels": [
    {"type":"timeseries","title":"USD/min by Tenant","targets":[{"expr":"sum by (tenant) (rate(inference_cost_usd_total[1m]))"}]},
    {"type":"table","title":"Top Models by Cost","targets":[{"expr":"topk(10, sum by (model) (rate(inference_cost_usd_total[5m])))"}]}
  ]
}

{
  "title": "Inference Reliability",
  "panels": [
    {"type":"timeseries","title":"Availability","targets":[{"expr":"1 - sum(rate(pred_errors_total[5m]))/sum(rate(pred_requests_total[5m]))"}]},
    {"type":"stat","title":"TTFT (ms)","targets":[{"expr":"llm_ttft_ms"}]}
  ]
}

Budget Guard Calculator

export function monthlyCostUSD(qps: number, usdPerReq: number){
  return qps * usdPerReq * 60 * 60 * 24 * 30
}
export function usdPerReq(tokens: number, inPrice: number, outPrice: number){
  const inTok = tokens * 0.8, outTok = tokens * 0.2
  return inTok*inPrice + outTok*outPrice
}

Canary Health Gate Script (Node)

import { execSync } from 'node:child_process'
function q(expr: string){ return parseFloat(execSync(`bash -lc "curl -s localhost:9090/api/v1/query --data-urlencode query='${expr}' | jq -r .data.result[0].value[1]"`).toString()) }
const p95 = q("histogram_quantile(0.95, sum(rate(pred_latency_seconds_bucket{pod=~'clf-canary-.*'}[5m])) by (le))")
const err = q("sum(rate(pred_errors_total{pod=~'clf-canary-.*'}[5m]))/sum(rate(pred_requests_total{pod=~'clf-canary-.*'}[5m]))")
if (p95 > 0.35 || err > 0.01) process.exit(1)

Shadow Comparison Harness

import requests, time, statistics as st

BASE = 'https://prod/predict'; CANARY = 'https://canary/predict'
lat_base=[]; lat_canary=[]; errs=0
for i in range(200):
    payload = { 'features': [0.1, 42, i % 10] }
    t0=time.time(); rb = requests.post(BASE, json=payload, timeout=1); lb = (time.time()-t0)
    t0=time.time(); rc = requests.post(CANARY, json=payload, timeout=1); lc = (time.time()-t0)
    if rb.status_code!=200 or rc.status_code!=200: errs+=1
    lat_base.append(lb); lat_canary.append(lc)
print('p95 base', st.quantiles(lat_base, n=100)[94], 'p95 canary', st.quantiles(lat_canary, n=100)[94], 'errs', errs)

E2E Smoke Test Suite (k6)

import http from 'k6/http'; import { check, sleep } from 'k6'
export const options = { vus: 20, duration: '2m' }
export default function () {
  const r = http.post('https://api/predict', JSON.stringify({ features: [0.1,42,5] }), { headers: { 'Content-Type': 'application/json' } })
  check(r, { '200': (res) => res.status === 200, 'swift': (res) => res.timings.duration < 300 })
  sleep(0.5)
}

Cost Dashboard Panels (PromQL)

# cost per model per minute
sum by (model) (rate(inference_cost_usd_total[1m]))
# projected monthly cost
sum by (model) (rate(inference_cost_usd_total[5m])) * 60 * 24 * 30

SRE Playbook: Hotfix Rollback

Trigger: p95 > 500ms or error rate > 2%
1) Helm rollback last stable: helm rollback clf 1 -n ml
2) Verify readiness and metrics within 10 minutes
3) Post incident note; begin RCA

DR Drill Script (Pseudo)

aws route53 change-resource-record-sets --hosted-zone-id Z --change-batch file://dr-failover.json
kubectl --context=dr apply -f inference.yaml
kubectl --context=dr rollout status deploy/clf -n ml --timeout=10m

Extended FAQ (401–420)

Rollback time target?
Under 2 minutes to initiate, <10 minutes to stabilize.
Canary gates location?
In CI and in deploy controller.
Burn alert thresholds?
Based on SLO and window; start conservative.
Feature store race conditions?
Transactions or idempotent writes; TTLs.
Schema rollback?
Dual-read until complete migration.
Token budget enforcement?
Per request and per tenant; hard stop.
A/B pollution?
Sticky bucketing; segment correctly.
GPU credit exhaustion?
Autoscale down; route to CPU; reduce batch size.
Data exfiltration risk?
NetworkPolicy egress allowlists; audit.
How to tune HPA?
Queue depth + CPU; test step loads.
Stream vs batch trade-off?
Latency vs cost; pick per route.
Observability costs?
Sample logs; aggregate metrics.
Canary kill-switch?
One flag to zero traffic to canary.
Post-incident reviews cadence?
Within 48 hours; publish actions.
SLIs for cost?
USD/min and $/request.
Infra cost allocation?
Labels; per-tenant tags; dashboards.
Model Zoo sprawl?
Archive unused; owners accountable.
Human override?
Owners can pause deploys and gates.
Compliance evidence?
Export logs, approvals, configs; store securely.
When to declare stable?
After passing gates and one cycle with healthy SLOs.

MLOps Deployment: Serving, Versioning, and Monitoring (2025)

Architecture

Serving patterns

Versioning and rollouts

Drift and performance monitoring

CI/CD template (sketch)

FAQ

Related posts

Call to action

Executive Summary

Deployment Patterns

Serving Options

FastAPI (Python)

TensorFlow Serving

TorchServe

NVIDIA Triton

KServe on Kubernetes

Infrastructure (K8s/Helm/Terraform)

Model Registry

MLflow

SageMaker Model Registry

Feature Store

Data and Versioning (DVC/LakeFS)

CI/CD Pipelines

Rollouts: Canary, Blue/Green, Shadow

Monitoring and Drift Detection

Online and Offline Evaluations

A/B Testing

Retraining Pipelines (Airflow)

Governance and Security

Cost Optimization

Runbooks

Latency Spike

Drift Alert

Error Spike

JSON-LD

Related Posts

Call to Action

Extended FAQ (1–120)

Advanced Serving Patterns

Ensemble Serving (Stacking/Blending)

Multi-Model Endpoints

Request Routing (By Segment)

Dynamic Batching and Autoscaling

KServe/Triton Configurations

Feature Store Parity (Online/Offline)

Data Contracts and Schema Validation

OpenTelemetry Traces

Prometheus Dashboards and Alerts

Canary Metrics and Rollback

Blue/Green Scripts

Airflow/Dagster Retraining with Drift Triggers

Governance Policies

PII Handling

SBOM and SLSA

Helm Values and Terraform Modules

Cost Calculators

Runbooks (Expanded)

Feature Store Outage

GPU Saturation

Storage Latency

Extended FAQ (121–200)

Model Explainability (SHAP/LIME)

Data Validation with Great Expectations

Live Model Switching with Feature Flags

Experiment Tracking (Weights & Biases)

Inference OpenAPI

Client SDKs

TypeScript

Python

Go

Service Mesh (Istio) and mTLS

Ingress and TLS

Feature Store Examples

Drift Detection

Kolmogorov–Smirnov Test

Concept Drift (Performance Decay)

Chaos Experiments

AIOps Automation

Disaster Recovery