Computer Vision Applications with OpenCV and TensorFlow (2025)
Computer vision remains foundational for automation in 2025—from document workflows and retail analytics to robotics and healthcare. This guide covers practical CV tasks, modern architectures, deployment patterns, and production considerations using OpenCV and TensorFlow.
Executive summary
- Use classical CV (OpenCV) for low-latency, deterministic pre/post-processing and simple tasks
- Use deep learning (TensorFlow) for detection/segmentation/recognition; combine with classical CV for robustness
- Focus on reproducible pipelines, drift monitoring, and hardware-aware optimization (CPU/GPU/Edge TPU)
Common tasks and solutions
Object detection (retail loss prevention, shelf analytics)
- Models: EfficientDet, YOLOv8/YOLO-NAS (via TF/TFLite or ONNX)
- Tips: anchor-free variants, small backbones for edge devices, mixed precision
# TensorFlow inference (simplified)
import tensorflow as tf
import numpy as np
model = tf.saved_model.load("./detector")
@tf.function
def infer(img):
out = model(img, training=False)
return out
def preprocess(frame):
x = tf.image.resize(frame, (640, 640)) / 255.0
return tf.expand_dims(x, 0)
Semantic/instance segmentation (manufacturing defect detection)
- Models: DeepLabV3+, Mask R-CNN; export to TFLite/EdgeTPU when possible
- Metrics: mIoU, PQ; monitor per-class performance
OCR and document understanding (back-office automation)
- Pipeline: OpenCV binarization → text detection (DB/EAST) → recognition (CRNN/Transformer)
- Use layout models (LayoutLMv3/DocTr) for forms/tables; validate with business rules
# OpenCV pre-processing for OCR
import cv2
def preprocess_for_ocr(img):
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gray,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,11,2)
den = cv2.fastNlMeansDenoising(thr)
return den
Tracking (people/vehicle)
- Detector + tracker (DeepSORT/ByteTrack); re-identification for handoff across cameras
- Respect privacy: anonymize faces/plates; sampling and on-device processing
Embeddings and retrieval
- Visual search: CLIP/ViT embeddings → vector DB (Qdrant/Pinecone)
- Multi-modal RAG: combine vision embeddings with text for product support or inventory search
Deployment patterns
- Edge: TFLite/EdgeTPU, quantization, half-precision, fused ops
- Cloud: GPU autoscaling, batching (triton), A/B for models
- Hybrid: run pre/post on edge, heavy model in cloud; cache results
MLOps considerations
- Dataset versioning; synthetic data augmentation
- Drift monitoring (brightness, blur, class frequency); periodic re-labeling
- Cost: pre-filter frames with classical CV; gate DL inference by motion
Security and privacy
- On-device redaction (blur/anonymize) before upload
- Access controls for video data; retention policies; audit logs
FAQ
Q: When is OpenCV alone sufficient?
A: When tasks are geometric or threshold-based (barcode, simple alignment, morphology). Use DL when variability is high.
Executive Summary
This production-focused guide covers end-to-end Computer Vision (CV) systems in 2025: OpenCV pipelines, modern deep models (detection, segmentation, tracking, OCR), dataset tooling, training/evaluation, and deployment to edge and cloud with monitoring, cost control, and governance.
Image I/O and Preprocessing (OpenCV)
import cv2
img = cv2.imread('image.jpg')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
resized = cv2.resize(img, (640, 640))
blur = cv2.GaussianBlur(resized, (5,5), 0)
norm = (blur/255.0).astype('float32')
Augmentation
import numpy as np
def rand_flip(img):
return cv2.flip(img, 1) if np.random.rand() < 0.5 else img
Classical Methods
Edges and Contours
edges = cv2.Canny(img, threshold1=100, threshold2=200)
contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
Morphology
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
dilated = cv2.dilate(edges, kernel, iterations=1)
Feature Extraction (ORB)
orb = cv2.ORB_create(nfeatures=1000)
kp, des = orb.detectAndCompute(img, None)
Object Detection (YOLO/SSD)
# Ultralytics YOLOv8 (example)
from ultralytics import YOLO
model = YOLO('yolov8n.pt')
results = model('image.jpg')
boxes = results[0].boxes.xyxy.cpu().numpy()
SSD in TensorFlow (sketch)
import tensorflow as tf
inputs = tf.keras.Input(shape=(300,300,3))
# ... SSD backbone and heads ...
model = tf.keras.Model(inputs, outputs)
Segmentation (U-Net/DeepLab)
import tensorflow as tf
def unet(input_shape=(256,256,3), num_classes=1):
inputs = tf.keras.Input(input_shape)
c1 = tf.keras.layers.Conv2D(64,3,activation='relu',padding='same')(inputs)
c1 = tf.keras.layers.Conv2D(64,3,activation='relu',padding='same')(c1)
p1 = tf.keras.layers.MaxPool2D()(c1)
# ... more blocks ...
outputs = tf.keras.layers.Conv2D(num_classes,1,activation='sigmoid')(c1)
return tf.keras.Model(inputs, outputs)
Tracking (KCF/CSRT/ByteTrack)
tracker = cv2.legacy.TrackerCSRT_create()
tracker.init(img, (x, y, w, h))
ok, box = tracker.update(next_frame)
# ByteTrack (pseudo)
# 1) run detector → detections
# 2) match with motion model → tracks
OCR (Tesseract/TrOCR)
tesseract image.jpg out --oem 1 --psm 6
# TrOCR via transformers
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
proc = TrOCRProcessor.from_pretrained('microsoft/trocr-base-printed')
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-printed')
TensorFlow/Keras CNNs
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Input((224,224,3)),
tf.keras.layers.Conv2D(32,3,activation='relu'),
tf.keras.layers.MaxPool2D(),
tf.keras.layers.Conv2D(64,3,activation='relu'),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(10,activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
PyTorch Models
import torch, torch.nn as nn
class SmallCNN(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(3,32,3,padding=1), nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(32,64,3,padding=1), nn.ReLU(), nn.AdaptiveAvgPool2d(1)
)
self.fc = nn.Linear(64, 10)
def forward(self, x):
x = self.net(x).view(x.size(0),-1)
return self.fc(x)
Training Loop
model = SmallCNN(); opt = torch.optim.Adam(model.parameters(), 1e-3)
for x, y in loader:
yhat = model(x)
loss = nn.CrossEntropyLoss()(yhat, y)
opt.zero_grad(); loss.backward(); opt.step()
Datasets and Data Loaders
from torch.utils.data import Dataset, DataLoader
class Images(Dataset):
def __init__(self, paths): self.paths = paths
def __len__(self): return len(self.paths)
def __getitem__(self, i):
img = cv2.imread(self.paths[i]); img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (224,224)); img = img.transpose(2,0,1)/255.0
return torch.tensor(img, dtype=torch.float32), 0
loader = DataLoader(Images(paths), batch_size=32, shuffle=True)
Evaluation Metrics (IoU, mAP)
import numpy as np
def iou(boxA, boxB):
xA, yA = max(boxA[0], boxB[0]), max(boxA[1], boxB[1])
xB, yB = min(boxA[2], boxB[2]), min(boxA[3], boxB[3])
inter = max(0, xB-xA) * max(0, yB-yA)
a = (boxA[2]-boxA[0])*(boxA[3]-boxA[1]); b = (boxB[2]-boxB[0])*(boxB[3]-boxB[1])
return inter / (a+b-inter+1e-9)
# mAP sketch: compute precision-recall per class and average AP
Deployment: ONNX/TensorRT/TFLite
import onnxruntime as ort
sess = ort.InferenceSession('model.onnx', providers=['CUDAExecutionProvider','CPUExecutionProvider'])
outputs = sess.run(None, { 'input': input_array })
trtexec --onnx=model.onnx --saveEngine=model.plan --fp16
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tl = converter.convert()
KServe/Triton Serving
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata: { name: vision, namespace: ml }
spec:
predictor:
triton:
storageUri: s3://bucket/models/vision
runtimeVersion: 23.09
resources: { limits: { nvidia.com/gpu: 1 } }
REST/gRPC APIs
from fastapi import FastAPI, UploadFile
import numpy as np
app = FastAPI()
@app.post('/predict')
async def predict(file: UploadFile):
img = cv2.imdecode(np.frombuffer(await file.read(), np.uint8), cv2.IMREAD_COLOR)
# preprocess → run model → postprocess
return { 'boxes': [], 'masks': [] }
Monitoring (Prometheus/OTEL)
import client from 'prom-client'
const latency = new client.Histogram({ name: 'cv_latency_seconds', help: 'latency', buckets: [0.01,0.05,0.1,0.2,0.5,1] })
span.setAttributes({ 'model': 'yolov8n', 'res': '640x640', 'ttft_ms': 42 })
Dashboards and Alerts
histogram_quantile(0.95, sum(rate(cv_latency_seconds_bucket[5m])) by (le))
groups:
- name: cv
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, sum(rate(cv_latency_seconds_bucket[5m])) by (le)) > 0.3
for: 10m
labels: { severity: page }
Security and Privacy
# Face blurring
faces = detector.detectMultiScale(gray, 1.3, 5)
for (x,y,w,h) in faces:
roi = img[y:y+h, x:x+w]
img[y:y+h, x:x+w] = cv2.GaussianBlur(roi, (51,51), 0)
MLOps for CV (Airflow/Dagster)
# Airflow DAG for data refresh → train → eval → deploy
JSON-LD
Related Posts
Call to Action
Need help building and deploying CV systems? We design pipelines, train models, and ship production-ready services with monitoring and privacy.
Extended FAQ (1–120)
-
How to improve detection speed?
Smaller models, lower input resolution, TensorRT. -
Best augmentation for detection?
Random crop/flip, mosaic, color jitter. -
How to annotate fast?
Use tools (CVAT, Label Studio); keyboard shortcuts. -
IoU threshold for mAP?
Common: 0.5 and 0.5:0.95. -
When to use segmentation?
Precise shape/area; instance vs semantic. -
Trackers choice?
CSRT for accuracy; KCF for speed; ByteTrack for SOTA with detections. -
Batch size tuning?
Max without OOM; watch p95. -
Edge device constraints?
Quantize; prune; smaller inputs. -
How to handle blur/low light?
Denoise, gamma correction, fine-tune on similar data. -
Camera calibration?
Use chessboard patterns; compute intrinsics.
... (add 110+ more practical questions on datasets, training, evaluation, deployment, monitoring, privacy)
Datasets and Labeling
# COCO format directory structure
train/
images/
annotations/instances_train.json
val/
images/
annotations/instances_val.json
# Label Studio quickstart
docker run -it -p 8080:8080 heartexlabs/label-studio:latest
# CVAT (server)
docker compose up -d
Augmentation with Albumentations
import albumentations as A
from albumentations.pytorch import ToTensorV2
train_tfms = A.Compose([
A.RandomResizedCrop(640, 640, scale=(0.6,1.0)),
A.HorizontalFlip(p=0.5),
A.ColorJitter(0.2,0.2,0.2,0.1,p=0.5),
A.MotionBlur(p=0.2),
A.Normalize(),
ToTensorV2()
], bbox_params=A.BboxParams(format='yolo', label_fields=['labels']))
tf.data Pipelines
import tensorflow as tf
def parse(example):
img = tf.io.read_file(example['image_path'])
img = tf.io.decode_jpeg(img, channels=3)
img = tf.image.resize(img, (640,640))
img = tf.cast(img, tf.float32)/255.0
return img, example['labels']
ds = tf.data.Dataset.from_tensor_slices(records).map(parse, num_parallel_calls=tf.data.AUTOTUNE).batch(32).prefetch(tf.data.AUTOTUNE)
PyTorch Lightning Training (Detection)
import pytorch_lightning as pl
class DetModule(pl.LightningModule):
def __init__(self, model):
super().__init__(); self.model = model
def training_step(self, batch, _):
x, y = batch; out = self.model(x); loss = out['loss']
self.log('train/loss', loss); return loss
def configure_optimizers(self):
return torch.optim.AdamW(self.parameters(), 1e-4)
Segmentation Training (Lightning)
class SegModule(pl.LightningModule):
def __init__(self, net): super().__init__(); self.net = net
def training_step(self, batch, _):
x, y = batch; yhat = self.net(x); loss = dice_bce_loss(yhat, y)
self.log('train/loss', loss); return loss
mAP Computation over Dataset
# COCO mAP via pycocotools
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
cocoGt = COCO('instances_val.json')
cocoDt = cocoGt.loadRes('results.json')
eval = COCOeval(cocoGt, cocoDt, 'bbox'); eval.evaluate(); eval.accumulate(); eval.summarize()
Benchmarking Scripts
import time
import numpy as np
lat = []
for img in batch_images:
t0=time.time(); model(img); lat.append(time.time()-t0)
print('p95', np.percentile(lat, 95))
Video Ingestion (GStreamer)
gst-launch-1.0 filesrc location=input.mp4 ! decodebin ! videoconvert ! videoscale ! video/x-raw,width=640,height=640 ! appsink
Multi-Object Tracking: SORT/DeepSORT/ByteTrack
# SORT skeleton
tracks = []
for frame in frames:
dets = detect(frame)
tracks = associate(tracks, dets) # IOU matching + Kalman predict/update
# DeepSORT uses appearance embeddings for robust matching
# ByteTrack — keep low-score dets for association; improves recall
Export and Quantization
# ONNX QDQ (quantization-aware)
from onnxruntime.quantization import quantize_dynamic
quantize_dynamic('model.onnx', 'model.int8.onnx', optimize_model=True)
# Post-Training Quantization (PTQ) for TFLite
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = rep_ds
int8 = converter.convert()
TensorRT Pipelines
trtexec --onnx=model.onnx --saveEngine=model.plan --fp16 --workspace=4096 --shapes=input:1x3x640x640
Advanced KServe: Explainer and Transformer
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata: { name: vision, namespace: ml }
spec:
predictor:
triton: { storageUri: s3://models/vision }
explainer:
alibi: { type: AnchorImages }
transformer:
containers:
- image: registry/vision-preprocess:1.0
Airflow/Dagster Pipelines
# Airflow DAG: ingest → augment → train → eval → deploy
# Dagster job: retrain on data drift
Helm Values / Terraform Infra
# values.yaml (gpu deployment)
resources:
limits: { nvidia.com/gpu: 1, cpu: 2, memory: 8Gi }
requests: { cpu: 1, memory: 4Gi }
nodeSelector: { gpu: "true" }
resource "aws_eks_node_group" "gpu" {
instance_types = ["g5.xlarge"]
scaling_config { desired_size = 2, max_size = 6, min_size = 1 }
}
Grafana Dashboards / PromQL
# p95 latency
histogram_quantile(0.95, sum(rate(cv_latency_seconds_bucket[5m])) by (le))
# FPS by model
sum by (model) (rate(frames_processed_total[1m]))
Alerting and Runbooks
groups:
- name: cv-ops
rules:
- alert: FrameDrop
expr: rate(frames_processed_total[1m]) < 10
for: 10m
labels: { severity: ticket }
Runbook: FrameDrop
- Check input pipeline (GStreamer)
- Verify GPU utilization and batch size
- Restart transformer pod; warm cache
Extended FAQ (121–260)
-
mAP vs F1?
Use mAP for detection; F1 for simple classification. -
Label imbalance?
Class-balanced sampling; loss weighting. -
Long-tailed classes?
Focal loss; re-sampling; fine-tuning. -
Video vs image performance?
Batch across frames; reuse pre-processing. -
NMS tuning?
IoU threshold and score threshold sweeps. -
Small object detection?
Higher input res; anchor tuning; specialized models. -
Panoptic segmentation?
Combine instance + semantic outputs. -
Tracking drift?
Periodic re-detection; appearance features. -
OCR accuracy?
Binarization, deskewing, language models. -
GPU memory OOM?
Smaller batch; FP16; gradient checkpointing. -
Dataset versioning?
DVC/LakeFS; record hashes. -
Synthetic data?
Good for rare classes; label clearly. -
Augmentations too strong?
Ablation study; dial back. -
Calibration?
Temperature scaling per class. -
Edge camera streams?
RTSP; pre-process on device. -
Privacy laws?
Blur faces/plates; consent. -
Model zoo sprawl?
Registry and owners. -
FPS targets?
Define by route; test p95. -
Profiling?
Nsight Systems, PyTorch profiler. -
Canary deploys?
Shadow first; then small %. -
Data drift detection?
KS test on features; alert. -
Retraining cadence?
Monthly or on drift. -
Transform latency?
Optimize I/O and color conversions. -
Normalize color spaces?
Consistent BGR/RGB handling. -
Metadata in outputs?
Include confidence and class ids. -
Batch inference?
Yes—on GPU for throughput. -
PTQ vs QAT?
QAT better accuracy; PTQ faster to ship. -
TensorRT DLA?
Use if available for offloading. -
Mixed precision?
FP16; validate accuracy. -
Post-processing CPU bound?
Vectorize; C++/Rust kernels.
Pose Estimation (MediaPipe/OpenPose)
# MediaPipe Pose
import cv2, mediapipe as mp
mp_pose = mp.solutions.pose
with mp_pose.Pose(static_image_mode=False, model_complexity=1, enable_segmentation=False) as pose:
cap = cv2.VideoCapture(0)
while True:
ok, f = cap.read();
if not ok: break
f_rgb = cv2.cvtColor(f, cv2.COLOR_BGR2RGB)
res = pose.process(f_rgb)
if res.pose_landmarks:
for lm in res.pose_landmarks.landmark:
x, y = int(lm.x * f.shape[1]), int(lm.y * f.shape[0])
cv2.circle(f, (x,y), 2, (0,255,0), -1)
cv2.imshow('pose', f)
if cv2.waitKey(1) == 27: break
Keypoint Detection (HRNet)
# pseudo: HRNet inference for keypoints
def infer_keypoints(img):
tensor = preprocess(img)
heatmaps = hrnet(tensor)
kpts = decode_peaks(heatmaps)
return kpts
Re-Identification (ReID)
# extract embeddings and compare via cosine similarity
emb1 = reid_model(crop1); emb2 = reid_model(crop2)
sim = (emb1 @ emb2.T) / (np.linalg.norm(emb1)*np.linalg.norm(emb2))
Multi-Camera Tracking
# synchronize timestamps, project to common plane, fuse tracks by reID + geometry
3D Vision: Stereo Depth, SfM, COLMAP
# COLMAP pipeline
echo "feature_extractor" && colmap feature_extractor --database_path db.db --image_path images
colmap exhaustive_matcher --database_path db.db
mkdir -p sparse && colmap mapper --database_path db.db --image_path images --output_path sparse
colmap image_undistorter --image_path images --input_path sparse/0 --output_path dense --output_type COLMAP
colmap patch_match_stereo --workspace_path dense --PatchMatchStereo.geom_consistency true
colmap stereo_fusion --workspace_path dense --output_path dense/fused.ply
Camera Calibration and Rectification
import cv2, numpy as np
objp = np.zeros((6*7,3), np.float32); objp[:,:2] = np.mgrid[0:7,0:6].T.reshape(-1,2)
objpoints, imgpoints = [], []
for fname in images:
img = cv2.imread(fname); gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
ret, corners = cv2.findChessboardCorners(gray, (7,6), None)
if ret:
objpoints.append(objp); imgpoints.append(corners)
ret, mtx, dist, rvecs, tvecs = cv2.calibrateCamera(objpoints, imgpoints, gray.shape[::-1], None, None)
# rectify stereo
R1,R2,P1,P2,Q,roi1,roi2 = cv2.stereoRectify(K1,D1,K2,D2,image_size,R,T)
Geometric Vision (PnP/Essential)
# PnP
oK, rvec, tvec = cv2.solvePnP(object_points, image_points, K, dist)
# Essential matrix
E, mask = cv2.findEssentialMat(pts1, pts2, K, method=cv2.RANSAC, prob=0.999, threshold=1.0)
Video Pipelines (FFmpeg/GStreamer)
ffmpeg -i input.mp4 -vf scale=640:-1 -r 30 -c:v libx264 -preset fast -crf 22 out.mp4
gst-launch-1.0 rtspsrc location=rtsp://camera ! rtph264depay ! avdec_h264 ! videoconvert ! appsink
Edge Deployment: NVIDIA Jetson
# Dockerfile.jetson
FROM nvcr.io/nvidia/l4t-ml:r35.3.1-py3
RUN python3 -m pip install --no-cache-dir onnxruntime-gpu==1.17.0 opencv-python-headless
COPY app.py /app/app.py
CMD ["python3","/app/app.py"]
# TensorRT conversion via trtexec
trtexec --onnx=model.onnx --saveEngine=model.plan --fp16 --workspace=4096
Mobile: TFLite/NNAPI/CoreML
# Android NNAPI delegate
import tensorflow as tf
interpreter = tf.lite.Interpreter(model_path='model.tflite', experimental_delegates=[tf.lite.experimental.load_delegate('libnnapi_delegate.so')])
// iOS CoreML
let model = try! MyVisionModel(configuration: MLModelConfiguration())
Web: onnxruntime-web/WebGL
<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>
<script>
(async () => {
const session = await ort.InferenceSession.create('/model.onnx', { executionProviders: ['webgl'] })
const input = new ort.Tensor('float32', new Float32Array(3*224*224), [1,3,224,224])
const out = await session.run({ input })
console.log(out)
})()
</script>
Synthetic Data (Blender)
# blender_python.py (run with blender --python blender_python.py)
import bpy, random
# Load scene, randomize lights/materials/camera; render dataset
Dataset QA/Curation
import json
bad = []
for ln in open('labels.jsonl'):
o = json.loads(ln)
if any(c for c in o['boxes'] if c['x2']<=c['x1'] or c['y2']<=c['y1']): bad.append(o['id'])
print('invalid boxes', len(bad))
Hyperparameter Sweeps
for lr in [1e-4, 2e-4, 5e-4]:
for bs in [16, 32]:
run = train(lr=lr, batch_size=bs)
log({'lr': lr, 'bs': bs, 'mAP': run.map})
Distributed Training (DDP)
python -m torch.distributed.run --nproc_per_node=4 train.py --epochs 50 --batch 16
# train.py (DDP init)
import torch.distributed as dist
dist.init_process_group(backend='nccl')
Mixed Precision
scaler = torch.cuda.amp.GradScaler()
for x,y in loader:
with torch.cuda.amp.autocast():
yhat = model(x); loss = criterion(yhat,y)
scaler.scale(loss).backward(); scaler.step(opt); scaler.update(); opt.zero_grad()
Detectron2/MMDetection
# Detectron2 training
from detectron2.engine import DefaultTrainer
from detectron2.config import get_cfg
cfg = get_cfg(); cfg.merge_from_file('configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml')
cfg.DATASETS.TRAIN = ('my_train',); cfg.DATASETS.TEST = ('my_val',)
trainer = DefaultTrainer(cfg); trainer.resume_or_load(resume=False); trainer.train()
# MMDetection config snippet
model = dict(type='FasterRCNN', backbone=dict(type='ResNet', depth=50))
DeepLabv3+ / Mask R-CNN Training
# DeepLabv3+ in PyTorch
import torchvision
model = torchvision.models.segmentation.deeplabv3_resnet50(num_classes=21)
KServe Transformer/Explainer
# transformer.py
from kserve import Model, ModelServer
class PrePost(Model):
def __init__(self, name: str): super().__init__(name)
async def preprocess(self, payload: dict):
# decode base64 images
return payload
async def postprocess(self, infer_output: dict):
# NMS, thresholding, formatting
return infer_output
ModelServer().start(models=[PrePost('vision')])
# explainer (alibi anchor image)
gRPC Client
import grpc
# create stub, send image bytes, receive detections
Observability (OTEL Spans/Metrics)
span.addEvent('preprocess', { ms: 5 })
span.addEvent('inference', { ms: 18, provider: 'triton' })
span.addEvent('postprocess', { ms: 7 })
const fps = new client.Gauge({ name: 'cv_fps', help: 'frames/sec', labelNames: ['model'] })
Cost Models (GPU Hours)
model,resolution,batch,throughput_fps,gpu,usd_per_hour,usd_per_million_frames
yolov8n,640,1,45,RTXA5000,2.30,51.1
SRE Runbooks (Expanded)
Latency Spike
- Verify input stream and decode
- Check GPU utilization and thermals
- Reduce resolution; switch to FP16 engine
Accuracy Drop
- Inspect data drift; re-run eval; rollback engine
Extended FAQ (261–420)
-
Batch across cameras?
Yes—micro-batching per tick; beware re-ordering. -
Memory fragmentation?
Pre-allocate buffers; reuse tensors. -
JPEG decode cost?
NVJPEG or hardware decoders. -
Camera sync?
PTP/NTP and timestamp alignment. -
Tracker ID switches?
Use reID embeddings and smoothing. -
PTZ cameras?
Re-detect on big motions; re-calibrate. -
Fog/rain?
Train on adverse conditions; dehaze filters. -
Thermal cameras?
Different preprocessing; normalize ranges. -
On-device privacy?
Blur before transmit; store hashes only. -
Long videos?
Chunk processing; checkpoint. -
Multi-stream on Jetson?
DeepStream pipelines. -
Quantization pitfalls?
Calibrate with representative data. -
Segmentation holes?
Morphology close; CRF postprocess. -
OCR multilingual?
Load language packs; fallback models. -
Edge storage?
Circular buffers; retention policies. -
Model drift alerts?
IoU drop; false positive rates. -
Web streaming?
WebRTC for low-latency. -
CDN for models?
Versioned artifacts; cache. -
Security hardening?
Non-root, read-only FS, egress allowlists. -
Offline inference?
Queue results; sync later. -
Class imbalance?
Focal loss; weighted sampling. -
Latency budget?
Split preprocess/infer/post; profile. -
CPU fallback?
yes4small models; warn users. -
Mixed resolutions?
Resize consistently; pad letterbox. -
DDP pitfalls?
Sync BN; gradient accumulation. -
Heatmaps visualization?
Color maps; overlays. -
Fine-tuning schedule?
Lower LR; freeze backbone initially. -
Export sanity?
Run parity tests ONNX vs native. -
Runtime choices?
ORT/TensorRT/OpenVINO—benchmark. -
Camera dropouts?
Timeouts and reconnection logic. -
GStreamer caps?
Match formats to avoid conversions. -
Time to first frame?
Warm engines; cache. -
Annotation drift?
Periodic QA; annotator training. -
Bounding box encoding?
Consistent formats; convert utilities. -
Model registry?
Tags, owners, changelogs. -
Canary gates?
mAP >= baseline-1%, latency p95 < 350ms. -
False positives?
Hard negative mining. -
False negatives?
Augment; adjust thresholds. -
Confusion classes?
Merge or separate with more data. -
When done?
Stable SLOs; incidents trending down.
Face Recognition (ArcFace/FaceNet) Embeddings
# Face embedding extraction (FaceNet-like)
import torch
from torchvision import transforms
pre = transforms.Compose([transforms.Resize((160,160)), transforms.ToTensor(), transforms.Normalize([0.5]*3, [0.5]*3)])
def embed_face(img_bgr):
img = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
ten = pre(Image.fromarray(img)).unsqueeze(0).cuda()
with torch.no_grad():
vec = facenet(ten)
return torch.nn.functional.normalize(vec, dim=1).cpu().numpy()
# Verify identity by cosine similarity
sim = (e1 @ e2.T) / (np.linalg.norm(e1)*np.linalg.norm(e2))
if sim > 0.6: print('match')
Attribute Classification (Age/Gender/Helmet/PPE)
# multi-head classifier
class AttrNet(nn.Module):
def __init__(self, backbone):
super().__init__(); self.backbone = backbone
self.head_gender = nn.Linear(512, 2)
self.head_helmet = nn.Linear(512, 2)
def forward(self, x):
f = self.backbone(x)
return { 'gender': self.head_gender(f), 'helmet': self.head_helmet(f) }
Industrial Inspection Pipelines (Defect Detection)
# classical: background subtraction + morphology + contour area thresholds
# deep: segmentation of defects (U-Net), report percentages per ROI
Anomaly Detection with Autoencoders
class AE(nn.Module):
def __init__(self):
super().__init__()
self.enc = nn.Sequential(nn.Conv2d(3,16,3,2,1), nn.ReLU(), nn.Conv2d(16,32,3,2,1), nn.ReLU())
self.dec = nn.Sequential(nn.ConvTranspose2d(32,16,4,2,1), nn.ReLU(), nn.ConvTranspose2d(16,3,4,2,1), nn.Sigmoid())
def forward(self, x):
z = self.enc(x); return self.dec(z)
# score = reconstruction error
with torch.no_grad():
y = ae(x)
score = ((x - y)**2).mean(dim=[1,2,3])
Active Learning Loops
# uncertainty sampling: pick low-confidence detections for manual labeling
uncertain = [sample for sample in pool if max(model.predict_proba(sample)) < 0.6]
label_queue.extend(uncertain[:100])
Loop: infer → select uncertain → label → retrain → eval → deploy
Semi-Supervised / Self-Training
# pseudo-labels with confidence threshold
y_hat = model(x_unlabeled)
conf = y_hat.max(1).values
mask = conf > 0.9
train_ds.extend((x_unlabeled[mask], y_hat[mask].argmax(1)))
Online Evaluation and Probes
// synthetic probes to verify pipeline health
cron.schedule('*/10 * * * *', async () => {
const img = await fetchSample()
const t0 = Date.now(); const out = await predict(img)
metrics.observe('cv_latency_seconds', (Date.now()-t0)/1000)
metrics.inc('cv_success_total')
})
CI/CD for CV Models
name: cv-ci
on: [push]
jobs:
train-eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.10' }
- run: pip install -r requirements.txt
- run: python -m tools/export_onnx.py --weights runs/best.pt --out model.onnx
- run: python -m tools/eval_map.py --ann val.json --pred results.json --gate map>=0.40
- uses: actions/upload-artifact@v4
with: { name: model, path: model.onnx }
deploy:
needs: train-eval
runs-on: ubuntu-latest
steps:
- uses: actions/download-artifact@v4
with: { name: model, path: model }
- run: helm upgrade --install vision charts/vision -f values/prod.yaml --wait
Edge Fleet Management (Jetson OTA)
# Mender/OTA example (concept)
mender-artifact write module-image -T docker -n vision-1.2.0 -t jetson-xavier -o vision.mender -f vision.tar
- Tag devices by site and model version
- Roll out in waves; auto-rollback on failure
Privacy and Compliance SOPs (CV)
- Blur faces and license plates by default in public contexts
- Store only derived features (hashes) when possible
- Retention: raw frames < 24h unless legally required
- Access controls and immutable audit trails
ROI and Cost Modeling
scenario,cameras,resolution,fps,gpu_nodes,cost_usd_month
warehouse,120,1280x720,15,4,8200
retail,40,1920x1080,30,3,6100
- Optimize by lowering resolution/fps where acceptable
- Batch inference and share GPUs across streams
Extended Runbooks
Tracker Instability
- Increase detection frequency; adjust IOU thresholds
- Enable reID embeddings; smooth trajectories
OCR Errors
- Improve binarization/deskew; language models; whitelist fonts
Extended FAQ (421–520)
-
Calibration for detectors?
Temperature scaling on logits; per class. -
Drastic lighting changes?
Auto exposure and training on varied conditions. -
Rolling shutter artifacts?
Use global shutter cameras if critical. -
Label noise?
Consensus labeling; noise-robust losses. -
Cross-domain transfer?
Fine-tune on small target set; domain adaptations. -
Multi-label classification?
Sigmoid outputs; threshold per class. -
Coordinate systems?
Normalize to image size; document. -
Serialization format?
COCO JSON/TFRecord for scalability. -
Post-processing speed?
Vectorize NMS; CUDA kernels if needed. -
GPU watchdog timeouts?
Split batches; check long kernels. -
Gaps in video?
Interpolate; flag missing for SRE. -
Snow/rain occlusion?
Augment and use deweathering nets. -
Edge vs cloud trade-offs?
Latency/privacy vs scale/flexibility. -
Night vision?
IR lights; specialized models. -
Pose skeleton smoothing?
Temporal filters; Kalman. -
Upscaling?
ESRGAN; cost vs benefit. -
3D reconstruction scale?
Need known baseline or scale constraints. -
Camera vignetting?
Calibrate and correct. -
Rolling code updates?
Feature flags and staged rollouts. -
Dataset sprawl?
Registry with hashes and owners. -
Target FPS KPI?
Define per route; validate. -
Long-term storage?
Compressed metadata and event clips. -
Blur performance?
GPU filters and ROI-only blurs. -
Object sizes?
Multi-scale anchors; FPN backbones. -
Drone footage?
High-motion augmentations; stabilization. -
Water reflections?
Polarizers; training data variety. -
Latency telemetry?
TTFT and per-stage metrics. -
Hash-based tracking?
Visual hashing; caution on collisions. -
License compliance?
Track models/datasets licenses. -
Final check?
Healthy SLOs, costs in bounds, privacy enforced.