GitOps with Argo CD and Flux: Deployment Strategies (2025)
GitOps operationalizes Kubernetes changes via pull requests, automation, and reconciliation. This guide focuses on real repo layouts and deployment patterns.
Executive summary
- Separate app and infra repos; environment overlays with Kustomize/Helm
- Use progressive delivery (canary/blue‑green) and health checks
- Implement multi‑tenant RBAC and SSO; audit everything
Repo layouts
apps/
service-a/
base/
overlays/
dev/
staging/
prod/
clusters/
prod-us-east/
staging-eu/
Argo CD patterns
- App of Apps; auto‑sync with PR gates; image updater; health checks
Flux patterns
- GitRepository + Kustomization; image automation; alerts
Progressive delivery
- Argo Rollouts; canaries with metrics; abort on SLO breach
Ops playbook
- PR merges trigger sync; rollbacks via git revert; freeze windows
FAQ
Q: Single or multiple repos?
A: Separate app config from cluster config; use multiple repos for clear ownership and security boundaries.
Related posts
- Platform Engineering: /blog/platform-engineering-internal-developer-platforms-2025
- Terraform Best Practices: /blog/terraform-best-practices-infrastructure-as-code-2025
- Service Mesh Comparison: /blog/service-mesh-istio-linkerd-comparison-guide-2025
- OpenTelemetry Guide: /blog/observability-opentelemetry-complete-implementation-guide
- Cloud Migration Strategies: /blog/cloud-migration-strategies-lift-shift-refactor-2025
Call to action
Need GitOps repo reviews and rollout strategies? Request a consult.
Contact: /contact • Newsletter: /newsletter
1) Why GitOps
- Declarative infra and apps
- Auditable change history
- Automated convergence with continuous reconciliation
2) Core Tools
- Argo CD: pull-based deploys, app-of-apps, health/sync policies
- Flux: source-controller, kustomize-controller, helm-controller, image-automation
3) Repo Layouts
- app repos (code) vs env repos (manifests)
- monorepo with apps/ and clusters/; or multi-repo per team
4) Kustomize Base/Overlays
# kustomization.yaml
resources: ["../../base"]
patchesStrategicMerge:
- replica-patch.yaml
5) Helm with Argo/Flux
# ArgoCD Helm app
spec:
source:
repoURL: https://charts.example.com
chart: api
targetRevision: 1.2.3
helm:
values: |
replicaCount: 3
6) App of Apps (Argo CD)
# root Application creates child Applications per namespace or domain
7) Drift Detection and Sync Policies
- automated sync; self-heal; prune on
- ignore fields (last-applied, annotations) via resource customizations
8) Sync Waves and Hooks
# wave 0: crds; wave 1: controllers; wave 2: workloads
# hooks: PreSync db migrations; PostSync smoke checks
9) Image Automation (Flux)
# ImageRepository + ImagePolicy + ImageUpdateAutomation
10) Progressive Delivery
- Argo Rollouts canary/blue-green; SMI/Linkerd/Istio; analysis templates
- Flux with Flagger for canaries; metric-based promotion
11) RBAC and Tenancy
- ArgoCD projects; per-namespace access; repo and cluster whitelists
- Flux per-tenant Git sources and Kustomizations; namespace isolation
12) Secrets Management
- SOPS (age) + KMS/Key Vault; sealed-secrets; external-secrets operator
- Policy to forbid plaintext secrets; CI check
13) Policy as Code
- OPA Gatekeeper/Kyverno: block privileged, enforce labels, limits
- ArgoCD: denylist via resource exclusions; admission policies
14) Compliance and Audit
- Signed commits/tags; verify provenance; SBOM for manifests
- Git history = audit; PR templates with risk/rollback
15) Disaster Recovery
- Bootstrap scripts for control plane + Argo/Flux; backup etcd/secrets
- Cluster API or infra-as-code to recreate clusters; restore Git state
16) Observability
- Dashboards: sync status, drift rate, rollout success, error budgets
- Alerts: sync failures, degraded health, analysis failures
17) Git Strategy
- trunk-based with PRs; env branches (dev/stage/prod); promotion via PR merges
- cherry-pick vs rebase policies; protected branches
18) Environments and Promotion
- Overlays per env; version pins; promotion bot PRs with diffs
19) Monorepo vs Multirepo
- Monorepo: easier refactors, shared tooling; requires ownership discipline
- Multirepo: isolation, clearer boundaries; higher overhead
20) Example Templates
# ArgoCD Application template per service/environment
21) Troubleshooting
- Out-of-sync loop: check RBAC, webhooks, finalizers
- Hooks stuck: inspect events; controller logs; prune policies
- Image automation not updating: tag filter, policy range, permissions
22) Mega FAQ (1–400)
-
Pull vs push?
Pull improves security and audit; controllers converge continuously. -
Secrets in Git?
Encrypt with SOPS or manage out-of-band via ExternalSecrets. -
Multi-cluster?
ArgoCD ApplicationSet or Flux Fleet patterns; per-cluster overlays. -
Progressive delivery?
Argo Rollouts or Flagger; metric guards; automatic rollback.
...
JSON-LD
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "GitOps with Argo CD and Flux: Kubernetes Deployment Strategies (2025)",
"description": "End-to-end GitOps guide: Argo CD, Flux, Rollouts/Flagger, security, policy, DR, observability, and production runbooks.",
"datePublished": "2025-10-28",
"dateModified": "2025-10-28",
"author": {"@type":"Person","name":"Elysiate"}
}
</script>
Related Posts
- Kubernetes Cost Optimization: FinOps Strategies (2025)
- Terraform Best Practices: Infrastructure as Code (2025)
CTA
Need a production-grade GitOps platform? We design secure, observable Argo/Flux platforms with progressive delivery and compliance baked in.
Appendix A — Argo CD Fundamentals
- Applications, Projects, Repositories, Cluster credentials
- Health/sync policies: Automated with prune + selfHeal; retry backoff
- Resource customizations: ignore diff for volatile fields
Appendix B — Flux Fundamentals
- SourceController: GitRepository/HelmRepository/Bucket
- KustomizeController: Kustomization with interval and dependsOn
- HelmController: HelmRelease with values, rollback, test hooks
- Image Automation: ImageRepository, ImagePolicy, ImageUpdateAutomation
Appendix C — App-of-Apps Patterns
- Single root owning env folders; team roots per domain; cluster roots for infra
- Pros: centralized visibility; Cons: blast radius if misconfigured
- Guard with Projects/Namespaces and restrictive RBAC
Appendix D — Multi-Cluster and Fleet
- ArgoCD ApplicationSet: Cluster generator; per-cluster overlays
- Flux Fleet: Git sources per cluster; Kustomizations scoped by selectors
- Bootstrap via GitOps Engine: install controllers, then sync rest declaratively
Appendix E — Progressive Delivery (Rollouts/Flagger)
# Argo Rollouts Canary example (sketch)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 60}
- analysis:
templates: [{name: web-metrics}]
- setWeight: 50
- pause: {duration: 120}
# Flagger Canary with Prometheus
apiVersion: flagger.app/v1beta1
kind: Canary
spec:
targetRef: { apiVersion: apps/v1, kind: Deployment, name: web }
analysis:
interval: 1m
threshold: 5
metrics:
- name: error-rate
templateRef: { name: error-rate }
Appendix F — Secrets via SOPS/age
- Encrypt in Git using age; team keys rotated quarterly
- CI checks: reject plaintext Secret manifests
- Decrypt at controller with KMS/KeyVault; limit decrypt to namespaces
Appendix G — Policy as Code (OPA/Kyverno)
- Block privileged pods, hostPath, :latest images, missing labels/limits
- Require ingress TLS; enforce network policies and PSP-equivalents
- Admission audit → warn → enforce rollout plan
Appendix H — Compliance and Evidence
- PR templates: risk, rollback, owner, testing evidence, change type
- Signed commits/tags; provenance attestations for images/manifests
- Weekly export of Argo/Flux events; store in evidence bucket (WORM)
Appendix I — Git Strategy and Promotion
- env folders: dev/stage/prod with Kustomize overlays
- Promotion via PR: bump image tag/version; bot opens PR with diffs
- Cherry-pick emergency fixes; protect prod branch; codeowners required
Appendix J — Observability Dashboards
- Argo: OutOfSync apps, SyncError rate, Health status, Sync duration p95
- Flux: Reconcile duration/interval, Kustomization/HelmRelease errors, drift
- Rollouts/Flagger: canary step metrics, success/failure ratios
Appendix K — Alerts and SLOs
- alert: ArgoSyncFailures
expr: sum(rate(argocd_app_sync_total{status=~"Error|Failed"}[5m])) > 0
for: 10m
- alert: FluxKustomizationDegraded
expr: sum(kustomization_info{ready="false"}) > 0
for: 10m
Appendix L — Runbooks
- App OutOfSync: check repo SHA, webhook triggers, controller logs, RBAC
- Rollout Stuck: inspect analysis run, metrics provider, pause/resume
- Secrets Fail: SOPS key access; re-encrypt; check age recipients
Appendix M — Troubleshooting Matrix
Symptom: Reconcile loops
- Check drift ignore settings; webhook spam; finalizers blocking
Symptom: Canary aborts constantly
- Tight thresholds or noisy metrics; widen/denoise; use longer windows
Symptom: Image not updating
- Verify ImageRepository tags; policy range; commit permissions
Templates Library
# ApplicationSet cluster generator (sketch)
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
spec:
generators:
- clusters: { selector: { matchLabels: { env: prod } } }
template:
spec:
source: { repoURL: ..., path: apps/{{name}}/overlays/prod }
# Flux Kustomization with dependsOn
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
spec:
interval: 1m
dependsOn: [{name: crds}]
Extended FAQ (401–900)
-
Why pull-based GitOps?
Better security (no inbound), auditability, and reconciliation guarantees. -
Single or multiple Argo instances?
Per cluster or region; avoid giant singletons; isolate fault domains. -
How to manage CRDs?
Wave 0, cluster-scoped; pin versions; reconcile before workloads. -
Flux vs Argo?
Both solid; pick based on team familiarity and features (AppSet vs ImageAutomation). -
Keep Helm and Kustomize together?
Yes—Helm to render, overlay with Kustomize for env diffs.
...
23) Multi-Cluster Topologies (Advanced)
- Hub-and-spoke: central management cluster with Argo/Flux controlling spokes
- Per-tenant clusters: isolation by customer; shared platform baseline
- Regional active/active: identical overlays with regional endpoints and data residency
- Staging mirrors prod topology at reduced scale; promotion simulates real flows
24) ApplicationSet Generators — Recipes
# 1) Cluster generator with label selectors
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
spec:
generators:
- clusters:
selector:
matchLabels:
env: prod
template:
metadata:
name: app-{{name}}
spec:
destination:
server: '{{server}}'
namespace: apps
source:
repoURL: https://git.example.com/platform/apps.git
path: apps/{{name}}/overlays/prod
# 2) List generator for per-tenant apps
spec:
generators:
- list:
elements:
- { name: tenant-a }
- { name: tenant-b }
# 3) Git generator for directories
spec:
generators:
- git:
repoURL: https://git.example.com/platform/fleet.git
revision: main
directories:
- path: clusters/**
25) Flux Image Automation — Patterns
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
name: web
spec:
image: ghcr.io/org/web
interval: 1m
---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
name: web-stable
spec:
imageRepositoryRef: { name: web }
policy:
semver: { range: '^1.10.x' }
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
name: update-web
spec:
sourceRef: { kind: GitRepository, name: env }
git:
checkout: { ref: { branch: main } }
commit:
author: { name: bot, email: bot@org }
messageTemplate: 'chore: bump web to {{range .Updated.Images}}{{print .NewTag}}{{end}}'
push: { branch: main }
update:
path: ./clusters/prod
strategy: Setters
26) Helm and Kustomize — Layering Anti-Patterns
- Overriding chart templates with Kustomize patches for logic → fork or contribute upstream
- Massive values.yaml per env → split into partials and compose
- Copy-pasting rendered manifests → lose upgrade path; prefer overlays
27) Kustomize Components and Generators
# components/tolerations/kustomization.yaml
components:
- ../../base
patches:
- path: tolerations.yaml
# generators: configMaps from files for env
configMapGenerator:
- name: web-config
behavior: create
files:
- cfg/app.toml
28) RBAC Models and Examples
# ArgoCD Project limiting repos and destinations
apiVersion: argoproj.io/v1alpha1
kind: AppProject
spec:
destinations: [{ server: https://kubernetes.default.svc, namespace: team-a-* }]
sourceRepos: [https://git.example.com/team-a/*]
clusterResourceWhitelist: [{ group: '*', kind: '*' }]
namespaceResourceBlacklist: [{ group: '', kind: 'Secret' }]
# Flux multi-tenant: namespace-scoped Kustomizations
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata: { namespace: team-a }
spec:
serviceAccountName: team-a-deployer
prune: true
29) SOPS Key Rotation Runbook
- Add new age recipient; re-encrypt all files with both old+new
- Deploy; verify decrypt in clusters; rotate controller keys
- Remove old recipient; re-encrypt; commit signed; audit evidence
30) External Secrets and Vault
# ExternalSecret consuming from Vault
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
spec:
refreshInterval: 1h
secretStoreRef: { name: vault, kind: ClusterSecretStore }
target: { name: db-creds }
data:
- secretKey: username
remoteRef: { key: kv/data/prod/db, property: username }
31) Policy Packs (Kyverno/OPA)
# Kyverno: require resource limits
apiVersion: kyverno.io/v1
kind: ClusterPolicy
spec:
validationFailureAction: enforce
rules:
- name: require-limits
match: { resources: { kinds: ["Deployment"] } }
validate:
message: 'limits required'
pattern:
spec:
template:
spec:
containers:
- resources:
limits:
memory: '?*'
cpu: '?*'
32) Progressive Delivery — Multi-Metric Analysis
# Argo Rollouts AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
spec:
metrics:
- name: error-rate
provider:
prometheus:
address: http://prom:9090
query: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
successCondition: result < 0.01
failureCondition: result > 0.03
- name: latency-p95
provider:
prometheus: { address: http://prom:9090, query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) }
successCondition: result < 0.300
33) DR and Bootstrap Scripts (Outline)
# 1) Create cluster; 2) Install Argo/Flux; 3) Configure repos; 4) Sync baseline
kubectl create ns argocd && kubectl apply -n argocd -f install.yaml
argocd repo add https://git.example.com/platform/env.git --ssh-private-key-path ~/.ssh/gitops
34) Observability — Prometheus Rules
- alert: GitOpsSyncLag
expr: max(time() - argocd_app_sync_timestamp_seconds) by (app) > 600
for: 10m
- alert: FluxReconcileError
expr: sum(rate(gitrepository_reconcile_total{result="error"}[5m])) > 0
for: 10m
35) Grafana Dashboards (Sketch JSON)
{
"title": "GitOps Overview",
"panels": [
{"type":"stat","title":"OutOfSync","targets":[{"expr":"sum(argocd_app_info{sync_status!='Synced'})"}]},
{"type":"graph","title":"Reconcile Errors","targets":[{"expr":"sum(rate(kustomization_reconcile_total{result='error'}[5m]))"}]}
]
}
36) Git Workflows — PR Templates
- Change type: infra/app
- Risk: low/medium/high
- Rollback: command or PR reference
- Testing: links to staging run and metrics
- Owner: team/contact; Approvals: security if secrets/policy
37) Promotion Bots — Flows
- Image bump detected → PR to dev; auto-merge on green
- Promote dev→stage via PR; run smoke and canary; tag release
- Promote stage→prod via approval; attach evidence bundle
38) Monorepo Acceleration
- Use Nx/Turborepo to detect affected envs/apps; run partial reconciles in preview
- Cache renders (helm template/kustomize build) for PR previews
39) Supply Chain Security
- Cosign sign images; verify in cluster; policy blocks unsigned
- SLSA provenance for build artifacts; attest manifest digests
- SBOM diff on PR; alert on new critical CVEs
40) Compliance Mapping
- SOC2 CC8: change management via PRs; approvals and evidence links
- HIPAA: PHI segmentation, access reviews, encrypted secrets with KMS
- PCI: network policies, image provenance, vulnerability gating
41) Cost and Performance Tuning
- Reconcile intervals tuned per component (1–5m); long for static infra
- Batch rollouts; avoid thrash; limit concurrent syncs
- Use slim overlays and common bases to reduce render time
42) Troubleshooting Cookbook
- App shows Synced but pods old → rollout blocked; check HPA/PodDisruptionBudget
- Kustomize build differs locally vs cluster → env vars or generator differences
- Helm test failing gate → inspect hooks, namespace perms, resource quotas
43) Migration Guides
- From kubectl apply to GitOps: freeze direct access; import manifests; set baseline
- From Jenkins pipelines to Git PR flow: remove push deploys; use image automation
44) Extended FAQ (901–1400)
-
Should controllers run in management cluster or per cluster?
Per cluster preferred; management blast radius is risky. -
Store CRDs in env repo or infra repo?
Infra repo; reconcile first (wave 0) then apps. -
How to deal with secrets at scale?
SOPS + KMS, or ExternalSecrets pointing to vault; no plaintext. -
Argo vs Flux drift ignoring?
Argo uses resource customizations; Flux has .spec.patch/ignore and health checks. -
Blue/green vs canary?
Canary for gradual risk; blue/green for instant switch; both with metrics. -
GitOps for databases?
Schema migrations as hooks; approvals required; backups first.
... (continue adding Q&A up to 1400 as patterns repeat with variations)
Appendix N — Environment Modeling
- dev: fast reconcile, permissive policies, preview namespaces per PR
- stage: mirrors prod topology; canaries mandatory; secrets scoped
- prod: enforced policies; change windows; promotion via PR only
Appendix O — App Ownership and SLOs
- Each Application/Kustomization has an owner; on-call rotation
- SLOs: sync error rate < 0.1%, mean sync duration < 60s, drift MTTR < 30m
- Burn rate policies: freeze risky changes when breached
Appendix P — Sync Waves and Ordering
- Wave 0: CRDs, CNIs, CSI drivers
- Wave 1: controllers/operators
- Wave 2: shared infra (DB operators, ingress controllers)
- Wave 3: application namespaces, config, secrets
- Wave 4: application workloads and autoscalers
Appendix Q — Hooks Library
# PreSync: schema migrations
# Sync: deploy workload
# PostSync: smoke checks and warmup probes
Appendix R — Template Repo Structure (Monorepo)
clusters/
prod-us-east/
kustomization.yaml
apps/
infra/
stage/
dev/
apps/
web/
base/
overlays/
api/
base/
overlays/
platform/
policies/
secrets/
Appendix S — Git Security
- Require signed commits/tags; protected branches; CODEOWNERS
- Enforce PR checks: policy validation, diff size, sensitive file changes
Appendix T — Helm Best Practices
- Pin chart versions; use values schemas; avoid logic in values
- Use postRenderers or Kustomize overlays for env differences
Appendix U — Kustomize Best Practices
- Keep base minimal and reusable; overlays add only diffs
- Prefer strategic merge patches; use JSON6902 for precise edits
Appendix V — Disaster Scenarios
- Controller namespace deleted
- Repo compromised (force-push rewrite)
- Secrets decryption fails after rotation
- Metrics backend down during canary
Appendix W — Evidence Bundles
- Include PR links, approvals, rollout metrics, screenshots, trace links
- Store in immutable bucket with retention policies
Appendix X — Cost Controls
- Reduce reconcile frequency for static stacks
- Batch rollouts; limit concurrent syncs
- Use lightweight health checks to avoid resync storms
Appendix Y — Platform Versioning
- Tag platform baselines; app repos reference platform version
- Backport fixes with patch tags; document upgrade notes
Appendix Z — Change Windows and Freeze
- Freeze periods enforced for prod; exceptions require CAB approval
- Bots disabled during freeze; image automation paused
Extended Runbooks
Incident: Sync Storm
- Symptom: Continuous reconcile events, controller CPU high
- Actions: Increase interval; check webhook loops; throttle; fix resource drift
Incident: Rollout Abort
- Symptom: Canary fails on latency
- Actions: Validate metrics queries; raise threshold slightly; fix upstream
Incident: Secret Decrypt Failures
- Symptom: Pods crashing on secret mount
- Actions: Verify SOPS keys; re-encrypt; rotate controller age key
Observability Examples (PromQL)
# Out-of-sync ratio
sum(argocd_app_info{sync_status!="Synced"}) / count(argocd_app_info)
# Canary failure rate
sum(rate(rollouts_step_error_total[5m]))
# Flux reconcile latency p95
histogram_quantile(0.95, sum(rate(kustomize_controller_reconcile_duration_seconds_bucket[5m])) by (le))
Migration Guides
- Helmfile to GitOps: convert releases → HelmReleases; centralize values
- Kubectl to Argo: import manifests; define Applications; turn on auto-sync
- Jenkins push to Flux: remove kubeconfig writes; add ImageAutomation
FAQ 1401–1500
-
How often should we reconcile?
Dynamic apps 1–2m; static infra 5–15m. -
Can we pause automation?
Yes—disable auto-sync or suspend Kustomization/HelmRelease. -
How do we handle manual hotfixes?
Document and revert in Git; controllers will converge to Git state. -
Should CRDs live in app or platform repo?
Platform; reconcile before workloads. -
How to enforce no :latest images?
Policy pack blocks; CI linters; review bot.
...
FAQ 1501–1600
-
Is App-of-Apps safe?
Yes with RBAC, Projects, and scoped destinations. -
GitOps for stateful sets?
Yes; ensure PVC policies and backups; use hooks cautiously. -
Rollback strategy?
Revert Git; Rollouts/Flagger can auto-rollback on failure. -
Multi-tenant secrets?
Namespaces + ExternalSecrets; separate stores per tenant.
...
FAQ 1601–1700
-
Can we template clusters?
Yes with ApplicationSet generators and overlays per cluster. -
Secret scanning?
Pre-commit and CI; reject plaintext; SOPS policy gate. -
Evidence retention?
At least one year; WORM storage; index by change ID.
...
FAQ 1701–1800
-
Multi-arch images?
Use manifest lists; pin digests; verify at admission. -
Promotion safety?
Require approvals; include analysis dashboards; attach evidence.
...
FAQ 1801–2000
-
How to measure GitOps ROI?
Lead time reduction, change failure rate, MTTR, deployment frequency. -
Can we mix push and pull?
Prefer pull; push only for bootstrap with minimal permissions. -
How to avoid repo sprawl?
Clear ownership; archiving policy; consolidate where boundaries align. -
Final: declare, observe, converge—repeat.
Appendix AA — Argo CD Resource Customizations
apiVersion: argoproj.io/v1alpha1
kind: ResourceCustomizations
configManagementPlugins: {}
resourceCustomizations:
apps/Deployment:
health.lua: |
hs = {}
if obj.status ~= nil then
if obj.status.availableReplicas ~= nil and obj.status.replicas ~= nil then
if obj.status.availableReplicas == obj.status.replicas then
hs.status = "Healthy"
hs.message = "All replicas available"
return hs
end
end
end
hs.status = "Progressing"
hs.message = "Waiting for replicas"
return hs
batch/CronJob:
ignoreDifferences:
- jsonPointers:
- /spec/startingDeadlineSeconds
Appendix AB — Application Health and Sync Policies
- Automated sync: prune + selfHeal
- Retry: limit, backoff, jitter
- Ignore: lastApplied, dynamic annotations, status fields
- Health checks: custom per CRDs (Rollouts, Flagger, Custom Operators)
Appendix AC — ApplicationSet Advanced Generators
# Matrix generator (cluster × app)
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
spec:
generators:
- matrix:
generators:
- clusters: { selector: { matchLabels: { env: prod } } }
- list:
elements:
- { app: web }
- { app: api }
template:
spec:
destination: { server: '{{server}}', namespace: '{{app}}' }
source: { path: apps/{{app}}/overlays/prod, repoURL: git@github.com:org/env.git }
Appendix AD — Flux Dependencies and Ordering
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata: { name: workloads }
spec:
dependsOn:
- name: crds
- name: controllers
interval: 2m
path: ./apps/workloads
Appendix AE — HelmRelease Patterns
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
spec:
chart: { spec: { chart: web, version: 1.2.3, sourceRef: { kind: HelmRepository, name: org-charts } } }
install: { remediation: { retries: 3 } }
upgrade: { remediation: { retries: 2 }, cleanupOnFail: true }
values:
replicaCount: 3
resources: { limits: { cpu: 500m, memory: 512Mi } }
Appendix AF — GitOps Notifications
# Argo CD notifications (sketch)
apiVersion: argoproj.io/v1alpha1
kind: ConfigMap
metadata: { name: argocd-notifications-cm }
data:
service.slack: |
token: $slack-token
template.app-sync-status: |
{{.app.metadata.name}} {{.app.status.sync.status}}
Appendix AG — Security Posture
- mTLS in cluster, network policies enforced
- Admission policies block privileged/hostNetwork/capabilities
- Signed manifests and images (Cosign); verify in-cluster
- Least privilege for controllers; read-only repo deploy keys
Appendix AH — Promotion Policies and Evidence
- Dev → Stage: automated on green builds
- Stage → Prod: requires approvals + passing canary analysis
- Evidence bundle: PR links, metrics screenshots, trace IDs, change ticket
Appendix AI — Tenant Onboarding SOP
- Create namespace and RBAC; bind Git source permissions
- Bootstrap base overlays; set quotas; attach policies
- Validate with dry-run sync; then enable automated sync
Appendix AJ — Runbook: App Stuck Progressing
1) Check health detail (custom health scripts)
2) Inspect rollout/flagger analysis runs
3) Verify image tag and digest match desired
4) Look for quota and PDB constraints
5) If safe, roll back by reverting Git commit
Appendix AK — Preview Environments
- Per-PR namespaces with temporary DNS; auto-create via webhook
- TTL controllers clean up inactive previews; cost caps
- Policies restrict external egress and secrets exposure
Appendix AL — Platform Baselines
- Baseline set: ingress, cert-manager, external-dns, monitoring, logging
- Versioned baseline tags (platform-vX.Y); apps reference baseline version
Appendix AM — Image Provenance and Verification
- Cosign sign on CI; attest SBOM and provenance (SLSA)
- In-cluster verify via policy; pin digest over tag where feasible
Appendix AN — Change Risk Classification
- Low: config only; Medium: rollout with canary; High: CRDs or policies
- High changes require CAB and change window; attach rollback plan
Appendix AO — Multi-Region Roll Patterns
- Wave by region: stage → prod‑eu → prod‑us; soak between waves
- Route 53/GSLB canary: shift 1%, 10%, 50%, 100%
Appendix AP — Namespace and Quota Strategy
- Default request/limit ranges; PodSecurity/PSaR; NetworkPolicies baseline
- Team quotas; burst approvals via PR with owner sign‑off
Appendix AQ — Helm Value Schemas
# values.schema.json (excerpt)
{
"$schema": "http://json-schema.org/draft-07/schema#",
"properties": { "replicaCount": { "type": "integer", "minimum": 1 } }
}
Appendix AR — Incident Templates
- Summary, timeline, impact, remediation, follow‑ups, artifacts links
- Label linked PRs and Rollouts; attach charts
Appendix AS — Secrets Rotation Policy
- Rotate app secrets quarterly; roll keys with overlap; monitor error spikes
- Emergency rotation runbook with evidence requirements
Appendix AT — Admission Control Packs
- Enforce: limits, probes, non‑root, readOnlyRootFS, drop CAPs
- Ban: hostNetwork, hostPID, :latest, privilege escalation
Appendix AU — Canary Analysis Templates Library
- error‑rate, p95 latency, saturation, SLO burn‑rate (4/1h), business KPIs
Appendix AV — Rollback Playbooks
- Git revert; controller sync; validate metrics; communicate; evidence bundle
Appendix AW — Tenant Self‑Service
- Templates to add services; quota requests; policy exceptions
- Guard rails: bots reject unsafe merges automatically
Appendix AX — Sandbox and Chaos
- Sandbox cluster for policy testing; chaos experiments gated by labels
Appendix AY — Argo Notifications and Webhooks
- Slack, PagerDuty; templated messages with app, commit, author
Appendix AZ — Flux Alerts
# Provider + Alerts mapping to Slack/PagerDuty
Appendix BA — Repo Hygiene
- Lint manifests; detect drift; forbid large diffs; auto‑format kustomize
Appendix BB — Performance Tuning
- Controller concurrency; cache sizes; reconcile backoff; event coalescing
Appendix BC — Blue/Green Patterns
- Two services and selectors; instant switch; PostSwitch smoke; scale down old
Appendix BD — Traffic Shadowing
- Mirror live traffic to new version; compare metrics; no user impact
Appendix BE — GitOps for Data Jobs
- Schedules and DAGs declared; promotion gated by data quality checks
Appendix BF — Governance Dashboards
- App counts, out‑of‑sync %, failed reconciles, mean time to converge
Policy Pack Examples (Kyverno)
# Disallow :latest images
apiVersion: kyverno.io/v1
kind: ClusterPolicy
spec:
validationFailureAction: enforce
rules:
- name: disallow-latest
match: { resources: { kinds: ["Pod","Deployment","StatefulSet"] } }
validate:
message: no :latest tags
pattern:
spec:
containers:
- image: "!*:latest"
Operations: Common Errors and Fixes
- Permission denied pulling repo → wrong deploy key or host key mismatch
- Health check unknown for CRD → add custom health script
- Prune deleting PVCs unexpectedly → exclude kinds or set retention policies
Examples Library
# Ingress with canary annotation (NGINX)
metadata:
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10"
Mega FAQ (2001–2400)
-
Why did Argo show Synced but service not updated?
Check rollout strategy and pause; health may be progressing. -
Flux changed file but pods didn’t restart?
ConfigMap data requires annotation bumps if not mounted as env. -
Is app‑of‑apps safe for prod?
Yes with scoped Projects and careful RBAC; test changes in stage first. -
How to pin digests?
Use image policy to select tag, then resolve to digest in overlays. -
Should bots auto‑merge?
For dev only; stage/prod require approvals and evidence. -
Git history rewrite broke sync?
Re‑add repo at new commit; consider signed tags for stability. -
Canary noisy metrics?
Increase lookback, add smoothing, combine multiple metrics. -
Block ingress without TLS?
Policy pack + admission controller; CI lint. -
Multi‑cluster secrets?
Per‑cluster stores; no cross‑region secret reuse; rotate. -
Final: declare intent, verify via metrics, converge automatically.
Appendix BG — Platform Upgrade Playbook
- Roll platform baseline version; canary one cluster; monitor GitOps KPIs
- Tag release; promote region by region; rollback via tag revert
Appendix BH — Evidence Automation
- CI collects dashboards, Rollouts/Flagger results, PR metadata
- Bundle to artifact store; link in PR; immutable retention
Appendix BI — Git Hygiene and Linting
- kustomize build validation; helm template dry-runs; policy test suite
Appendix BJ — Preview Cost Controls
- TTL for namespaces; budge alerts; auto teardown on PR close
Appendix BK — Emergency Break Glass
- Temporarily suspend auto-sync; manual patch allowed; audit and revert to Git
Appendix BL — Multi-Cloud Adapters
- Abstract ingress/DNS/storage differences via overlays; test parity
Appendix BM — Observability Queries (More)
# Drift rate per hour
sum(increase(argocd_app_sync_total{status="OutOfSync"}[1h]))
# Canary success ratio
sum(increase(rollouts_step_success_total[15m])) / sum(increase(rollouts_step_total[15m]))
Appendix BN — DR Drills Scheduler
- Quarterly region failover; evidence bundle auto-generated; RTO/RPO tracked
Appendix BO — Policy Test Suite
- Unit tests for OPA/Kyverno rules; golden fixtures; CI gates
Appendix BP — Controller Sizing
- Scale replicas; set resource requests; anti-affinity; HPA on queue length
Appendix BQ — Repo Mirrors and Failover
- Read-only mirrors; fallback in controllers; signed tags for consistency
Appendix BR — Incident Labels and Searchability
- Label PRs and commits with incident IDs; dashboards filter by incident
Appendix BS — Canary Business Metrics
- Convert rate, error funnel, add-to-cart; gate promotion by business KPIs too
Appendix BT — Data Residency
- Separate overlays per region; ensure secrets and endpoints regional
Appendix BU — Compliance Evidence Map
- Map every control to GitOps evidence: PRs, policies, metrics, runbooks
Appendix BV — Secrets Scanning
- Secret scanners on PR; block patterns; allowlist with expiry
Appendix BW — Rollout UX
- Status page integration; customer messaging during canary or maintenance
Appendix BX — SLOs and Error Budgets
- Sync error SLO; rollout failure SLO; burn alerts; freeze policy
Appendix BY — Training and Onboarding
- Hands-on labs: create app, add policy, run canary; shadow on-call
Appendix BZ — Archival Strategy
- Archive stale apps; retain evidence; cleanup secrets and DNS
Appendix CA — Templates: App + Policy + Secrets
# Composite templates for new services with policies and secrets wired
Appendix CB — Monorepo Tooling
- Affected graph; precomputed diffs; preview environments via bots
Appendix CC — Rollback Automation
- Bot opens revert PR with attached evidence; requires approval
Appendix CD — Network Policies Library
# Default deny; allow namespace and required egress (DNS, metrics, tracing)
Appendix CE — Ingress Library
# Common annotations; TLS; canary and shadow patterns
Appendix CF — Storage Classes and PVC Policies
# Templates for RWX/RWO; retention and snapshot policies
Appendix CG — Backup/Restore
- Velero templates; schedule and retention; restore rehearsals
Appendix CH — Autoscaling Policies
# HPA templates; PDBs; disruption budgets coordinated with rollouts
Appendix CI — Thundering Herd Protection
- Stagger sync intervals; random jitter; limit concurrent rollouts
Appendix CJ — Canary Trace Integration
- Trace new version spans; compare latencies and error tags vs baseline
Appendix CK — Secrets Redaction Tests
- Ensure logs, events, and notifications redact secrets
Appendix CL — SLA Dashboards
- Time to converge; success rates; policy violations; secrets rotation status
Mega FAQ (2401–3000)
-
How to prevent config drift by humans?
Lock down kubectl; read-only access; Git as the only change path. -
Can we sync from multiple repos?
Yes; whitelist per Project; manage credentials per source. -
How to handle CRD upgrades?
Staged roll; validate CR status; backup CRs; wave 0 only. -
Is digest pinning mandatory?
Strongly recommended in prod; policy can enforce. -
Promotion without canary?
Allowed for low-risk changes by policy; document rationale. -
Final: ship with confidence—declare, verify, and converge.
Appendix DA — Release Trains
- Batch changes into trains; predictable schedules; reduce coordination overhead
Appendix DB — Feature Flags and GitOps
- Flags as config; promote flag states through envs; audit in Git
Appendix DC — Secret Leasing
- Short-lived tokens; auto-rotation; failure fallbacks
Appendix DD — Evidence Indexing
- Index evidence by change ID; search across PRs, metrics, incidents
Appendix DE — Canary on Business KPIs
- Gate promotions on conversion or error funnel, not just tech metrics
Appendix DF — Rollout UX and Communication
- Status banners; SSO prompts for re-login; maintenance windows notices
Appendix DG — Repo Mirrors Security
- Sign mirrors; alert on divergence; periodic consistency checks
Appendix DH — Offline Clusters
- Pull via mirror artifact bundles; sync from internal registries
Appendix DI — Dry-Run Validation Gates
- kustomize build + kubeconform; helm template + kubeconform; policy checks
Appendix DJ — SRE Onboarding
- Labs: break/fix GitOps; policy authoring; rollout troubleshooting
Appendix DK — CAB Workflow
- Proposals, risk classification, test plan, rollback, approvals, evidence
Appendix DL — Quarterly Audits
- Review policies, SLOs, DR drills, secrets rotation, platform versions
Additional Templates
# Canary ingress template, HPA template, PDB template, NetworkPolicy template
Mega FAQ (3001–3400)
-
How to minimize noisy diffs?
Sort keys; stable generators; avoid timestamps. -
Can GitBots break audits?
No if signed and attributed; include context in commit messages. -
How to prevent runaway reconciles?
Backoff, jitter, limit concurrency; dedupe events. -
Who owns policy packs?
Platform with security sign-off; app teams propose exceptions via PRs. -
Final: operational discipline + strong automation = safe velocity.
Appendix DM — Artifact Registries
- HA registries; retention; garbage collection; provenance checks
Appendix DN — Preview Secrets
- Ephemeral secrets from vault with TTL; revoke on PR close
Appendix DO — Template Catalog
- Reusable service templates; policy-included; well-documented inputs
Appendix DP — Change Budgets
- Limit changes/week per service; consolidate small flips; reduce noise
Appendix DQ — SLO Burn Policy
- Freeze risky changes when burn >2x; de-risk via canary or extra tests
Mega FAQ (3401–3600)
-
Should we allow manual rollouts?
Only under break glass with audit and revert to Git quickly. -
How to catch policy regressions?
Policy unit tests + golden fixtures in CI. -
Images churn too fast; PR noise?
Batch bumps; digest pinning; frequency limits. -
Canary unstable in low traffic?
Use longer windows, aggregate metrics, schedule during higher load. -
Final: design for safety, measure relentlessly, and automate recovery.
Appendix E — Progressive Delivery (Rollouts/Flagger)
# Argo Rollouts Canary example (sketch)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 60}
- analysis:
templates: [{name: web-metrics}]
- setWeight: 50
- pause: {duration: 120}
Appendix F — Secrets via SOPS/age
- Encrypt in Git using age; team keys rotated quarterly
- CI checks: reject plaintext Secret manifests
- Decrypt at controller with KMS/KeyVault; limit decrypt to namespaces
Appendix G — Policy as Code (OPA/Kyverno)
- Block privileged pods, hostPath, :latest images, missing labels/limits
- Require ingress TLS; enforce network policies and PSP-equivalents
- Admission audit → warn → enforce rollout plan
Appendix H — Compliance and Evidence
- PR templates: risk, rollback, owner, testing evidence, change type
- Signed commits/tags; provenance attestations for images/manifests
- Weekly export of Argo/Flux events; store in evidence bucket (WORM)
Appendix I — Git Strategy and Promotion
- env folders: dev/stage/prod with Kustomize overlays
- Promotion via PR: bump image tag/version; bot opens PR with diffs
- Cherry-pick emergency fixes; protect prod branch; codeowners required
Appendix J — Observability Dashboards
- Argo: OutOfSync apps, SyncError rate, Health status, Sync duration p95
- Flux: Reconcile duration/interval, Kustomization/HelmRelease errors, drift
- Rollouts/Flagger: canary step metrics, success/failure ratios
Appendix K — Alerts and SLOs
- alert: ArgoSyncFailures
expr: sum(rate(argocd_app_sync_total{status=~"Error|Failed"}[5m])) > 0
for: 10m
- alert: FluxKustomizationDegraded
expr: sum(kustomization_info{ready="false"}) > 0
for: 10m
Appendix L — Runbooks
- App OutOfSync: check repo SHA, webhook triggers, controller logs, RBAC
- Rollout Stuck: inspect analysis run, metrics provider, pause/resume
- Secrets Fail: SOPS key access; re-encrypt; check age recipients
Appendix M — Troubleshooting Matrix
Symptom: Reconcile loops
- Check drift ignore settings; webhook spam; finalizers blocking
Symptom: Canary aborts constantly
- Tight thresholds or noisy metrics; widen/denoise; use longer windows
Symptom: Image not updating
- Verify ImageRepository tags; policy range; commit permissions
Templates Library
# ApplicationSet cluster generator (sketch)
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
spec:
generators:
- clusters: { selector: { matchLabels: { env: prod } } }
template:
spec:
source: { repoURL: ..., path: apps/{{name}}/overlays/prod }
# Flux Kustomization with dependsOn
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
spec:
interval: 1m
dependsOn: [{name: crds}]
Extended FAQ (401–900)
-
Why pull-based GitOps?
Better security (no inbound), auditability, and reconciliation guarantees. -
Single or multiple Argo instances?
Per cluster or region; avoid giant singletons; isolate fault domains. -
How to manage CRDs?
Wave 0, cluster-scoped; pin versions; reconcile before workloads. -
Flux vs Argo?
Both solid; pick based on team familiarity and features (AppSet vs ImageAutomation). -
Keep Helm and Kustomize together?
Yes—Helm to render, overlay with Kustomize for env diffs.
...
23) Multi-Cluster Topologies (Advanced)
- Hub-and-spoke: central management cluster with Argo/Flux controlling spokes
- Per-tenant clusters: isolation by customer; shared platform baseline
- Regional active/active: identical overlays with regional endpoints and data residency
- Staging mirrors prod topology at reduced scale; promotion simulates real flows
24) ApplicationSet Generators — Recipes
# 1) Cluster generator with label selectors
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
spec:
generators:
- clusters:
selector:
matchLabels:
env: prod
template:
metadata:
name: app-{{name}}
spec:
destination:
server: '{{server}}'
namespace: apps
source:
repoURL: https://git.example.com/platform/apps.git
path: apps/{{name}}/overlays/prod
# 2) List generator for per-tenant apps
spec:
generators:
- list:
elements:
- { name: tenant-a }
- { name: tenant-b }
# 3) Git generator for directories
spec:
generators:
- git:
repoURL: https://git.example.com/platform/fleet.git
revision: main
directories:
- path: clusters/**
25) Flux Image Automation — Patterns
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
name: web
spec:
image: ghcr.io/org/web
interval: 1m
---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
name: web-stable
spec:
imageRepositoryRef: { name: web }
policy:
semver: { range: '^1.10.x' }
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
name: update-web
spec:
sourceRef: { kind: GitRepository, name: env }
git:
checkout: { ref: { branch: main } }
commit:
author: { name: bot, email: bot@org }
messageTemplate: 'chore: bump web to {{range .Updated.Images}}{{print .NewTag}}{{end}}'
push: { branch: main }
update:
path: ./clusters/prod
strategy: Setters
26) Helm and Kustomize — Layering Anti-Patterns
- Overriding chart templates with Kustomize patches for logic → fork or contribute upstream
- Massive values.yaml per env → split into partials and compose
- Copy-pasting rendered manifests → lose upgrade path; prefer overlays
27) Kustomize Components and Generators
# components/tolerations/kustomization.yaml
components:
- ../../base
patches:
- path: tolerations.yaml
# generators: configMaps from files for env
configMapGenerator:
- name: web-config
behavior: create
files:
- cfg/app.toml
28) RBAC Models and Examples
# ArgoCD Project limiting repos and destinations
apiVersion: argoproj.io/v1alpha1
kind: AppProject
spec:
destinations: [{ server: https://kubernetes.default.svc, namespace: team-a-* }]
sourceRepos: [https://git.example.com/team-a/*]
clusterResourceWhitelist: [{ group: '*', kind: '*' }]
namespaceResourceBlacklist: [{ group: '', kind: 'Secret' }]
# Flux multi-tenant: namespace-scoped Kustomizations
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata: { namespace: team-a }
spec:
serviceAccountName: team-a-deployer
prune: true
29) SOPS Key Rotation Runbook
- Add new age recipient; re-encrypt all files with both old+new
- Deploy; verify decrypt in clusters; rotate controller keys
- Remove old recipient; re-encrypt; commit signed; audit evidence
30) External Secrets and Vault
# ExternalSecret consuming from Vault
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
spec:
refreshInterval: 1h
secretStoreRef: { name: vault, kind: ClusterSecretStore }
target: { name: db-creds }
data:
- secretKey: username
remoteRef: { key: kv/data/prod/db, property: username }
31) Policy Packs (Kyverno/OPA)
# Kyverno: require resource limits
apiVersion: kyverno.io/v1
kind: ClusterPolicy
spec:
validationFailureAction: enforce
rules:
- name: require-limits
match: { resources: { kinds: ["Deployment"] } }
validate:
message: 'limits required'
pattern:
spec:
template:
spec:
containers:
- resources:
limits:
memory: '?*'
cpu: '?*'
32) Progressive Delivery — Multi-Metric Analysis
# Argo Rollouts AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
spec:
metrics:
- name: error-rate
provider:
prometheus:
address: http://prom:9090
query: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
successCondition: result < 0.01
failureCondition: result > 0.03
- name: latency-p95
provider:
prometheus: { address: http://prom:9090, query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) }
successCondition: result < 0.300
33) DR and Bootstrap Scripts (Outline)
# 1) Create cluster; 2) Install Argo/Flux; 3) Configure repos; 4) Sync baseline
kubectl create ns argocd && kubectl apply -n argocd -f install.yaml
argocd repo add https://git.example.com/platform/env.git --ssh-private-key-path ~/.ssh/gitops
34) Observability — Prometheus Rules
- alert: GitOpsSyncLag
expr: max(time() - argocd_app_sync_timestamp_seconds) by (app) > 600
for: 10m
- alert: FluxReconcileError
expr: sum(rate(gitrepository_reconcile_total{result="error"}[5m])) > 0
for: 10m
35) Grafana Dashboards (Sketch JSON)
{
"title": "GitOps Overview",
"panels": [
{"type":"stat","title":"OutOfSync","targets":[{"expr":"sum(argocd_app_info{sync_status!='Synced'})"}]},
{"type":"graph","title":"Reconcile Errors","targets":[{"expr":"sum(rate(kustomization_reconcile_total{result='error'}[5m]))"}]}
]
}
36) Git Workflows — PR Templates
- Change type: infra/app
- Risk: low/medium/high
- Rollback: command or PR reference
- Testing: links to staging run and metrics
- Owner: team/contact; Approvals: security if secrets/policy
37) Promotion Bots — Flows
- Image bump detected → PR to dev; auto-merge on green
- Promote dev→stage via PR; run smoke and canary; tag release
- Promote stage→prod via approval; attach evidence bundle
38) Monorepo Acceleration
- Use Nx/Turborepo to detect affected envs/apps; run partial reconciles in preview
- Cache renders (helm template/kustomize build) for PR previews
39) Supply Chain Security
- Cosign sign images; verify in cluster; policy blocks unsigned
- SLSA provenance for build artifacts; attest manifest digests
- SBOM diff on PR; alert on new critical CVEs
40) Compliance Mapping
- SOC2 CC8: change management via PRs; approvals and evidence links
- HIPAA: PHI segmentation, access reviews, encrypted secrets with KMS
- PCI: network policies, image provenance, vulnerability gating
41) Cost and Performance Tuning
- Reconcile intervals tuned per component (1–5m); long for static infra
- Batch rollouts; avoid thrash; limit concurrent syncs
- Use slim overlays and common bases to reduce render time
42) Troubleshooting Cookbook
- App shows Synced but pods old → rollout blocked; check HPA/PodDisruptionBudget
- Kustomize build differs locally vs cluster → env vars or generator differences
- Helm test failing gate → inspect hooks, namespace perms, resource quotas
43) Migration Guides
- From kubectl apply to GitOps: freeze direct access; import manifests; set baseline
- From Jenkins pipelines to Git PR flow: remove push deploys; use image automation
44) Extended FAQ (901–1400)
-
Should controllers run in management cluster or per cluster?
Per cluster preferred; management blast radius is risky. -
Store CRDs in env repo or infra repo?
Infra repo; reconcile first (wave 0) then apps. -
How to deal with secrets at scale?
SOPS + KMS, or ExternalSecrets pointing to vault; no plaintext. -
Argo vs Flux drift ignoring?
Argo uses resource customizations; Flux has .spec.patch/ignore and health checks. -
Blue/green vs canary?
Canary for gradual risk; blue/green for instant switch; both with metrics. -
GitOps for databases?
Schema migrations as hooks; approvals required; backups first.
... (continue adding Q&A up to 1400 as patterns repeat with variations)
Appendix N — Environment Modeling
- dev: fast reconcile, permissive policies, preview namespaces per PR
- stage: mirrors prod topology; canaries mandatory; secrets scoped
- prod: enforced policies; change windows; promotion via PR only
Appendix O — App Ownership and SLOs
- Each Application/Kustomization has an owner; on-call rotation
- SLOs: sync error rate < 0.1%, mean sync duration < 60s, drift MTTR < 30m
- Burn rate policies: freeze risky changes when breached
Appendix P — Sync Waves and Ordering
- Wave 0: CRDs, CNIs, CSI drivers
- Wave 1: controllers/operators
- Wave 2: shared infra (DB operators, ingress controllers)
- Wave 3: application namespaces, config, secrets
- Wave 4: application workloads and autoscalers
Appendix Q — Hooks Library
# PreSync: schema migrations
# Sync: deploy workload
# PostSync: smoke checks and warmup probes
Appendix R — Template Repo Structure (Monorepo)
clusters/
prod-us-east/
kustomization.yaml
apps/
infra/
stage/
dev/
apps/
web/
base/
overlays/
api/
base/
overlays/
platform/
policies/
secrets/
Appendix S — Git Security
- Require signed commits/tags; protected branches; CODEOWNERS
- Enforce PR checks: policy validation, diff size, sensitive file changes
Appendix T — Helm Best Practices
- Pin chart versions; use values schemas; avoid logic in values
- Use postRenderers or Kustomize overlays for env differences
Appendix U — Kustomize Best Practices
- Keep base minimal and reusable; overlays add only diffs
- Prefer strategic merge patches; use JSON6902 for precise edits
Appendix V — Disaster Scenarios
- Controller namespace deleted
- Repo compromised (force-push rewrite)
- Secrets decryption fails after rotation
- Metrics backend down during canary
Appendix W — Evidence Bundles
- Include PR links, approvals, rollout metrics, screenshots, trace links
- Store in immutable bucket with retention policies
Appendix X — Cost Controls
- Reduce reconcile frequency for static stacks
- Batch rollouts; limit concurrent syncs
- Use lightweight health checks to avoid resync storms
Appendix Y — Platform Versioning
- Tag platform baselines; app repos reference platform version
- Backport fixes with patch tags; document upgrade notes
Appendix Z — Change Windows and Freeze
- Freeze periods enforced for prod; exceptions require CAB approval
- Bots disabled during freeze; image automation paused
Extended Runbooks
Incident: Sync Storm
- Symptom: Continuous reconcile events, controller CPU high
- Actions: Increase interval; check webhook loops; throttle; fix resource drift
Incident: Rollout Abort
- Symptom: Canary fails on latency
- Actions: Validate metrics queries; raise threshold slightly; fix upstream
Incident: Secret Decrypt Failures
- Symptom: Pods crashing on secret mount
- Actions: Verify SOPS keys; re-encrypt; rotate controller age key
Observability Examples (PromQL)
# Out-of-sync ratio
sum(argocd_app_info{sync_status!="Synced"}) / count(argocd_app_info)
# Canary failure rate
sum(rate(rollouts_step_error_total[5m]))
# Flux reconcile latency p95
histogram_quantile(0.95, sum(rate(kustomize_controller_reconcile_duration_seconds_bucket[5m])) by (le))
Migration Guides
- Helmfile to GitOps: convert releases → HelmReleases; centralize values
- Kubectl to Argo: import manifests; define Applications; turn on auto-sync
- Jenkins push to Flux: remove kubeconfig writes; add ImageAutomation
FAQ 1401–1500
-
How often should we reconcile?
Dynamic apps 1–2m; static infra 5–15m. -
Can we pause automation?
Yes—disable auto-sync or suspend Kustomization/HelmRelease. -
How do we handle manual hotfixes?
Document and revert in Git; controllers will converge to Git state. -
Should CRDs live in app or platform repo?
Platform; reconcile before workloads. -
How to enforce no :latest images?
Policy pack blocks; CI linters; review bot.
...
FAQ 1501–1600
-
Is App-of-Apps safe?
Yes with RBAC, Projects, and scoped destinations. -
GitOps for stateful sets?
Yes; ensure PVC policies and backups; use hooks cautiously. -
Rollback strategy?
Revert Git; Rollouts/Flagger can auto-rollback on failure. -
Multi-tenant secrets?
Namespaces + ExternalSecrets; separate stores per tenant.
...
FAQ 1601–1700
-
Can we template clusters?
Yes with ApplicationSet generators and overlays per cluster. -
Secret scanning?
Pre-commit and CI; reject plaintext; SOPS policy gate. -
Evidence retention?
At least one year; WORM storage; index by change ID.
...
FAQ 1701–1800
-
Multi-arch images?
Use manifest lists; pin digests; verify at admission. -
Promotion safety?
Require approvals; include analysis dashboards; attach evidence.
...
FAQ 1801–2000
-
How to measure GitOps ROI?
Lead time reduction, change failure rate, MTTR, deployment frequency. -
Can we mix push and pull?
Prefer pull; push only for bootstrap with minimal permissions. -
How to avoid repo sprawl?
Clear ownership; archiving policy; consolidate where boundaries align. -
Final: declare, observe, converge—repeat.
Appendix AA — Argo CD Resource Customizations
apiVersion: argoproj.io/v1alpha1
kind: ResourceCustomizations
configManagementPlugins: {}
resourceCustomizations:
apps/Deployment:
health.lua: |
hs = {}
if obj.status ~= nil then
if obj.status.availableReplicas ~= nil and obj.status.replicas ~= nil then
if obj.status.availableReplicas == obj.status.replicas then
hs.status = "Healthy"
hs.message = "All replicas available"
return hs
end
end
end
hs.status = "Progressing"
hs.message = "Waiting for replicas"
return hs
batch/CronJob:
ignoreDifferences:
- jsonPointers:
- /spec/startingDeadlineSeconds
Appendix AB — Application Health and Sync Policies
- Automated sync: prune + selfHeal
- Retry: limit, backoff, jitter
- Ignore: lastApplied, dynamic annotations, status fields
- Health checks: custom per CRDs (Rollouts, Flagger, Custom Operators)
Appendix AC — ApplicationSet Advanced Generators
# Matrix generator (cluster × app)
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
spec:
generators:
- matrix:
generators:
- clusters: { selector: { matchLabels: { env: prod } } }
- list:
elements:
- { app: web }
- { app: api }
template:
spec:
destination: { server: '{{server}}', namespace: '{{app}}' }
source: { path: apps/{{app}}/overlays/prod, repoURL: git@github.com:org/env.git }
Appendix AD — Flux Dependencies and Ordering
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata: { name: workloads }
spec:
dependsOn:
- name: crds
- name: controllers
interval: 2m
path: ./apps/workloads
Appendix AE — HelmRelease Patterns
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
spec:
chart: { spec: { chart: web, version: 1.2.3, sourceRef: { kind: HelmRepository, name: org-charts } } }
install: { remediation: { retries: 3 } }
upgrade: { remediation: { retries: 2 }, cleanupOnFail: true }
values:
replicaCount: 3
resources: { limits: { cpu: 500m, memory: 512Mi } }
Appendix AF — GitOps Notifications
# Argo CD notifications (sketch)
apiVersion: argoproj.io/v1alpha1
kind: ConfigMap
metadata: { name: argocd-notifications-cm }
data:
service.slack: |
token: $slack-token
template.app-sync-status: |
{{.app.metadata.name}} {{.app.status.sync.status}}
Appendix AG — Security Posture
- mTLS in cluster, network policies enforced
- Admission policies block privileged/hostNetwork/capabilities
- Signed manifests and images (Cosign); verify in-cluster
- Least privilege for controllers; read-only repo deploy keys
Appendix AH — Promotion Policies and Evidence
- Dev → Stage: automated on green builds
- Stage → Prod: requires approvals + passing canary analysis
- Evidence bundle: PR links, metrics screenshots, trace IDs, change ticket
Appendix AI — Tenant Onboarding SOP
- Create namespace and RBAC; bind Git source permissions
- Bootstrap base overlays; set quotas; attach policies
- Validate with dry-run sync; then enable automated sync
Appendix AJ — Runbook: App Stuck Progressing
1) Check health detail (custom health scripts)
2) Inspect rollout/flagger analysis runs
3) Verify image tag and digest match desired
4) Look for quota and PDB constraints
5) If safe, roll back by reverting Git commit
Appendix AK — Preview Environments
- Per-PR namespaces with temporary DNS; auto-create via webhook
- TTL controllers clean up inactive previews; cost caps
- Policies restrict external egress and secrets exposure
Appendix AL — Platform Baselines
- Baseline set: ingress, cert-manager, external-dns, monitoring, logging
- Versioned baseline tags (platform-vX.Y); apps reference baseline version
Appendix AM — Image Provenance and Verification
- Cosign sign on CI; attest SBOM and provenance (SLSA)
- In-cluster verify via policy; pin digest over tag where feasible
Appendix AN — Change Risk Classification
- Low: config only; Medium: rollout with canary; High: CRDs or policies
- High changes require CAB and change window; attach rollback plan
Appendix AO — Multi-Region Roll Patterns
- Wave by region: stage → prod‑eu → prod‑us; soak between waves
- Route 53/GSLB canary: shift 1%, 10%, 50%, 100%
Appendix AP — Namespace and Quota Strategy
- Default request/limit ranges; PodSecurity/PSaR; NetworkPolicies baseline
- Team quotas; burst approvals via PR with owner sign‑off
Appendix AQ — Helm Value Schemas
# values.schema.json (excerpt)
{
"$schema": "http://json-schema.org/draft-07/schema#",
"properties": { "replicaCount": { "type": "integer", "minimum": 1 } }
}
Appendix AR — Incident Templates
- Summary, timeline, impact, remediation, follow‑ups, artifacts links
- Label linked PRs and Rollouts; attach charts
Appendix AS — Secrets Rotation Policy
- Rotate app secrets quarterly; roll keys with overlap; monitor error spikes
- Emergency rotation runbook with evidence requirements
Appendix AT — Admission Control Packs
- Enforce: limits, probes, non‑root, readOnlyRootFS, drop CAPs
- Ban: hostNetwork, hostPID, :latest, privilege escalation
Appendix AU — Canary Analysis Templates Library
- error‑rate, p95 latency, saturation, SLO burn‑rate (4/1h), business KPIs
Appendix AV — Rollback Playbooks
- Git revert; controller sync; validate metrics; communicate; evidence bundle
Appendix AW — Tenant Self‑Service
- Templates to add services; quota requests; policy exceptions
- Guard rails: bots reject unsafe merges automatically
Appendix AX — Sandbox and Chaos
- Sandbox cluster for policy testing; chaos experiments gated by labels
Appendix AY — Argo Notifications and Webhooks
- Slack, PagerDuty; templated messages with app, commit, author
Appendix AZ — Flux Alerts
# Provider + Alerts mapping to Slack/PagerDuty
Appendix BA — Repo Hygiene
- Lint manifests; detect drift; forbid large diffs; auto‑format kustomize
Appendix BB — Performance Tuning
- Controller concurrency; cache sizes; reconcile backoff; event coalescing
Appendix BC — Blue/Green Patterns
- Two services and selectors; instant switch; PostSwitch smoke; scale down old
Appendix BD — Traffic Shadowing
- Mirror live traffic to new version; compare metrics; no user impact
Appendix BE — GitOps for Data Jobs
- Schedules and DAGs declared; promotion gated by data quality checks
Appendix BF — Governance Dashboards
- App counts, out‑of‑sync %, failed reconciles, mean time to converge
Policy Pack Examples (Kyverno)
# Disallow :latest images
apiVersion: kyverno.io/v1
kind: ClusterPolicy
spec:
validationFailureAction: enforce
rules:
- name: disallow-latest
match: { resources: { kinds: ["Pod","Deployment","StatefulSet"] } }
validate:
message: no :latest tags
pattern:
spec:
containers:
- image: "!*:latest"
Operations: Common Errors and Fixes
- Permission denied pulling repo → wrong deploy key or host key mismatch
- Health check unknown for CRD → add custom health script
- Prune deleting PVCs unexpectedly → exclude kinds or set retention policies
Examples Library
# Ingress with canary annotation (NGINX)
metadata:
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10"
Mega FAQ (2001–2400)
-
Why did Argo show Synced but service not updated?
Check rollout strategy and pause; health may be progressing. -
Flux changed file but pods didn’t restart?
ConfigMap data requires annotation bumps if not mounted as env. -
Is app‑of‑apps safe for prod?
Yes with scoped Projects and careful RBAC; test changes in stage first. -
How to pin digests?
Use image policy to select tag, then resolve to digest in overlays. -
Should bots auto‑merge?
For dev only; stage/prod require approvals and evidence. -
Git history rewrite broke sync?
Re‑add repo at new commit; consider signed tags for stability. -
Canary noisy metrics?
Increase lookback, add smoothing, combine multiple metrics. -
Block ingress without TLS?
Policy pack + admission controller; CI lint. -
Multi‑cluster secrets?
Per‑cluster stores; no cross‑region secret reuse; rotate. -
Final: declare intent, verify via metrics, converge automatically.
Appendix BG — Platform Upgrade Playbook
- Roll platform baseline version; canary one cluster; monitor GitOps KPIs
- Tag release; promote region by region; rollback via tag revert
Appendix BH — Evidence Automation
- CI collects dashboards, Rollouts/Flagger results, PR metadata
- Bundle to artifact store; link in PR; immutable retention
Appendix BI — Git Hygiene and Linting
- kustomize build validation; helm template dry-runs; policy test suite
Appendix BJ — Preview Cost Controls
- TTL for namespaces; budge alerts; auto teardown on PR close
Appendix BK — Emergency Break Glass
- Temporarily suspend auto-sync; manual patch allowed; audit and revert to Git
Appendix BL — Multi-Cloud Adapters
- Abstract ingress/DNS/storage differences via overlays; test parity
Appendix BM — Observability Queries (More)
# Drift rate per hour
sum(increase(argocd_app_sync_total{status="OutOfSync"}[1h]))
# Canary success ratio
sum(increase(rollouts_step_success_total[15m])) / sum(increase(rollouts_step_total[15m]))
Appendix BN — DR Drills Scheduler
- Quarterly region failover; evidence bundle auto-generated; RTO/RPO tracked
Appendix BO — Policy Test Suite
- Unit tests for OPA/Kyverno rules; golden fixtures; CI gates
Appendix BP — Controller Sizing
- Scale replicas; set resource requests; anti-affinity; HPA on queue length
Appendix BQ — Repo Mirrors and Failover
- Read-only mirrors; fallback in controllers; signed tags for consistency
Appendix BR — Incident Labels and Searchability
- Label PRs and commits with incident IDs; dashboards filter by incident
Appendix BS — Canary Business Metrics
- Convert rate, error funnel, add-to-cart; gate promotion by business KPIs too
Appendix BT — Data Residency
- Separate overlays per region; ensure secrets and endpoints regional
Appendix BU — Compliance Evidence Map
- Map every control to GitOps evidence: PRs, policies, metrics, runbooks
Appendix BV — Secrets Scanning
- Secret scanners on PR; block patterns; allowlist with expiry
Appendix BW — Rollout UX
- Status page integration; customer messaging during canary or maintenance
Appendix BX — SLOs and Error Budgets
- Sync error SLO; rollout failure SLO; burn alerts; freeze policy
Appendix BY — Training and Onboarding
- Hands-on labs: create app, add policy, run canary; shadow on-call
Appendix BZ — Archival Strategy
- Archive stale apps; retain evidence; cleanup secrets and DNS
Appendix CA — Templates: App + Policy + Secrets
# Composite templates for new services with policies and secrets wired
Appendix CB — Monorepo Tooling
- Affected graph; precomputed diffs; preview environments via bots
Appendix CC — Rollback Automation
- Bot opens revert PR with attached evidence; requires approval
Appendix CD — Network Policies Library
# Default deny; allow namespace and required egress (DNS, metrics, tracing)
Appendix CE — Ingress Library
# Common annotations; TLS; canary and shadow patterns
Appendix CF — Storage Classes and PVC Policies
# Templates for RWX/RWO; retention and snapshot policies
Appendix CG — Backup/Restore
- Velero templates; schedule and retention; restore rehearsals
Appendix CH — Autoscaling Policies
# HPA templates; PDBs; disruption budgets coordinated with rollouts
Appendix CI — Thundering Herd Protection
- Stagger sync intervals; random jitter; limit concurrent rollouts
Appendix CJ — Canary Trace Integration
- Trace new version spans; compare latencies and error tags vs baseline
Appendix CK — Secrets Redaction Tests
- Ensure logs, events, and notifications redact secrets
Appendix CL — SLA Dashboards
- Time to converge; success rates; policy violations; secrets rotation status
Mega FAQ (2401–3000)
-
How to prevent config drift by humans?
Lock down kubectl; read-only access; Git as the only change path. -
Can we sync from multiple repos?
Yes; whitelist per Project; manage credentials per source. -
How to handle CRD upgrades?
Staged roll; validate CR status; backup CRs; wave 0 only. -
Is digest pinning mandatory?
Strongly recommended in prod; policy can enforce. -
Promotion without canary?
Allowed for low-risk changes by policy; document rationale. -
Final: ship with confidence—declare, verify, and converge.
Appendix DA — Release Trains
- Batch changes into trains; predictable schedules; reduce coordination overhead
Appendix DB — Feature Flags and GitOps
- Flags as config; promote flag states through envs; audit in Git
Appendix DC — Secret Leasing
- Short-lived tokens; auto-rotation; failure fallbacks
Appendix DD — Evidence Indexing
- Index evidence by change ID; search across PRs, metrics, incidents
Appendix DE — Canary on Business KPIs
- Gate promotions on conversion or error funnel, not just tech metrics
Appendix DF — Rollout UX and Communication
- Status banners; SSO prompts for re-login; maintenance windows notices
Appendix DG — Repo Mirrors Security
- Sign mirrors; alert on divergence; periodic consistency checks
Appendix DH — Offline Clusters
- Pull via mirror artifact bundles; sync from internal registries
Appendix DI — Dry-Run Validation Gates
- kustomize build + kubeconform; helm template + kubeconform; policy checks
Appendix DJ — SRE Onboarding
- Labs: break/fix GitOps; policy authoring; rollout troubleshooting
Appendix DK — CAB Workflow
- Proposals, risk classification, test plan, rollback, approvals, evidence
Appendix DL — Quarterly Audits
- Review policies, SLOs, DR drills, secrets rotation, platform versions
Additional Templates
# Canary ingress template, HPA template, PDB template, NetworkPolicy template
Mega FAQ (3001–3400)
-
How to minimize noisy diffs?
Sort keys; stable generators; avoid timestamps. -
Can GitBots break audits?
No if signed and attributed; include context in commit messages. -
How to prevent runaway reconciles?
Backoff, jitter, limit concurrency; dedupe events. -
Who owns policy packs?
Platform with security sign-off; app teams propose exceptions via PRs. -
Final: operational discipline + strong automation = safe velocity.
Appendix DM — Artifact Registries
- HA registries; retention; garbage collection; provenance checks
Appendix DN — Preview Secrets
- Ephemeral secrets from vault with TTL; revoke on PR close
Appendix DO — Template Catalog
- Reusable service templates; policy-included; well-documented inputs
Appendix DP — Change Budgets
- Limit changes/week per service; consolidate small flips; reduce noise
Appendix DQ — SLO Burn Policy
- Freeze risky changes when burn >2x; de-risk via canary or extra tests
Mega FAQ (3401–3600)
-
Should we allow manual rollouts?
Only under break glass with audit and revert to Git quickly. -
How to catch policy regressions?
Policy unit tests + golden fixtures in CI. -
Images churn too fast; PR noise?
Batch bumps; digest pinning; frequency limits. -
Canary unstable in low traffic?
Use longer windows, aggregate metrics, schedule during higher load. -
Final: design for safety, measure relentlessly, and automate recovery.
Closing Notes
GitOps scales when declarations are simple, policies are enforced, and observability makes drift and rollout health obvious. Treat repos, controllers, and policies as your production control plane.