Environment Parity and Configuration Drift Detection
Environment Parity and Configuration Drift Detection
The Failure
The inventory service worked in staging but crashed in production with OOM kills. The staging overlay set resources.limits.memory: 2Gi. The production overlay set resources.limits.memory: 512Mi. Someone had reduced production memory limits six months ago to save costs and never updated staging to match. The team tested against 2Gi in staging and deployed into 512Mi in production.
Environment parity means: the things that matter are the same. Resource limits matter. Base images matter. Environment variable names matter. The payment processor URL is allowed to differ. The memory limit is not.
The Mechanism
Parity Rules
| Category | Must Be Identical | Allowed to Differ |
|---|---|---|
| Base image | Yes | No |
| Resource requests | No (scale differs) | Proportional ratios only |
| Resource limits | Yes (same ratios) | Absolute values scale with environment |
| Environment variables | Names must match | Values differ (URLs, endpoints) |
| Replica count | No | Yes (1 in dev, 2 in staging, 3+ in prod) |
| Ingress rules | Structure must match | Hostnames differ |
| Health check config | Yes | No |
| Service mesh config | Yes | No |
Kustomize Overlay Strategy
The base contains everything that is identical. Overlays contain only the differences. If an overlay is adding resources that do not exist in the base, something is wrong.
# apps/checkout-service/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service
spec:
selector:
matchLabels:
app: checkout-service
template:
metadata:
labels:
app: checkout-service
spec:
containers:
- name: checkout
image: ghcr.io/acme/checkout-service:latest
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
# apps/checkout-service/overlays/production/patch-replicas.yaml
# HARDENED: Only replica count and resource scaling differ
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service
spec:
replicas: 3
template:
spec:
containers:
- name: checkout
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
The Implementation
Drift Detection with ArgoCD
ArgoCD compares the desired state (Git) with the live state (cluster). When they diverge, ArgoCD reports the diff. With selfHeal: true, ArgoCD reverts manual changes automatically.
# HARDENED: ArgoCD app with drift detection and self-heal
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: checkout-service-production
namespace: argocd
annotations:
notifications.argoproj.io/subscribe.on-sync-failed.slack: ci-alerts
notifications.argoproj.io/subscribe.on-health-degraded.slack: ci-alerts
spec:
source:
repoURL: https://github.com/acme/ecommerce-infra.git
targetRevision: main
path: apps/checkout-service/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
Parity Validation Script
Run this in CI when the infra repo changes to catch parity violations:
#!/bin/bash
# HARDENED: Validate environment parity
set -euo pipefail
SERVICES=("checkout-service" "catalog-service" "inventory-service" "payments-service")
ENVS=("dev" "staging" "production")
for service in "${SERVICES[@]}"; do
echo "=== Checking $service ==="
# Build each overlay and compare structural elements
for env in "${ENVS[@]}"; do
kustomize build "apps/$service/overlays/$env" > "/tmp/${service}-${env}.yaml"
done
# Check that health checks are identical across environments
for env in "${ENVS[@]}"; do
PROBES=$(yq '.spec.template.spec.containers[0] | (.readinessProbe, .livenessProbe)' \
"/tmp/${service}-${env}.yaml" | md5sum | cut -d' ' -f1)
echo " $env probes hash: $PROBES"
done
# Check that environment variable NAMES are identical
for env in "${ENVS[@]}"; do
ENV_NAMES=$(yq '.spec.template.spec.containers[0].env[].name' \
"/tmp/${service}-${env}.yaml" 2>/dev/null | sort | md5sum | cut -d' ' -f1)
echo " $env env var names hash: $ENV_NAMES"
done
done
Audit Trail for Manual Changes
When ArgoCD self-heals a drift, log it:
# ArgoCD notification template for drift detection
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-notifications-cm
namespace: argocd
data:
template.drift-detected: |
message: |
⚠️ Configuration drift detected and reverted
Application: {{.app.metadata.name}}
Namespace: {{.app.spec.destination.namespace}}
Time: {{.app.status.operationState.finishedAt}}
Sync Result: {{.app.status.operationState.phase}}
The Gate
The parity validation script runs as a CI check on every PR to the infra repo. If health check definitions differ between environments, the PR is blocked. If environment variable names differ, the PR is blocked.
Resource limit ratios are checked but not enforced strictly — production may need more resources than dev. The script warns if the ratio between environments is greater than 4x (e.g., dev has 256Mi but production has 2Gi), indicating someone may have forgotten to update an environment.
The Recovery
ArgoCD self-heals a legitimate manual change: Someone scaled up production replicas during an incident. ArgoCD reverted it. The correct response: make the change in Git (update the overlay), push, and let ArgoCD sync. For emergency scaling, use ArgoCD’s “disable auto-sync” temporarily, make the manual change, then update Git and re-enable auto-sync.
Overlay grows too large: If a production overlay has more than 10 patches, the base is too generic. Move production-specific resources into the base and use dev/staging overlays to scale down instead.
New environment variable added to one environment but not others: The parity script catches this. Add the variable to the base and override values per environment using Kustomize configMapGenerator or secretGenerator.