Environment Parity and Configuration Drift Detection

The Failure

The inventory service worked in staging but crashed in production with OOM kills. The staging overlay set resources.limits.memory: 2Gi. The production overlay set resources.limits.memory: 512Mi. Someone had reduced production memory limits six months ago to save costs and never updated staging to match. The team tested against 2Gi in staging and deployed into 512Mi in production.

Environment parity means: the things that matter are the same. Resource limits matter. Base images matter. Environment variable names matter. The payment processor URL is allowed to differ. The memory limit is not.

The Mechanism

Parity Rules

Category	Must Be Identical	Allowed to Differ
Base image	Yes	No
Resource requests	No (scale differs)	Proportional ratios only
Resource limits	Yes (same ratios)	Absolute values scale with environment
Environment variables	Names must match	Values differ (URLs, endpoints)
Replica count	No	Yes (1 in dev, 2 in staging, 3+ in prod)
Ingress rules	Structure must match	Hostnames differ
Health check config	Yes	No
Service mesh config	Yes	No

Kustomize Overlay Strategy

The base contains everything that is identical. Overlays contain only the differences. If an overlay is adding resources that do not exist in the base, something is wrong.

# apps/checkout-service/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service
spec:
  selector:
    matchLabels:
      app: checkout-service
  template:
    metadata:
      labels:
        app: checkout-service
    spec:
      containers:
        - name: checkout
          image: ghcr.io/acme/checkout-service:latest
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20

# apps/checkout-service/overlays/production/patch-replicas.yaml
# HARDENED: Only replica count and resource scaling differ
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: checkout
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              cpu: 1000m
              memory: 1Gi

The Implementation

Drift Detection with ArgoCD

ArgoCD compares the desired state (Git) with the live state (cluster). When they diverge, ArgoCD reports the diff. With selfHeal: true, ArgoCD reverts manual changes automatically.

# HARDENED: ArgoCD app with drift detection and self-heal
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: checkout-service-production
  namespace: argocd
  annotations:
    notifications.argoproj.io/subscribe.on-sync-failed.slack: ci-alerts
    notifications.argoproj.io/subscribe.on-health-degraded.slack: ci-alerts
spec:
  source:
    repoURL: https://github.com/acme/ecommerce-infra.git
    targetRevision: main
    path: apps/checkout-service/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true

Parity Validation Script

Run this in CI when the infra repo changes to catch parity violations:

#!/bin/bash
# HARDENED: Validate environment parity
set -euo pipefail

SERVICES=("checkout-service" "catalog-service" "inventory-service" "payments-service")
ENVS=("dev" "staging" "production")

for service in "${SERVICES[@]}"; do
  echo "=== Checking $service ==="

  # Build each overlay and compare structural elements
  for env in "${ENVS[@]}"; do
    kustomize build "apps/$service/overlays/$env" > "/tmp/${service}-${env}.yaml"
  done

  # Check that health checks are identical across environments
  for env in "${ENVS[@]}"; do
    PROBES=$(yq '.spec.template.spec.containers[0] | (.readinessProbe, .livenessProbe)' \
      "/tmp/${service}-${env}.yaml" | md5sum | cut -d' ' -f1)
    echo "  $env probes hash: $PROBES"
  done

  # Check that environment variable NAMES are identical
  for env in "${ENVS[@]}"; do
    ENV_NAMES=$(yq '.spec.template.spec.containers[0].env[].name' \
      "/tmp/${service}-${env}.yaml" 2>/dev/null | sort | md5sum | cut -d' ' -f1)
    echo "  $env env var names hash: $ENV_NAMES"
  done
done

Audit Trail for Manual Changes

When ArgoCD self-heals a drift, log it:

# ArgoCD notification template for drift detection
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-notifications-cm
  namespace: argocd
data:
  template.drift-detected: |
    message: |
      ⚠️ Configuration drift detected and reverted
      Application: {{.app.metadata.name}}
      Namespace: {{.app.spec.destination.namespace}}
      Time: {{.app.status.operationState.finishedAt}}
      Sync Result: {{.app.status.operationState.phase}}

The Gate

The parity validation script runs as a CI check on every PR to the infra repo. If health check definitions differ between environments, the PR is blocked. If environment variable names differ, the PR is blocked.

Resource limit ratios are checked but not enforced strictly — production may need more resources than dev. The script warns if the ratio between environments is greater than 4x (e.g., dev has 256Mi but production has 2Gi), indicating someone may have forgotten to update an environment.

The Recovery

ArgoCD self-heals a legitimate manual change: Someone scaled up production replicas during an incident. ArgoCD reverted it. The correct response: make the change in Git (update the overlay), push, and let ArgoCD sync. For emergency scaling, use ArgoCD’s “disable auto-sync” temporarily, make the manual change, then update Git and re-enable auto-sync.

Overlay grows too large: If a production overlay has more than 10 patches, the base is too generic. Move production-specific resources into the base and use dev/staging overlays to scale down instead.

New environment variable added to one environment but not others: The parity script catches this. Add the variable to the base and override values per environment using Kustomize configMapGenerator or secretGenerator.