Skip to main content
ship it and sleep

The Four Pipeline Properties: Reproducibility, Observability, Gateability, Recoverability

11 min read Chapter 2 of 66

The Four Pipeline Properties

Reproducibility: Same Commit, Same Artifact

The Failure

Tuesday morning. The checkout service passed all tests on Monday, was deployed to staging, and the QA team signed off. On Tuesday, the team promotes the same commit to production. The pipeline re-builds the image. This time, a transitive dependency ([email protected]) has been replaced by a compromised version ([email protected]) published 14 hours ago. The production image contains code that was not in the staging image, despite building from the same commit SHA.

This is not hypothetical. The event-stream incident in 2018, the ua-parser-js compromise in 2021, and the colors.js sabotage in 2022 all exploited the gap between “same source code” and “same artifact.” A reproducible pipeline closes that gap.

The Mechanism

Reproducibility means that a given commit SHA produces a byte-identical artifact regardless of when or where the build runs. Three things break this:

Unpinned dependencies. A package.json with "lodash": "^4.17.21" resolves to whatever version is current at build time. Pin to exact versions and commit the lock file.

Mutable base images. FROM node:20 resolves to a different image every time the Node.js team pushes an update. Pin to the digest: FROM node:20@sha256:abc123....

Build-time side effects. Fetching a configuration file from S3 during the build, downloading a binary from a URL without checksum verification, or using $(date) in a label. Each introduces variation.

The Implementation

# FRAGILE: Unpinned dependencies and mutable base image
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm install
      - run: docker build -t acme/checkout-service:latest .
# FRAGILE: Mutable base image, no lock file enforcement
FROM node:20-slim
COPY package.json .
RUN npm install
COPY . .
CMD ["node", "server.js"]
# HARDENED: Pinned dependencies, digest-based base image, lock file verification
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Verify lock file is current
        run: |
          npm ci --ignore-scripts
          git diff --exit-code package-lock.json

      - name: Build with pinned base image
        uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: ghcr.io/acme/checkout-service:${{ github.sha }}
# HARDENED: Digest-pinned base image, npm ci enforces lock file
FROM node:20-slim@sha256:a1b2c3d4e5f6...
COPY package.json package-lock.json ./
RUN npm ci --ignore-scripts
COPY . .
CMD ["node", "server.js"]

The Gate

The gate for reproducibility is the lock file verification step. If package-lock.json is out of sync with package.json, npm ci fails and the build stops. The developer must run npm install locally, commit the updated lock file, and push again. This is annoying the first time and invisible after that.

For the base image, the gate is the digest pin in the Dockerfile. If the digest does not match any available image, the build fails. The team updates the digest on a scheduled cadence (weekly or after security patches), not on every build.

The Recovery

When a compromised dependency slips past the lock file (because the lock file itself was updated to include it), the recovery is to revert the lock file change, rebuild from the last known good lock file, and redeploy. Chapter 4 covers how to detect compromised dependencies before they enter the lock file using SBOM generation and dependency scanning.

Observability: Pipelines That Tell You What Happened

The Failure

The checkout service pipeline takes 18 minutes. Six weeks ago it took 9 minutes. Nobody noticed because nobody is watching pipeline duration as a metric. The CI provider dashboard shows individual run times, but there is no trend line, no alert threshold, and no breakdown by stage.

The cause turns out to be a Docker layer cache miss. A refactoring moved the COPY . . instruction above the RUN npm ci instruction, invalidating the dependency cache on every build. Nine minutes of unnecessary npm ci runs on every push for six weeks. The cost in developer wait time and CI runner minutes is real and invisible without observability.

The Mechanism

Pipeline observability is structured metadata emitted at every stage. GitHub Actions provides two mechanisms:

GITHUB_STEP_SUMMARY writes Markdown to the workflow run summary page. Use it for human-readable build reports: image tags, test counts, scan results, deployment targets.

Custom metrics exported to Prometheus (or Datadog, or whatever the team uses) via a push gateway or API call. Use these for trend analysis: build duration by stage, image size over time, test count by suite, cache hit rate.

The first is free and immediate. The second requires a metrics endpoint but enables dashboards and alerting.

The Implementation

# HARDENED: Build metadata emission at every stage
jobs:
  build:
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ steps.meta.outputs.version }}
      build-duration: ${{ steps.duration.outputs.seconds }}
    steps:
      - uses: actions/checkout@v4
      - name: Record start time
        id: start
        run: echo "time=$(date +%s)" >> $GITHUB_OUTPUT

      - name: Build and push
        id: build
        uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: ghcr.io/acme/checkout-service:${{ github.sha }}

      - name: Calculate duration
        id: duration
        run: |
          start=${{ steps.start.outputs.time }}
          end=$(date +%s)
          echo "seconds=$((end - start))" >> $GITHUB_OUTPUT

      - name: Emit build summary
        run: |
          echo "## Checkout Service Build" >> $GITHUB_STEP_SUMMARY
          echo "| Metric | Value |" >> $GITHUB_STEP_SUMMARY
          echo "|--------|-------|" >> $GITHUB_STEP_SUMMARY
          echo "| Duration | ${{ steps.duration.outputs.seconds }}s |" >> $GITHUB_STEP_SUMMARY
          echo "| Image | ghcr.io/acme/checkout-service:${{ github.sha }} |" >> $GITHUB_STEP_SUMMARY
          echo "| Commit | ${{ github.sha }} |" >> $GITHUB_STEP_SUMMARY
          echo "| Triggered by | ${{ github.actor }} |" >> $GITHUB_STEP_SUMMARY

      - name: Push metrics to Prometheus
        if: always()
        run: |
          cat <<EOF | curl --data-binary @- \
            ${{ secrets.PROMETHEUS_PUSHGATEWAY }}/metrics/job/ci/service/checkout
          ci_build_duration_seconds{branch="${{ github.ref_name }}"} ${{ steps.duration.outputs.seconds }}
          ci_build_status{branch="${{ github.ref_name }}"} ${{ job.status == 'success' && '1' || '0' }}
          EOF

The Gate

Observability does not gate the pipeline. It enables gating in other stages by providing the data. When the Locust performance gate in CH17 blocks a deployment, the observability layer records the failure reason, the metric values, and the threshold that was exceeded. Without observability, the gate fires but nobody knows why.

The Recovery

When a pipeline slows down, observability provides the diagnostic path: which stage increased, by how much, and when the increase started. The recovery is fixing the root cause (in the Docker layer cache example, moving COPY . . back below RUN npm ci). Without observability, the recovery is “the pipeline is slow, I do not know why, I will ignore it.”

Gateability: Hard Stops, Not Polite Warnings

The Failure

The payments service has a security scanning step. It runs Trivy against the built image. Last month, Trivy found a critical CVE in the base image’s OpenSSL library. The scan emitted a warning in the logs. The pipeline continued. The image was pushed. ArgoCD synced it to staging. The QA team did not check scan results because scan results were not in their workflow. The image was promoted to production. The CVE sat in production for 11 days until a scheduled audit caught it.

The scan ran. The vulnerability was found. The pipeline did not stop.

The Mechanism

A gate is a job dependency with needs: in GitHub Actions. If the security scan job fails, every downstream job is skipped. The image is never pushed. The infra repo is never updated. There is no path from “critical CVE found” to “image in production.”

The failure mode is continue-on-error: true. Teams add this when a scan is noisy (too many false positives) or slow (adds 4 minutes to the pipeline). Both are real problems. The solution is tuning the scan thresholds and optimizing the scan, not disabling the gate.

The Implementation

# FRAGILE: Security scan that warns but does not block
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: docker build -t acme/payments-service:${{ github.sha }} .
      - run: docker push acme/payments-service:${{ github.sha }}

  scan:
    runs-on: ubuntu-latest
    needs: [build]
    continue-on-error: true # This line makes the gate worthless
    steps:
      - name: Trivy scan
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: acme/payments-service:${{ github.sha }}
          severity: CRITICAL,HIGH

  deploy:
    runs-on: ubuntu-latest
    needs: [build, scan] # Proceeds even when scan fails
    steps:
      - run: echo "Deploying to staging..."
# HARDENED: Security scan as a hard gate
jobs:
  build:
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ github.sha }}
    steps:
      - uses: actions/checkout@v4
      - uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: ghcr.io/acme/payments-service:${{ github.sha }}

  scan:
    runs-on: ubuntu-latest
    needs: [build]
    # No continue-on-error. Failure stops the pipeline.
    steps:
      - name: Trivy scan
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ghcr.io/acme/payments-service:${{ github.sha }}
          severity: CRITICAL,HIGH
          exit-code: 1
          format: table

  deploy:
    runs-on: ubuntu-latest
    needs: [build, scan] # Only runs if scan succeeds
    steps:
      - run: echo "Promoting to staging..."

The Gate

The gate itself is the exit-code: 1 parameter in the Trivy action combined with the absence of continue-on-error. When Trivy finds a CRITICAL or HIGH vulnerability, the step exits with code 1, the job fails, and the deploy job is skipped because its needs dependency was not satisfied.

The Recovery

When a legitimate vulnerability blocks the pipeline, the developer has two options: update the base image to one that patches the CVE (preferred), or add the CVE to an ignore list with a documented justification and an expiration date (acceptable for low-risk findings while waiting for an upstream fix). Chapter 16 covers threshold tuning and false positive management.

Recoverability: The 3am Revert

The Failure

Friday 11pm. The checkout service deployed a new version that introduced a deadlock in the inventory reservation call. The error rate climbs from 0.1% to 12% over 15 minutes. The on-call engineer is paged. They open their laptop. They need to roll back.

Without a recovery path in the pipeline, rollback means: find the last good commit, check out the branch, push a revert commit, wait for the CI pipeline to build a new image (12 minutes), wait for tests to pass (4 minutes), wait for the infra repo update (2 minutes), wait for ArgoCD to sync (3 minutes). Total: 21 minutes if everything goes right. At 3am, with a foggy head and rising error rates, “everything goes right” is optimistic.

The Mechanism

GitOps provides an immediate recovery path. The infra repo contains the image tag for every service in every environment. Rolling back the checkout service in production means reverting the commit that updated the image tag. ArgoCD detects the change and syncs the previous image. No CI pipeline runs. No tests. No build. The revert takes 30 seconds. The sync takes 60 seconds.

Argo Rollouts provides an automated recovery path. During a canary deployment, Argo Rollouts monitors error rates and latency. If the canary exceeds the failure threshold, Rollouts aborts the rollout and reverts to the previous revision. The on-call engineer wakes up to a notification that says “canary rolled back due to error rate 12% exceeding threshold 5%” instead of a page that says “checkout is down.”

The Implementation

# HARDENED: Infra repo commit structure that enables instant rollback
# File: ecommerce-infra/overlays/production/checkout/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: production

resources:
  - ../../../base/checkout

images:
  - name: ghcr.io/acme/checkout-service
    newTag: "a1b2c3d" # Short SHA, revertible via git revert

Rolling back is one command:

# In the ecommerce-infra repo
git revert HEAD --no-edit
git push
# ArgoCD detects the change and syncs within 3 minutes (default poll interval)
# or immediately if a webhook is configured

The Gate

Recoverability is not a gate in the traditional sense. It is a property of the delivery architecture. The gate that enables recoverability is the separation between CI (app repo) and CD (infra repo). If the pipeline directly applied manifests to the cluster with kubectl apply, rollback would require re-running the pipeline with the old image. With GitOps, rollback is a git operation.

The Recovery

The recovery is the property itself. When something goes wrong in production:

  1. Revert the last commit on the infra repo’s production overlay.
  2. Push. ArgoCD syncs the revert.
  3. Investigate the root cause using the build metadata from the observability layer.
  4. Fix, rebuild, re-test, and re-promote through the normal pipeline.

Step 1 and 2 take under 2 minutes. Step 3 and 4 happen after production is stable. The on-call engineer’s first job is recovery, not diagnosis. GitOps makes that possible.