Skip to main content
ship it and sleep

Job Dependencies and the Critical Path

5 min read Chapter 5 of 66

Job Dependencies and the Critical Path

The Failure

The inventory service pipeline has six jobs. The team parallelized aggressively after reading about pipeline optimization. Each test type runs in its own job: unit tests, integration tests, API contract tests, linting, type checking, and security scanning. Every job depends only on the build job. Maximum parallelism.

The pipeline takes 9 minutes. The build takes 3 minutes. Each parallel job takes between 30 seconds (linting) and 4 minutes (integration tests). But each job also takes 45 seconds to provision a runner and download the image artifact. Six parallel jobs times 45 seconds of overhead is 4.5 minutes of runner time spent on setup alone.

The linting job takes 30 seconds to run and 45 seconds to start. It would have been faster as a step in the build job.

The Mechanism

Every GitHub Actions job runs on a fresh runner. The runner must be provisioned (queued, assigned, booted), the repository must be checked out, and any artifacts from upstream jobs must be downloaded. This overhead is typically 30-90 seconds depending on runner availability and artifact size.

Splitting work into separate jobs is valuable when:

  • The tasks can run in parallel and their combined duration exceeds the overhead
  • The tasks need different runner types (e.g., Linux vs macOS for cross-platform testing)
  • The tasks should gate independently (a scan failure should not block test results from being visible)

Splitting is counterproductive when:

  • The task duration is less than the runner provisioning overhead
  • The tasks share expensive setup (database migrations, dependency installation) that would need to be repeated in each job

The decision rule: if the task takes less than 2 minutes and does not need a different runner or independent gating, keep it as a step in an existing job.

The Implementation

# FRAGILE: Over-parallelized pipeline with excessive runner overhead
name: ci
on: [push, pull_request]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: ghcr.io/acme/inventory-service:${{ github.sha }}

  lint:
    needs: [build]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm run lint # 30 seconds of work, 45 seconds of startup

  typecheck:
    needs: [build]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm run typecheck # 20 seconds of work, 45 seconds of startup

  unit-test:
    needs: [build]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm run test:unit # 90 seconds

  integration-test:
    needs: [build]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: docker compose up -d --wait
      - run: npm run test:integration # 4 minutes

  contract-test:
    needs: [build]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm run test:contract # 2 minutes

  scan:
    needs: [build]
    runs-on: ubuntu-latest
    steps:
      - uses: aquasecurity/trivy-action@master
        with:
          image-ref: ghcr.io/acme/inventory-service:${{ github.sha }}
          exit-code: 1
# HARDENED: Right-sized parallelism, fast checks in build job
name: ci
on: [push, pull_request]

env:
  IMAGE: ghcr.io/acme/inventory-service

jobs:
  build-and-check:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    outputs:
      image-digest: ${{ steps.build.outputs.digest }}
    steps:
      - uses: actions/checkout@v4

      # Fast checks run before the build, fail fast
      - name: Lint
        run: npm run lint

      - name: Type check
        run: npm run typecheck

      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push
        id: build
        uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: ${{ env.IMAGE }}:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  test:
    runs-on: ubuntu-latest
    needs: [build-and-check]
    strategy:
      fail-fast: false
      matrix:
        suite: [unit, integration, contract]
    steps:
      - uses: actions/checkout@v4

      - name: Start dependencies
        if: matrix.suite == 'integration'
        run: docker compose -f docker-compose.test.yml up -d --wait

      - name: Run ${{ matrix.suite }} tests
        run: |
          docker run --rm \
            ${{ matrix.suite == 'integration' && '--network=host' || '' }} \
            ${{ env.IMAGE }}@${{ needs.build-and-check.outputs.image-digest }} \
            ./run-${{ matrix.suite }}-tests.sh

      - name: Stop dependencies
        if: matrix.suite == 'integration' && always()
        run: docker compose -f docker-compose.test.yml down

  scan:
    runs-on: ubuntu-latest
    needs: [build-and-check]
    steps:
      - uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.IMAGE }}@${{ needs.build-and-check.outputs.image-digest }}
          exit-code: 1
          severity: CRITICAL,HIGH

  promote:
    runs-on: ubuntu-latest
    needs: [test, scan]
    if: github.ref == 'refs/heads/main'
    steps:
      - run: echo "All gates passed. Ready for infra repo update."

The restructured pipeline merges lint and typecheck into the build job (they take 50 seconds combined and share the same checkout). The three test suites run as a matrix strategy, which provisions three runners but shares the job definition. The scan runs in parallel with all tests.

The Gate

The promote job depends on both test (all matrix variants) and scan. A matrix strategy with fail-fast: false ensures all test suites run to completion even if one fails. This means the developer sees all failures at once instead of fixing one, re-running, and discovering the next.

When fail-fast: true (the default), the first matrix failure cancels all other matrix jobs. Use fail-fast: false for test suites where seeing all failures is more valuable than saving runner minutes.

The Recovery

When a specific matrix variant fails consistently (e.g., integration tests are flaky), the temptation is to add continue-on-error: true to that variant. Do not. Instead, fix the flaky test, or move it to a separate non-blocking job with a clear label: integration-test-flaky. The flaky job emits a warning but does not block promotion. The team tracks the flaky test and fixes it on a defined timeline. Chapter 18 covers flaky test detection and management.