Knative Serving Configuration, Scale-to-Zero, and Cold Start Budget

The Failure

The team deployed the report generation service on Knative with default settings. The service used a Java 21 Spring Boot container that took 12 seconds to start. The default scale-to-zero-grace-period was 30 seconds, so after 30 seconds of no traffic the pod was terminated. A vendor would request a report, wait 15 seconds for the cold start, and then request another report 45 seconds later—triggering another cold start. The service scaled to zero between every single request.

The fix: increase scale-to-zero-pod-retention-period to match the service’s expected request interval, and optimize the container for faster startup.

The Mechanism

Autoscaler Parameters

Parameter	Default	Description
`scale-to-zero-grace-period`	30s	Global: time after last request before scale-to-zero begins
`scale-to-zero-pod-retention-period`	0s	Per-revision: minimum time to keep last pod alive
`target`	100	Concurrent requests per pod before scaling up
`target-utilization-percentage`	70%	Scale up when pod reaches this % of target
`min-scale`	0	Minimum replicas (0 enables scale-to-zero)
`max-scale`	0 (unlimited)	Maximum replicas
`initial-scale`	1	Replicas on first deployment
`scale-down-delay`	0s	Delay before scaling down after load decreases
`metric`	concurrency	Metric type: concurrency or rps

Container Startup Optimization

The cold start budget has three components:

Cold Start = Image Pull + Container Init + App Startup + Readiness Probe

Each can be optimized independently:

Component	Optimization
Image pull	Pre-pull images (DaemonSet), use small base images, image caching
Container init	Avoid init containers, minimize volume mounts
App startup	AOT compilation, lazy loading, fast frameworks (Quarkus, Go)
Readiness probe	Short `initialDelaySeconds`, fast health endpoint

The Implementation

Service Profile Configurations

# Profile: Low-traffic API (product import, report generation)
# HARDENED: Optimized for infrequent traffic with acceptable cold start
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: product-import
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/min-scale: "0"
        autoscaling.knative.dev/max-scale: "5"
        autoscaling.knative.dev/target: "10"
        autoscaling.knative.dev/scale-to-zero-pod-retention-period: "15m"
        autoscaling.knative.dev/scale-down-delay: "5m"
    spec:
      containerConcurrency: 10
      timeoutSeconds: 300
      containers:
        - image: ghcr.io/acme/product-import:abc123

# Profile: Scheduled batch job (nightly reports)
# HARDENED: Scale-to-zero quickly, accept cold start
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: report-generator
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/min-scale: "0"
        autoscaling.knative.dev/max-scale: "3"
        autoscaling.knative.dev/target: "1"
        autoscaling.knative.dev/scale-to-zero-pod-retention-period: "2m"
    spec:
      containerConcurrency: 1
      timeoutSeconds: 600
      containers:
        - image: ghcr.io/acme/report-generator:abc123

Measuring Cold Start

# Measure cold start: ensure service is at zero, then time first request
# Step 1: Verify scale-to-zero
kubectl get pods -n production -l serving.knative.dev/service=product-import
# No pods should be listed

# Step 2: Time the first request
time curl -s -o /dev/null -w "%{http_code} %{time_total}s" \
  https://product-import.production.example.com/health

# Step 3: Check pod startup events
kubectl get events -n production --sort-by='.lastTimestamp' \
  --field-selector reason=Started | tail -5

Multi-Container Optimization (Init Before Serve)

# HARDENED: Multi-stage build for minimal cold start
FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /server .

FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /server /server
EXPOSE 8080
ENTRYPOINT ["/server"]

Revision Management

# List revisions for a Knative Service
kubectl get revisions -n production -l serving.knative.dev/service=product-import

# Pin traffic to a specific revision (rollback)
kubectl patch ksvc product-import -n production --type merge -p '
spec:
  traffic:
    - revisionName: product-import-00005
      percent: 100
'

# Split traffic between revisions (canary)
kubectl patch ksvc product-import -n production --type merge -p '
spec:
  traffic:
    - revisionName: product-import-00005
      percent: 90
    - revisionName: product-import-00006
      percent: 10
'

The Gate

Cold start budget is the gate. Define a maximum acceptable cold start time for each service profile. If the cold start exceeds the budget after container optimization, the service should not use scale-to-zero. Set min-scale: 1.

Service Profile	Cold Start Budget	Action if Exceeded
Background processor	30s	Acceptable, no change
Internal API	5s	Optimize container or min-scale: 1
User-facing API	2s	min-scale: 1 or use standard Deployment

The Recovery

Service keeps scaling to zero between requests: Increase scale-to-zero-pod-retention-period. Set it to 2-3x the expected gap between requests.

Cold start exceeds budget after optimization: Set min-scale: 1. The service keeps one warm pod at all times. You lose scale-to-zero but gain consistent latency.

Old revisions consume resources: Knative keeps old revisions. Set revisionHistoryLimit in the Knative global config to limit retained revisions. Alternatively, clean up with kubectl delete revision.