Job and Scheduling Solutions

Solution: Exercise 3 — Parallel Job with Completions

This exercise requires a Job that runs 6 total completions with 3 Pods executing in parallel.

Step 1: Write the Job Manifest

cat > parallel-job.yaml << 'EOF'
apiVersion: batch/v1
kind: Job
metadata:
  name: batch-processor
spec:
  completions: 6
  parallelism: 3
  backoffLimit: 4
  template:
    metadata:
      labels:
        job: batch-processor
    spec:
      restartPolicy: Never
      containers:
        - name: worker
          image: busybox:1.36
          command: ["sh", "-c"]
          args:
            - |
              echo "Processing batch item on $(hostname) at $(date)"
              sleep 5
              echo "Done"
EOF

Key fields and their effects:

completions: 6: The Job requires 6 successful Pod completions before it is considered done.
parallelism: 3: Kubernetes runs up to 3 Pods simultaneously. When one completes, a new one starts until all 6 completions are reached.
backoffLimit: 4: If a Pod fails, the Job retries. After 4 consecutive failures, the Job is marked as failed.
restartPolicy: Never: Failed containers are not restarted within the same Pod. Instead, the Job controller creates a new Pod (up to the backoff limit). This is required for Jobs — Always is not allowed.

Step 2: Apply and Monitor

kubectl apply -f parallel-job.yaml

Watch the Pods as they execute:

kubectl get pods -l job=batch-processor -w

Expected progression:

NAME                      READY   STATUS    RESTARTS   AGE
batch-processor-abc12     1/1     Running   0          2s
batch-processor-def34     1/1     Running   0          2s
batch-processor-ghi56     1/1     Running   0          2s
batch-processor-abc12     0/1     Completed   0        7s
batch-processor-jkl78     1/1     Running     0        8s
batch-processor-def34     0/1     Completed   0        7s
batch-processor-mno90     1/1     Running     0        8s
batch-processor-ghi56     0/1     Completed   0        7s
batch-processor-pqr12     1/1     Running     0        8s

Three Pods start simultaneously. As each completes, a new one launches until all 6 have run.

Step 3: Verify Completion

kubectl get job batch-processor

Expected output:

NAME              COMPLETIONS   DURATION   AGE
batch-processor   6/6           14s        20s

The 6/6 confirms all completions succeeded. Check individual Pod logs:

kubectl logs -l job=batch-processor --prefix

Each Pod should show its processing message and “Done”. The --prefix flag prepends the Pod name to each log line, making it clear which Pod produced which output.

Troubleshooting

Symptom	Cause	Fix
Completions stuck at `3/6`	Pod failures exhausted `backoffLimit`	Check failed Pod logs: `kubectl describe pod <name>`
Only 1 Pod runs at a time	`parallelism` not set or set to 1	Verify `parallelism: 3` is in the Job spec
Job shows `BackoffLimitExceeded`	Command exits non-zero	Fix the container command and recreate the Job
Pods remain after Job completes	Expected behavior	Jobs retain completed Pods for log inspection; use `ttlSecondsAfterFinished` to auto-clean

Cleanup

kubectl delete job batch-processor

Solution: Exercise 4 — nodeSelector for Pod Placement

This exercise requires labeling a worker node and creating a Pod that schedules exclusively on that node using nodeSelector.

Step 1: List Nodes and Their Roles

kubectl get nodes --show-labels

In a Kind cluster, you’ll see nodes named like kind-worker and kind-worker2. Identify a worker node to target.

Step 2: Label the Target Node

kubectl label node kind-worker disk=ssd

Expected output:

node/kind-worker labeled

Verify the label:

kubectl get node kind-worker --show-labels | grep disk=ssd

Step 3: Create the Pod with nodeSelector

cat > nodeselector-pod.yaml << 'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: ssd-pod
  labels:
    app: ssd-workload
spec:
  nodeSelector:
    disk: ssd
  containers:
    - name: app
      image: nginx:1.25
      ports:
        - containerPort: 80
EOF

The nodeSelector field tells the scheduler to place this Pod only on nodes that have the label disk=ssd. If no node matches, the Pod stays in Pending indefinitely — the scheduler will not compromise on nodeSelector constraints.

kubectl apply -f nodeselector-pod.yaml

Step 4: Verify Placement

kubectl get pod ssd-pod -o wide

Expected output:

NAME      READY   STATUS    RESTARTS   AGE   IP           NODE          NOMINATED NODE   READINESS GATES
ssd-pod   1/1     Running   0          5s    10.244.1.3   kind-worker   <none>           <none>

The NODE column must show kind-worker — the node you labeled. If the Pod landed on a different node, the nodeSelector was not applied correctly.

Step 5: Test the Constraint

Create a second Pod targeting a label that no node has:

kubectl run no-match --image=nginx:1.25 --dry-run=client -o yaml | \
  kubectl patch --local -f - -p '{"spec":{"nodeSelector":{"disk":"nvme"}}}' --type merge -o yaml | \
  kubectl apply -f -

Or write the manifest directly:

cat > no-match-pod.yaml << 'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: no-match
spec:
  nodeSelector:
    disk: nvme
  containers:
    - name: app
      image: nginx:1.25
EOF
kubectl apply -f no-match-pod.yaml

Check the Pod status:

kubectl get pod no-match

Expected output:

NAME       READY   STATUS    RESTARTS   AGE
no-match   0/1     Pending   0          10s

The Pod remains Pending because no node carries disk=nvme. Run kubectl describe pod no-match and look at the Events section — you’ll see a message like:

Warning  FailedScheduling  0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector.

Cleanup

kubectl delete pod ssd-pod no-match
kubectl label node kind-worker disk-

The trailing - removes the disk label from the node.

Solution: Exercise 5 — Taints and Tolerations

This exercise requires tainting a node so that no regular Pod can schedule on it, then creating a Pod with a toleration that allows it past the taint.

Step 1: Taint a Worker Node

kubectl taint nodes kind-worker2 dedicated=special:NoSchedule

Expected output:

node/kind-worker2 tainted

This taint has three parts:

Key: dedicated
Value: special
Effect: NoSchedule — Pods without a matching toleration will not be placed on this node. Existing Pods are unaffected (use NoExecute to evict running Pods).

Verify the taint:

kubectl describe node kind-worker2 | grep -A 3 Taints

Expected output:

Taints:             dedicated=special:NoSchedule

Step 2: Prove the Taint Works

Create a Pod without any toleration:

kubectl run taint-test --image=nginx:1.25

If kind-worker2 is the only worker node, this Pod will stay Pending. If other worker nodes exist, the Pod schedules on one of them, avoiding kind-worker2:

kubectl get pod taint-test -o wide

The NODE column should not show kind-worker2.

Step 3: Create a Pod with the Matching Toleration

cat > toleration-pod.yaml << 'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: special-pod
  labels:
    app: special-workload
spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "special"
      effect: "NoSchedule"
  nodeSelector:
    kubernetes.io/hostname: kind-worker2
  containers:
    - name: app
      image: nginx:1.25
      ports:
        - containerPort: 80
EOF

Two fields work together here:

tolerations: Allows the Pod to schedule on nodes with the dedicated=special:NoSchedule taint. The toleration does not force the Pod onto the tainted node — it permits it.
nodeSelector: Forces the Pod onto kind-worker2 specifically. Without this, the Pod could schedule on any node (tainted or not), since tolerations are permissive, not directive.

The operator: "Equal" means the toleration matches only when the taint’s key, value, and effect all match exactly. The alternative operator: "Exists" matches any taint with the specified key regardless of value.

kubectl apply -f toleration-pod.yaml

Step 4: Verify Placement

kubectl get pod special-pod -o wide

Expected output:

NAME          READY   STATUS    RESTARTS   AGE   IP           NODE           NOMINATED NODE   READINESS GATES
special-pod   1/1     Running   0          4s    10.244.2.5   kind-worker2   <none>           <none>

The Pod runs on kind-worker2 because it tolerates the taint and the nodeSelector directs it there.

Step 5: Verify the Taint Still Blocks Other Pods

cat > no-toleration.yaml << 'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: blocked-pod
spec:
  nodeSelector:
    kubernetes.io/hostname: kind-worker2
  containers:
    - name: app
      image: nginx:1.25
EOF
kubectl apply -f no-toleration.yaml

kubectl get pod blocked-pod

Expected output:

NAME          READY   STATUS    RESTARTS   AGE
blocked-pod   0/1     Pending   0          5s

The scheduler cannot place this Pod on kind-worker2 (tainted, no toleration) and cannot place it on any other node (nodeSelector restricts to kind-worker2). The Pod is stuck in Pending.

Troubleshooting

Symptom	Cause	Fix
Tolerating Pod still `Pending`	Toleration key/value/effect mismatch	Compare `kubectl describe node` taint with Pod tolerations field-by-field
Pod schedules on tainted node without toleration	Taint was not applied	Re-run `kubectl describe node` and verify the Taints line
`NoExecute` evicts running Pods	Wrong effect chosen	Use `NoSchedule` to block new Pods only; `NoExecute` evicts existing ones
Toleration uses `Exists` but Pod still blocked	Effect mismatch	`Exists` ignores value but still requires matching effect

Cleanup

kubectl delete pod taint-test special-pod blocked-pod
kubectl taint nodes kind-worker2 dedicated=special:NoSchedule-

The trailing - removes the taint. Verify:

kubectl describe node kind-worker2 | grep Taints

Expected output:

Taints:             <none>