Design a Task Scheduler

Distributed task schedulers power background job processing, cron-like periodic tasks, delayed notifications, and retry logic across every large-scale system. This chapter designs a scheduler that handles 100M+ tasks per day with sub-second accuracy, at-least-once execution guarantees, and horizontal scaling — the kind of system behind AWS Step Functions, Airflow, or a custom job queue.

Requirements

Functional Requirements

One-Time Tasks: Schedule a task to execute at a specific future timestamp.
Recurring Tasks: Support cron-style schedules (e.g., “every Monday at 9 AM”).
Delayed Tasks: Execute a task after a specified delay (e.g., “30 minutes from now”).
Task Priorities: Higher-priority tasks execute before lower-priority tasks when resources are contended.
Task Cancellation: Cancel a scheduled task before execution.
Task Status: Query the current state of any task (pending, running, completed, failed, cancelled).
Retry on Failure: Automatically retry failed tasks with configurable policies.
Dead Letter Queue: Route tasks that exceed max retries to a DLQ for manual inspection.

Non-Functional Requirements

Metric	Target
Execution guarantee	At-least-once
Scheduling accuracy	< 1 second from target time
Throughput	100M+ tasks/day (~1,200 tasks/second avg)
Peak throughput	10x average during batch submissions
Horizontal scaling	Add workers without downtime
Availability	99.99% — no single point of failure
Task payload size	Up to 64KB per task

Capacity Estimation

Task volume: 100M tasks/day = ~1,200 tasks/second average, ~12,000 tasks/second at peak.
Task storage: 100M tasks × 1KB avg (metadata + payload) = 100GB/day. With 30-day retention: 3TB.
Active tasks in memory: At any moment, ~5M tasks are pending within the next hour. At 1KB each: ~5GB fits comfortably in memory across a cluster.
Worker capacity: If each task takes 100ms average, one worker handles 10 tasks/second. For 1,200 TPS: 120 workers minimum, 1,200 workers at peak. Virtual threads reduce this — one JVM can handle thousands of concurrent tasks.

High-Level Design

  ┌──────────┐       ┌──────────────┐       ┌───────────────┐
  │  Client  │──────▶│   Task API   │──────▶│  Task Store   │
  │  (SDK)   │       │  (REST/gRPC) │       │  (PostgreSQL) │
  └──────────┘       └──────┬───────┘       └───────┬───────┘
                            │                       │
                     ┌──────▼───────┐        ┌──────▼───────┐
                     │  Scheduler   │◀───────│  Timer       │
                     │  (Leader)    │        │  (Time Wheel)│
                     └──────┬───────┘        └──────────────┘
                            │
              ┌─────────────┼─────────────┐
              │             │             │
        ┌─────▼────┐  ┌────▼─────┐  ┌────▼─────┐
        │ Worker 1 │  │ Worker 2 │  │ Worker N │
        │(Virtual  │  │(Virtual  │  │(Virtual  │
        │ Threads) │  │ Threads) │  │ Threads) │
        └─────┬────┘  └────┬─────┘  └────┬─────┘
              │             │             │
              └─────────────┼─────────────┘
                            │
                     ┌──────▼───────┐
                     │ Result Store │
                     │  (PostgreSQL)│
                     └──────────────┘

Task API: Accepts schedule/cancel/query requests. Validates payloads and writes to the Task Store.
Task Store: Persistent storage for all task metadata, schedules, and state transitions.
Scheduler (Leader): The elected leader scans for due tasks and dispatches them to workers. Only one scheduler is active at a time via leader election.
Timer (Time Wheel): An in-memory data structure that triggers task dispatch with O(1) efficiency.
Worker Pool: Stateless workers that execute task payloads. Scale horizontally. Use Java 25 virtual threads for high concurrency.
Result Store: Records execution outcomes, retry counts, and error details.

Deep Dive

Task Model

The task model captures scheduling intent through a sealed interface hierarchy:

public sealed interface Schedule permits OneTime, Recurring, Delayed {}

public record OneTime(Instant executeAt) implements Schedule {}

public record Recurring(String cronExpression, Instant nextExecution) implements Schedule {}

public record Delayed(Duration delay, Instant createdAt) implements Schedule {
    public Instant executeAt() {
        return createdAt.plus(delay);
    }
}

public record Task(
    String id,
    String taskType,
    byte[] payload,
    Schedule schedule,
    Priority priority,
    TaskState state,
    int retryCount,
    int maxRetries,
    RetryPolicy retryPolicy,
    Instant createdAt,
    Instant updatedAt
) {}

public enum Priority { CRITICAL, HIGH, NORMAL, LOW }

public enum TaskState { PENDING, SCHEDULED, RUNNING, COMPLETED, FAILED, CANCELLED, DEAD_LETTERED }

Pattern matching on the Schedule type determines when a task should fire:

static Instant resolveExecutionTime(Schedule schedule) {
    return switch (schedule) {
        case OneTime ot    -> ot.executeAt();
        case Recurring r   -> r.nextExecution();
        case Delayed d     -> d.executeAt();
    };
}

Priority Dispatch: The scheduler maintains a PriorityQueue<Task> ordered by (priority, executionTime). Critical tasks preempt normal tasks even if their execution time is slightly later:

Comparator<Task> taskOrder = Comparator
    .comparing(Task::priority)
    .thenComparing(t -> resolveExecutionTime(t.schedule()));

Scheduling Algorithms

Three approaches exist, each with different performance characteristics:

1. Database Polling (Naive): A loop queries SELECT * FROM tasks WHERE execute_at <= NOW() AND state = 'PENDING' ORDER BY priority, execute_at LIMIT 100. This is O(n) per scan against the task table, creates database load, and introduces latency equal to the polling interval (typically 1–5 seconds). Acceptable for low-volume systems (< 10K tasks/day) but unacceptable at scale.

2. Priority Queue (Min-Heap): Load upcoming tasks into an in-memory min-heap keyed by execution time. Insert is O(log n), extract-min is O(log n). The scheduler sleeps until the earliest task is due, then extracts and dispatches it. This eliminates polling but requires all pending tasks to fit in memory. For 5M near-term tasks at 1KB each, that is ~5GB — feasible for a dedicated scheduler node.

3. Hierarchical Time Wheel: The time wheel achieves O(1) insert and O(1) tick processing, making it the preferred choice at scale.

Time Wheel Structure: A circular buffer of slots, where each slot holds a bucket (linked list) of tasks. The wheel advances one slot per tick (e.g., every 100ms). Tasks are inserted into the slot corresponding to their execution time modulo the wheel size.

Wheel (8 slots, 1-second tick):
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │
└─┬─┴───┴─┬─┴───┴───┴─┬─┴───┴───┘
  │       │           │
  ▼       ▼           ▼
 [A,B]   [C]         [D,E,F]
  ▲
  │ current pointer

For tasks beyond one rotation, a hierarchical wheel uses multiple levels — seconds, minutes, hours — similar to a clock. When the seconds wheel completes a rotation, one slot from the minutes wheel cascades down.

public class TimeWheel {
    private final int wheelSize;
    private final Duration tickDuration;
    private final List<List<Task>>[] wheels; // [seconds, minutes, hours]
    private int currentTick = 0;

    @SuppressWarnings("unchecked")
    public TimeWheel(int wheelSize, Duration tickDuration) {
        this.wheelSize = wheelSize;
        this.tickDuration = tickDuration;
        this.wheels = new List[3]; // 3 levels
        for (int level = 0; level < 3; level++) {
            wheels[level] = new ArrayList<>(wheelSize);
            for (int i = 0; i < wheelSize; i++) {
                wheels[level].add(new ArrayList<>());
            }
        }
    }

    public void addTask(Task task) {
        long ticksUntilExecution = Duration.between(
            Instant.now(), resolveExecutionTime(task.schedule())
        ).dividedBy(tickDuration);

        if (ticksUntilExecution < wheelSize) {
            int slot = (int) ((currentTick + ticksUntilExecution) % wheelSize);
            wheels[0].get(slot).add(task);
        } else if (ticksUntilExecution < wheelSize * wheelSize) {
            int slot = (int) ((ticksUntilExecution / wheelSize) % wheelSize);
            wheels[1].get(slot).add(task);
        } else {
            int slot = (int) ((ticksUntilExecution / (wheelSize * wheelSize)) % wheelSize);
            wheels[2].get(slot).add(task);
        }
    }

    public List<Task> tick() {
        List<Task> dueTasks = wheels[0].get(currentTick);
        wheels[0].set(currentTick, new ArrayList<>());

        currentTick = (currentTick + 1) % wheelSize;

        // Cascade from higher wheels on full rotation
        if (currentTick == 0) {
            cascadeWheel(1);
        }

        return dueTasks;
    }

    private void cascadeWheel(int level) {
        if (level >= wheels.length) return;
        int slot = currentTick; // simplified — production code tracks per-level pointers
        List<Task> tasks = wheels[level].get(slot);
        wheels[level].set(slot, new ArrayList<>());
        for (Task task : tasks) {
            addTask(task); // Re-insert into lower wheel
        }
    }
}

Algorithm	Insert	Dispatch	Memory	Use Case
DB Polling	O(1) write	O(n) scan	Disk	< 10K tasks/day
Min-Heap	O(log n)	O(log n)	RAM	< 1M active tasks
Time Wheel	O(1)	O(1) amortized	RAM	100M+ tasks/day

Distributed Task Execution

Consistent Hashing for Task Partitioning: Tasks are assigned to workers using consistent hashing on the task ID. Each worker owns a range of the hash ring. When a worker joins or leaves, only tasks in the affected range are redistributed — minimizing rebalancing overhead.

Worker Heartbeats: Workers send heartbeats to the scheduler every 5 seconds. If three consecutive heartbeats are missed (15 seconds), the scheduler marks the worker as dead and reassigns its tasks to neighboring workers on the hash ring.

Task Stealing: When a worker’s queue is empty and other workers are overloaded, idle workers “steal” tasks from the tail of busy workers’ queues. This balances load dynamically without centralized coordination.

Virtual Threads for Workers: Java 25 virtual threads allow one virtual thread per task without exhausting OS threads:

public class TaskWorker {
    private final ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();
    private final TaskHandler handler;
    private final ResultStore resultStore;

    public void executeTask(Task task) {
        executor.submit(() -> {
            try {
                var result = handler.handle(task.taskType(), task.payload());
                resultStore.recordSuccess(task.id(), result);
            } catch (Exception e) {
                resultStore.recordFailure(task.id(), e.getMessage(), task.retryCount());
            }
        });
    }
}

A single JVM running virtual threads can handle tens of thousands of concurrent task executions, since virtual threads are scheduled on a small pool of platform threads and consume minimal memory (~1KB stack per thread vs. ~1MB for platform threads).

Retry Mechanism

Failed tasks need intelligent retry strategies to avoid overwhelming downstream systems.

public sealed interface RetryPolicy permits FixedDelay, ExponentialBackoff, ExponentialWithJitter {}

public record FixedDelay(Duration delay) implements RetryPolicy {}

public record ExponentialBackoff(Duration baseDelay, double multiplier, Duration maxDelay)
    implements RetryPolicy {}

public record ExponentialWithJitter(Duration baseDelay, double multiplier, Duration maxDelay)
    implements RetryPolicy {}

public class RetryCalculator {
    private static final ThreadLocalRandom random = ThreadLocalRandom.current();

    public static Duration calculateNextDelay(RetryPolicy policy, int attemptNumber) {
        return switch (policy) {
            case FixedDelay fd -> fd.delay();

            case ExponentialBackoff eb -> {
                long delayMs = (long) (eb.baseDelay().toMillis() *
                    Math.pow(eb.multiplier(), attemptNumber));
                yield Duration.ofMillis(Math.min(delayMs, eb.maxDelay().toMillis()));
            }

            case ExponentialWithJitter ej -> {
                long delayMs = (long) (ej.baseDelay().toMillis() *
                    Math.pow(ej.multiplier(), attemptNumber));
                long capped = Math.min(delayMs, ej.maxDelay().toMillis());
                long jittered = ThreadLocalRandom.current().nextLong(capped / 2, capped);
                yield Duration.ofMillis(jittered);
            }
        };
    }
}

Why jitter matters: Without jitter, all retries for tasks that failed at the same time fire simultaneously, causing a retry storm. Jitter spreads retries uniformly across the delay window.

Dead Letter Queue: After maxRetries attempts, the task moves to a DLQ — a separate database table or Kafka topic. Operations teams monitor the DLQ and can manually replay, modify, or discard dead-lettered tasks.

Idempotent Execution Requirement: Since at-least-once delivery means a task could execute more than once (e.g., worker crashes after execution but before acknowledgment), every task handler must be idempotent. The result store records completed task IDs; handlers check this before performing side effects.

High Availability

Leader Election: Only one scheduler instance is active at a time to avoid duplicate dispatching. Leader election uses a consensus protocol (Raft or ZAB via ZooKeeper/etcd). The elected leader runs the time wheel and dispatches tasks. Standby nodes replicate state from the task store and are ready to take over within seconds.

Active-Passive Failover: The passive scheduler continuously loads upcoming tasks from the database. On leader failure, the new leader rebuilds its time wheel from the task store. Tasks that were in-flight on the failed leader are detected via heartbeat timeout and re-dispatched.

Task Deduplication: During failover, a task might be dispatched twice — once by the old leader (before it realized it lost leadership) and once by the new leader. To prevent this:

Each dispatch carries a fencing token — a monotonically increasing epoch number from the leader election.
Workers reject dispatches with stale fencing tokens.
The result store uses task ID + attempt number as a unique key, discarding duplicate results.

Fencing Token Example: Leader A has epoch 5. It dispatches task T1. Network partition occurs — Leader B is elected with epoch 6. Leader A recovers and tries to dispatch T1 again with epoch 5. The worker checks: if (epoch < currentEpoch) reject(). Task T1 executes exactly once.

Cron Expression Parser

Recurring tasks use cron expressions. A simplified parser calculates the next execution time:

public record CronSchedule(
    int minute,    // 0-59 or -1 for wildcard
    int hour,      // 0-23 or -1 for wildcard
    int dayOfMonth,// 1-31 or -1 for wildcard
    int month,     // 1-12 or -1 for wildcard
    int dayOfWeek  // 0-6 (Sun-Sat) or -1 for wildcard
) {
    public static CronSchedule parse(String expression) {
        String[] parts = expression.split("\\s+");
        if (parts.length != 5) throw new IllegalArgumentException("Invalid cron: " + expression);
        return new CronSchedule(
            parseField(parts[0], 0, 59),
            parseField(parts[1], 0, 23),
            parseField(parts[2], 1, 31),
            parseField(parts[3], 1, 12),
            parseField(parts[4], 0, 6)
        );
    }

    private static int parseField(String field, int min, int max) {
        if ("*".equals(field)) return -1;
        int value = Integer.parseInt(field);
        if (value < min || value > max) {
            throw new IllegalArgumentException("Value " + value + " out of range [" + min + "," + max + "]");
        }
        return value;
    }

    public Instant nextExecution(Instant from) {
        ZonedDateTime zdt = from.atZone(ZoneOffset.UTC).plusMinutes(1)
            .withSecond(0).withNano(0);

        for (int i = 0; i < 525960; i++) { // Search up to 1 year
            boolean matches =
                (minute == -1 || zdt.getMinute() == minute) &&
                (hour == -1 || zdt.getHour() == hour) &&
                (dayOfMonth == -1 || zdt.getDayOfMonth() == dayOfMonth) &&
                (month == -1 || zdt.getMonthValue() == month) &&
                (dayOfWeek == -1 || zdt.getDayOfWeek().getValue() % 7 == dayOfWeek);

            if (matches) return zdt.toInstant();
            zdt = zdt.plusMinutes(1);
        }
        throw new IllegalStateException("No matching time found within one year");
    }
}

This brute-force approach iterates minute-by-minute. A production implementation would skip non-matching months, days, and hours to reach the answer in O(1) calendar arithmetic, but the linear scan illustrates the logic clearly for interview purposes.

Bottlenecks & Scaling

Bottleneck	Solution
DB polling bottleneck	Replace polling with the time wheel. The scheduler loads upcoming tasks into the wheel from the database in batches (e.g., tasks due in the next 5 minutes). The wheel triggers dispatch with O(1) ticks — no repeated queries.
Hot scheduler node	Shard by task partition. Run multiple scheduler instances, each responsible for a hash range. Each shard has its own time wheel and leader election. Tasks are routed to the correct shard by hashing the task ID.
Worker failure	Heartbeat timeout (15s) triggers task reassignment. The scheduler re-inserts unacknowledged tasks into the wheel with their retry policy applied. Workers must be stateless — all task state lives in the database.
Clock skew across nodes	Use NTP synchronization on all scheduler and worker nodes. Accept a tolerance window (e.g., ±500ms). The scheduling accuracy guarantee (< 1s) accounts for this tolerance.
Batch submission spike	Rate-limit task submissions at the API level. Buffer incoming tasks in a Kafka topic if the scheduler’s time wheel insertion rate is saturated. Backpressure signals the client SDK to slow down.
Large task payloads	Store payloads > 64KB in object storage (S3). The task record stores a reference URL instead of the payload itself. Workers fetch the payload on execution.

Interviewer Tips

Clarify execution guarantees first: “At-least-once” vs. “exactly-once” changes the entire design. At-least-once is standard for task schedulers; exactly-once requires idempotent handlers and deduplication — mention this explicitly.
Time wheel is the differentiator: Most candidates describe database polling or a priority queue. Explaining the hierarchical time wheel with O(1) insert/tick demonstrates deep knowledge.
Virtual threads are a natural fit: Mention that Java 25 virtual threads eliminate the need for complex thread pool sizing. One virtual thread per task scales to millions of concurrent tasks per JVM.
Leader election is expected: Interviewers will ask “What happens if the scheduler crashes?” Have the Raft/ZooKeeper-based failover answer ready, including fencing tokens for zombie prevention.
Draw the state machine: Sketch the task state transitions (PENDING → SCHEDULED → RUNNING → COMPLETED/FAILED) on the whiteboard. This shows you think about edge cases like double-execution and stuck tasks.
Dead letter queue shows production maturity: Proactively discussing what happens after all retries are exhausted signals real-world experience.