Design a Task Scheduler
SummaryCovers priority queue-based scheduling, time-wheel algorithm for efficient...
Covers priority queue-based scheduling, time-wheel algorithm for efficient...
Covers priority queue-based scheduling, time-wheel algorithm for efficient timer management, distributed task assignment with consistent hashing, retry strategies with exponential backoff, and leader election for scheduler high availability.
Design a Task Scheduler
Distributed task schedulers power background job processing, cron-like periodic tasks, delayed notifications, and retry logic across every large-scale system. This chapter designs a scheduler that handles 100M+ tasks per day with sub-second accuracy, at-least-once execution guarantees, and horizontal scaling — the kind of system behind AWS Step Functions, Airflow, or a custom job queue.
Requirements
Functional Requirements
- One-Time Tasks: Schedule a task to execute at a specific future timestamp.
- Recurring Tasks: Support cron-style schedules (e.g., “every Monday at 9 AM”).
- Delayed Tasks: Execute a task after a specified delay (e.g., “30 minutes from now”).
- Task Priorities: Higher-priority tasks execute before lower-priority tasks when resources are contended.
- Task Cancellation: Cancel a scheduled task before execution.
- Task Status: Query the current state of any task (pending, running, completed, failed, cancelled).
- Retry on Failure: Automatically retry failed tasks with configurable policies.
- Dead Letter Queue: Route tasks that exceed max retries to a DLQ for manual inspection.
Non-Functional Requirements
| Metric | Target |
|---|---|
| Execution guarantee | At-least-once |
| Scheduling accuracy | < 1 second from target time |
| Throughput | 100M+ tasks/day (~1,200 tasks/second avg) |
| Peak throughput | 10x average during batch submissions |
| Horizontal scaling | Add workers without downtime |
| Availability | 99.99% — no single point of failure |
| Task payload size | Up to 64KB per task |
Capacity Estimation
- Task volume: 100M tasks/day = ~1,200 tasks/second average, ~12,000 tasks/second at peak.
- Task storage: 100M tasks × 1KB avg (metadata + payload) = 100GB/day. With 30-day retention: 3TB.
- Active tasks in memory: At any moment, ~5M tasks are pending within the next hour. At 1KB each: ~5GB fits comfortably in memory across a cluster.
- Worker capacity: If each task takes 100ms average, one worker handles 10 tasks/second. For 1,200 TPS: 120 workers minimum, 1,200 workers at peak. Virtual threads reduce this — one JVM can handle thousands of concurrent tasks.
High-Level Design
┌──────────┐ ┌──────────────┐ ┌───────────────┐
│ Client │──────▶│ Task API │──────▶│ Task Store │
│ (SDK) │ │ (REST/gRPC) │ │ (PostgreSQL) │
└──────────┘ └──────┬───────┘ └───────┬───────┘
│ │
┌──────▼───────┐ ┌──────▼───────┐
│ Scheduler │◀───────│ Timer │
│ (Leader) │ │ (Time Wheel)│
└──────┬───────┘ └──────────────┘
│
┌─────────────┼─────────────┐
│ │ │
┌─────▼────┐ ┌────▼─────┐ ┌────▼─────┐
│ Worker 1 │ │ Worker 2 │ │ Worker N │
│(Virtual │ │(Virtual │ │(Virtual │
│ Threads) │ │ Threads) │ │ Threads) │
└─────┬────┘ └────┬─────┘ └────┬─────┘
│ │ │
└─────────────┼─────────────┘
│
┌──────▼───────┐
│ Result Store │
│ (PostgreSQL)│
└──────────────┘
- Task API: Accepts schedule/cancel/query requests. Validates payloads and writes to the Task Store.
- Task Store: Persistent storage for all task metadata, schedules, and state transitions.
- Scheduler (Leader): The elected leader scans for due tasks and dispatches them to workers. Only one scheduler is active at a time via leader election.
- Timer (Time Wheel): An in-memory data structure that triggers task dispatch with O(1) efficiency.
- Worker Pool: Stateless workers that execute task payloads. Scale horizontally. Use Java 25 virtual threads for high concurrency.
- Result Store: Records execution outcomes, retry counts, and error details.
Deep Dive
Task Model
The task model captures scheduling intent through a sealed interface hierarchy:
public sealed interface Schedule permits OneTime, Recurring, Delayed {}
public record OneTime(Instant executeAt) implements Schedule {}
public record Recurring(String cronExpression, Instant nextExecution) implements Schedule {}
public record Delayed(Duration delay, Instant createdAt) implements Schedule {
public Instant executeAt() {
return createdAt.plus(delay);
}
}
public record Task(
String id,
String taskType,
byte[] payload,
Schedule schedule,
Priority priority,
TaskState state,
int retryCount,
int maxRetries,
RetryPolicy retryPolicy,
Instant createdAt,
Instant updatedAt
) {}
public enum Priority { CRITICAL, HIGH, NORMAL, LOW }
public enum TaskState { PENDING, SCHEDULED, RUNNING, COMPLETED, FAILED, CANCELLED, DEAD_LETTERED }
Pattern matching on the Schedule type determines when a task should fire:
static Instant resolveExecutionTime(Schedule schedule) {
return switch (schedule) {
case OneTime ot -> ot.executeAt();
case Recurring r -> r.nextExecution();
case Delayed d -> d.executeAt();
};
}
Priority Dispatch: The scheduler maintains a PriorityQueue<Task> ordered by (priority, executionTime). Critical tasks preempt normal tasks even if their execution time is slightly later:
Comparator<Task> taskOrder = Comparator
.comparing(Task::priority)
.thenComparing(t -> resolveExecutionTime(t.schedule()));
Scheduling Algorithms
Three approaches exist, each with different performance characteristics:
1. Database Polling (Naive): A loop queries SELECT * FROM tasks WHERE execute_at <= NOW() AND state = 'PENDING' ORDER BY priority, execute_at LIMIT 100. This is O(n) per scan against the task table, creates database load, and introduces latency equal to the polling interval (typically 1–5 seconds). Acceptable for low-volume systems (< 10K tasks/day) but unacceptable at scale.
2. Priority Queue (Min-Heap): Load upcoming tasks into an in-memory min-heap keyed by execution time. Insert is O(log n), extract-min is O(log n). The scheduler sleeps until the earliest task is due, then extracts and dispatches it. This eliminates polling but requires all pending tasks to fit in memory. For 5M near-term tasks at 1KB each, that is ~5GB — feasible for a dedicated scheduler node.
3. Hierarchical Time Wheel: The time wheel achieves O(1) insert and O(1) tick processing, making it the preferred choice at scale.
Time Wheel Structure: A circular buffer of slots, where each slot holds a bucket (linked list) of tasks. The wheel advances one slot per tick (e.g., every 100ms). Tasks are inserted into the slot corresponding to their execution time modulo the wheel size.
Wheel (8 slots, 1-second tick):
┌───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │
└─┬─┴───┴─┬─┴───┴───┴─┬─┴───┴───┘
│ │ │
▼ ▼ ▼
[A,B] [C] [D,E,F]
▲
│ current pointer
For tasks beyond one rotation, a hierarchical wheel uses multiple levels — seconds, minutes, hours — similar to a clock. When the seconds wheel completes a rotation, one slot from the minutes wheel cascades down.
public class TimeWheel {
private final int wheelSize;
private final Duration tickDuration;
private final List<List<Task>>[] wheels; // [seconds, minutes, hours]
private int currentTick = 0;
@SuppressWarnings("unchecked")
public TimeWheel(int wheelSize, Duration tickDuration) {
this.wheelSize = wheelSize;
this.tickDuration = tickDuration;
this.wheels = new List[3]; // 3 levels
for (int level = 0; level < 3; level++) {
wheels[level] = new ArrayList<>(wheelSize);
for (int i = 0; i < wheelSize; i++) {
wheels[level].add(new ArrayList<>());
}
}
}
public void addTask(Task task) {
long ticksUntilExecution = Duration.between(
Instant.now(), resolveExecutionTime(task.schedule())
).dividedBy(tickDuration);
if (ticksUntilExecution < wheelSize) {
int slot = (int) ((currentTick + ticksUntilExecution) % wheelSize);
wheels[0].get(slot).add(task);
} else if (ticksUntilExecution < wheelSize * wheelSize) {
int slot = (int) ((ticksUntilExecution / wheelSize) % wheelSize);
wheels[1].get(slot).add(task);
} else {
int slot = (int) ((ticksUntilExecution / (wheelSize * wheelSize)) % wheelSize);
wheels[2].get(slot).add(task);
}
}
public List<Task> tick() {
List<Task> dueTasks = wheels[0].get(currentTick);
wheels[0].set(currentTick, new ArrayList<>());
currentTick = (currentTick + 1) % wheelSize;
// Cascade from higher wheels on full rotation
if (currentTick == 0) {
cascadeWheel(1);
}
return dueTasks;
}
private void cascadeWheel(int level) {
if (level >= wheels.length) return;
int slot = currentTick; // simplified — production code tracks per-level pointers
List<Task> tasks = wheels[level].get(slot);
wheels[level].set(slot, new ArrayList<>());
for (Task task : tasks) {
addTask(task); // Re-insert into lower wheel
}
}
}
| Algorithm | Insert | Dispatch | Memory | Use Case |
|---|---|---|---|---|
| DB Polling | O(1) write | O(n) scan | Disk | < 10K tasks/day |
| Min-Heap | O(log n) | O(log n) | RAM | < 1M active tasks |
| Time Wheel | O(1) | O(1) amortized | RAM | 100M+ tasks/day |
Distributed Task Execution
Consistent Hashing for Task Partitioning: Tasks are assigned to workers using consistent hashing on the task ID. Each worker owns a range of the hash ring. When a worker joins or leaves, only tasks in the affected range are redistributed — minimizing rebalancing overhead.
Worker Heartbeats: Workers send heartbeats to the scheduler every 5 seconds. If three consecutive heartbeats are missed (15 seconds), the scheduler marks the worker as dead and reassigns its tasks to neighboring workers on the hash ring.
Task Stealing: When a worker’s queue is empty and other workers are overloaded, idle workers “steal” tasks from the tail of busy workers’ queues. This balances load dynamically without centralized coordination.
Virtual Threads for Workers: Java 25 virtual threads allow one virtual thread per task without exhausting OS threads:
public class TaskWorker {
private final ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();
private final TaskHandler handler;
private final ResultStore resultStore;
public void executeTask(Task task) {
executor.submit(() -> {
try {
var result = handler.handle(task.taskType(), task.payload());
resultStore.recordSuccess(task.id(), result);
} catch (Exception e) {
resultStore.recordFailure(task.id(), e.getMessage(), task.retryCount());
}
});
}
}
A single JVM running virtual threads can handle tens of thousands of concurrent task executions, since virtual threads are scheduled on a small pool of platform threads and consume minimal memory (~1KB stack per thread vs. ~1MB for platform threads).
Retry Mechanism
Failed tasks need intelligent retry strategies to avoid overwhelming downstream systems.
public sealed interface RetryPolicy permits FixedDelay, ExponentialBackoff, ExponentialWithJitter {}
public record FixedDelay(Duration delay) implements RetryPolicy {}
public record ExponentialBackoff(Duration baseDelay, double multiplier, Duration maxDelay)
implements RetryPolicy {}
public record ExponentialWithJitter(Duration baseDelay, double multiplier, Duration maxDelay)
implements RetryPolicy {}
public class RetryCalculator {
private static final ThreadLocalRandom random = ThreadLocalRandom.current();
public static Duration calculateNextDelay(RetryPolicy policy, int attemptNumber) {
return switch (policy) {
case FixedDelay fd -> fd.delay();
case ExponentialBackoff eb -> {
long delayMs = (long) (eb.baseDelay().toMillis() *
Math.pow(eb.multiplier(), attemptNumber));
yield Duration.ofMillis(Math.min(delayMs, eb.maxDelay().toMillis()));
}
case ExponentialWithJitter ej -> {
long delayMs = (long) (ej.baseDelay().toMillis() *
Math.pow(ej.multiplier(), attemptNumber));
long capped = Math.min(delayMs, ej.maxDelay().toMillis());
long jittered = ThreadLocalRandom.current().nextLong(capped / 2, capped);
yield Duration.ofMillis(jittered);
}
};
}
}
Why jitter matters: Without jitter, all retries for tasks that failed at the same time fire simultaneously, causing a retry storm. Jitter spreads retries uniformly across the delay window.
Dead Letter Queue: After maxRetries attempts, the task moves to a DLQ — a separate database table or Kafka topic. Operations teams monitor the DLQ and can manually replay, modify, or discard dead-lettered tasks.
Idempotent Execution Requirement: Since at-least-once delivery means a task could execute more than once (e.g., worker crashes after execution but before acknowledgment), every task handler must be idempotent. The result store records completed task IDs; handlers check this before performing side effects.
High Availability
Leader Election: Only one scheduler instance is active at a time to avoid duplicate dispatching. Leader election uses a consensus protocol (Raft or ZAB via ZooKeeper/etcd). The elected leader runs the time wheel and dispatches tasks. Standby nodes replicate state from the task store and are ready to take over within seconds.
Active-Passive Failover: The passive scheduler continuously loads upcoming tasks from the database. On leader failure, the new leader rebuilds its time wheel from the task store. Tasks that were in-flight on the failed leader are detected via heartbeat timeout and re-dispatched.
Task Deduplication: During failover, a task might be dispatched twice — once by the old leader (before it realized it lost leadership) and once by the new leader. To prevent this:
- Each dispatch carries a fencing token — a monotonically increasing epoch number from the leader election.
- Workers reject dispatches with stale fencing tokens.
- The result store uses task ID + attempt number as a unique key, discarding duplicate results.
Fencing Token Example: Leader A has epoch 5. It dispatches task T1. Network partition occurs — Leader B is elected with epoch 6. Leader A recovers and tries to dispatch T1 again with epoch 5. The worker checks: if (epoch < currentEpoch) reject(). Task T1 executes exactly once.
Cron Expression Parser
Recurring tasks use cron expressions. A simplified parser calculates the next execution time:
public record CronSchedule(
int minute, // 0-59 or -1 for wildcard
int hour, // 0-23 or -1 for wildcard
int dayOfMonth,// 1-31 or -1 for wildcard
int month, // 1-12 or -1 for wildcard
int dayOfWeek // 0-6 (Sun-Sat) or -1 for wildcard
) {
public static CronSchedule parse(String expression) {
String[] parts = expression.split("\\s+");
if (parts.length != 5) throw new IllegalArgumentException("Invalid cron: " + expression);
return new CronSchedule(
parseField(parts[0], 0, 59),
parseField(parts[1], 0, 23),
parseField(parts[2], 1, 31),
parseField(parts[3], 1, 12),
parseField(parts[4], 0, 6)
);
}
private static int parseField(String field, int min, int max) {
if ("*".equals(field)) return -1;
int value = Integer.parseInt(field);
if (value < min || value > max) {
throw new IllegalArgumentException("Value " + value + " out of range [" + min + "," + max + "]");
}
return value;
}
public Instant nextExecution(Instant from) {
ZonedDateTime zdt = from.atZone(ZoneOffset.UTC).plusMinutes(1)
.withSecond(0).withNano(0);
for (int i = 0; i < 525960; i++) { // Search up to 1 year
boolean matches =
(minute == -1 || zdt.getMinute() == minute) &&
(hour == -1 || zdt.getHour() == hour) &&
(dayOfMonth == -1 || zdt.getDayOfMonth() == dayOfMonth) &&
(month == -1 || zdt.getMonthValue() == month) &&
(dayOfWeek == -1 || zdt.getDayOfWeek().getValue() % 7 == dayOfWeek);
if (matches) return zdt.toInstant();
zdt = zdt.plusMinutes(1);
}
throw new IllegalStateException("No matching time found within one year");
}
}
This brute-force approach iterates minute-by-minute. A production implementation would skip non-matching months, days, and hours to reach the answer in O(1) calendar arithmetic, but the linear scan illustrates the logic clearly for interview purposes.
Bottlenecks & Scaling
| Bottleneck | Solution |
|---|---|
| DB polling bottleneck | Replace polling with the time wheel. The scheduler loads upcoming tasks into the wheel from the database in batches (e.g., tasks due in the next 5 minutes). The wheel triggers dispatch with O(1) ticks — no repeated queries. |
| Hot scheduler node | Shard by task partition. Run multiple scheduler instances, each responsible for a hash range. Each shard has its own time wheel and leader election. Tasks are routed to the correct shard by hashing the task ID. |
| Worker failure | Heartbeat timeout (15s) triggers task reassignment. The scheduler re-inserts unacknowledged tasks into the wheel with their retry policy applied. Workers must be stateless — all task state lives in the database. |
| Clock skew across nodes | Use NTP synchronization on all scheduler and worker nodes. Accept a tolerance window (e.g., ±500ms). The scheduling accuracy guarantee (< 1s) accounts for this tolerance. |
| Batch submission spike | Rate-limit task submissions at the API level. Buffer incoming tasks in a Kafka topic if the scheduler’s time wheel insertion rate is saturated. Backpressure signals the client SDK to slow down. |
| Large task payloads | Store payloads > 64KB in object storage (S3). The task record stores a reference URL instead of the payload itself. Workers fetch the payload on execution. |
Interviewer Tips
- Clarify execution guarantees first: “At-least-once” vs. “exactly-once” changes the entire design. At-least-once is standard for task schedulers; exactly-once requires idempotent handlers and deduplication — mention this explicitly.
- Time wheel is the differentiator: Most candidates describe database polling or a priority queue. Explaining the hierarchical time wheel with O(1) insert/tick demonstrates deep knowledge.
- Virtual threads are a natural fit: Mention that Java 25 virtual threads eliminate the need for complex thread pool sizing. One virtual thread per task scales to millions of concurrent tasks per JVM.
- Leader election is expected: Interviewers will ask “What happens if the scheduler crashes?” Have the Raft/ZooKeeper-based failover answer ready, including fencing tokens for zombie prevention.
- Draw the state machine: Sketch the task state transitions (PENDING → SCHEDULED → RUNNING → COMPLETED/FAILED) on the whiteboard. This shows you think about edge cases like double-execution and stuck tasks.
- Dead letter queue shows production maturity: Proactively discussing what happens after all retries are exhausted signals real-world experience.